Learning Day 67: Semantic segmentation 1 — FCN; Deconvolution

Published in

dejunhuang

4 min readJun 22, 2021

--

Image segmentation

In Day 47, image segmentation with conventional CV has been mentioned.
It can be done based on features like colour, grayscale, texture and shape

Semantic segmentation

Understand and recognize the content of image on a pixel-level
A pixel-wise classification based on the semantic information
INPUT: images
OUTPUT: pixel-wise label with the same output size as input
It can be applied in robotics, scene understanding, autonomous driving and medical diagnostics

Before and after deep learning

Before: manually extracted features + CRF (Conditional Random Field)
After: improve CNN to Fully Convolutional Networks (FCN)

FCN

FCN has been mentioned in day 65 for object detection
It solves the problems with FC layers in which 1) the spatial information is lost, 2) the image input size has to be fixed and 3) too many weights and prone to overfitting
It can be as simple as replacing the FC layers with Conv layers of same depth, eg. FC layer 4096 → Conv layer 16x16x4096

Illustration of FCN (ref)

However, there is a problem at the last layer where there is a huge upsampling: a small feature map → the pixel-wise prediction of the same size as the input image.
Therefore, more sophisticated upsampling methods are needed

Upsampling method in FCN

1. Deconvolution / Transposed convolution (learnable)

Deconvolution has been mentioned in Day 40 regarding GAN
It can be treated as using thick paddings to increase the feature map size and perform convolution operation to obtain larger feature map

Transposed convolution different padding strategies (ref)

2. Bilinear interpolation (non-learnable)

Simple and fast calculation
Bilinear just means it takes in 2 variables (x, y in below case) for interpolation as compare to linear interpolation

The red dots are data, green dot is the point to be interpolated (ref)

3. Unpooling (non-learnable)

It was mentioned in Day 55 for back propagation of pooling layer
For max pooling, it records down the location of max values before pooling was performed. During unpooling, just need to place the values back to the max-value locations.

Max pooling and its corresponding unpooling operation. Switch variables are the location records that were kept for unpooling. (ref)

Another component working with upsampling: Skip-layer

In order not to loss too much information, we do upsampling not only from the last layer, but also from the shallower layers (where the feature maps are bigger) and combine them to form the final output.

Illustration of skip-layer (ref)

Shallow layers (bigger feature maps, eg pool3 above) extract details
Deep layers (smaller feature maps, eg. conv7 above) extract semantics

Combining upsampling and skip-layer

For the final outputs from the last layer and all the branches formed by skip-layers, they use interpolation
For the intermediate stages in the skip layers, they use deconvolution with learnable weights whose initial values are obtained via interpolation

AlexNet-based FCN with the details in skip-layer FCN-8s (shallow feature map) (ref)

The FCN layers are colour-coded as follows:
— Blue: Conv layers
— Green: Pooling layer
— Yellow: Elementwise summation
— Orange: Deconvolution
— Grey: Crop (to unify the size)

Training guideline

SGD with momentum (0.9)
Learning rate: 1e-3 (AlexNet), 1e-4 (VGG16), 1e-5 (GoogLeNet)
Minibatch: 20
First 5 conv layers loaded with pre-trained weight from original CNN
The 6th and 7th conv layers are initialized with zero
Last layer of upsampling uses bilinear interpolation, no learning
The rest of upsampling are to be initialized with interpolation values, followed by deconvolution with learning

FCN performance

In terms of skip-layer, FCN-8s gives the best results
In terms of base network, FCN-VGG16 has the best performance, but also the slowest
The edge accuracy is still not fantastic due to thick padding and cropping operation

Reference

Machine Learning

De Jun Huang

Written by De Jun Huang

Editor for

dejunhuang

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams