Learning Day 67: Semantic segmentation 1 — FCN; Deconvolution

De Jun Huang
dejunhuang
Published in
4 min readJun 22, 2021

--

Image segmentation

  • In Day 47, image segmentation with conventional CV has been mentioned.
  • It can be done based on features like colour, grayscale, texture and shape

Semantic segmentation

  • Understand and recognize the content of image on a pixel-level
  • A pixel-wise classification based on the semantic information
  • INPUT: images
  • OUTPUT: pixel-wise label with the same output size as input
  • It can be applied in robotics, scene understanding, autonomous driving and medical diagnostics

Before and after deep learning

  • Before: manually extracted features + CRF (Conditional Random Field)
  • After: improve CNN to Fully Convolutional Networks (FCN)

FCN

  • FCN has been mentioned in day 65 for object detection
  • It solves the problems with FC layers in which 1) the spatial information is lost, 2) the image input size has to be fixed and 3) too many weights and prone to overfitting
  • It can be as simple as replacing the FC layers with Conv layers of same depth, eg. FC layer 4096 → Conv layer 16x16x4096
Illustration of FCN (ref)
  • However, there is a problem at the last layer where there is a huge upsampling: a small feature mapthe pixel-wise prediction of the same size as the input image.
  • Therefore, more sophisticated upsampling methods are needed

Upsampling method in FCN

1. Deconvolution / Transposed convolution (learnable)

  • Deconvolution has been mentioned in Day 40 regarding GAN
  • It can be treated as using thick paddings to increase the feature map size and perform convolution operation to obtain larger feature map
Transposed convolution different padding strategies (ref)

2. Bilinear interpolation (non-learnable)

  • Simple and fast calculation
  • Bilinear just means it takes in 2 variables (x, y in below case) for interpolation as compare to linear interpolation
The red dots are data, green dot is the point to be interpolated (ref)

3. Unpooling (non-learnable)

  • It was mentioned in Day 55 for back propagation of pooling layer
  • For max pooling, it records down the location of max values before pooling was performed. During unpooling, just need to place the values back to the max-value locations.
Max pooling and its corresponding unpooling operation. Switch variables are the location records that were kept for unpooling. (ref)

Another component working with upsampling: Skip-layer

  • In order not to loss too much information, we do upsampling not only from the last layer, but also from the shallower layers (where the feature maps are bigger) and combine them to form the final output.
Illustration of skip-layer (ref)
  • Shallow layers (bigger feature maps, eg pool3 above) extract details
  • Deep layers (smaller feature maps, eg. conv7 above) extract semantics

Combining upsampling and skip-layer

  • For the final outputs from the last layer and all the branches formed by skip-layers, they use interpolation
  • For the intermediate stages in the skip layers, they use deconvolution with learnable weights whose initial values are obtained via interpolation
AlexNet-based FCN with the details in skip-layer FCN-8s (shallow feature map) (ref)
  • The FCN layers are colour-coded as follows:
  • — Blue: Conv layers
  • — Green: Pooling layer
  • — Yellow: Elementwise summation
  • — Orange: Deconvolution
  • — Grey: Crop (to unify the size)

Training guideline

  • SGD with momentum (0.9)
  • Learning rate: 1e-3 (AlexNet), 1e-4 (VGG16), 1e-5 (GoogLeNet)
  • Minibatch: 20
  • First 5 conv layers loaded with pre-trained weight from original CNN
  • The 6th and 7th conv layers are initialized with zero
  • Last layer of upsampling uses bilinear interpolation, no learning
  • The rest of upsampling are to be initialized with interpolation values, followed by deconvolution with learning

FCN performance

  • In terms of skip-layer, FCN-8s gives the best results
  • In terms of base network, FCN-VGG16 has the best performance, but also the slowest
  • The edge accuracy is still not fantastic due to thick padding and cropping operation

Reference

link1

--

--