Learning Day 68: Semantic segmentation 2 — DeepLab, atrous/dilated convolution

Published in

dejunhuang

3 min readJun 23, 2021

--

Background

For FCN in Day 67, it still suffers from the problem of big-step upsampling from small feature maps to the final output.
DeepLab aims to solve this problem to make the object boundary more accurate

DeepLab v1

CNN + CRF
Use Atrous/Dilated convolution at the deeper layers in CNN

Atrous/Dilated convolution

As compared to using a 3x3 filter, holes are inserted in-between the filter to make it cover an area of 5x5

Atrous/Dilated convolution using a dilated 3x3 filter (effective coverage is 5x5) (ref)

With the similar amount of weights, the field of view is bigger
Use it with the appropriate stride to replace upsampling deconvolution layer. The resultant feature map has more details by the below comparison.

Compare standard convolution with down- and up-sampling (top) with atrous convolution (bottom) (ref)

A concept called dilation rate. The amount of holes inserted in-between=rate-1. Eg. Rate=2, no. of holes to be inserted=1. However, it cannot be too big. If rate ≥the input size, it is similar to doing convolution with 1x1 filter.

CRF (Conditional Random Field)

Take the rough segmentation results from CNN and refine the boundary using fully connected CRF
I don’t quite understand the theory of CRF

DeepLab v2

The base model can be VGG16 or ResNet101
Introduce Atrous Spatial Pyramid Pooling (ASPP)

Atrous Spatial Pyramid Pooling (ASPP)

Use different dilation rate to capture features at different scales

ASPP (ref)

An illustration of ASPP at Conv6 layer (right after Pool5) (ref)

DeepLab v3

More universal as it can use any CNN structure as a backbone
Use batch-norm in ASPP
No CRF
Use “series” and “parallel” connections of atrous convolution layers

Deep CNN with atrous convolution in “series” connection or in cascade (ref)

ASPP in “parallel” connection (ref)

DeepLab v3+

Expanded on v3
added a encoder-decoder structure to conserve boundary information
The original DeepLab v3 is used as the encoder to apply atrous convolution at multiple scales
Decoder gets the low-level features from the backbone model and concat with output from encoder after some conv layers and upsampling

Encoder-decoder structure in DeepLab v3+ (ref)

Reference

Machine Learning

De Jun Huang

Written by De Jun Huang

Editor for

dejunhuang

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams