1. Abstract

  • We address the scence segmentation task by capturing rich contextual dependencies based on the self-attention mechanism.

  • 之前学习文本信息(capture contexts)是通过多尺度特征融合(multi-scale feature fusion).

  • 而DANet利用全局相关性整合局部特征(integrate local features with their global dependencies).

  • Two types of attention modules : **spatial and channel **

  • new state-of-art segmentation results on three challenging scene segmentation datasets.

Main contributions :

  1. We propose a novel DANet with self-attention mechanism to enhance the discriminant ability of feature representations for scene segmentation.

  2. A position attention module is proposed to learn the spatial interdependencies of features and a channel attention module is designed to model channel interdependencies.It significantly improves the segmentation results by modeling rich contextual dependencies over local features.

  3. We achieve new state-of-the-art results on three popular benchmarks including Cityscapes, PASCAL Context dataset and COCO stuff dataset.

2. Dual Attention Network

Backbone : Modified ResNet

  • we design two types of attention modules to draw global context over local features generated by a dilated residual network, thus obtaining better feature representations for pixel-level prediction.

  • We employ a pretrained residual network with the dilated strategy as the backbone.

  • Noted that we remove the down sampling operations and employ dilated convolutions in the last two ResNet blocks, thus enlarging the size of the final feature map size to 1/8 of the input image. It retains more details without adding extra parameters.

Core Modules

  • Then the features from the dilated residual network would be fed into two parallel attention modules.

  • Take the spatial attention modules in the upper part of the Figure. 2 as an example,we first apply a convolution layer to obtain the features of dimension reduction. Then we feed the features into the position attention module and generate new features of spatial long-range contextual information through the following three steps.

  1. The first step is to generate a spatial attention matrix which models the spatial relationship between any two pixels of the features.

  2. Next, we perform a matrix multiplication between the attention matrix and the original features.

  3. Third, we perform an element-wise sum operation on the above multiplied resulting matrix and original features to obtain the final representations reflecting long range contexts.

  • The process of capturing the channel relationship is similar to the position attention module except for the first step, in which channel attention matrix is calculated in channel dimension.

  • Finally we aggregate the outputs from the two attention modules to obtain better feature representations for pixel-level prediction.

2.1 Position Attention Model

2.2 Channel Attention Model