Coordinate Attention的原理及实现

lijingle · 2021-6-8 18:03:07

随着今年(可以说是)最大的计算机视觉会议CVPR 2021又增加了另一个注意力机制，对最佳注意机制的竞争还在继续。本文提出了Coordinate Attention for Efficient Mobile Network Design。乍一看，注意力机制似乎是Triplet Attention和Strip Pooling混合体，但更具体地针对轻量级的移动部署网络。

我们首先看一下工作的动机，然后接着简要介绍Triplet Attention(Rotate to Attend: Convolutional Triplet Attention Module)和条形池(Strip Pooling: Rethinking Spatial Pooling for Scene Parsing)。然后，我们将分析所提出的机制的结构，并以本文的研究结果作为结束语。

目录：

动机
Coordinate Attention
PyTorch Code
结果
总结
References

文章摘要：
Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e.g., the Squeeze-and Excitation attention) for lifting model performance, but they generally neglect the positional information, which is important for generating spatially selective attention maps. In this paper, we propose a novel attention mechanism for mobile networks by embedding positional information into channel attention, which we call "coordinate attention". Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively. In this way, long range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction. The resulting feature maps are then encoded separately into a pair of direction-aware and position-sensitive attention maps that can be complementarily applied to the input feature map to augment the representations of the objects of interest. Our coordinate attention is simple and can be flexibly plugged into classic mobile networks, such as MobileNetV2, MobileNeXt, and EfficientNet with nearly no computational overhead. Extensive experiments demonstrate that our coordinate attention is not only beneficial to ImageNet classification but more interestingly, behaves better in down-stream tasks, such as object detection and semantic segmentation.

动机
结构设计是近年来计算机视觉领域的一个重要研究领域。许多人可能会认为，这一研究方向正趋于饱和，但研究正在挑战这些论点，在定制层的设计和注意机制方面取得了一些新的进展。这篇文章的重点是后一种注意机制。注意力机制本质上是根据所提供的输入数据提供关于“where”和“what”的附加信息。其中一些方法例如Squeeze and Excitation (SE),Convolutional Block Attention Module (CBAM), Triplet Attention, Global Context (GC),更多的人已经证明了这种插件模块的效率，在计算复杂度最小的情况下显着地改善了传统baseline模型的性能。不过，以“最小”一词为例，因为模型一直在努力在尽可能低的开销下增强注意模块的性能效益。然而，设计这些注意机制的方法主要集中在大规模网络上，因为这些方法引入的计算开销使得它们在容量有限的移动网络中不可行。此外，大多数关注机制只关注于信道信息，而由于否定空间信息而丧失了表现能力。

基于这些缺点：
本文在第一批工作的基础上，提出了一种新的、高效的注意机制，即将位置信息嵌入到信道注意中，使移动网络能够在大范围内参与，同时避免了大量的计算开销。

作者称这种新的注意机制为Coordinate Attention，因为它的操作区分了空间方向(即坐标)，并生成了坐标感知的注意映射。
Coordinate Attention具有以下优点。首先，它不仅获取跨渠道，而且还获取方向感知和position-sensitive information，这有助于模型更准确地定位和识别感兴趣的物体。其次，该方法灵活、重量轻，可以方便地插入移动网络的典型构建块，如MobileNetV 2中提出的反向残差块和MobileNext中提出的沙漏块，通过强调信息表示来增强特征。第三，作为一种预先训练的模型，Coordinate Attention可以给具有移动网络的下行任务带来显著的性能提升，特别是对于那些具有密集预测(例如语义分割)的任务。

Coordinate Attention

正如本文序言中所暗示的那样，如上图(C)所示，Coordinate Attention与分别在WACV 2021和CVPR 2020上发布的Triplet Attention和Strip Pooling 的结构有些相似。(这两篇论文也来自同一作者，也是Coordinate Attention的作者)。

Strip Pooling (CVPR 2020)

虽然本文的主要重点是场景解析，但从结构上看，我们可以注意到Triplet Attention和Coordinate Attention体系结构设计之间的相似之处。Triplet Attention实质上采用输入张量X∈C∗H∗W X∈RC∗H∗W，对于每个空间特征映射，分别将其简化为两个空间矢量H∗1 H∗1和W∗1 W∗1。然后，这两个向量经过两个一维卷积核，然后通过一个双线性插值过程，得到原H∗W，最后按元素添加。然后，通过pointwise 卷积将此映射传递给原特征映射，并在乘法之前对其进行sigmoid 激活，从而将其按元素顺序乘以原始特征图。

Triplet Attention (WACV 2021)

Triplet Attention更多地涉及到领域上的Coordinate Attention，它提供了一种结构，通过排列操作隔离每个空间维数，从而实质上计算空间注意力与通道信息的对应关系。这一概念被论文称为“Cross Dimension Interaction (CDI)”。关于Triplet Attention的深入分析，请在浏览我的其它博客。

关于Coordinate Attention 我们分析一下模块中发生了什么。如本节开头的图所示，Coordinate Attention(Coord Att)。取输入张量

，并在两个空间维H和W上应用平均池化，得到两个张量

，然后将这两个张量连接成

，这两个张量随后通过2D卷积传递。它根据指定的还原比r减少从

通道。然后是一个normalization层(本例中为Batch Norm)，然后是一个激活函数(本例中为HardSwish)。最后，将张量分解为

。这两个张量分别通过两个2D卷积核，每个核都增加了从

返回

的通道，最后对得到的两个张量进行sigmoid激活。然后，将attention maps按元素顺序与原始输入张量x相乘。

PyTorch Code
下面的代码片段提供了用于Coordinate Attention模块的PyTorch代码，它可以插入到任何经典的主干中。

import torch

import torch.nn as nn

import math

import torch.nn.functional as F



class h_sigmoid(nn.Module):

    def __init__(self, inplace=True):

        super(h_sigmoid, self).__init__()

        self.relu = nn.ReLU6(inplace=inplace)



    def forward(self, x):

        return self.relu(x + 3) / 6



class h_swish(nn.Module):

     def __init__(self, inplace=True):

         super(h_swish, self).__init__()

         self.sigmoid = h_sigmoid(inplace=inplace)



     def forward(self, x):

         return x * self.sigmoid(x)



class CoordAtt(nn.Module):

     def __init__(self, inp, oup, reduction=32):

         super(CoordAtt, self).__init__()

         self.pool_h = nn.AdaptiveAvgPool2d((None, 1))

         self.pool_w = nn.AdaptiveAvgPool2d((1, None))



         mip = max(8, inp // reduction)



         self.conv1 = nn.Conv2d(inp, mip, kernel_size=1, stride=1, padding=0)

         self.bn1 = nn.BatchNorm2d(mip)

         self.act = h_swish()

    

         self.conv_h = nn.Conv2d(mip, oup, kernel_size=1, stride=1, padding=0)

         self.conv_w = nn.Conv2d(mip, oup, kernel_size=1, stride=1, padding=0)

    



    def forward(self, x):

         identity = x

    

         n,c,h,w = x.size()

         x_h = self.pool_h(x)

         x_w = self.pool_w(x).permute(0, 1, 3, 2)



         y = torch.cat([x_h, x_w], dim=2)

         y = self.conv1(y)

         y = self.bn1(y)

         y = self.act(y) 

    

         x_h, x_w = torch.split(y, [h, w], dim=2)

         x_w = x_w.permute(0, 1, 3, 2)



         a_h = self.conv_h(x_h).sigmoid()

         a_w = self.conv_w(x_w).sigmoid()



         out = identity * a_w * a_h



    return out

结果：
对于作者所做的一系列广泛的实验结果，我建议看一看这篇论文。在这里，我们只展示了在ImageNet上使用MobileNet和MobileNext进行图像分类的突出结果，以及在MS-Coco上的对象检测方面的突出结果。

总结
虽然结果本身是比较好的，但该方法违背了本文的主要动机之一：移动网络的low-cost attention，因为实际上，从参数和失败的角度来看，Coordinate Attention都比它所比较的两种方法(CBAM和SE)代价更高。第二个缺点是比较次数有限。虽然SE和CBAM是比较突出的attention方法，但与Triplet和ECA相比，性能更好、成本更低的注意模块有了更多的进步。此外，Coordinate Attention模块并不是与cbam进行的apple-to-apple 比较，因为前者使用hard swish作为其激活，从而有了显着的性能提升，然而，它与后者相比使用relu(不如hard swish)。

		自动登录	找回密码
密码			立即注册

Tic商业评论

Coordinate Attention的原理及实现

浏览过的版块

关于我们

网站地图

资讯

视频

活动

电话咨询: 135xxxxxxx

关注微信