文章摘要:
Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e.g., the Squeeze-and Excitation attention) for lifting model performance, but they generally neglect the positional information, which is important for generating spatially selective attention maps. In this paper, we propose a novel attention mechanism for mobile networks by embedding positional information into channel attention, which we call "coordinate attention". Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively. In this way, long range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction. The resulting feature maps are then encoded separately into a pair of direction-aware and position-sensitive attention maps that can be complementarily applied to the input feature map to augment the representations of the objects of interest. Our coordinate attention is simple and can be flexibly plugged into classic mobile networks, such as MobileNetV2, MobileNeXt, and EfficientNet with nearly no computational overhead. Extensive experiments demonstrate that our coordinate attention is not only beneficial to ImageNet classification but more interestingly, behaves better in down-stream tasks, such as object detection and semantic segmentation.
动机
结构设计是近年来计算机视觉领域的一个重要研究领域。许多人可能会认为,这一研究方向正趋于饱和,但研究正在挑战这些论点,在定制层的设计和注意机制方面取得了一些新的进展。这篇文章的重点是后一种注意机制。注意力机制本质上是根据所提供的输入数据提供关于“where”和“what”的附加信息。其中一些方法例如Squeeze and Excitation (SE),Convolutional Block Attention Module (CBAM), Triplet Attention, Global Context (GC),更多的人已经证明了这种插件模块的效率,在计算复杂度最小的情况下显着地改善了传统baseline模型的性能。不过,以“最小”一词为例,因为模型一直在努力在尽可能低的开销下增强注意模块的性能效益。然而,设计这些注意机制的方法主要集中在大规模网络上,因为这些方法引入的计算开销使得它们在容量有限的移动网络中不可行。此外,大多数关注机制只关注于信道信息,而由于否定空间信息而丧失了表现能力。