2.1.2. Attention Module

In order to enhance the global response ability of receptive field and avoid the more complex network structure caused by stacking convolution, we use the non-local [49] module to enhance the fused feature map. Through the module based on a self-attention mechanism, the network can pay more attention to the more important global information. The non-local module establishes the correlation between the global information and the local information without stacking the convolution kernels. This is equivalent to expanding the field of vision, so that the network can better integrate the global information of the image. The expression of the non-local attention formula is as follows:

$$H\_{\bar{i}} = \frac{1}{\mathbb{C}(\mathcal{F})} \sum\_{\forall j} f\left(\mathcal{F}\_{i\cdot}, \mathcal{F}\_{\bar{j}}\right) \mathcal{G}\left(\mathcal{F}\_{\bar{j}}\right) \tag{2}$$

where *Fi* represents the *i*-th location of the input feature map. *<sup>C</sup>*(·) represents the normalized coefficient. Function *f*(·) is used for calculating the similarity between *Fi* and *Fj*. Function *g*(·) is used for calculating the representation of the input feature map at *j*-th location. Coefficient 1 *C*(*F*)is used for normalization.

In order to express it concisely, the function *g*(·) can be regarded as a linear embedding, i.e.,

$$\mathcal{S}\left(F\_{\vec{\lambda}}\right) = \mathcal{W}\_{\vec{\lambda}} F\_{\vec{\lambda}} \tag{3}$$

where *<sup>W</sup>g* is the weight matrix.

In this module, we use the embedded Gaussian as the function *f*(·). Embedded Gaussian is a simple extension of Gaussian and calculates the similarity of embedded space.

$$f\left(F\_i, F\_j\right) = \epsilon^{\theta\left(F\_i\right)^T\varphi\left(F\_j\right)}$$

$$\theta\left(F\_i\right) = \mathcal{W}\_{\theta}F\_i$$

$$\varphi\left(F\_j\right) = \mathcal{W}\_{\theta}F\_j$$

where *<sup>θ</sup>*(·) and *ϕ*(·) is the weight matrix. *Wθ* and *Wθ* are the weight matrices. In addition, the normalized coefficient *<sup>C</sup>*(·) is

$$\mathcal{C}(\mathcal{F}) = \sum\_{\forall j} f\left(\mathcal{F}\_{i\cdot}\mathcal{F}\_{j}\right) \tag{5}$$

Therefore, through the above formula, we can deduce that the output expression is

$$H\_i = \left(\sum\_{\forall j} \epsilon^{\theta(F\_i)^T \varphi(F\_j)} \times \mathcal{W}\_{\mathcal{S}} F\_j\right) / \sum\_{\forall j} f(F\_i, F\_j) \tag{6}$$

Figure 5 shows the architecture of the non-local module. We put the input feature maps *F* into three 1 × 1 convolutional layers at the same time to calculate *ϕ*, *θ* and **g**. Then, we flatten the H and W dimensions of *ϕ* and *θ*. With the flattened layers, we can calculate the similarity *f* by matrix multiplication. Finally, the similarity *f* is normalized by a softmax function, and the result is multiplied by the flattened feature maps *g*. The output is also processed by a 1 × 1 convolution layer to make it match the size of the input feature map. Apart from that, a skip connect is added into the architecture. So, the final output refined feature map *H* is

$$H' = \mathcal{W}\_{\mathcal{H}} \mathcal{H} + \mathcal{F} \tag{7}$$

where *WH* is the weight matrix.

**Figure 5.** Architecture of non-local module.

For example, for the input feature map *F*, the following operations are performed in the non-local module:

