3.1.1. Water Attention Module

Attention mechanisms have been successfully applied in the field of image segmentation, highlighting the features that need attention based on the context of the network. Fu et al. [34] proposed a dual attention network to capture rich contextual dependencies for scene segmentation by combining local features with their global dependencies. Li et al. [35] designed a pyramid attention network, which combined an attention mechanism with a spatial pyramid to extract precise, dense object features for semantic segmentation. To optimize and stabilize the segmentation model in terms of memory and computation, an expectation–maximization attention module was developed and encapsulated into a neural network [36]. In our GAN-GL, a water index is used in the water attention module to obtain the initial glacial lake extent. Combined with convolution features, the possible lake pixels are highlighted, and potential water areas are given a relatively high weight. The structure of this module is shown in Figure 4.

Given an input Landsat-8 OLI image *<sup>I</sup>* <sup>∈</sup> <sup>R</sup>*H*×*W*×*C*, features *<sup>F</sup>*<sup>1</sup> and *<sup>F</sup>*<sup>2</sup> are calculated through the convolution operation, with a 1 <sup>×</sup> 1 kernel size {*F*1, *<sup>F</sup>*2} <sup>∈</sup> <sup>R</sup>*H*×*W*×1. Feature *F*<sup>3</sup> refers to the water index. Due to the simplicity of the expression and relatively stable thresholds used for the classification of lakes [13,37], NDWI was selected in this study, as follows:

$$NDWI = \frac{\rho\_{\text{green}} - \rho\_{NIR}}{\rho\_{\text{green}} + \rho\_{NIR}} \tag{1}$$

where *ρgreen* and *ρNIR* represent top-of-atmosphere (TOA) reflectance values in the green and NIR bands measured by the Landsat-8 OLI sensor, respectively.

After the calculation of all the feature maps, *<sup>F</sup>*<sup>1</sup> and *<sup>F</sup>*<sup>2</sup> are both reshaped to R*N*×1, where *N* = *H* × *W*. Then, matrix multiplication is performed on the reshaped *F*<sup>1</sup> and

transpose of reshaped *F*2, and a softmax layer is used for the normalization of the input to obtain the feature map *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*N*×*N*.

$$A\_{ji} = \frac{\exp\left(F\_{1i} \otimes F\_{2j}\right)}{\sum\_{i=1}^{N} \exp\left(F\_{1i} \otimes F\_{2j}\right)}\tag{2}$$

The operator ⊗ is the ordinary matrix multiplication. Similarly, feature *F*<sup>3</sup> is also reshaped to R*N*×1, and matrix multiplication is operated on the transpose of reshaped *<sup>F</sup>*<sup>3</sup> and feature *A* to enhance the water information in the water attention map *W*:

$$\mathcal{W}\_{\vec{\jmath}} = \sum\_{i=1}^{N} \left( A\_{\vec{\jmath}i} \odot F\_{\mathfrak{J}i} \right) \tag{3}$$

Note that here, *<sup>W</sup>* <sup>∈</sup> <sup>R</sup>1×*<sup>N</sup>* should be reshaped to <sup>R</sup>*H*×*W*.

**Figure 4.** Structure of our water attention module.

#### 3.1.2. Image Segmentation Module

The attention results give the weight information of a pixel that belongs to a glacial lake. To fully utilize this information and further segment glacial lakes, a U-Net-based segmentation module was incorporated into the generator. Figure 3 shows that the input of this module is the element-wise product between the water attention map and Landsat imagery. We exploited five down-sampling operations to capture the lake information at different scales, each of which contains two convolution layers with a rectified linear unit (ReLU) active function and one convolution layer with a stride of 2. The ReLU function activates the input data *x* and extends the nonlinear applications in deep learning models, which is defined as *f*(*x*) = max(0, *x*). An input image with a size of *H* × *W* × *C* is down-sampled to (*H*/16) × (*W*/16) × *C* . Because some small glacial lakes can only be extracted from shallow layers, the feature maps of the same size during down-sampling and up-sampling are concatenated, namely as skip connections, to integrate features at different scales. Finally, lake binary masks are produced by processing connection features in the last two convolution layers.
