*Article* **Empirical Estimation of Nutrient, Organic Matter and Algal Chlorophyll in a Drinking Water Reservoir Using Landsat 5 TM Data**

**Md Mamun <sup>1</sup> , Jannatul Ferdous <sup>2</sup> and Kwang-Guk An 1,\***


**Abstract:** The main objective of this study was to develop empirical models from Landsat 5 TM data to monitor nutrient (total phosphorus: TP), organic matter (biological oxygen demand: BOD), and algal chlorophyll (chlorophyll-a: CHL-a). Instead of traditional monitoring techniques, such models could be substituted for water quality assessment in aquatic systems. A set of models were generated relating surface reflectance values of four bands of Landsat 5 TM and in-situ data by multiple linear regression analysis. Radiometric and atmospheric corrections improved the satellite image quality. A total of 32 compositions of different bands of Landsat 5 TM images were considered to find the correlation coefficient (*r*) with in-situ measurement of TP, BOD, and CHL-a levels collected from five sampling sites in 2001, 2006, and 2010. The results showed that TP, BOD, and CHL-a correlate well with Landsat 5 TM band reflectance values. TP (*r* = −0.79) and CHL-a (*r* = −0.79) showed the strongest relations with B1 (Blue). In contrast, BOD showed the highest correlation with B1 (Blue) (*r* = −0.75) and B1\*B3/B4 (Blue\*Red/Near-infrared) (*r* = −0.76). Considering the *r* values, significant bands and their compositions were identified and used to generate linear equations. Such equations for Landsat 5 TM could detect TP, BOD, and CHL-a with accuracies of 67%, 65%, and 72%, respectively. The developed empirical models were then applied to all study sites on the Paldang Reservoir to monitor spatio-temporal distributions of TP, BOD, and CHL-a for the month of September using Landsat 5 TM images of the year 2001, 2006, and 2010. The results showed that TP, BOD, and CHL-a decreased from 2001 to 2006 and 2010. However, S3 and S4 still have water quality issues and are influenced by climatic and anthropogenic factors, which could significantly affect reservoir drinking water quality. Overall, the present study suggested that the Landsat 5 TM may be appropriate for estimating and monitoring water quality parameters in the reservoir.

**Keywords:** empirical models; multiple regression; Paldang Reservoir; water quality parameters

#### **1. Introduction**

Freshwater reservoirs are significant natural resources within the biosphere that function as sources of drinking, irrigation and industrial water, tourism attractions, and aquatic organisms' habitats [1–3]. These reservoirs face a number of stressors, including land use change, pollution, intensive farming, climate change, and human activities, causing several water quality issues [1,4,5]. About fifty percent of the world's populations live near water resources, and human activities accelerate aquatic stressors like eutrophication and algal blooms [4]. Due to rapid urban population growth, industrialization, intensive agricultural farming, and global climate changes, reservoirs are facing significant challenges, most important of which are the rise in nutrients, algal blooms, and organic matter pollution [6–8]. This is a global environmental issue and a current research subject [3,9,10].

Paldang Reservoir is one of the main reservoirs in South Korea, formed by the construction of a hydroelectric dam in 1973 [11]. It has been used for various purposes such

**Citation:** Mamun, M.; Ferdous, J.; An, K.-G. Empirical Estimation of Nutrient, Organic Matter and Algal Chlorophyll in a Drinking Water Reservoir Using Landsat 5 TM Data. *Remote Sens.* **2021**, *13*, 2256. https:// doi.org/10.3390/rs13122256

Academic Editor: Giacomo De Carolis

Received: 26 April 2021 Accepted: 8 June 2021 Published: 9 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

as irrigation, hydroelectric, fishing, recreation, and drinking water [12]. It has been declared a nationally protected resource and provides water for the Seoul metropolitan and surrounding areas [13]. Approximately half of the Korean population depends on the Paldang Reservoir for drinking water [14]. Simply put, the water quality of the reservoir is crucial to the Korean government. However, human activities have risen in the watershed, resulting in short-term algal blooms and organic pollution in the reservoir [15,16]. Urbanization, municipal pollutants, livestock farms, intensive farming practices, domestic and industrial wastewater, and inflowing rivers contribute to the water contamination of the reservoir [17,18]. Henceforth, monitoring the nutrients, organic matter, and algal chlorophyll concentrations and determining their spatial and temporal dynamics are essential to managing the reservoir water quality [3].

Traditional monitoring approaches, including in-situ measurements and laboratory analysis, allowed us to understand and categorize water-quality parameters [1,3,19]. Though this technique yields accurate measurements, it is time-intensive and laborious and may not provide an overview of water quality at a broad spatial scale [20]. Furthermore, current monitoring techniques can not cover the wider spectrum of spatial and temporal analysis which is necessary to resolve aquatic integrity and public health issues [4]. It is particularly true for large water bodies like Paldang Reservoir, one of Korea's largest freshwater sources.

Satellite remote sensing is currently one of the most powerful and most reliable methods for monitoring and managing water quality [3,21]. Readily accessible remote sensing data offers cost-effective and less time-intensive methods than in-situ methods by providing continuous spatial and temporal coverage of environmental processes [1]. This approach delivers a large-scale synoptic range of the systems [19,22]. Spectral satellite radiance measurement is interrelated to many water quality variables influencing an aquatic ecosystem's optical properties [23,24]. Several previous studies have shown that satellite systems' brightness data are closely associated with water quality variables [3,4,19–22,24,25].

Miller et al. [26] noted that the "*Landsat series provided an approximate annual economic benefit of 2.19 billion US dollars spread across several study areas for only the USA*". Since 1984, Landsat 5 has provided a steady stream of data with a moderate spatial resolution (30 m), multispectral imagery, and a sampling rate of 16 days [4,19,24]. Therefore, these pictures are appropriate for demonstrating the study of aquatic resources. The moderate spatial resolution of images allows us to study a small water body, about 8 ha [25]. It indicates that Landsat Thematic Mapper (TM) sensor can be used extensively to form the empirical relationships among water quality parameters and spectral reflectance values. The most common way to determine a relationship among spectral reflectance values and water quality parameters are through regression analysis. The most critical aspect of running a regression analysis is choosing a regression model with appropriate independent parameters (single bands, band ratios, and combinations of bands) that yield a high R<sup>2</sup> value. A high R<sup>2</sup> value reveals that the return equation is highly correlated with existing data and provides a relatively accurate model. However, previous studies demonstrated that the bands that best predict water quality parameters differ with water conditions and ecosystems. Therefore, empirical models must be individually developed for each variable at different systems. Researchers used Landsat 5 sensor to determine the spatial and temporal distribution of water quality parameters throughout the world, including chlorophyll-a (CHL-a), turbidity, Secchi depth (SD), total suspended solids (TSS), total phosphorus (TP), organic matter (BOD), electrical conductivity, etc. [4,19,21,22,25,27–30].

This study aimed to develop a method for using Landsat 5 TM data to determine TP, BOD, and CHL-a concentrations in Paldang Reservoir, Korea. This reservoir was selected for study due to its status as a nationally protected resource. This research's primary objectives were to: (i) determine the relationship among TP, BOD, and CHL-a with TM bands, band ratios, and combinations of bands and (ii) develop empirical models using regression analysis for monitoring TP, BOD, and CHL-a. The developed models were also

used to evaluate the spatio-temporal variations in TP, BOD, and CHL-a among study sites during 2001, 2006, and 2010.

#### **2. Materials and Methods**

#### *2.1. Study Area*

The Paldang Reservoir is situated approximately 45 km northeast of Seoul and provides drinking water for 24 million people [14]. It has an area of 38.2 km<sup>2</sup> and a volume of 250 <sup>×</sup> 106 m<sup>3</sup> [11]. The mean and maximum depth of the reservoir is 6.5 m and 25 m, respectively [11]. Five reservoir sampling sites labelled S1–S5 were selected for this study. Sites S1 and S2 were located in the South Han River part of the reservoir. In contrast, S3, S4, and S5 were situated at the North Han River, Kyoungan Stream, and dam, respectively. The water intake tower for Paldang Reservoir is located at S5 (Figure 1). It receives water from three different sources, namely the Kyoungan Stream, South and North Han River, and directly affects the reservoir's hydrodynamics and water quality [2,11,12,16]. About 95% of the reservoir's water comes from the North and South Han Rivers, which have relatively good water quality [11]. In contrast with the two sources, Kyoungan Stream has a small flow rate and a lower water quality. The drinking water supply tower is located near Kyoungan Stream's confluence and this significantly impacts drinking water quality (Figure 1).

**Figure 1.** The map showing the study sites of Paldang Reservoir.

#### *2.2. Methodological Approach*

This study solely depends on secondary data. To monitor the water quality parameters (WQPs: TP, BOD, and CHL-a) of a reservoir, several Landsat 5 TM images with band values were acquired and processed. Finally, regression analysis was carried out to establish a

relationship between band values of Landsat images and in-situ measurements of WQPs. Figure <sup>2</sup> illustrates the methodological approach of this study.

**Figure 2.** Methodological flow chart of the study.

2.2.1. Acquisition and Processing of Satellite Data

A total of 19 images of Landsat 5 TM (path/row: 116/34 and 115/34) were downloaded from the United States Geological Survey (USGS) Website (https://earthexplorer. usgs.gov/, accessed on 26 April 2021). Seven images from the year 2001 (dated 11 January, 16 March, 17 April, 19 May, 15 September, 27 November, and 13 December), six images from 2006 (dated 25 January, 17 February, 14 March, 5 August, 13 September, and 18 December), and six images from 2010 (dated 5 February, 16 March, 24 September, 19 October, 11 November, and 29 December) were selected due to availability of cloud-free images. To make these raw images more suitable to use, appropriate radiometric and atmospheric corrections were carried out using the semi-automatic classification plugin (SCP) of QGIS. To remove the effect of haze, this plugin employs dark object subtraction (DOS) method. SCP is a widely used plugin for preprocessing satellite images [31,32]. SCP uses the spectral radiance scaling method to convert the digital number (DN) to top of atmosphere (TOA) reflectance in two steps [33]. The procedure is described in the following sections. At first, the spectral radiance at the sensor's aperture *L*<sup>λ</sup> (Wm−<sup>2</sup> sr−1um−<sup>1</sup> ) is measured from DN (Equation (1)) [34]:

$$L\_{\lambda} = M\_L \times Q\_{\text{cal}} + A\_L \tag{1}$$

where *M<sup>L</sup>* = Band-specific multiplicative rescaling factor from Landsat metadata (RA-DIANCE\_MULT\_BAND\_x, where x is the band number), *A<sup>L</sup>* = Band-specific additive

rescaling factor from Landsat metadata (RADIANCE\_ADD\_BAND\_x, where x is the band number), *Qcal* = Quantized and calibrated standard product pixel values (DN). After that DOS, the image-based atmospheric correction is carried out in SCP to calculate land surface reflectance (Equation (3)) for each pixel by calculating path radiance (Equation (2)):

$$L\_p = M\_L \times DN\_{\rm min} + A\_L - 0.01 \times E\_{\rm SINA} \times \cos \theta \text{s/(}\pi \times d^2\text{)}\tag{2}$$

where, *L<sup>p</sup>* = path radiance, *DNmin* = minimum DN value of the scene, *d* = Earth-Sun distance in astronomical units, *ESUN<sup>λ</sup>* = mean solar exo-atmospheric irradiances, *θs* = solar zenith angle in degrees, which is equal to *θs* = 90◦ − *θe* where *θe* is the Sun elevation:

$$\rho = \left[ \pi \times (L\_{\lambda} - L\_{p}) \times d^{2} \right] / (E\_{\text{SINA}} \times \cos \theta \text{s}) \tag{3}$$

where, *ρ* = land surface reflectance, *L<sup>λ</sup>* = spectral radiance at the sensor's aperture, *L<sup>p</sup>* = path radiance.

#### 2.2.2. Assembling WQPs Data and Associated Band Values

The concentrations of different WQPs (TP, BOD, and CHL-a) of five sample collection points in Paldang Reservoir were collected from the South Korean Ministry of Environment for 2001, 2006, and 2010. These measurements are usually collected once a month. The dates of acquisition of satellite images were near the sampling days. A total of 95-pixel values for each band associated with these sample points were extracted from processed satellite images in the ArcGIS platform (Esri Inc., Redlands, CA, USA). For this analysis, four bands (blue, green, red and near infra-red) of Landsat 5 TM images were selected to extract, and a total of 32 band compositions were calculated in Microsoft Excel (Microsoft Office, Redmond, WA, USA).

#### 2.2.3. Development of Multiple Regression Equation between WQPs and Landsat Band Values

After arranging the data, outliers of the dataset were identified by plotting box-whisker plots (Supplementary File Figures S1–S3). These box-whisker plots have identified one outlier for BOD (3.5 mg/L), six for TP (123, 138, 140, 142, 228, 236 µg/L), four for CHL-a (49.1, 56.3, 71.9, 132 µg/L). For developing the empirical models, these outliers were omitted. Pearson's coefficient of correlation (*r*) was calculated to find the strength of association among the band values and WQPs. To identify the significant band values, a threshold value of *r* was considered to be equal to or greater than 0.7, which represents strong correlation [21]. Multiple regression analysis was carried out in an online-based calculator to generate equations for each WQP. This analysis continued iteration until a significant *p*-value was obtained. This online-based calculator consideres all the assumptions of linear regression analysis (https://www.statskingdom.com/doc\_linear\_regression.html#multi, accessed on 26 April 2021). The assumptions are: (i) linearity—there is a linear relationship between the dependent variable, Y and the independent variables, Xi; (ii) residual normality; (iii) homoscedasticity (homogeneity of variance)—the variance of the residuals is constant and does not depend on the independent variables Xi; (iv) variables—the dependent variable, Y, should be continuous variable while the independent variables, Xi, should be continuous variables or ordinal variables; (v) multicollinearity—there is no perfect correlation among two or more independent variables, Xi.

To determine the efficiency of the generated models, root mean squared error (RMSE), root mean squared log error (RMSLE), mean relative error (MRE) and mean absolute error (MAE) were computed along with coefficient of determination (*r* 2 ) and *p*-values. The following are the equations of RMSE, RMSLE, MRE and MAE. These equations can be

applied to radiometrically and atmospherically corrected Landsat 5 TM images to predict specific water properties (TP, BOD and CHL-a):

$$RMSE = \sqrt{\frac{\sum (P\_i - O\_i)^2}{n}} \tag{4}$$

$$RMSLE = \sqrt{\frac{\sum (\log(P\_i + 1) - \log(O\_i + 1))^2}{n}} \tag{5}$$

$$MRE = \frac{\sum \left| \frac{P\_i}{\overline{O\_i}} - 1 \right|}{n} \tag{6}$$

$$MAE = \frac{\sum |P\_i - O\_i|}{n} \tag{7}$$

where, *P<sup>i</sup>* = predicted values of WQPs, *O<sup>i</sup>* = observed values of WQPs, and *n* = sample size.

#### *2.3. Spatio-Temporal Variation of WQPs*

The spatio-temporal variation in WQPs of Paldang Reservoir for the years 2001, 2006, and 2010 were studied using the generated equations. Landsat 5 TM images of the month of September of these years were processed in SCP of QGIS, and the area of interest- Paldang Reservoir was extracted from the images. From their band values, values of WQPs were computed in Raster Calculator (Esri Inc., USA) and analyzed for the change detection study.

#### **3. Results**

#### *3.1. Reservoir Conditions*

The water quality parameters (TP, BOD, and CHL-a) of the Paldang Reservoir showed significant site variations (Table 1). The mean TP varied from 34.75–92.06 µgL−<sup>1</sup> from sites S1–S5. Site S4 showed the highest TP (92.06 µgL−<sup>1</sup> ) value compared to all sites due to the reception of wastewater from industry and household activities. Moreover, Site S4 is highly impacted by the Kyoungan Stream. Nürnberg [35] proposed that TP concentrations > 30 µgL−<sup>1</sup> indicate a eutrophic reservoir. Mean TP levels above 30 µgL−<sup>1</sup> at all sites were observed in this study. High BOD values suggest that organic matter pollution is linked to wastewater effluents. The mean BOD ranged from 1.05 to 1.72 mgL−<sup>1</sup> in the Paldang Reservoir. Like TP, the highest BOD had been observed in Site S4 (1.72 mgL−<sup>1</sup> ). It is well known that CHL-a is the primary indicator of eutrophication in the lentic system [5,36]. The mean CHL-a varied from 10.89 to 27.74 µgL−<sup>1</sup> . Nürnberg [35] proposed that eutrophic reservoir should be indicated by CHL-a concentrations greater than 9 µgL−<sup>1</sup> . Mean CHL-a concentrations at five sites were found above 9 µgL−<sup>1</sup> . Like TP and BOD, the highest CHL-a was also observed at site S4. Industrial and household wastewater and the Kyoungan Stream highly affect the water quality of site S4. Eun and Seok [11] and Mamun et al. [2] found that the water quality of the Kyoungan Stream is in poor condition compared to the South Han River (sites S1 and S2) and North Han River (Site S3), and this could have a major effect on the reservoir's water quality.


**Table 1.** Summary statistics of Total Phosphorus (TP), Biological Oxygen Demand (BOD) and Chlorophyll-a (CHL-a) in Paldang Reservoir.

A Pearson-based correlation analysis was used to identify the relationship among TP, BOD, and CHL-a (*p* < 0.05; Table 2). The BOD showed positive correlation with TP (*r* = 0.249) and CHL-a (*r* = 0.627). The positive correlation between BOD and TP suggests that nutrients (TP) flow into the Paldang Reservoir along with organic matter (BOD). The high positive correlation between BOD and CHL-a indicates that autochthonous organic matter production is primarily resulting from phytoplankton processes. CHL-a was positively related with TP (*r* = 0.375), which is the key factor regulating algal growth in the freshwater lentic system [5,9,10].

**Table 2.** Pearson correlation among total phosphorus (TP), biological oxygen demand (BOD) and chlorophyll-a (CHL-a).


#### *3.2. Relations of Band Compositions with TP, BOD, and CHL-a*

Values of 32 band compositions and associated TP, BOD, and CHL-a concentrations were employed to compute correlation (r) values for Landsat 5 TM sensors. The band compositions and their allied r values are shown in Table 3. Only four bands (blue, green, red, and near-infrared) provide the visibly displayed water quality parameter spectral reflectiveness (0.4–0.9 µm); that is why we used these four bands to determine TP, BOD, and CHL-a [20]. TP is a significant factor in deciding eutrophication in freshwater systems. The TM bands' correlation with TP ranged from −0.07 (B2/B4) to −0.79 (B1). Particularly, TP showed the strongest correlation with B1 (r = −0.79). The present findings have concurred with some previous studies [20]. BOD is the indicator of organic pollution in the aquatic systems. BOD and TM bands' correlation varied from −0.15 (B1/B3) to −0.76 (B1\*B3/B4). Like TP, BOD showed the highest correlation with B1 (r = −0.75) and B1\*B3/B4 (r = −0.76). CHL-a is a good indicator of overall algal biomass in the aquatic systems. The present results showed a dynamic relation between TM bands and CHL-a. The correlation among TM bands and CHL-a ranged from 0.12 (B2/B3) to −0.79 (B1). Like TP and BOD, CHL-a showed the highest correlation with B1 (r = −0.79). From the r values, influential bands and band compositions have been identified to generate empirical models for TP, BOD, and CHL-a (marked in bold; Table 3).

**Table 3.** Correlation matrix values for different band compositions of Landsat 5 TM with TP, BOD and CHL-a (B1: Blue, B2: Green, B3: Red, B4: NIR-Near-Infrared, TP: Total Phosphorus, BOD: Biological Oxygen Demand, CHL-a: Chlorophyll-a, influential bands and band compositions have been (*r* ≥ 0.7) marked in bold.


#### *3.3. Empirical Model Development of TP, BOD, and CHL-a from Landsat 5 TM Data*

Variables with high correlation values (|r| ≥ 0.70) have only been used to generate the empirical model for TP, BOD, and CHL-a (Table 4). The analysis was performed in an onlinebased calculator until a significant relationship was indicated by the *p*-value (*p* < 0.01). Due to an online-based calculator's automatic iteration power, it is not easy to control the inclusion of any specific independent variables. The *p*-values for the model show that they

have a significant relationship. The developed model using Landsat 5 TM images can detect TP 67% correctly while it was 65% and 72% for BOD and CHL-a, respectively (Table 4). The values of RMSE, RMSLE, MRE and MAE also depict the efficiency of developed models (Table 4). A more efficient TP, BOD, and CHL-a models from Landsat 5 TM can be developed using more sampling point data [21]. Considering our findings, further studies should be carried out with satellite sensors data to develop the empirical models of TP, BOD, and CHL-a. Scatter plots of the observed TP, BOD, and CHL-a data with predicted TP, BOD, and CHL-a values from the generated regression models are shown in Figure 3. For TP, the relationship between observed and predicted values displayed a correlation of 0.82 with *p* < 0.01. In contrast, it was 0.81 and 0.85 for BOD and CHL-a, respectively with *p* < 0.01.

**Table 4.** Linear equations for Landsat 5 TM to detect TP, BOD and CHL-a of Paldang Reservoir (B: Blue, G: Green, R: Red, NIR: Near-Infrared, TP: Total Phosphorus, BOD: Biological Oxygen Demand, CHL-a: Chlorophyll-a, WQPs: Water Quality Parameters).


− −

**Figure 3.** The relationship among observed and predicted TP, BOD and CHL-a for Landsat 5 TM (TP: Total Phosphorus, BOD: Biological Oxygen Demand, and CHL-a: Chlorophyll-a).

μ <sup>−</sup> μ <sup>−</sup> μ <sup>−</sup>

− −

μ <sup>−</sup>

−

#### *3.4. Spatial and Temporal Patterns of Water Quality Parameters*

The developed empirical models were applied to all study sites on the Paldang Reservoir to monitor spatio-temporal distributions of WQPs on 15 September 2001; 13 September 2006, and 24 September 2010, using Landsat 5 TM images (Figures 4–6). Sites S1 and S2 are influenced by the South Han River, While S3 and S4 are affected by the North Han River and Kyoungan Stream. The TP concentration varied between 5.82 to 80.60 µgL−<sup>1</sup> on 15 September 2001. The maximum TP concentrations gradually decreased from 2001 (80.60 µgL−<sup>1</sup> ) to 2006 (57.83 µgL−<sup>1</sup> ) and 2010 (55.32 µgL−<sup>1</sup> ) in the Paldang Reservoir due to new treatment facilities in the sewage treatment plants and the development of water quality management strategies in the reservoirs (Figure 4). However, the TP concentrations were still in a eutrophic state in site S3 and S4 during 2006 and 2010. Like TP, the maximum BOD concentration also showed decreasing pattern from 2001 (2.28 mgL−<sup>1</sup> ) to 2006 (2.10 mgL−<sup>1</sup> ) and 2010 (2.0 mgL−<sup>1</sup> ) (Figure 5). It indicates that the biological effluent treatment process may efficiently degrade the influent's degradable organic matter [16]. The highest BOD level was also observed in S4 during 2010 September. Like TP and BOD, the maximum level of CHL-a was also showed a declining trend from 2001 (35.46 µgL−<sup>1</sup> ) to 2006 (14.96 µgL−<sup>1</sup> ) and 2010 (14.31 µgL−<sup>1</sup> ) (Figure 6). During 2001 September, the entire reservoir showed a eutrophic state (>9 µgL−<sup>1</sup> ) based on CHL-a concentration. Although the water quality is getting better in terms of TP, BOD, and CHL-a from 2001 to 2010, Site S3 and S4 are still facing some problems. The authors suggest that sites S3 and S4 should be taken under special consideration as S4 highly influences the intake tower water quality. μ <sup>−</sup> μ <sup>−</sup> μ <sup>−</sup> μ <sup>−</sup>

**Figure 4.** Spatial and temporal pattern of total phosphorus (TP) on September 2001, 2006 and 2010.

**Figure 5.** Spatial and temporal pattern of biological oxygen demand (BOD) on September 2001, 2006 and 2010.

**Figure 6.** Spatial and temporal pattern of chlorophyll-a (CHL-a) on September 2001, 2006 and 2010.

#### **4. Discussion**

The present study shows that remote sensing technology can be a handy tool to detect water quality parameters. Landsat data series are useful for monitoring the water quality of freshwater bodies. Paldang Reservoir has experienced significant water quality changes due to urbanization, land use change, and intensive agricultural farming [14,15,17]. The observed mean TP and CHL-a concentration at all sites in the Paldang Reservoir showed eutrophic conditions. This indicates a moderate risk of cyanobacterial exposure in the reservoir [37]. Previous studies stated that blooms of cyanobacteria are associated with eutrophic conditions in water bodies [36]. CHL-a is a good predictor of total phytoplankton biomass and monitoring CHL-a is a direct tool for semiquantitative estimation of cyanobacterial biomass in aquatic environments [2]. Previous studies of Paldang Reservoir have suggested that cyanobacterial blooms occur during the spring season and identified the following genera: *Anabaena, Aphanocapsa, Chroococcus, Coelosphaerium, Dactylococcopsis, Microcystis, Merismopedia, Phormidium, Oscillatoria, and Pseudoanabaen* [2,16,18]. In addition, TP, BOD, and CHL-a levels at site S4 were constantly elevated. The water quality of site S4 is heavily impacted by industrial and domestic wastewater and the Kyoungan Stream. Eun and Seok [11] and Mamun et al. [2] found that the water quality of the Kyoungan Stream is in poor condition in comparison to the Southern Han River (Sites S1 and S2) and Northern Han River (Site S3) based on nutrients, organic matter and algal chlorphyll. It could have significant effects on the reservoir's water quality.

Variation in TP, BOD, and CHL-a concentrations of the Paldang Reservoir was prominent during the pre-monsoon, monsoon, and post-monsoon seasons [18]. TP concentrations were higher during the monsoon period due to intense precipitation, while BOD and CHLa level at Paldang Reservoir was highest in the spring period [2,18]. The summer monsoon significantly influences the nutrient, organic matter, and algal chlorophyll in the Korean reservoirs [5,16,38]. Organic matter in aquatic systems may come from allochthonous or autochthonous sources. Allochthonous organic matter enters into the environment during precipitation events, while algae produce autochthonous organic matter by photosynthesis [18]. It was noticeable that 69% of the total organic matter was allochthonous in the Paldang Reservoir during monsoon season [18]. Inversely, during winter and spring, a high load of autochthonous organic matter had observed because of low flow rates and high water residence time [16]. Previous research on Paldang Reservoir indicated that 73% of autochthonous organic matter loading happens during the spring [16,18]. The high-level organic matter during spring corresponds to algae's maximum production [18]. It suggests that autochthonous production by algae (CHL-a) is dire to accumulate organic matter in the reservoir during spring; hence, the threat to the reservoir's water quality is highest in spring [18].

The water quality of the Paldang Reservoir varied from site to site and season to season due to climatic factors and anthropogenic activities. Since climatic conditions are uncontrollable, anthropogenic impacts should be kept to a minimum. For that reason, regular monitoring of water quality parameters is essentially mandatory. Traditional monitoring approaches are time-intensive, laborious, and cannot provide an overview of water quality at a broader scale [1,19].

On the other hand, satellite remote sensing is presently one of the most potent and reliable approaches for monitoring and managing water quality [4,20,21]. This study has confirmed the applicability of Landsat 5 TM to identify and map the water quality parameters in the reservoir. The developed empirical models by multiple linear regression analysis can identify TP 67%, BOD 65%, and CHL-a 72% accurately from Landsat 5 TM images. As shown in Table 4, blue\*near-infrared/green (B1\*B4/B2), green\*red/near-infrared (B2\*B3/B4), and blue\*green\*red/near-infrared (B1\*B2\*B3/B4) bands and band ratios are the significant predictors for TP concentrations in Paldang Reservoir. Previous studies also found that three visible bands (blue, green, and red) and NIR bands and their ratios are suitable for estimating TP concentrations in freshwater systems [4,20]. blue\*green (b1\*b2), blue\*green/near-infrared(b1\*b2/b4), blue\*Red/Green (B1\*B3/B2), blue\*red/near-infrared

(B1\*B3/B4), blue\*green\*red/near-infrared (B1\*B2\*B3/B4) bands and band ratios are the significant predictors for BOD concentrations in the reservoir. Quibell [39] reported that the NIR and red bands ratio were good predictors of CHL-a concentration in waters. Also, other bands and band ratios are good indicators of CHL-a [40]. The present result indicated that CHL-a was better explained by the green (B2), red (B3), blue\*red (B1\*B3) and blue\*green/red (B1\*B2/B3) bands and band ratios.

#### **5. Conclusions**

Nutrient and organic pollution and algal blooms regulate water quality in freshwater systems. For this reason, it is essential to develop a cost-effective remote sensing monitoring tool to estimate the water quality parameters for maintaining an effective water management system. The present study has successfully established Landsat 5 TM data's applicability to detect TP, BOD, and CHL-a for the surface water of Paldang Reservoir (Korea). The results showed that TP, BOD, and CHL-a are closely related to the Landsat 5 surface reflectance band values. TP (r = −0.79) and CHL-a (r = −0.79) showed the highest relations with B1 (blue) band. By contrast, BOD showed the highest negative correlation with B1 (blue) (r = −0.75) and B1\*B3/B4 (blue\*red/near-infrared) bands (r = −0.76). The developed empirical models of Landsat 5 TM data can estimate TP, BOD, and CHL-a correctly by around 67%, 65%, and 72%, respectively, for the reservoir. The results presented here revealed that the surface water quality of the reservoir varied from site to site. The water quality of sites S3 and S4 are affected by anthropogenic factors, which significantly impact reservoir's water quality. Considering the present findings, we should take a particular account for site S3 and S4 to maintain the water quality. The present developed models and methods could be applied to other Korean reservoirs for validation.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/ 10.3390/rs13122256/s1. Figure S1 Outlier detection for BOD; Figure S2 Outlier detection for TP; Figure S3 Outlier detection for CHL-a.

**Author Contributions:** Conceptualization, M.M.; methodology, M.M. and J.F.; software, M.M. and J.F.; validation, M.M. and J.F.; formal analysis, M.M. and J.F.; resources, M.M. and J.F.; data curation, M.M.; writing—original draft preparation, M.M. and J.F.; writing—review and editing, M.M., J.F. and K.-G.A.; visualization, M.M. and J.F.; supervision, K.-G.A.; project administration, K.-G.A.; funding acquisition, K.-G.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Korea Environmental Industry & Technology Institute (KEITI) project (Project:2020-1008-01) "Development of Longitudinal Connectivity Evaluation System, Based on the Fish Movement".

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The datasets presented in this study are available on reasonable request from the corresponding author.

**Acknowledgments:** The authors would like to acknowledge the Korea Environmental Industry & Technology Institute (KEITI) for their assistance.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Deep Learning Model Using Satellite Ocean Color and Hydrodynamic Model to Estimate Chlorophyll-***a* **Concentration**

**Daeyong Jin <sup>1</sup> , Eojin Lee <sup>1</sup> , Kyonghwan Kwon <sup>2</sup> and Taeyun Kim 1,\***


**Abstract:** In this study, we used convolutional neural networks (CNNs)—which are well-known deep learning models suitable for image data processing—to estimate the temporal and spatial distribution of chlorophyll-*a* in a bay. The training data required the construction of a deep learning model acquired from the satellite ocean color and hydrodynamic model. Chlorophyll-*a*, total suspended sediment (TSS), visibility, and colored dissolved organic matter (CDOM) were extracted from the satellite ocean color data, and water level, currents, temperature, and salinity were generated from the hydrodynamic model. We developed CNN Model I—which estimates the concentration of chlorophyll-*a* using a 48 × 27 sized overall image—and CNN Model II—which uses a 7 × 7 segmented image. Because the CNN Model II conducts estimation using only data around the points of interest, the quantity of training data is more than 300 times larger than that of CNN Model I. Consequently, it was possible to extract and analyze the inherent patterns in the training data, improving the predictive ability of the deep learning model. The average root mean square error (RMSE), calculated by applying CNN Model II, was 0.191, and when the prediction was good, the coefficient of determination (R<sup>2</sup> ) exceeded 0.91. Finally, we performed a sensitivity analysis, which revealed that CDOM is the most influential variable in estimating the spatiotemporal distribution of chlorophyll-*a*.

**Keywords:** deep learning; convolutional neural network; chlorophyll-*a*; satellite; hydrodynamic model

#### **1. Introduction**

Marine environments experience continuous deterioration owing to the influx of pollutants from rivers and various infrastructure projects including breakwater construction, dredging, and reclamation. To restore marine environments, numerous mitigation plans have been established using various prediction and evaluation techniques. Nevertheless, several limitations still remain: first, the ocean is a complex three-dimensional system that is difficult to model accurately; second, sea water constituents exhibit dynamic movements due to external forces such as wind, tides, currents, density, etc.; third, a significant amount of time and effort is required to observe oceanic trends; and finally, despite significant developments in marine environment prediction technology, several assumptions and additional research area information are still required [1–4].

The water quality model has been widely employed in marine environment prediction, although professional knowledge and experience, various input data, and model validation procedures are required to utilize it. However, owing to the complex and interconnected nature of marine environments, major problems such as eutrophication, harmful algal blooms (HABs), and hypoxia, are difficult to identify and solve. Consequently, considerable research has been conducted on the development of efficient and reliable prediction techniques. Since 2015, deep learning technology that makes predictions using big data has been widely used in various atmospheric, financial, medical, and scientific fields [5–8].

**Citation:** Jin, D.; Lee, E.; Kwon, K.; Kim, T. A Deep Learning Model Using Satellite Ocean Color and Hydrodynamic Model to Estimate Chlorophyll-*a* Concentration. *Remote Sens.* **2021**, *13*, 2003. https://doi.org/ 10.3390/rs13102003

Academic Editor: Giacomo De Carolis

Received: 6 April 2021 Accepted: 17 May 2021 Published: 20 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Marine research using deep learning technology can be divided into prediction-related research, classification-related research, and research on methods to correct missing values. Prediction-related research has been applied to various topics, such as the El Niño Index, chlorophyll-*a* time series, and sea surface temperature [9–11]. Classification-related research has been conducted to classify marine life using image data. For example, studies have been conducted to identify the harmful algae that adversely affect marine ecosystems and to classify coral reefs and monitor aquatic ecosystems [12–14]. However, observations using sensors can contain a significant amount of missing data. Consequently, various methods have been developed to estimate the missing data using deep learning techniques [15].

In addition to water quality modeling and deep learning studies, significant research has also been conducted to evaluate the status of plankton and other environmental factors related to marine environments using remote sensing. Ocean color sensors have been used in remote sensing satellites for decades. Those currently in operation include the Chinese Ocean Color and Temperature Scanner (COCTS) onboard HY-1D; Geostationary Ocean Color Imager (GOCI) onboard the Communication, Ocean, and Meteorological Satellite (COMS); Moderate Resolution Imaging Spectroradiometer (MODIS) onboard Aqua; Multi-Spectral Instrument (MSI) onboard Sentinel-2A and Sentinel-2B; Ocean and Land Color Instrument (OLCI) onboard Sentinel-3A and Sentinel-3B; Visible Infrared Imaging Radiometer Suite (VIIRS) onboard Suomi NPP; and Second-Generation Global Imager (SGLI) onboard GCOM-C [16,17]. Ocean color sensors provide vast amounts of spatial data that cannot be obtained from in situ measurements, and consequently, various analyses of spatiotemporal trends are possible. Therefore, extensive research has been conducted to retrieve marine inherent optical properties from ocean color remote sensing and verify ocean color data [18–21]. The data obtained from ocean color sensors are calibrated and verified by comparing them with in situ measurements and the results of existing ocean color sensors [22,23]. Recently, the measurement of ocean color data products such as colored dissolved organic matter (CDOM), chlorophyll-*a*, and total suspended sediment (TSS) has been improved using various neural network methods [24–26].

Another significant problem is the occurrence of HABs, which induce hypoxia and kills fish in marine environments. An HAB is caused by complex external environmental processes and factors such as eutrophication, currents, and salinity gradients [27,28]. Monitoring and predicting the spatiotemporal distribution of chlorophyll-*a* are vital to minimize the damage of HABs [29]. A variety of spatial information is required to predict the spatiotemporal distribution of chlorophyll-*a*, owing to the complex interaction of various physical, chemical, and biological factors. Although CDOM, TSS, and chlorophyll-*a* data can be obtained using ocean color sensors, the extraction of physical information such as currents, velocity, and salinity is limited, and in situ measurements can only provide some information. The continued development of hydrodynamic models has significantly improved their prediction ability, providing physical information with a root mean square error (RMSE) of ±10%, ±10% to ±20%, ±0.5 ◦C, and ±1 psu for water level, velocity, temperature, and salinity, respectively [30].

In this study, we aim to develop a tool that can estimate the spatial distribution of chlorophyll-*a* using deep learning technology. Satellite ocean color and hydrodynamic model data are used as the training data for the deep learning model. The CDOM, TSS, visibility, and chlorophyll-*a* data recorded on an hourly basis were extracted from a geostationary satellite. The hydrodynamic model data include temperature, salinity, water level, and velocity. The developed tool estimates the spatial distribution of chlorophyll-*a* using the spatial information of CDOM, TSS, visibility, water level, velocity, temperature, and salinity. The accuracy and applicability of the developed prediction tool is demonstrated by comparing the predicted results against the satellite data. As the variables applied to the prediction of chlorophyll-*a* contribute both individually and collectively, the contribution of each variable to the estimation of chlorophyll-*a* is examined as well.

#### **2. Material and Methods**

#### *2.1. Study Area*

The study area is a semi-closed maritime region surrounded by Hadong-gun, Sacheon, and Namhae-gun in South Korea, and is connected to the sea through the Daebang channel to the east, the Noryang channel to the west, and the Changsun channel to the south, as shown in Figure 1. The study area is approximately 19 km long along the north–south direction, and 13 km long along the east–west direction. The length of the coastline is approximately 136 km and the bounded area is approximately 180 km<sup>2</sup> . The average depth is approximately 3.6 m, the depth of the central area is approximately 10 m, and the deepest area—in the channels—is approximately 30–40 m. In summer, a large volume of river water flows into the study area through the channels due to high rainfall. Consequently, although it is a semi-closed sea area, seawater exchange occurs. Sprayed shellfish farming is actively carried out in the region, gradually increasing from 230 tons in 2000, to 730 tons in 2010, and 2410 tons in 2014 [31]. Consequently, sustainable water quality management is vital in such semi-closed marine environments with active aquaculture. ‐ ‐ ‐ ‐

‐

‐ ‐

‐ ‐ ‐

‐

**Figure 1.** Study area in South Korea (Source: Google Earth).

#### *2.2. Satellite Ocean Color*

‐ ‐ ‐ Various satellites with ocean color sensors have been launched from around the world, and Korea launched COMS in 2010 for ocean observation [32,33]. COMS performs meteorological and ocean observations and provides communication services. Ocean color observations are made using the GOCI. The GOCI observes an area of 2500 km × 2500 km, centered on the Korean Peninsula. The resolution of each grid is 500 m, both in width and height, as shown in Figure 2. As COMS is a geostationary satellite, the GOCI records data eight times a day (from 9:00 to 16:00), with images recorded for 30 min every hour. The primary role of the GOCI is to monitor the marine ecosystems around the Korean Peninsula, including long- and short-term marine environmental and climatic changes, coastal and marine environmental monitoring, coastal and marine resource management, and the generation of marine and fishery information [34,35].

‐ **Figure 2.** Spatial information observed by the GOCI http://kosc.kiost.ac.kr/p20/kosc\_p21.html (accessed on 26 May 2020).

‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ The GOCI has six visible bands with band centers of 412 nm (B1), 443 nm (B2), 490 nm (B3), 555 nm (B4), 660 nm (B5), and 680 nm (B6), and two near-infrared bands with band centers of 745 nm (B7) and 865 nm (B8). Bands B1–B5 are used to record the water quality parameters. The main applications of each band are B1 for yellow substances and turbidity; B2 for chlorophyll absorption maximum; B3 for chlorophyll and other pigments; B4 for turbidity and suspended sediment; and B5 for baseline of fluorescence signal, chlorophyll, and suspended sediment [36]. The amount of light recorded by the optical sensor onboard the satellite is converted to an electronic value and stored in the satellite image. Radiometric calibration is used to precisely define the relationship between the amount of light and the electronic value, and geometric correction is performed to correct the positional information of each pixel in the image. Subsequently, first-order outputs, such as the top-of-atmosphere radiance, and secondary outputs, such as the remote sensing reflectance, chlorophyll-*a*, TSS, and CDOM concentrations, are verified. Various calibration and validation studies have been performed on the GOCI data to improve its accuracy [35,37–39]. The ocean data products used herein were obtained from the GOCI using a software GDPS including atmospheric correction and ocean environment analysis algorithms. The GDPS enables real-time data processing using a Windows-based GUI. The data products obtained from the GDPS include the water leaving radiance (Lw), normalized water leaving radiance (nLw), chlorophyll-*a*, TSS, and CDOM [40].

#### *2.3. Hydrodynamic Model*

‐ ‐ A hydrodynamic model was used to generate marine physical factors, such as the currents, water level, salinity, and temperature, in the study area. The Delft 3D model, which has been applied in several research areas, was used to simulate three-dimensional hydrodynamics [41–44]. The model domain extended for 58 km along the north–south direction and 53 km along the east–west direction, to sufficiently cover the study area. The model grid contained 155 × 245 horizontal cells and, to optimize the computational time, fine and coarse grids were formed in the study area and open sea area, respectively. A total of five vertical layers were modeled to replicate the interaction between the vertical layers and the vertical distribution of salinity and water temperature. Bathymetry for the study area was obtained from the latest navigational charts and the survey data of the Korea Hydrographic and Oceanographic Agency (KHOA). As shown in the bathymetry chart in Figure 3, the bay has a relatively shallow depth and the channels are relatively deep.

**Figure 3.** (**a**) Model grid in the study area; (**b**) Bathymetry in the study area.

‐ ‐ ‐ ‐ ‐ ‐ The boundary conditions of the study area must be defined to execute the hydrodynamic model. The water levels, salinity, and temperatures observed at different measurement sites (GoSung-JaRan, TongYong3, NamHae3) by the Korea Marine Environment Management Corporation (KOEM) were set as the sea boundary conditions, and the monthly average flow rates at GwanGok, BakRyeon, MukGok, GaWa, and SaCheon were set as the river boundary conditions. Meteorological data, such as the wind direction, wind speed, air temperature, and relative humidity, measured at the NamHae site of the Korea Meteorological Administration (KMA), were also used as model input data. The initial conditions of the water level and velocity were set to zero, and the initial conditions of temperature and salinity were derived from the measured data at the five KOEM stations shown in Figure 4. The hydrodynamic model was simulated for a total of five years from 1 January 2015 to 31 December 2019. As the data used in the deep learning model include the water level, current, salinity and temperature, these data were verified. The water level was verified using the data observed at the T1 site operated by KHOA, which is located inside the bay. The current was validated against the data recorded at the PC1 site operated by KHOA, between 24 July 2015 and 26 August 2015. The salinity and water temperature were validated against the data measured at the JinJuMan 1 and JinJuMan 2 sites, operated by KOEM, and the SamCheonPo site, operated by KHOA, as shown in Figure 4.

The water levels in the study area fluctuated by approximately 3 m and were primarily affected by the tides. The average difference in the water level between the hydrodynamic model and the observed values was approximately 10 cm, and the absolute error was within 8–10%, with slight differences every year. The currents observed between 24 July 2015 and 26 August 2015 were classified into a U-component—moving east–west—and a Vcomponent—moving north–south. As shown, the U-component was the dominant current in the study area. The U-component current flowed as fast as 0.5 m/s and fluctuated based on the tidal cycle. Although the hydrodynamic model results appear to underestimate the current patterns, the results are reproduced well. The temperature was below 10 ◦C during winter and almost 30 ◦C during summer, with clearly noticeable seasonal variations. The water temperature varied between 13 ◦C and 20 ◦C during spring and autumn, with the lowest temperature in February and the highest temperature in August. Considering the predicted daily temperatures, the hydrodynamic model adequately reproduced the annual temperature-change pattern, and the average RMSE of the temperature was 0.862 ◦C. The salinity was highly influenced by the river flow, i.e., during spells of high rainfall, the salinity temporarily decreased before increasing to approximately 32–33 psu. The average RMSE of the salinity was 0.6 psu, as shown in Figure 5.

**Figure 4.** Locations of the KMA, KOEM, KHOA, and river monitoring stations in the study area. (**a**) Measurement sites used for boundary conditions. (**b**) Measurement sites used to validate the hydrodynamic model.

‐ ‐

#### *2.4. Data Structure for Deep Learning Model*

‐ ‐ ‐ ‐ ‐ ‐ ‐ The satellite data of the study area, which was required to construct the deep learning model, was provided by the Korea Ocean Satellite Center (KOSC) in the Korea Institute of Ocean Science and Technology. The data were recorded eight times per day between 9:00 and 16:00, from January 2015 to December 2019. The data obtained included the entire Korean Peninsula, and the total size of the data was approximately 14 TB. No satellite data could be extracted when the study area was covered by clouds. The total number of extracted data was 391 in 2015, 276 in 2016, 266 in 2017, 271 in 2018, and 128 in 2019. Generally, a large amount of data were recorded during winter, when the weather was good, and a small amount of data were recorded during summer, owing to the increased rainfall and typhoons.

‐ ‐ The hydrodynamic model results were extracted for the same area as the satellite measurements, as shown in Figure 6. The hourly salinity, temperature, currents, and water levels between 2015 and 2019 were converted into a grid format. As the resolution of the satellite data was 500 m, the data from the area adjacent to the coastline could not be obtained. Therefore, only the data pertaining to the sea area 500 m away from the coastline were used to train the deep learning model. Accordingly, the hydrodynamic model results of the area adjacent to the coastline were also neglected.

#### *2.5. Deep Learning Model Structure*

(**a**) As the satellite and hydrodynamic model data were in the form of a 48 × 27 grid, they could be treated as image data. Consequently, an image-based deep learning method was applied herein. Each 48 × 27 grid was referred to as an 'image,' and each point in the image was referred to as the 'data' or 'point'. The satellite chlorophyll-*a* data were treated as ground-truth data, as several studies have shown a high correlation between the ground-truth chlorophyll-*a* data and satellite chlorophyll-*a* data. Accordingly, we constructed a deep learning model to estimate the temporal and spatial distribution of chlorophyll-*a* using both the satellite and the hydrodynamic model data. Specifically, the deep learning model estimated the temporal and spatial distribution of chlorophyll-*a* at a given time (t) by integrating the satellite data, such as the CDOM, TSS, and visibility, and the hydrodynamic model data, such as the currents, water level, temperature, and salinity, at the same time (t), as illustrated in Figure 7.

‐ ‐

‐

‐

‐

‐

‐

‐

‐

‐ ‐ ‐

**Figure 5.** (**a**) Temporal variations of water level; (**b**) Temporal variation of currents; (**c**) Temporal variation of salinity; (**d**) Temporal variation of temperature (points are observations and lines are model results).

‐ **Figure 6.** Spatial distribution of training data in the study area: salinity, temperature, currents, and water levels from the hydrodynamic model, and CDOM, chlorophyll-*a*, TSS, and visibility from the satellite ocean color data.

‐

‐

‐ ‐ ‐ ‐ **Figure 7.** Construction of the deep learning model for estimating the temporal and spatial distribution of chlorophyll-*a*. To utilize spatial information, the input data were organized in a matrix accumulated over time. The value corresponding to each row and column corresponds to the latitude and longitude of each data.

‐ ‐ ‐ ‐ ‐ A convolutional neural network (CNN) is a well-known deep learning model that is suitable for image data processing. A CNN model consists of multiple convolutional layers that extract features from an image and pool the layers through subsampling, leaving only the important patterns behind. Classification and estimation are performed through iterative convolutional and pooling operations. We designed two approaches to estimate chlorophyll-*<sup>a</sup>* based on a CNN. The first CNN model, called 'CNN Model I', estimatesthe chlorophyll-*<sup>a</sup>* concentration from an image in a 48 <sup>×</sup> 27 grid format by integrating a total of seven images—three images from the satellite data, such as the CDOM, TSS, and visibility, and four images from the hydrodynamic model data, such as the currents, water level, temperature, and salinity—as shown in Figure 8. Notably, as the image size

‐

‐

‐

‐

‐ ‐

‐

‐ ‐

‐

was small, there was no pooling layer. Consequently, the pooling layer for information compression was ineffective. The second CNN model, called 'CNN Model II', predicted the chlorophyll-*a* concentration using segmented images.

Additional preprocessing is required to use segmented images as the model input. For example, in the case of 7 × 7 segmented images, the chlorophyll-*a* value is estimated by using segmented images of seven individual input variables. The difference between CNN Model I and CNN Model II is that the former estimates one chlorophyll-*a* image by integrating the images of seven individual input variable changes, whereas the latter estimates the chlorophyll-*a* value by integrating segmented images of seven individual input variables, as shown in Figure 9. As CNN Model II estimates the chlorophyll-*a* value using the data around a point of interest, we believe that it also reflects the local characteristics well.

‐ ‐ **Figure 8.** (**a**) Algorithm of CNN Model I and (**b**) CNN Model II. CNN Model I uses seven images of 48 × 27 grid size and estimates the chlorophyll-*a* value in a 48 × 27 grid format. CNN Model II uses segmented images in a 7 × 7 grid format and estimates the chlorophyll-*a* value.

‐

‐ ‐

‐

‐

‐

‐ ‐ ‐ ‐ **Figure 9.** Schematic diagram of the application of segmented images in the CNN Model II; segmented images are generated by iteratively moving the window cell-by-cell. The CNN Model II estimates a chlorophyll-*a* value integrating segmented images of seven individual input variables.

To verify the reliability of the deep learning model, the data were divided into training data, validation data, and test data, considering the seasonal characteristics over an entire year. For CNN Model I, 932 images were used for training, 271 images for validation, and 128 images for testing. For CNN Model II, the images in a 48 × 27 grid format were divided into segmented images with a 7 × 7 grid format. Consequently, the number of images used for training, validation, and testing increased to 293,580, 85,365, and 40,320, respectively. As CNN Model II did not have the segmented images required to estimate the values of three columns and three rows at the edge of each image, the values related to these regions were not predicted. The quantity of available data varied from one year to another as the satellite measurements could not be obtained on days with poor weather. In particular, the quantity of data obtained during summer was relatively small compared to that obtained during the other seasons owing to increased rainfall and typhoons, as shown in Table 1.

**Table 1.** Information of training data, validation data, and test data in the CNN Model I and CNN Model II.


#### **3. Results**

#### *3.1. CNN Model I*

The RMSE, which is the difference between the predicted chlorophyll-*a* and the satellite chlorophyll-*a* values, was used to evaluate the accuracy of the CNN models designed herein. The RMSE was calculated as:

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum\_{1}^{n} (pred(i) - target(i))^2} \tag{1}$$

where *pred(i)* represents the predicted chlorophyll-*a* pixel value for of the *i*th point and *target(i)* represents the satellite chlorophyll-*a* pixel value for the *i*th point in each image.

CNN Model I was used to estimate the chlorophyll-*a* value of 128 images recorded in 2019. In most cases, the RMSE was approximately 0.2–0.6 and the average RMSE was 0.436, as shown in Figure 10. The minimum RMSE was 0.106 and the maximum RMSE was 1.242, which is a significant gap. Therefore, specific analyses were performed for the cases with RMSE = 0.106, RMSE = 0.506, and RMSE = 1.209, as shown in Figure 11.

RMSE ൌ ට

ଵ 

ଵ

‐

‐ ‐ **Figure 10.** RMSE distribution for 128 images using CNN Model I: histogram with the range of RMSE values on the *X*-axis and the number of images on the *Y*-axis.

‐ ‐

∑ ൫ሺሻ െ ሺሻ൯ ଶ

‐

‐

‐ ‐

Predicted Chlorophyll- Data Satellite Chlorophyll- Data (**a**) RMSE = 0.106 case In the case with the lowest RMSE (RMSE = 0.106), the model results showed that there was a slight predictive error in the image, but the overall trend was well estimated. In the case with the RMSE close to the average value (RMSE = 0.506), the overall change in chlorophyll-*a* in the entire image was clearly estimated, but the accuracy of the estimation of the local changes in chlorophyll-*a* was limited. In the case with the high RMSE (RMSE = 1.209), the model was unable to estimate the satellite chlorophyll-*a* value. The measured values clearly indicate a change in the spatial chlorophyll-*a* values, whereas the estimated values tend to converge to the average value at most points. Thus, the model appeared to have a tendency to approximate the average value as the estimated value when the training data were insufficient, as shown in Figure 11. Consequently, the coefficient of determination (R<sup>2</sup> ), which represents how well the model results fit the satellite data, was applied herein. R2 is represented by a value of 0.0–1.0, where a value of 1.0 indicates a perfect fit. When the RMSE was relatively low, the R<sup>2</sup> was around 0.673, and when the RMSE was high, R<sup>2</sup> < 0.5. When R<sup>2</sup> < 0.5, the higher the chlorophyll-*a* value of the satellite data, the lower the predictive ability, as shown in Figure 12.

The results of CNN Model I tended to be averaged by assimilating the surrounding values instead of estimating local changes. As deep learning models such as a CNN estimate values by analyzing patterns from training data, the prediction patterns could not be determined from insufficient training data. Therefore, CNN Model I, which was trained using only 1203 training and validation images, could predict the overall trends but failed to predict local changes. Notably, if additional training data is provided, the prediction accuracy of CNN Model I can be improved.

#### *3.2. CNN Model II*

Chlorophyll-*a* estimation was also performed using CNN Model II, which utilized 300 times more training and validation data than CNN Model I, owing to the use of segmented images. The RMSE values of CNN Model II were around 0.05–0.8. Most of the RMSE values were less than or equal to 0.2, with an average of 0.167. Compared to the results of CNN Model I, the RMSE values of CNN Model II were significantly lower, confirming the excellent predictive ability of the latter. Notably, RMSE was less than or equal to 0.12 in almost half the total number of predictions. A detailed analysis was performed by classifying the RMSE values of CNN Model II into good, average, and bad cases, as shown in Figure 13.

In the case of a low RMSE value (RMSE = 0.055), the predicted chlorophyll-*a* values were almost the same as those of the satellite chlorophyll-*a* values. Furthermore, the spatial variations of chlorophyll-*a* concentration were properly estimated. The case with an RMSE value close to the average value (RMSE = 0.204) also demonstrated similar results to the observed values. In particular, the changes in the spatial concentration were estimated accurately. In the case of a high RMSE value (RMSE = 0.775), the model accurately reproduced the spatial concentration pattern but tended to underestimate the concentration at some

points. The satellite data exhibited large variations in the concentration between adjacent points, whereas the deep learning model corrected this drastic change and estimated it smoothly in space, as shown in Figure 14. **Figure 10***.* RMSE distribution for 128 images using CNN Model I: histogram with the range of RMSE values on the X-axis and the number of images on the Y-axis.

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 11 of 20

Model II.

**3. Results**  *3.1. CNN Model I*

CNN Model II

signed herein. The RMSE was calculated as:

RMSE = ට

ଵ 

ଵ

where *pred(i)* represents the predicted chlorophyll-*a* pixel value for of the *i*th point and *target(i)* represents the satellite chlorophyll-*a* pixel value for the *i*th point in each image. CNN Model I was used to estimate the chlorophyll-*a* value of 128 images recorded in 2019. In most cases, the RMSE was approximately 0.2–0.6 and the average RMSE was 0.436, as shown in Figure 10. The minimum RMSE was 0.106 and the maximum RMSE was 1.242, which is a significant gap. Therefore, specific analyses were performed for the cases with RMSE = 0.106, RMSE = 0.506, and RMSE = 1.209, as shown in Figure 11.

**Table 1.** Information of training data, validation data, and test data in the CNN Model I and CNN

(# of segmented images (7 × 7)) 293,580 85,365 40,320

**Category Training Data Validation Data Test Data**  Period (year) 2015–2017 2018 2019 CNN Model I (# of images) 932 271 128

The RMSE, which is the difference between the predicted chlorophyll-*a* and the satellite chlorophyll-*a* values, was used to evaluate the accuracy of the CNN models de-

> ൯)݅(ݐ݃݁ݎܽݐ − (݅)݁݀ݎ൫∑ ଶ

(1)

**Figure 11***.* Chlorophyll-*a* results estimated using the CNN Model I: The left section shows the predicted chlorophyll-*a* values and the right section shows the satellite chlorophyll-*a* values corresponding to the left section. The RMSE values for the three cases are (**a**) 0.106, (**b**) 0.506, and (**c**) 1.209, respectively. **Figure 11.** Chlorophyll-*a* results estimated using the CNN Model I: The left section shows the predicted chlorophyll-*a* values and the right section shows the satellite chlorophyll-*a* values corresponding to the left section. The RMSE values for the three cases are (**a**) 0.106, (**b**) 0.506, and (**c**) 1.209, respectively.

In the case with the lowest RMSE (RMSE = 0.106), the model results showed that there was a slight predictive error in the image, but the overall trend was well estimated. In the case with the RMSE close to the average value (RMSE = 0.506), the overall change in chlorophyll-*a* in the entire image was clearly estimated, but the accuracy of the estimation of the local changes in chlorophyll-*a* was limited. In the case with the high RMSE (RMSE = 1.209), the model was unable to estimate the satellite chlorophyll-*a* value. The measured values clearly indicate a change in the spatial chlorophyll-*a* values, whereas the estimated values tend to converge to the average value at most points. Thus, the model appeared to have a tendency to approximate the average value as the estimated value when the training data were insufficient, as shown in Figure 11. Consequently, the

ellite data, was applied herein. R2 is represented by a value of 0.0–1.0, where a value of

< 0.5. When R<sup>2</sup>

of the satellite data, the lower the predictive ability, as shown in Figure 12.

), which represents how well the model results fit the sat-

< 0.5, the higher the chlorophyll-*a* value

was around 0.673,

coefficient of determination (R<sup>2</sup>

(**a**) RMSE = 0.106, R<sup>2</sup>= 0.673 (**b**) RMSE = 1.209, R<sup>2</sup>= 0.481

‐

‐

‐

**Figure 12.** Examples of (**a**) good R<sup>2</sup> and (**b**) bad R<sup>2</sup> values among the results of the CNN Model I. **Figure 12.** Examples of (**a**) good R<sup>2</sup> and (**b**) bad R<sup>2</sup> values among the results of the CNN Model I.

The results of CNN Model I tended to be averaged by assimilating the surrounding

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 13 of 20

confirming the excellent predictive ability of the latter. Notably, RMSE was less than or equal to 0.12 in almost half the total number of predictions. A detailed analysis was per-‐ ‐ **Figure 13.** RMSE distribution for 128 images using the CNN Model II: histogram with the range of RMSE values on the *X*-axis and the number of images on the *Y*-axis.

formed by classifying the RMSE values of CNN Model II into good, average, and bad cases, as shown in Figure 13. ‐ ‐ ‐ Compared to CNN Model I, CNN Model II has significantly better chlorophyll-*a* estimation ability, and the spatial change pattern of chlorophyll-*a* was successfully estimated in all the model results. Furthermore, the coefficient of determination (R<sup>2</sup> ) improved significantly. When RMSE = 0.055, R<sup>2</sup> = 0.91, and when RMSE = 0.775, which suggests a high degree of error, the overall trend was reproduced well and R<sup>2</sup> = 0.661, as shown in Figure 15. Although both models used the same CNN technique, the difference in their estimation abilities is likely due to the large difference in their respective training data volumes.

**Figure 13.** RMSE distribution for 128 images using the CNN Model II: histogram with the range of

In the case of a low RMSE value (RMSE = 0.055), the predicted chlorophyll-*a* values were almost the same as those of the satellite chlorophyll-*a* values. Furthermore, the spatial variations of chlorophyll-*a* concentration were properly estimated. The case with an RMSE value close to the average value (RMSE = 0.204) also demonstrated similar re-

RMSE values on the X-axis and the number of images on the Y-axis.

sults to the observed values. In particular, the changes in the spatial concentration were estimated accurately. In the case of a high RMSE value (RMSE = 0.775), the model accurately reproduced the spatial concentration pattern but tended to underestimate the concentration at some points. The satellite data exhibited large variations in the concentration between adjacent points, whereas the deep learning model corrected this drastic

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 14 of 20

change and estimated it smoothly in space, as shown in Figure 14.

**Figure 14.** Chlorophyll-*a* results estimated using CNN Model II. The left section shows the predicted chlorophyll-*a* values and the right section shows the corresponding satellite chlorophyll-*a* image values. The corresponding RMSE values are (**a**) 0.055, (**b**) 0.204, and (**c**) 0.775, respectively. **Figure 14.** Chlorophyll-*a* results estimated using CNN Model II. The left section shows the predicted chlorophyll-*a* values and the right section shows the corresponding satellite chlorophyll-*a* image values. The corresponding RMSE values are (**a**) 0.055, (**b**) 0.204, and (**c**) 0.775, respectively. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 15 of 20

Compared to CNN Model I, CNN Model II has significantly better chlorophyll-*a* es-

**Figure 15.** Examples of (**a**) good R<sup>2</sup> and (**b**) bad R<sup>2</sup> values among the results of the CNN Model II. **Figure 15.** Examples of (**a**) good R<sup>2</sup> and (**b**) bad R<sup>2</sup> values among the results of the CNN Model II.

Plankton growth is affected by various factors such as water flow, water temperature, nutrients, and light. The concentration of plankton is relatively high in shallow water coastal areas and upwelling regions, as they have a rich supply of nutrients. The surface salinity and temperature of the study area change significantly as high salinity and low temperature seawater flows through the Daebang channel, located in the northeast. The satellite data reveals that the seawater flowing in from the Daebang channel contains low concentrations of chlorophyll-*a*, resulting in a relatively low chlorophyll-*a* concentration in the center of the study area. Moreover, the study area is connected to a river, and large amounts of river water flow into the study area during the rainy summer season, affecting the growth of plankton. As the growth of each type of plankton depends on the water temperature, it is important to predict the seasonal

The monthly averaged satellite data and model data were compared to determine whether the prediction model developed herein can adequately estimate the seasonal changes in plankton concentration. In 2016 and 2018, the chlorophyll-*a* concentration was low in January—the winter season—but high during spring and summer. The concentration decreased again in November, which clearly demonstrates the seasonal fluctuations in plankton concentration in the study area. The developed model successfully estimated the seasonal fluctuations in plankton concentration in 2016 and 2018. Notably, although the seasonal fluctuations in 2019 were relatively small compared to those in 2016 and 2018, the developed model accurately estimated the small seasonal and local

**Year Category January May August November** 

changes in plankton concentration.

concentration changes, as shown in Figure 16.

**4. Discussion** 

2016 (Trainin g Data)

2018 (Validati on Data)

Predicted

Satellite

Predicted

#### **4. Discussion**

Plankton growth is affected by various factors such as water flow, water temperature, nutrients, and light. The concentration of plankton is relatively high in shallow water coastal areas and upwelling regions, as they have a rich supply of nutrients. The surface salinity and temperature of the study area change significantly as high salinity and low temperature seawater flows through the Daebang channel, located in the northeast. The satellite data reveals that the seawater flowing in from the Daebang channel contains low concentrations of chlorophyll-*a*, resulting in a relatively low chlorophyll-*a* concentration in the center of the study area. Moreover, the study area is connected to a river, and large amounts of river water flow into the study area during the rainy summer season, affecting the growth of plankton. As the growth of each type of plankton depends on the water temperature, it is important to predict the seasonal changes in plankton concentration.

The monthly averaged satellite data and model data were compared to determine whether the prediction model developed herein can adequately estimate the seasonal changes in plankton concentration. In 2016 and 2018, the chlorophyll-*a* concentration was low in January—the winter season—but high during spring and summer. The concentration decreased again in November, which clearly demonstrates the seasonal fluctuations in plankton concentration in the study area. The developed model successfully estimated the seasonal fluctuations in plankton concentration in 2016 and 2018. Notably, although the seasonal fluctuations in 2019 were relatively small compared to those in 2016 and 2018, the developed model accurately estimated the small seasonal and local concentration changes, as shown in Figure 16.

We performed a sensitivity analysis to determine the influence of each input variable in the model results. To do so, the performance of the model was investigated by only using individual input variables as training data for the deep learning model. The results of the sensitivity analysis (Table 2) indicated that CDOM contributes significantly to the estimation of chlorophyll-*a*, with an RMSE of 0.231. The visibility, TSS, and temperature are also relatively important variables, whereas the remaining input variables have a relatively low contribution to the improvements in model performance. Notably, when all the input variables, except for CDOM, were integrated, the RMSE increased to 0.330. Thus, although the individual input variables have a negligible effect on the model performance, the integration of the input variables has a complementary effect and improves model prediction. When all the input variables were used, the RMSE was 0.191, which represents the best model performance.

Predictive studies on plankton concentrations have been conducted for decades using various water quality models. However, there are numerous challenges and limitations owing to the complex interactions between water quality parameters, uncertainty of hydrodynamic information, and lack of boundary nutrient loadings and validation data. For example, the results of studies that predicted the level of chlorophyll-*a* in Chesapeake Bay by employing a 3D water quality model had a correlation coefficient of less than 0.5 [45,46]. The main objective of this study was to develop a prediction tool that can be used in combination with existing water quality models, wherein the currents, water level, salinity, and temperature calculated from the hydrodynamic model were used to predict chlorophyll-*a* concentration. As the hydrodynamic model results have an error of only 10–20%, they can be used as training data for deep learning models [30]. Accordingly, satellite data such as CDOM, TSS, and visibility, which were validated through various studies, were used as training data to develop a chlorophyll-*a* prediction tool. The prediction model developed herein—CNN Model II—has good accuracy in the estimation of chlorophyll-*a* concentration, as evidenced by an R<sup>2</sup> of 0.66–0.91 and an RMSE of 0.055–0.775. Although the data used in the model are not in situ measurements, satellite data and hydrodynamic model data have continuously improved in recent years, and provide spatiotemporal data that cannot be obtained from in situ measurements. In addition, the developed model can predict the spatiotemporal chlorophyll-*a* concentration based on changes in individual parameters such as an increase in water temperature due to climate change, an increase in


CDOM due to land development, and an increase in TSS as a result of poor flushing due to the presence of coastal structures, etc.

**Figure 16.** Monthly averaged spatial distribution of model results and satellite chlorophyll-*a* images (CNN Model II).


The model results must be compared to real-world measurement data to validate the performance of the model. However, spatiotemporal chlorophyll-*a* data cannot be obtained through in situ measurements. The performance of the chlorophyll algorithms used for the GOCI radiometric data were evaluated using in situ measurements collected at 491 stations [47]. The evaluation results of the coincident in situ pairs of Rrs and chlorophyll measurements demonstrated that the mean uncertainty was <35%, with a correlation of around 0.8. Therefore, assuming that the data from GOCI are close to the real-world values, the model results were validated by comparing them against the satellite data. To improve the developed model, it is necessary to conduct a validation study with the measurement data of the study area and a comparative study with the state-of-the-art methods.

#### **5. Conclusions**

In this study, we developed a deep learning model using a CNN to predict the spatiotemporal changes in chlorophyll-*a* in a bay in Korea. The data used to train the deep learning model were the spatial data of chlorophyll-*a*, total suspended sediment (TSS), visibility, and colored dissolved organic matter (CDOM) obtained from the Geostationary Ocean Color Imager (GOCI) on board COMS, and the water level, currents, temperature, and salinity calculated by a verified hydrodynamic model. CNN MODEL I, which estimates chlorophyll-*a* images in a 48 × 27 grid format, was developed using the same 48 × 27 grid size of the CDOM, TSS, visibility, water level, currents, temperature, and salinity data. The RMSE between the satellite image and the predicted image from the model was calculated, and was between 0.2 and 0.6 in most cases. Although CNN Model I was able to estimate the overall trend, there were significant differences between the predicted results and the satellite data in some cases. As the deep learning model improves the predictive ability of the model by extracting and analyzing the inherent patterns in the training data, if the training data is insufficient, the predictive ability of the model decreases significantly.

To solve the problem of insufficient data, we designed another deep learning model— CNN Model II—using segmented images in a 7 × 7 grid format. CNN Model II estimates target values only using the data around the point of interest and, consequently, the volume of training data used in CNN Model II is around 300 times more than that of CNN Model I. Therefore, CNN Model II can extract and analyze inherent patterns in the training data more accurately. The average RMSE of CNN Model II was 0.191, which is significantly lower than that of CNN Model I, which was 0.463. Moreover, the spatial concentration of chlorophyll-*a* was well estimated by CNN Model II, thereby proving the efficacy of the deep learning model.

A sensitivity analysis was performed to determine the influence of each input variable on the model performance, and CDOM was found to have the most influence on the prediction of chlorophyll-*a*. The visibility, TSS, and temperature were also relatively important variables. The input variables with a strong influence on the model performance have a direct relationship with nutrients, photosynthesis, and temperature, which influence plankton growth. Therefore, the data-based deep learning model considers the major factors related to the growth of plankton and makes predictions. Additionally, the predictive accuracy of the deep learning model was improved if the training data also included the currents, velocity, and salinity.

**Author Contributions:** Conceptualization, D.J. and T.K.; methodology, E.L. and T.K.; software, D.J., T.K. and K.K.; validation, D.J. and K.K.; data curation, E.L.; writing—original draft preparation, D.J. and T.K.; writing—review and editing, D.J. and T.K; visualization, D.J. and K.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data sharing is no applicable to this article.

**Acknowledgments:** This paper was written following the research work "A Study on Marine Pollution Using Deep Learning and its Application to Environmental Impact Assessment (II)" (RE2021-08), funded by the Korea Environment Institute (KEI).

**Conflicts of Interest:** The authors declare that they have no conflict of interest.

#### **References**


## *Article* **RCSANet: A Full Convolutional Network for Extracting Inland Aquaculture Ponds from High-Spatial-Resolution Images**

**Zhe Zeng <sup>1</sup> , Di Wang <sup>2</sup> , Wenxia Tan <sup>3</sup> , Gongliang Yu 4,\*, Jiacheng You <sup>1</sup> , Botao Lv <sup>1</sup> and Zhongheng Wu <sup>5</sup>**


**Abstract:** Numerous aquaculture ponds are intensively distributed around inland natural lakes and mixed with cropland, especially in areas with high population density in Asia. Information about the distribution of aquaculture ponds is essential for monitoring the impact of human activities on inland lakes. Accurate and efficient mapping of inland aquaculture ponds using high-spatial-resolution remote-sensing images is a challenging task because aquaculture ponds are mingled with other land cover types. Considering that aquaculture ponds have intertwining regular embankments and that these salient features are prominent at different scales, a Row-wise and Column-wise Self-Attention (RCSA) mechanism that adaptively exploits the identical directional dependency among pixels is proposed. Then a fully convolutional network (FCN) combined with the RCSA mechanism (RCSANet) is proposed for large-scale extraction of aquaculture ponds from high-spatial-resolution remote-sensing imagery. In addition, a fusion strategy is implemented using a water index and the RCSANet prediction to further improve extraction quality. Experiments on high-spatial-resolution images using pansharpened multispectral and 2 m panchromatic images show that the proposed methods gain at least 2–4% overall accuracy over other state-of-the-art methods regardless of regions and achieve an overall accuracy of 85% at Lake Hong region and 83% at Lake Liangzi region in aquaculture pond extraction.

**Keywords:** aquaculture ponds; extraction; inland lake; self-attention

#### **1. Introduction**

Aquaculture has become one of the main sources of animal protein and increasingly contributes to food security for many inland cities with large populations in Asia. Freshwater aquaculture products such as fish, crustaceans, and molluscs are supplied from aquaculture ponds built around natural lakes. Aquaculture in China already accounts for 60% of global production [1]. Aquaculture foods provided by inland aquaculture ponds have become predominant contributors of aquatic foods in Chinese banquets [2]. Provinces in the middle and lower reaches of the Yangtze River basin account for more than half the country's total freshwater production. In recent years, pond aquaculture has become predominant and has contributed on average 71 percent to total freshwater production (China Fishery Statistical Yearbook 2004–2016), maintaining an average growth rate of 5.8 percent per year. The area under pond aquaculture has greatly increased. However, intensive aquaculture has a severely destructive effect on the environment, including high

**Citation:** Zeng, Z.; Wang, D.; Tan, W.; Yu, G.; You, J.; Lv, B.; Wu, Z. RCSANet: A Full Convolutional Network for Extracting Inland Aquaculture Ponds from High-Spatial-Resolution Images. *Remote Sens.* **2021**, *13*, 92. https://doi.org/10.3390/rs13010092

Received: 10 December 2020 Accepted: 26 December 2020 Published: 30 December 2020

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

levels of water use, local environmental pollution, and the loss of services provided by the freshwater ecosystems of natural lakes [3,4].

Remotely sensed imagery has been used as an effective means for global monitoring of aquaculture ponds in coastal areas [5–7] and nearby inland lakes [8]. An object-based image analysis (OBIA) method was used on Landsat TM images to extract aquaculture ponds in coastal areas of south-eastern China [9]. Tran et al. used maximum likelihood classification on Landsat and SPOT5 images to obtain long-term land-cover and land-use changes in a delta in Vietnam, where aquaculture ponds were one of the classes [10]. Ottinger et al. used the geometric features of aquaculture ponds for image segmentation on Sentinel-1 Synthetic Aperture Radar (SAR) images to extract fish ponds in several delta areas in Asia [5,11]. Zeng et al. used Landsat and Gaofen-1 satellite images to extract aquaculture ponds around inland lakes using boundary curve features and a Support Vector Machine (SVM) classifier [8]. In state-of-the-art methods for aquaculture pond extraction, object-oriented classification is usually integrated with hand-crafted features, and the spatial resolution of the satellite images commonly used is generally 10 meters or coarser. However, aquaculture ponds close to inland lakes are mixed with water bodies that are approximately the same size as these ponds. Accurately mapping aquaculture ponds using finer spatial resolution (up to a few meters) remote-sensing images and applying a more generalized approach, rather than manual feature engineering, remains a technical challenge for inland lake mapping.

Because semantic segmentation can understand images at the pixel level, statistics- and geometry-based image segmentation methods have been replaced by methods that depend on Deep Convolutional Neural Networks (DCNNs) [12]. DCNNs have been recognized by industry and have become widely used, advancing from LeNet-5's success in zip encoding recognition in the 1980s to AlexNet's victory in the 2012 ImageNet competition [13]. Subsequently, a deep CNN architecture proposed by Visual Geometry Group of Oxford University (VGG) [14], a residual network architecture proposed by He (ResNet) [15] and other DCNN structures have become the basic learning framework for advanced feature extraction from visual images. The fully convolutional network (FCN) constitutes a breakthrough in semantic image segmentation by converting the fully connected layer in traditional DCNNs, such as VGG, into a fully convolutional layer, thereby successfully achieving end-to-end labelling [16]. Badrinarayanan et al. [17] proposed Segnet to achieve pixel level classification through a deep convolutional encoder-decoder architecture in which the decoder upsamples the lower-resolution feature maps. Chen et al. proposed the Deeplab architecture and its revised versions, which introduced atrous convolution and atrous spatial pyramid pooling (ASPP) models into the deep encoder-decoder architecture for semantic segmentation [18–20]. Deep learning techniques for semantic segmentation have been developed for various computer vision tasks such as autonomous vehicles, medical research and many other applications in recent years [21]. However, implementing semantic segmentation of deep neural networks on remote-sensing images must overcome specific problems, including different data sources and scales [22]. For example, SegNet and ResNet have been efficiently implemented on multi-modal remote-sensing data using the FuseNet principle [23]. FCN has been used for slum mapping by transfer learning [24]. FCN was re-designed and used for automatic raft labelling in offshore waters by a dual-scale structure [25] or a U-Net [26].

Aquaculture ponds are shallow artificial water bodies that commonly have distinctly man-made shapes for efficient aquaculture production [10]. The ponds around inland lakes are formed gradually by embankment, partition, and regularization of other land cover types, such as cropland or natural lake water bodies. Because the shoreline of a natural lake winds along the surrounding terrain, its boundary shape is generally extremely irregular. On the other hand, the borders of aquaculture ponds are constructed on the principle of cost-saving, and straight lines are often used to delimit the boundary in a local area. Hence, the boundaries of aquaculture ponds have more regular shapes overall. Furthermore, when the human eye perceives satellite images where aquaculture ponds are densely distributed,

the aquaculture ponds with their intertwining regular boundaries will be visual attention areas because human perception commonly pays attention to parts of visual space where patterns can be acquired, according to neuroscience and cognitive science literature [27].

Attention mechanisms have been extensively used for various visual tasks. The recurrent attention model is used for object recognition through a recurrent neural network (RNN) integrated with reinforcement learning to mimic the process of the human visual system as it recurrently determines the attention region. The attention mechanism on top of the RNN proposed by the neural machine translation community [28,29], was also adopted to perform image captioning by assigning different weights to image representations [30]. The self-attention mechanism without the RNN model is exploited in a super-resolution image generator [31], which is a variant of the TRANSFORMER [32], a cutting-edge deep neural network for language translation. Furthermore, self-attention mechanisms have been introduced into scene segmentation for modelling feature dependencies from spatial and channel dimensions [33]. In remote sensing, attention models have also been used for object classification in various satellite images. For instance, attention mechanisms are integrated into multi-scale and feedback strategies of deep neural networks for pixel-wise classification of very-high-resolution satellite images [34]. The attention model is combined with a learning layer to capture class-specific feature dependencies [35].

When human beings visually identify densely distributed aquaculture ponds on remote-sensing images, the intertwining regular embankments around these ponds are prominent visual attention features. This paper is inspired by this visual attention mechanism used for human interpretation of satellite images. Moreover, the intertwining regular embankments are a salient feature that is available at different scales. The two motivations of this study are first to develop a novel attention mechanism that can mimic the process of the human visual system to recurrently determine the attention region, which is the intertwining regular embankments of aquaculture ponds, and to evolve multi-scale visual attention through the encoder-decoder, fully convolutional network architecture that integrates the attention mechanism with atrous convolutions to better extract aquaculture ponds.

Therefore, the main contributions of the paper can be summarized as follows:


#### **2. Materials**

#### *2.1. Study Area*

Hubei Province, known as the province of thousands of lakes, lies in the middle reaches of the Yangtze River and has densely distributed lakes. Hubei has a mature freshwater aquaculture industry with large numbers of aquaculture ponds developed surrounding natural lakes. As shown in Figure 1, six regions with densely distributed aquaculture ponds were selected as study areas from three large lakes (Lake Liangzi, Lake Futou, and Lake Hong) along the Yangtze River because these are typical inland aquaculture areas in China. Among them, Lake Hong and Lake Liangzi are the two largest freshwater lakes in Hubei Province. The population in this part of China is dense, and aquaculture is very developed. Lake Liangzi, and its surroundings, however, have been relatively well protected since the 1980s. The six selected regions were divided into two categories: type I and type II. The type I regions, including regions A and B, are used for testing, whereas type II regions are used for training. Region A is an area of 73.76 km<sup>2</sup> close to eastern Lake Hong which is an artificial lake, and region B is an area of 33.92 km<sup>2</sup> close to eastern Lake Liangzi, which has been preserved in a state more like a natural lake.

**Figure 1.** Location of the study area. The pseudo-colour images (**A**,**B**) are pansharpening images using the near infrared, the red and the green band as red, green and blue. The corresponding labelling image for each is given below.

#### *2.2. Dataset*

The Landsat multispectral images were selected because of their long history. The bands such as the near infrared can be beneficial for extracting water bodies. However, the spatial resolution of Landsat multispectral data is only 30 m. Panchromatic images with 2–2.5 m spatial resolutions from the panchromatic and multispectral (PMS) camera of the GaoFen-1 (GF-1) satellite [36], the panchromatic remote-sensing instrument for stereo mapping (PRISM) of the ALOS satellite, and the NAD panchromatic sensor of the ZiYuan-3 (ZY-3) satellite [37] were also used to improve recognition and extraction of aquaculture ponds and natural water bodies. Table 1 lists the images used for the selected study regions. The Landsat multi-spectral images used in this study were captured in the winter of 2010–2011 and 2013–2014 and the spring of 2015. High-resolution panchromatic images were used for fusion with multi-spectral images. The panchromatic images were mainly selected from the GF-1 satellite and had acquisition dates close to the corresponding OLI images from Landsat satellite, whereas panchromatic images from the ALOS satellite were used instead for 2010 TM images from the Landsat satellite. However, when ALOS or GF-1 panchromatic images with similar acquisition dates were still not found, panchromatic images from the ZY-3 satellite captured in same season as Landsat images from a nearby year were selected because the ZY-3 satellite was launched in 2012.

Three classes: aquaculture ponds (artificial water surfaces), natural water surfaces and background (non-water surfaces) were included in the reference dataset (Figure 1), which was mainly generated by human visual interpretation. Field investigations were also conducted on some difficult-to-identify features, in cases where aquaculture ponds were mixed with small natural water surfaces (Figure 2).



**Figure 2.** Field photos of inland aquaculture ponds in Hubei Province, China. Aquaculture ponds are usually equipped with air pumps. (**A**) A branch of a natural lake. (**B**) An aquaculture pond equipped with oxygen pumps. (**C**) Below is a small river (natural water body) and above are several aquaculture ponds.

#### **3. Methodology**

To better understand the effectiveness of the proposed method for aquaculture pond segmentation, the methodology will be introduced in three parts: data pre-processing, the basic model, and a fusion strategy designed to further improve accuracy. In the preprocessing stage, the multi-spectral image and the corresponding 2 m panchromatic image were pansharpened. The pansharpened image was then fed into the proposed network, i.e., RCSANet, for semantic segmentation. The result generated from the network was finally fused with a water surface extraction image using the water index to further improve segmentation quality.

#### *3.1. Preprocessing*

Multi-spectral satellite images contain more spectral information, especially in the infrared spectral bands, which is beneficial for aquaculture pond identification, whereas panchromatic satellite images have higher spatial resolution, which helps to better distinguish the shape of the aquaculture pond. To use both together, the multi-spectral and high-spatial-resolution panchromatic images must be pansharpened to obtain images with both spectral information and higher spatial resolution. First, multi-spectral images were synthesized by selecting the three bands (green, red, NIR) that are useful for water body identification. The pixel values were normalized and then mapped to the range (0, 255). Similarly, the gray values of panchromatic images were also normalized and mapped to the range of (0, 255). The multi-spectral images were re-projected into the coordinate system of the corresponding panchromatic images to ensure consistent coordinates. The multi-spectral and panchromatic images were fused by the GRAM-SCHMIDT method [38], which is a widely used high-quality pansharpening method providing a fusion of panchromatic image and multi-spectral images with any number bands through orthogonalization of different multi-spectral bands [39].

#### *3.2. Basic Model*

#### 3.2.1. Network Architecture

The deep neural network architecture, depicted in Figure 3a, for semantic segmentation of aquaculture ponds in the proposed method is based on an FCN framework, that uses ResNet-101 [15] as the encoder to generate multiple semantic features. The encoding part produces the feature maps through five convolution layers, including the first convolution layer (Conv1) followed by a pooling layer, and the other four convolution layers (Res-1 to Res-4) are all residual subnetworks. The feature maps are abstract representations of the input image at different levels. Semantic segmentation by the FCN framework is a dense prediction procedure in that the coarse outputs of the convolution layers are connected by upsampling to produce pixel-level prediction. In the proposed method, the RCSA mechanism (introduced in Section 3.2.2) was developed on the coarse outputs at different levels of abstract representation (detailed in Section 3.2.3). Next, channel attention blocks (CAB), which were designed to assign different weights to features at different stages for consistency [40], were used to connect the coarse abstract representations from the encoder with the upsampling feature at the decoder in the whole dense prediction procedure. The spatial size of the coarse outputs derived from the different convolution layers were kept consistent by the upsampling blocks (Figure 3c) to achieve end-to-end learning through backward propagation. Specifically, to accurately capture aquaculture ponds and their context information at multiple scales, the ASPP module combined with the RCSA mechanism (ASPP-RC) forms a branch from Conv1 to the end of the decoder before a 1 × 1 convolution layer and is integrated with the corresponding feature as a skip connection. To extract spatial context information at different scales, atrous convolutions with different rates, followed by the RCSA mechanism, were performed in parallel on the low-level feature map in the ASPP-RC module. These branches for capturing features at different scales are connected by weighting each branch in terms of its own importance (Figure 3b, introduced in Section 3.2.4).

#### 3.2.2. RCSA Mechanism

When human beings use visual perception to understand remote-sensing images containing inland lakes with densely distributed aquaculture ponds, the ponds as a group will be eye-catching. The attention focuses on the spatial dependencies of aquaculture ponds and their surroundings. To mimic this human visual mechanism, the proposed model first establishes inter-pixel contextual dependencies through bidirectional gated recurrent units (GRUs) [41], which are a powerful variant of RNN, and then the self-attention modules are used on top of the bidirectional GRUs to establish this visual attention.

The self-attention mechanism is essentially a special case of the attention model. The unified attention model contains three types of inputs: key, value, and query [42], as depicted in Figure 4. The key and the value are a pair of data representations. Assume that there are *T* pairs < *k<sup>i</sup>* , *<sup>v</sup><sup>i</sup>* <sup>&</sup>gt; (*<sup>i</sup>* <sup>∈</sup> 1, . . . , *<sup>T</sup>*). By evaluating the similarity between a query *q* and each key, the model essentially captures the weight coefficient of each key and then weights the corresponding values to derive their final attention values. The attention mechanism first scores the similarity between a query and a key pair by the *f* function:

$$e\_i = f(k\_{i\prime}q) \tag{1}$$

Then the original scores *e<sup>i</sup>* are normalized by a Softmax function to obtain the weight coefficients:

$$\begin{aligned} a\_i &= \mathbf{g}(e\_i) \\ &= \operatorname{softmax}(e\_i) \\ &= \frac{\exp(e\_i)}{\sum\_{j=1}^{T} \exp(e\_j)} \end{aligned} \tag{2}$$

Finally, the context vector *c<sup>t</sup>* is evaluated by a weighted sum of the values:

$$c\_l = \sum\_i a\_i v\_i \tag{3}$$

The attention model can be presented in a unified form

$$c\_l = \text{Attention}(\mathbf{K}\_\prime \mathbf{Q}\_\prime V) = \text{Softmax}(f(\mathbf{K}\_\prime q))V\tag{4}$$

The attention model becomes a self-attention mechanism when all inputs, including the query, the key, and the value, have the same value.

**Figure 3.** RCSANet: FCN architecture combined with RCSA mechanism for semantic segmentation of aquaculture ponds: (**a**) Network architecture; (**b**) ASPP-RC module; (**c**) Upsampling block. The input image of the entire deep neural network is a 256 × 256 pansharpening patch with three spectral channels. Through encoding and decoding, a three-channel matrix for classification was output through a 1 × 1 convolution layer at the end, and finally a Softmax layer gave a prediction map with the same size as the input image.

**Figure 4.** Attention models.

The RCSA mechanism takes a feature map, which is the convolutional result from the previous layer or the input image, as an input *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*h*×*w*×*C*, where *<sup>h</sup>*, *<sup>w</sup>*, and *<sup>C</sup>* are the number of rows, columns, and channels respectively. The feature map can be spatially divided into *<sup>h</sup>* rows *<sup>r</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>1×*w*×*C*(*<sup>i</sup>* <sup>∈</sup> 1 . . . *<sup>h</sup>*) or *<sup>w</sup>* columns *<sup>c</sup><sup>j</sup>* <sup>∈</sup> <sup>R</sup>*h*×1×*C*(*<sup>j</sup>* <sup>∈</sup> 1 . . . *<sup>w</sup>*). RCSA enables the construction of spatial dependencies between pixels within a row or a column by the self-attention mechanism. Hence, the RCSA mechanism consists of two parallel branches, column-wise and row-wise self-attention, which are subsequently concatenated by summation, as shown in detail in Figure 5. In the upper branch, the row-wise selfattention mechanism first uses the bidirectional GRU model to depict the dependencies between the pixels in a row of the feature map

$$r\_i' = BiGRU(r\_i) \tag{5}$$

Then the outcome from the GRUs *r* ′ *i* is fed into the self-attention model by which the importance of the dependencies between pixels in the row is evaluated. The self-attention model is a specific variant of the attention model, in which the input query, key, and value have the same value, as shown in Figure 4b. The *r* ′ *i* are respectively conducted by three 1 × 1 convolution kernels, *WQ*, *WK*,and *WV*, so that the query, the key, and the value can be obtained by *Q* = *W<sup>Q</sup>* ∗ *r* ′ *i* , *K* = *W<sup>K</sup>* ∗ *r* ′ *i* , *V* = *W<sup>V</sup>* ∗ *r* ′ *i* , where "\*" is the convolution operation. Then they are substituted into the following Equation (4):

$$\begin{aligned} c\_l &= \text{Attention}(\mathcal{W}\_{\mathbf{K}}r'\_{i\prime}, \mathcal{W}\_{\mathbf{Q}}r'\_{i\prime}, \mathcal{W}\_{V}r'\_i) \\ &= \text{Softmax}(f(\mathcal{W}\_{\mathbf{K}}r'\_{i\prime}, \mathcal{W}\_{\mathbf{Q}}r'\_i)) \mathcal{W}\_{V}r'\_i \end{aligned} \tag{6}$$

where the similarity function *<sup>f</sup>*(*K*, *<sup>Q</sup>*) = *QK<sup>T</sup>* <sup>√</sup> *dk* and *d<sup>k</sup>* is the dimension of the key. The computation for one row can traverse to each row of the feature map. Equivalently, in the bottom branch, the same operations are performed in parallel on each column of the feature map. Eventually, the two branches are combined with equal weights.

**Figure 5.** Attention layer consisting of column-wise and row-wise self-attention models.

#### 3.2.3. RCSA for Dense Prediction

In semantic segmentation of remote-sensing images, dense prediction must fuse abstract representations of different levels from the encoder to improve pixel-level prediction. Visual attention on densely distributed aquaculture ponds could be involved in the dense prediction procedure. Consequently, the outputs of different convolution blocks in the encoding part are conducted by RCSA and then participate in dense prediction. These RCSA modules in the lateral connection enhance the features pixel-wise by assigning different weights to achieve a reasonable optimization of visual attention. In fact, this optimization takes place in a two-dimensional space made up of row and column vectors. However, the importance of different band channels must also be emphasized. The CAB module is directly used to fuse encoder and decoder features through assigning different weights to channels.

#### 3.2.4. ASPP-RC Module

Atrous convolutions at different rates can enlarge the field of view so that spatial information at different scales can be extracted. Aquaculture ponds, which are water bodies surrounded by dikes with regular shapes, are densely distributed close to inland lakes. These features show visual salience in remote-sensing images. Hence, the RCSA block is arranged next to atrous convolution to selectively focus attention. After the first convolution blocks of the encoder, in the ASPP-RC module, the low-level feature map is executed in parallel by atrous convolutions with different rates combined with RCSA. Eventually, the branches are connected by:

$$I = \sum\_{i=1}^{5} w\_i \cdot b\_i \tag{7}$$

where *b<sup>i</sup>* is the feature map produced by the *i*th branch in which the atrous convolution and RCSA are conducted in sequence and *w<sup>i</sup>* is the weight of the *i*th branch that evaluates the importance of different scales. This is unlike the original ASPP structure in which each branch has the same importance. The importance of each branch is adjusted adaptively in the proposed ASPP-RC module. All weight parameters are initially defined by a random vector *w* 0 *i* , which can be optimized during backpropagation when training the whole network. Finally, these weights are normalized using a Softmax function:

$$w\_i = \operatorname{softmax}(w\_i^0) \tag{8}$$

#### *3.3. Fusion Strategy*

To further improve the segmentation quality of aquaculture ponds, the normalized difference water index (NDWI) maps from pansharpening images are fused with the prediction probability matrices from the proposed network to produce the final classification result (Figure 6). This implementation is called "RCSANet-NDWI". The classification probability matrices are produced from the three classes (aquaculture ponds, natural water surfaces, and background) probability maps after the Softmax layer. Both aquaculture ponds and natural water surfaces are water bodies surrounding inland lakes. Hence, the water extraction index, which is a typical representation of the spectral characteristics of a water body used to distinguish ground features, has been extensively used. The NDWI maps were used to provide prior knowledge for aquaculture pond extraction. Through OTSU threshold binary segmentation [43], NDWI maps were divided into water and non-water parts. The water parts in the NDWI maps were used to refine the three-class probability matrix described earlier. Assume that the original probability matrix *P*<sup>0</sup> and the refined matrix *P* are both *h* × *w* × *c* in size, whereas the NDWI map *S* is *h* × *w* in size. *c* is the channel number, *k* is the channel ID, and the *k*th channel represents the *k*th class. Hence, *k* = 1, 2, 3 represent background, water, and aquaculture ponds, respectively. The fusion operation can be defined as:

**Figure 6.** Fusion strategy.

$$P^{ijk} = \begin{cases} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mu\_0^{ijk} \ \lambda k \neq 1 \\ \ \ \ y^{ij} \cdot P\_0^{ijk} \ \ \ \ k = 1 \end{cases} \tag{9}$$

where *y* is the indicator variable

$$y^{ij} = \begin{cases} 0, S^{ij} = 1 \\ 1, S^{ij} = 0 \end{cases} \tag{10}$$

For the pixel in the *i*-th row and *j*-th column of *S*, if its value is 1 (representing water), the corresponding background probability (*k* = 1) in *P*<sup>0</sup> is set to 0, and the fused matrix *P* is generated. The final classification maps can be obtained using the maximum probability judgment. The maximum probability judgment is the usual method for mapping the probability matrix to the final label image: the classification label of this pixel is determined with maximum probability: *lij* = arg max *k* (*p k ij*), where probability *p k ij* is the probability

of a pixel in the *i*-th row and *j*-th column from *k* different sources. With the NDWI, the interference from the background of the water body extraction is eliminated because the probability of the non-water part is set to 0.

#### **4. Experiments**

This section describes a series of qualitative and quantitative comprehensive evaluations that were conducted using the proposed methods with the dataset introduced in Section 2.

#### *4.1. Experimental Set-Up*

The inputs of the proposed network were 256 × 256 pansharpening patches with three spectral channels. Table 2 lists the parameters of the convolution kernels, which are basic operators of different modules in the entire process of the proposed RCSANet. Parameter rate means that the convolution kernels in different atrous convolution branches of the ASPP-RC module have different padding and dilation configurations, which are set to 6, 12, and 18, respectively, according to Figure 3b. Validation consisted of two parts:


user's accuracy, producer's accuracy, and kappa coefficients were calculated to assess aquaculture pond extraction accuracy on the 2 m spatial resolution pansharpened images.

**Table 2.** Parameters used for the convolution kernels in the various modules in RCSANet.


In addition, the proposed methods were divided into two versions: RCSANet (without NDWI fusion) and RCSANet-NDWI (with NDWI fusion) to verify the role of NDWI fusion. Three state-of-the-art segmentation methods, including DeeplabV3+ [20], Reseg [44], and Homogeneous Convolutional Neural Network (HCN) [25] were selected for comparison. In addition, the performance of SVM was also assessed as a representative of traditional machine learning methods that directly use each pixel as a feature. DeeplabV3+ is an FCN method for semantic segmentation with the help of an ASPP module. Reseg is a hybrid deep network for semantic segmentation. Except for CNN, the bidirectional GRU is also used in Reseg to capture contextual dependencies. HCN was originally proposed for automatic raft labelling and is now considered to have potential for aquaculture pond extraction. HCN was implemented following the settings in [25], and Resnet-101 was simultaneously used as the encoder in DeeplabV3+, Reseg, and the proposed methods.

In the present experiments, the parameters of the proposed methods were optimized by minibatch stochastic gradient descent using a momentum algorithm with a batch size of 2. The learning rate was set to 10−<sup>2</sup> and decayed with training epoch according to the "polynomial" strategy. The number of training epochs was configured as 40. The SVM was implemented with the help of the LIBSVM package[45], and two important factors, *C* and *γ*, were determined through a five-fold cross validation grid search. Except for the HCN, which was operated using TensorFlow 1.9.0, the other deep learning-based algorithms were implemented in Pytorch 1.1.0. All deep learning methods were implemented on a single NVIDIA GeForce GTX 1080 GPU.

#### *4.2. Results*

The performance of the various semantic segmentation methods in Part 1 of the experiments is depicted in Table 3. Clearly, the deep learning-based methods perform better than the traditional SVM algorithm because the latter cannot perceive spatial semantic information in the image. DeeplabV3+ is a state-of-the-art FCN method that has been widely used. Resnet-101 was also chosen as the backbone for DeeplabV3+. HCN is a deep convolutional neural network for automatic raft labelling, and Reseg is a deep recurrent neural network for semantic segmentation. The classification accuracy of the proposed methods for natural water surfaces and aquaculture ponds was consistently better than the other methods. Meanwhile, compared with DeeplabV3+, the overall accuracy in the two versions of the proposed methods led to an improvement of more than 7% and the Kappa coefficients of the proposed methods were greater than 0.72, indicating that the proposed method is significantly better than DeeplabV3+. Moreover, the results also demonstrated the effect of the proposed fusion strategy because RCSANet-NDWI further surpassed RCSANet on most metrics.


**Table 3.** Performance comparison of different methods for semantic segmentation of aquaculture ponds in Part 1 of the experiments (%).

> Figure 7 gives a detailed display of the classification results in Part 1 of the experiment. Inland water areas contain various natural water bodies as well as aquaculture ponds. These natural water bodies greatly interfere with the segmentation result for aquaculture ponds, making pixel-scale classification intricate. Figure 7 shows that the SVM classification results misclassified many aquaculture ponds as natural water bodies and many natural water bodies as aquaculture ponds, indicating that the traditional pixel-based method cannot efficiently distinguish natural water bodies from aquaculture ponds. The segmentation maps created by DeeplabV3+ look significantly better than those from SVM, but in some difficult zones where natural water bodies look similar to aquaculture ponds, they are also trapped by their own performance limitations and misclassified natural water bodies as aquaculture ponds (area in the 7th row) or aquaculture ponds as natural water bodies (districts in the 5th row). HCN, which has good performance for raft-culture extraction in offshore waters, performed poorly on semantic segmentation of inland aquaculture ponds and serious misclassifications also happened with HCN. Reseg, which combines CNN and bidirectional GRU, can perform semantic segmentation for aquaculture ponds. However, the identification of natural water bodies that closely resemble aquaculture ponds around inland lakes is not as good as with the proposed methods. In Table 3, the overall accuracy of the Reseg method can reach greater than 80% but its Kappa coefficient is less than 0.7. This shows that Reseg has established a spatial relationship through the construction of GRU, which has a certain effect on the segmentation of aquaculture ponds around inland lakes, but it is not good enough. In the Reseg segmentation map, many objects are stuck together, and the edges of aquaculture ponds are not well displayed. Among these result maps, the two versions of the proposed method separated natural water bodies and aquaculture ponds more satisfactorily than the other methods. The ASPP-RC module of the proposed method feeds back the details at different scales into the low-level feature map, which can draw visual attention to the decoding part. This facilitates identification of the thin edges surrounding the aquaculture ponds in semantic segmentation. Hence, the edges of aquaculture ponds were clearly identified in most cases, as shown in the results from RCSANet and RCSANet-NDWI. Finally, note that RCSANet-NDWI further improved the quality of aquaculture pond extraction compared with RCSANet.

> Table 4 provides assessment results for the various algorithms in Part 2 of the experiment and shows the corresponding extraction accuracies of the aquaculture pond and natural water surface classes in the two experimental areas (regions A and B in Figure 1) by different sensors. The overall accuracy and Kappa coefficient show that the two versions of the proposed method (RCSANet and RCSANet-NDWI) both performed better than the other methods, regardless of sensor or area. Moreover, compared with RCSANet, the accuracy of RCSANet-NDWI was further improved with the aid of NDWI fusion. In region A, their overall accuracies in pansharpening images from different sensors were greater than 85 percent, and the Kappa coefficients were definitely greater than 0.7. These results were better than those of other deep learning-based methods, not to mention SVM. In region B, the proposed methods still performed the best. Unlike region A, where the lake is greatly influenced by residents living nearby, causing the aquaculture ponds to be neatly and regularly distributed, the aquaculture ponds in region B have a sparser distribution.

Region B is relatively well protected, and some small natural water bodies, which are easily confused with aquaculture ponds and interfere with network identification, were produced when the lake was split for artificial development. Hence, the situations in the two regions are completely different, which shows the stability of the proposed methods under various scenarios. The overall accuracies of the proposed methods in pansharpening images from different regions were close to or greater than 80 percent. In addition, it should be noted that user accuracy in identifying natural water bodies in almost all methods is relatively high. This is because natural water bodies tend to be extensive, homogeneous, self-contained, and distributed in aggregates, a situation that is easier to recognize for the classifier. Compared with the proposed methods, Reseg and DeeplabV3+ may also obtain higher user accuracy in some cases. However, because of their limited recognition ability, they cannot explicitly judge the difference between aquaculture ponds and natural water bodies (Figure 8).

**Figure 7.** Semantic segmentation results for 256 × 256 pixel image patches from test set in Part 1 of the experiment. The leftmost column gives the sensors or satellites to which the multispectral and panchromatic data of the pansharpened images belong, and the bottom row lists the different methods by which the semantic segmentation images in the same column were obtained.


**Table 4.** Accuracy evaluation for the two classes, aquaculture ponds and natural water surfaces, in each experimental area (%).

> Figure 8 shows the classification results in the two study regions. Extracting aquaculture ponds in region B is more difficult than in region A because region B contains more natural water bodies that are hard to distinguish from aquaculture ponds. The two versions of the proposed method performed significantly better than the other methods for aquaculture pond extraction. The proposed methods were predominantly successful in predicting aquaculture ponds that are divided into regular shapes by embankments, as well as the natural water bodies in the two regions. In region A, the proposed methods generally extracted almost all aquaculture ponds compared with the ground truth, whereas other methods failed, especially in the upper part of the scene. In region B, compared with Reseg, the proposed methods had lower misclassification rates, and the natural rivers located at the bottom, which could not be identified by Reseg, were not misclassified as aquaculture ponds by the proposed methods. Moreover, the shapes of the ponds are best retained, as shown in the results of the proposed method. The advantage of the proposed method is the proposed RCSA mechanism for determining salient pixels in a row or column, which is essentially a description of the pixel-level context. This enables the proposed method to identify detailed features of the 2 m spatial resolution image, where the dikes around aquaculture ponds are such pixel-level details. Hence, the aquaculture ponds in region B were more fully extracted by the proposed RCSANet than by other state-of-the-art methods, such as DeeplabV3+ and Reseg. On the other hand, fusion using NDWI can better distinguish water surfaces, including natural water bodies and aquaculture ponds, from background. In effect, the proposed method with NDWI re-segments the leaked water surface from the background, which improves the producer's accuracy of the aquaculture pond. However, this also entails a phenomenon whereby a small part of the background is mistakenly classified as water surface.

**Figure 8.** Semantic segmentation results for aquaculture ponds and natural water bodies by various methods: (**a**) in region A, using the pansharpened image with the TM multispectral image captured in January 2011 and the ZY-3 panchromatic image in January 2013; and (**b**) in region B, using the pansharpened image with the OLI multispectral image captured in February 2014 and the GF-1 panchromatic image in January 2014.

#### **5. Discussion**

This study has used a fully convolutional network architecture with row- and columnwise self-attention to semantically segment aquaculture ponds around inland lakes. Artificial aquaculture ponds around inland lakes are small, and the dikes between these ponds are only about 2 m wide. On medium-resolution multispectral images, water pixels are firstly separated from land, and then water objects are formed based on connectivity. After that, these water objects are classified as natural water bodies and aquaculture ponds using geometric characteristics [8]. However, for inland lake area where aquaculture ponds are intensively distributed with narrow dikes (e.g., Lake Hong), the 15–30 m spatial resolution of the image limits the capability of the object based method to accurately extract aquaculture ponds. Hence, finer-spatial-resolution images are considered for pond extraction. By fusing multi-spectral information into panchromatic images from the GF-1, ZY-3 or ALOS satellites, the spatial resolution of the resultant satellite images can achieve up to 2 meters, enabling the identification of thin narrow dams. Meanwhile, the multi-spectral capability is utilized to recognize water. From the segmentation results, the proposed network structure was shown to be capable of extracting these regular pond boundaries, mainly because semantic segmentation of the aquaculture ponds benefits from establishing a spatial relationship between pixels in the same direction by the self-attention model. Although HCN was also an FCN-based method used to automatic raft labeling [25], nevertheless, its performance for extracting aquaculture ponds around inland lakes are not as effective as that for labeling raft-culture. Because the spatial context of raft-culture in coastal area is much simple than that of the inland lake area. In general, through high-spatial-resolution images that incorporate multi-spectral and panchromatic data, the proposed RCSANet enables the extraction of large-scale aquaculture ponds around inland lakes where complex spatial contexts of water surfaces exist. However, it is still challenging for the recognition of small water bodies in such complex spatial context. The experimental region B was in the process of recovering aquaculture ponds and farmland as lake area from 2011 to 2014. Therefore, various aquaculture ponds and natural water bodies are spatially mixed on the images of pansharpening multispectral and panchromatic data from 2011 and 2014, which poses great challenges for semantic segmentation of aquaculture ponds. For example, Figure 9c,d are images of the same area, which changed significantly between 2011 and 2014. Several small reservoirs were apparent in Figure 9c, but the profiles of these reservoirs had changed significantly in Figure 9d, and the left side of this area had been recovered into a large lake. The segmentation results in Figure 9g,f show that the restored large lake has been well segmented, but the small reservoirs are easily classified as aquaculture ponds or missed segmentations.

In the paper, extracting aquaculture ponds is performed on images that pansharpen multi-spectral data from Landsat satellites and panchromatic data from other satellites in the same period, and therefore the semantic segmentation might also be affected by the spectral range of the panchromatic image. Table 5 gives the results of an accuracy analysis that divided the training data of Part 2 of the experiment into two portions: pansharpened TM images and pansharpened OLI images. The predicted results of pansharpened TM images from Region B are based on RSCANet, which was trained by fusing TM images with panchromatic images from ZY-3 or ALOS satellites. The predicting results of pansharpened OLI images from Region B is based on RSCANet, which was trained by fusing OLI images with panchromatic images from GF-1 satellites. Table 5 shows that the results of pansharpened OLI images with panchromatic images from GF-1 satellites are significantly better than the results of pansharpened TM images with panchromatic images from ZY-3 or ALOS satellites. The spectrum of panchromatic images from GF-1 satellites ranges from 0.45 to 0.90 µm, which can completely cover the three NIR, red, and green bands of Landsat OLI data. However, the spectrum of panchromatic images from ZY-3 or ALOS satellites can only partly cover the NIR band of the TM sensor. The acquisition time of the TM images was earlier than 2012, and therefore it is difficult to use a GF-1 panchromatic image for pansharpened TM images.

**Figure 9.** Influence of multi-year changes on semantic segmentation results for aquaculture ponds. (**a**,**b**) are pansharpened images of the significantly changed area from Region B in 2010 and 2014 respectively. (**c**,**d**) are magnified images. (**e**,**f**) are labelling images for (**c**,**d**). (**g**,**h**) are semantic segmentation results for (**c**,**d**) using the proposed method.

The RCSANet can extract aquaculture ponds around inland lakes on 2 m satellite images more accurately than other methods because the involvement of two connection groups from the encoder to the decoder. The first connection group is the combination of the RCSA module and the ASPP-RC module, which links Conv1 of encoder part to decoder part. The second is the RCSA modules, linking Res-1, Res-2, and Res-3 of encoder part to decoder part. Table 6 indicates that the first connection group of RCSANet achieves an additional 2.32% overall accuracy gains over RCSANet1, and the second connection group brings 3.59% overall accuracy gains over RCSANet2, i.e., a plain FCN architecture based on ResNet-101 model. Nevertheless, the connections expend more computing resources because they involve the non-local self-attention mechanism, which contains many innerproduct operations. Moreover, the RCSANet is an encoder-decoder architecture in which the gradual upsampling are conducted, requiring more memory and calculation time. Table 7 shows that the RCSANet consumes more memory and training and prediction time than Deeplabv3+ and Reseg methods. It is feasible to sacrifice some computing resources to achieve higher accuracy of aquaculture pond extraction, especially the GPU performance will increase gradually.

**Table 5.** Accuracy comparation of semantic segmentation results for region B by dividing the training data of Part 2 of the experiment into pansharpened TM images and pansharpened OLI images (%).


**Table 6.** Accuracy evaluation for RCSANet and its two varants RCSANet<sup>1</sup> and RCSANet<sup>2</sup> in Part 1 of the experiments (%). RCSANet<sup>1</sup> is a RCSANet variant ablating the first connection group and RCSANet<sup>2</sup> is the other variant ablating both connection groups.


**Table 7.** Performance evaluation of different deep-learning based methods for semantic segmentation of aquaculture ponds in Part 2 of the experiments.


#### **6. Conclusions**

This study has implemented a semantic segmentation network on high-spatial-resolution satellite images for aquaculture pond extraction. A row- and column-wise self-attention (RCSA) mechanism has been proposed to capture the intertwining regular embankments of aquaculture ponds in feature maps, and then a fully convolutional network framework combined with the RCSA mechanism is proposed for semantic segmentation of aquaculture ponds. The proposed methods have been evaluated on high-spatial-resolution pansharpened images obtained by fusing multi-spectral and panchromatic images in typical regions with inland lakes and densely distributed aquaculture ponds. Experiments on satellite images of both a highly developed lake and a reserved lake show that the overall accuracy of the proposed method is significantly better than those of other methods (3–8% overall accuracy gains at Lake Liangzi and 1–2% overall accuracy gains at Lake Hong over the best of other methods). Specifically, from the experimental semantic segmentation results for large regions, detailed information, such as the embankments of aquaculture ponds, can be more accurately identified by the proposed method. It can be concluded that the proposed method is effective for large-scale extraction of aquaculture ponds. In addition, RCSANet-NDWI further improves the accuracy of the proposed method compared with RCSANet, indicating the significance of the proposed NDWI fusion strategy. For future study, the proposed methods can be extended to raft-culture extraction in offshore waters.

**Author Contributions:** Conceptualization, Z.Z., W.T. and G.Y.; methodology, Z.Z. and D.W.; software, D.W. and J.Y.; validation, D.W., J.Y., B.L. and Z.W.; investigation, Z.Z., W.T. and G.Y.; writing—original draft preparation, Z.Z.; writing—review and editing, W.T., D.W. and G.Y.; visualization, D.W. and J.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported partially by the National Key Research and Development Program of China under Grant 2017YFC1405600, partially by the Fundamental Research Funds for the Central Universities under Grant 18CX02060A and CCNU19TD002, and partially by the National Natural Science Foundation of China under grant 41506208.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Acknowledgments:** We are particularly grateful for help in the field and with high-resolution imagery from Jianhua Huang, National & Local Joint Engineering Research Center of Satellite Navigation and Location Service, Guilin University of Electronic Technology. We would like to acknowledge the invaluable input and assistance from the rest of the College of Oceanography and Space Informatics, China University of Petroleum, including Xuxu Kong, Weipeng Lu, Yachao Chen and Sai Zhang.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

