*4.1. CNN Data Processing*

When the traditional method retrieves the AOD and haze concentrations from remote sensing images, the information of these angles needs to be extracted, cut, and synthesized, and this process is relatively cumbersome. After evaluating the zenith and azimuth angles, we decided not to consider their changes since these angles are relatively fixed when the satellite passes through the same area in the same season. Therefore, when preprocessing the remote sensing data, the CNN method only needs to extract, cut, and synthesize the reflectivity and emissivity. After the data are processed as above, they are stored chronologically by season. Then, the convolution neural network is used to fit their nonlinear relationship, and finally, the haze level is classified through the classification layer.

According to the channel information, the channels for monitoring the edge and characteristics of the land and cloud are 1–7 channels. The wavelength and spatial resolution of each channel are shown in Table 1. We want to convert the satellite image into a threechannel RGB image to the convolutional neural network. Combining the wavelength range of visible light, as shown in Table 2, the three channels that best fit the three bands of RGB are channel 1, channel 4, and channel 3, so we combine the data of these three channels to get a true-color image. The synthesized image is shown in Figure 4. The correspondence AQI and PM 2.5 concentration of each haze level is shown in Table 3.

**Table 1.** Spatial resolution and internal wavelength of each channel in MOD02-1 km.


**Table 2.** Internal visible light wavelength.


**Figure 4.** (**a**) is the satellite image with full channel information, and (**b**) is the synthesized RGB image.


**Table 3.** Correspondence table among haze level, air quality index, and PM2.5 concentration.

## *4.2. CNN Structure*

The general structure of the CNN network proposed in this article is shown in Figure 5.

**Figure 5.** Structure of convolutional neural network (CNN).

Input layer: If the input data are RGB true-color images, the input data format is: *nh* × *nw* × 3; if they are grayscale image data, the input data format is: *nh* × *nw* × 1. Moreover, the input data of the input layer should be normalized, and the size of the image and the number of data channels should be consistent.

Convolution layer: The convolution operation is expressed as C, where f represents the length and width of the convolution kernel. The length and width are the same, and the number of channels is the same as the number of input channels. m represents the number of convolution kernels. Then, the process of convolution operation is shown in Equation (2):

$$y\_{i,j}^{l} = \sum\_{r=0}^{m-1} \sum\_{s=0}^{f-1} \sum\_{t=0}^{f-1} \mathcal{W}\_{s,t}^{(r,l)} x\_{i+s,j+t}^{l-1} + b^{l} \tag{2}$$

The first-level summation formula means that all convolution kernels are traversed once. The second- and third-layer summation formulas indicate that a convolution kernel with a size of *f* × *f* is used to perform a convolution operation on the input, where W is the weight and *b* is the bias. Where *i*, *j* represent the position of the image in the output layer as shown in Equation (3):

$$\begin{array}{l} i = 1, 2...(n\_h - f) \\ j = 1, 2,...(n\_w - f) \end{array} \tag{3}$$

Activation function: Use Sigmoid activation function, as shown in Equation (4):

$$Sigmoid(\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{x}}} \tag{4}$$

Pooling layer: This is an essential step in a convolutional neural network, also called a down-sampling layer, and the size is generally a square window with the same length and width. The pooling process is shown in Equation (5):

$$y\_{i,j}^l = \max\_{0 \le s, t \le f} \left[ \text{ReLU}(\sum\_{r=0}^{m-1} \sum\_{s=0}^{f-1} \sum\_{t=0}^{f-1} W\_{s,t}^{(r,l)} \mathbf{x}\_{i+s,j+t}^{l-1} + b^l) \right] \tag{5}$$

After the input data are translated and transformed, the output will not change, improving the convolutional network's robustness to extract features. This kind of translation invariance is a very practical property.

Fully connected layer: In the studied model, there are six categories according to the level of haze, and the corresponding labels are: (0 0 0 0 0 0 1), (0 0 0 0 0 1 0), ..., (0 1 0 0 0 0 0). Among them, data that are disturbed by information such as clouds and cannot be identified are marked with *p* position 1, that is, (1 0 0 0 0 0 0).

SoftMax classification layer: The classification process is to judge the probability that this vector belongs to each category, and the one with the highest probability is the result of the classification, as shown in Equation (6):

$$h\_{\theta} \left( \mathbf{x}\_{i}^{l-1} \right) = \begin{bmatrix} p(y\_{i} = 1 \left| \mathbf{x}\_{i}^{l-1}; \theta) \\ p(y\_{i} = 2 \left| \mathbf{x}\_{i}^{l-1}; \theta) \\ \vdots \\ p(y\_{i} = n \left| \mathbf{x}\_{i}^{l-1}; \theta) \right. \end{bmatrix} = \frac{1}{\sum\_{j=1}^{n} e^{\theta\_{j}^{T}} \mathbf{x}\_{i}} \begin{bmatrix} e^{\theta\_{1}^{T}} \mathbf{x}\_{1} \\ e^{\theta\_{2}^{T}} \mathbf{x}\_{2} \\ \vdots \\ e^{\theta\_{n}^{T}} \mathbf{x}\_{n} \end{bmatrix} \tag{6}$$

where *p*(*yi* = *n xl*−<sup>1</sup> *<sup>i</sup>* ; *θ*) represents the probability estimation of the classification function to the data being the nth category, and *θ* represents the model's parameters. The rightmost formula of the equation represents the normalized form of the probability so that the sum of all the probabilities is 1.
