**1. Introduction**

Precipitation can be divided into stratiform precipitation and convective precipitation [1]. A convective precipitation system generally has the characteristics of strong upward motion, small areal coverage and high precipitation intensity, while a stratiform precipitation system has the characteristics of weak upward motion, large areal coverage and weak precipitation intensity. The classification of precipitation can be used in meteorological research, weather forecasting and meteorological disaster prevention. First, there are di fferent precipitation growth mechanisms and di fferent physical principles between convective and stratiform precipitation. Research on convective and stratiform classification can provide a better understanding of the physical mechanisms of clouds. Additionally, convective systems have an important influence on the thermal balance of the atmosphere [2,3], and thermodynamic di fferences can lead to di fferent latent heat distributions, moisture cycling, and

cold rain and warm rain processes, which can have di fferent e ffects on cloud lifetime and Earth's climate [1,4]. The energy of a convective system is often expressed in terms of the apparent heat sources and apparent moisture sinks. An apparent heat source is di fficult to measure but can be estimated by precipitation [5,6], and di fferent types of precipitation can reflect di fferent thermodynamic structures. Finally, precipitation estimation plays an important role in understanding the hydrological cycle reducing uncertainties in global climate change model predictions for future environmental scenarios, weather forecasting and disaster prevention [7–9]. In the traditional method of precipitation estimation, a Z-R relationship is adopted, and the estimated precipitation is calculated through the relation between the radar echo intensity and precipitation. However, di fferent types of clouds have di fferent structures and precipitation growth mechanisms, and the use of a single Z-R relation cannot provide a good estimate of precipitation. It is helpful to improve the accuracy of precipitation estimation by classifying precipitation clouds and adopting di fferent Z-R relations for di fferent types of clouds [10,11]. However, there is no clear boundary between stratiform precipitation and convective precipitation, and it is di fficult to distinguish one from the other directly from radar data. As a result, a number of methods have been developed to classify precipitation types.

In an early study, ground rain gauges were used for classification. This method classifies rainfall as convective when the gauge data exceed some background level by a certain amount [12,13]. The background-exceedance technique (BET) uses radar reflectivity to identify the convective core in a certain plane and set the radius of influence; the area inside the influence radius is considered the convective rain zone, and the area outside the influence radius is the stratiform rain zone [14]. Steiner, et al. [15] modified the BET method using a variable radius of influence and a variable threshold instead of a fixed radius of influence and threshold in the BET method. The authors proposed the method in 1995, and the method was named SHY95 using the initials of the three authors and the year that the method was proposed. An extended SHY95 method was applied by DeMott, et al. [16], who used a two-dimensional BET at each height level within a volume of radar reflectivity to extend this approach to three dimensions. They suggested that using low-level data may lead to the misclassification of convective cells that tilted strongly with height and showed that using three-dimensional data can improve the accuracy of precipitation classification. Biggersta ff and Listemaa [5] modified the classification results of SHY95 by considering the vertical structure of the radar reflectivity factor based on the SHY95 method and found that the method yielded higher accuracy than SHY95. Bringi, et al. [17] classified precipitation types by calculating the standard deviation of the drop size distribution (DSD). When the standard deviation of the DSD is smaller than a certain standard, it is classified as stratiform precipitation, and when the standard deviation of the DSD is larger than this standard, it is classified as convective precipitation. Instead of using the traditional method based on the BET, Anagnostou [18] proposed an algorithm for classifying stratiform and convective clouds using an artificial neural network (ANN). The cloud-top height, reflectivity at a height of 2 km and 4 other features were used in the ANN training. Compared with other traditional algorithms based on the BET, the ANN exhibited better performance. The DSD has also been used to classify precipitation [19]. Based on a large number of rain events and by computing the Z-R relationship, the average DSD and the corresponding parameters, microphysical analysis can be performed; the rain distribution and precipitation type can be adequately characterized by a gamma DSD. Zhang and Qi [20] developed a method that automatically corrects for large errors due to the bright band in a real-time national radar quantitative precipitation estimation product, and the performances were good [21–23]. Yang, et al. [24] applied the fuzzy logic (FL) method for precipitation classification research using the 2-km height echo reflectivity, vertical integral liquid water content and other characteristics for classification, and the FL classification results were more natural and realistic than those of other methods. Yang, et al. [25] used FL to classify precipitation types and estimate precipitation. The results showed that compared with the Z-R relationship, FL can reduce the underestimation of precipitation and improve the accuracy of estimating precipitation using radar data.

Some studies have used satellite data to classify precipitation types. Adler and Negri [26] used infrared satellite data and applied a variant of the BET to classify convective and stratiform precipitation. Unlike convective cores denoted by radar reflectivity in the BET, they used the minimum cloud-top temperatures to identify the convective core area. The radius of influence of each core was dependent of the magnitude of the infrared brightness temperature of the core [26,27]. Goldenberg, et al. [28] used an infrared cloud-top temperature method similar to the BET to classify convective and stratiform precipitation for a tropical cloud cluster. Awaka, et al. [29] used TRMM precipitation radar data. Two algorithms, the vertical contour mode (V-method) and horizontal contour mode (H-method), were used in the study to classify precipitation types. If the classification results of the two algorithms are the same, the classification result is determined, and if the classification results di ffer, fusion-based classification results are used. The V-method can be used to detect the bright band. Once the bright band is detected, the precipitation type is classified as stratiform precipitation. Then, the V-method continues to detect convective precipitation according to the radar reflectivity. If the precipitation type is neither stratiform nor convective, it is classified as another type of precipitation. The H-method is based on Houze's classification model [15] using the horizontal echo intensity at a height of 2 km to assess the type of precipitation.

The precipitation process involves complex thermodynamic mechanisms and cloud microphysical mechanisms during sedimentation. These principles are di fficult to fully explore. Thus, it is di fficult to classify precipitation types based on these mechanisms. Machine learning can be used to build models and capture the characteristics of data such that changes in the data can be predicted and the data can be classified into di fferent categories based on the relevant characteristics. When using machine learning to classify precipitation types, it is not necessary to understand the precipitation mechanisms of convective precipitation or stratiform precipitation, and the representative and appropriate variables for classification and labeling can be selected to achieve optimal classification. Machine learning is a discipline that uses experience to improve the performance of a system by means of calculations [30]. In computer systems, experience often exists in the form of data; thus, the main area of machine learning research involves computer algorithms that generate models from data. The main types of machine learning include supervised learning, unsupervised learning and semisupervised learning. Supervised learning is a method of adjusting the parameters of a model with a set of known classes of samples to achieve the required performance. Supervised learning includes decision trees, boosting and bagging algorithms, support vector machines, etc. Semisupervised learning refers to the fact that data sets contain both identified and unidentified data, and unidentified data are obtained using the identified data. Semisupervised learning usually includes semisupervised Support Vector Machine (SVM), semisupervised clustering, etc. In unsupervised learning, training samples do not have known characteristic information. Unsupervised learning reflects the inherent nature and laws of data by learning unlabeled training samples, providing a basis for further data analysis. This approach is commonly used in clustering.

The K-nearest neighbor (KNN) method is a type of supervised learning algorithm that has been widely used in pattern recognition and classification. KNN relies on the nearest k samples instead of all the samples for classification and is most suitable for classifying samples with overlap and unclear boundaries. KNN was proposed by Fix and Hodges [31], and Cover and Hart [32] further developed and improved the algorithm. KNN has fewer tunable parameters and provides faster calculations for small data sets than other methods. Thus, this approach has advantages in solving classification problems involving precipitation types. Machine learning is seldom used to classify precipitation types. KNN is a mature classification algorithm with many advantages and has been used in many fields, but there is no relevant study to prove the applicability of KNN in the classification of precipitation types, and the present study attempts to use and explore the applicability of KNN to classify precipitation types.

This paper consists of the following: Section 2 introduces radar data and satellite data used in this paper, Section 3 describes the implementation process of the KNN algorithm and the performance of the selected variables under different conditions, Section 4 presents the results of the KNN classification of precipitation type, and the final section provides a summary and conclusion.

## **2. Data Description**

The Doppler radar data used in this study are from the six S-band China Next-Generation Weather Radars (CINRAD/SA), and the site information and usage period of the radars are shown in Table 1. The radars are 10-cm wavelength Doppler radars with a 1◦ half-power beam width. The radar data consist of volume scans of the radar reflectivity, average radial velocity and spectral width. The radars are operated in 360◦ azimuthal volume scan mode with steps in elevation angles from 0.5◦ to 19.5◦ during periods of precipitation. The number of elevation steps and temporal resolution of the data depend on the operational mode of each radar. The radial bin spacing is 250 m. The radar data used in this paper are interpolated by the Barnes interpolation algorithm to a horizontal grid with a resolution of 1 km × 1 km [33] and a vertical resolution of 500 m over a depth of 18 km in the Cartesian coordinate system. The origin of the coordinates is the position of the radar. The data are quality controlled.

The precipitation radar (PR) is mounted on the TRMM satellite. The system takes 92.5 min to scan Earth, and it can scan Earth 16 times a day. The scanning range is from 38◦N to 38◦S and 180◦W to 180◦E, and the scanning swath width is 247 km. The spatial resolution is 5 km. As TRMM uses a low-altitude orbit, the PR can provide measurements of 3D rainfall distributions with unprecedented accuracy in the tropics and subtropics. The products of TRMM have been widely used in a variety of studies, such as the study of precipitation distribution patterns in tropical and subtropical regions [34], to improve the accuracy of precipitation prediction [35]. Research on precipitation structure and properties [36] has demonstrated the reliability of TRMM and its products. In addition to the basic information provided by the PR, the 2A23 product includes rain characteristics observed by the PR. Based on the high vertical resolution of the PR data, the 2A23 product can accurately detect the bright band (BB) occurrence and its height. The following variables are used in this paper: rain flag, which indicates the possibility of precipitation in a grid, the rain type, which is the classification of the precipitation type, including stratiform, convective and others, and the height of the bright band, which indicates whether a BB is detected in a grid and the height of the BB if there is one. As warm and cold rain precipitation are not directly classified and interpreted in the 2A23 product, the classification results do not include the classification of cold or warm rain. The precipitation radar has a wavelength of 2.2 cm, and the ground-based radar used in this paper has a wavelength of 10 cm. Therefore, the precipitation radar will be subject to more two-way path attenuation. In addition, the scanning angle, signal frequency and sensitivity of ground-based radar differ from the PR. The main purpose of this paper is to classify the types of precipitation, taking the 2A23 precipitation classification product of the PR instead of the echo reflectivity data of the PR as the training sample label for KNN and evaluating the training results; these differences are not taken into consideration and have no effect on the results.

The 2A23 product has a horizontal resolution of 5 km, and the horizontal resolution of the radar data is 1 km. To make these two datasets comparable, the interpolation scheme and data selection are described below.

Instantaneous 2A23 data and ground radar data that are within a time lag with a maximum of 3 min are projected into a Cartesian coordinate with 5 km × 5 km horizontal resolution. Each ray of a PR swath is projected on the Cartesian grid by the status of the nearest pixel.

There are still steps needed to make the comparisons of two datasets meaningful. These steps are as follows: (1) a pixel is classified as stratiform by the 2A23 product if a BB is not detected and ref2km is greater than 40 dBz or if there is a BB detected and ref2km is greater than 42 dBz with a horizontal gradient greater than 3 dB/km; (2) a pixel is classified as convective by the 2A23 product if no BB is detected but ref2km is less than 40 dBz; and (3) a pixel is classified as convective by the 2A23 product if a BB is detected.


**Table 1.** Radar site information and data usage time.

#### **3. Algorithm and Features**

#### *3.1. Overview of the K-Nearest Neighbor Method*

KNN is a classification algorithm used to classify precipitation types in this paper. KNN does not have a display learning process. In the training phase, KNN simply saves the training samples and processes them after receiving the test samples [37]. Input samples with classification labels are used as KNN inputs for training samples. To achieve satisfactory classification results, a larger number of training samples are needed, and the proportion of each classification in the training samples should be as uniform as possible. In the actual precipitation process, the spatial and temporal extents of stratiform precipitation are usually larger than those of convective precipitation. In the interpolated and screened samples, the number of stratiform precipitation grid points is much larger than the number of convective precipitation grid points. If such data sets are used as KNN training samples, the classification results will be generally biased toward stratiform precipitation. The number of different types of precipitation samples in the training sample needs to be adjusted. Samples of different types of precipitation were randomly selected, and the training sample set was reconstructed according to stratiform cloud precipitation, convective precipitation and other precipitation with a ratio of 1:1:1.

When there are samples to be classified, to obtain the classification results, the distance between the sample to be classified and all the training samples is calculated. After calculating the distance, k training samples with the smallest distances from the sample to be classified are selected. The k training samples have the same influence factor, and the probability that the sample is classified as type j is as follows:

$$P\_j = \frac{N\_j}{\mathbf{k}},\tag{1}$$

where *Pj* is the probability that the sample is classified as type *j* and *Nj* is the number of training samples with a classification label of type *j* among the k-nearest training samples. When the *Pj* value is a maximum, type *j* is the classification result.

#### *3.2. Selection of Features*

Using KNN to classify different precipitation types requires that the variables used in the classification have significant differences for different precipitation types, such as stratiform and convective precipitation, so that the precipitation types can be well distinguished. Using the 2A23 product as a reference, the variations in the frequency of the variables used for the classification of different precipitation types is determined and compared horizontally to validate the classification variable. However, if the bright band is present, the reflectivity will increase significantly, which will negatively influence the classification results. The bright band is not expected to appear at the time of classification. An altitude of 2 km is high enough to provide a sufficient amount of radar data out to a

radius of approximately 150 km, and a 2-km altitude is low enough to avoid serious e ffects of the bright band, which usually appears at a height of 2.5 km to 4.5 km in tropical and sub-tropical areas [38].

Feature 1: Horizontal distribution characteristics of radar reflectivity at a height of 2 km (ref2km) [18]. ref2 km can often reflect the horizontal structural characteristics of convective systems. For stratiform systems, this height should be adjusted appropriately. In some cases, the temperature at 2 km in the vertical height layer is close to 0 ◦C, and there is a mixture of liquid and solid phase water and transitions between the two phases. During the conversion process of solid water to liquid water, a water-coating film is formed on the surface of melting water, and the di fference between the negative refractive index values of the liquid phase particles and the solid phase particles will cause the back reflectance measured by the radar to increase, which may result in a flat and strong echo band. If such strong echo bands are not distinguished, it may cause an erroneous assessment of that type of precipitation in the area. Figure 1a is the frequency distribution diagram of ref2km. The frequencies of stratiform precipitation and convective precipitation increase below 30 dBz. In the range of 30–35 dBz, the frequency of stratiform precipitation reaches a maximum, and then the frequency decreases gradually with larger reflectivity values. At 40–45 dBz, the frequency drops to almost zero. The convective precipitation reaches a maximum frequency at 35–40 dBz, and the frequency decreases gradually with larger reflectivity values. However, when this frequency is above 50 dBz, there is still convective precipitation. The frequency graph of ref2km shows that the value of ref2km exhibits large di fferences and can be used to su fficiently discriminate among di fferent precipitation types. It is reasonable to use ref2 km for the classification of precipitation types.

Feature 2: Vertically integrated liquid-water content (VIL) [39]. The liquid water content M and radar reflectivity Z can be defined as follows:

$$\mathbf{M} = \frac{\rho\_w \pi}{6} \int\_0^\infty n(a) a^3 da,\tag{2}$$

$$Z = \int\_0^\infty n(a)a^6 da,\tag{3}$$

where *x* is the maximum drop diameter, and ρ*w* is the density of water. When the Marshall-Palmer drop size distribution is used in Equations (3) and (4), the error is small if the upper limit of integration, *x*, is replaced by ∞.

$$\mathcal{M} = \frac{N\_0 \rho\_w \pi}{6} \int\_0^\infty \exp(-ba) a^3 da = \frac{N\_0 \rho\_w \pi}{6} \frac{\Gamma(4)}{b^4} = \frac{N\_0 \rho\_w \pi}{b^4},\tag{4}$$

$$Z = N\_0 \int\_0^\infty \exp(-ba)a^3 da = \frac{N\_0 \Gamma(4)}{b^7} = \frac{720N\_0}{b^7}.\tag{5}$$

Eliminating the parameter b in Equations (5) and (6) yields

$$\mathbf{M} = \frac{N\_0 \rho\_w \pi}{\left[720 \times 10^{18} N\_0\right]^{4/7}} Z^{4/7}. \tag{6}$$

For *N*0 = 8 × 10<sup>6</sup> m<sup>−</sup><sup>4</sup> and ρ*w* = 10<sup>6</sup> g/m3,

$$\mathbf{M} = 3.44 \times 10^{-3} Z^{4/7} \text{,} \tag{7}$$

where the units of M are g/m<sup>3</sup> and those of Z are mm<sup>6</sup>/m3.

$$M^\* = \int\_{h\_{\rm hnr}}^{h\_{\rm tpr}} \mathsf{M}dh' = 3.44 \times 10^{-6} \int\_{h\_{\rm hnr}}^{h\_{\rm tpr}} Z^{\frac{4}{3}} dh',\tag{8}$$

Here, *M*\* is VIL, which is given in units of kg/m2; Z is radar reflectivity, with units of mm<sup>6</sup>/m3; and *htop* and *hbase* are the uppermost and lowermost layers of the radar echo, with units of meters. VIL reflects the overall vertical state of the echo area, and it is possible to filter the effects of false high echoes caused by bright bands and topographical factors. At the same time, changes in VIL are a good reflection of changes in a convective system. However, in nonconvective areas, VIL changes little, and the reference value decreases accordingly. Figure 1b shows the frequency distribution of VIL. The frequency of stratiform precipitation reaches a maximum for VIL of 2 kg/m<sup>2</sup> and then decreases rapidly. The VIL value of conditions with almost no stratiform precipitation could reach 4 kg/m2. In contrast, the frequency of convective precipitation reaches a maximum near a VIL value of 4 kg/m<sup>2</sup> and then decreases, although convective precipitation exists even if the VIL value reaches 18 kg/m2. Additionally, VIL considerably varies and sufficiently reflects different precipitation types. Thus, it is reasonable to use VIL for the classification of precipitation types.

**Figure 1.** The variations in different types of precipitation frequency as a function of (**a**) Horizontal distribution characteristics of radar reflectivity at a height of 2 km (ref2km) and (**b**) Vertically integrated liquid water content (VIL).

Variables ref2km and VIL have different scales. During precipitation, ref2 km usually has a minimum of 16 dBz and maximum of 50 dBz, while VIL has a minimum of 0 kg/m<sup>2</sup> and maximum of 10 kg/m2. When calculating the Euclidean distance, the effect of VIL on the distance can be significantly small due to the smaller scale. Thus, the data need to be normalized or standardized before calculating the distance, which could decease the influence of variables with different scales.

The standardized Euclidean distance can decrease the influence of variables with different scales by standardizing the data. The standardized Euclidean distance between sample x and sample y is calculated as follows:

$$\mathbf{d}(\mathbf{x}, \mathbf{y}) = \sqrt{\sum\_{i=1}^{n} \frac{\left(\mathbf{x}\_i - \mathbf{y}\_i\right)^2}{s\_i^2}},\tag{9}$$

where *si* is the standard deviation of *xi* and *yi* over the sample set.

The Euclidean distance, Manhattan distance, and standardized Euclidean distance are used to classify cases at the same time. Although the scales of ref2 km and VIL are not the same, the classification results of the standardized Euclidean distance, Euclidean distance and Manhattan distance do not differ substantially. To remove the possible occurrence of unstandardized adverse effects, the standardized Euclidean distance is used as the distance in the KNN in this study.

#### *3.3. Training and Classification*

In this paper, two variables, ref2km and VIL, are used as classification variables, and the corresponding precipitation classification results from the 2A23 product, as classification labels, are put into the KNN algorithm training process. An appropriate k value is selected, and ref2km and VIL of the sample to be classified are put into the KNN. The standardized Euclidean distance between the sample to be classified and each training sample stored in the KNN is calculated. Then, the class with the largest number of k training samples is taken as the classification result.
