*Article*

## **Ice Detection on Aircraft Surface Using Machine Learning Approaches Based on Hyperspectral and Multispectral Images**

### **Maria Angela Musci 1,2, Luigi Mazzara 1,2,\* and Andrea Maria Lingua 1,2**


Received: 7 July 2020; Accepted: 14 August 2020; Published: 18 August 2020

**Abstract:** Aircraft ground de-icing operations play a critical role in flight safety. However, to handle the aircraft de-icing, a considerable quantity of de-icing fluids is commonly employed. Moreover, some pre-flight inspections are carried out with engines running; thus, a large amount of fuel is wasted, and CO2 is emitted. This implies substantial economic and environmental impacts. In this context, the European project (reference call: MANUNET III 2018, project code: MNET18/ICT-3438) called SEI (Spectral Evidence of Ice) aims to provide innovative tools to identify the ice on aircraft and improve the efficiency of the de-icing process. The project includes the design of a low-cost UAV (uncrewed aerial vehicle) platform and the development of a quasi-real-time ice detection methodology to ensure a faster and semi-automatic activity with a reduction of applied operating time and de-icing fluids. The purpose of this work, developed within the activities of the project, is defining and testing the most suitable sensor using a radiometric approach and machine learning algorithms. The adopted methodology consists of classifying ice through spectral imagery collected by two different sensors: multispectral and hyperspectral camera. Since the UAV prototype is under construction, the experimental analysis was performed with a simulation dataset acquired on the ground. The comparison among the two approaches, and their related algorithms (random forest and support vector machine) for image processing, was presented: practical results show that it is possible to identify the ice in both cases. Nonetheless, the hyperspectral camera guarantees a more reliable solution reaching a higher level of accuracy of classified iced surfaces.

**Keywords:** hyperspectral images; multispectral data; machine learning; ice detection

#### **1. Introduction**

Human safety is one of the main concerns at airports and aircraft icing represents a significant hazard in aviation [1]. The ice formation leads the aircraft's balance in a loss of control, and de-icing and anti-icing are necessary treatments for flight safety during the winter [2,3]. However, de-icing operations require the employment of chemicals such as ethylene glycol- (EG-) or propylene glycol- (PG-) that can cause damage to the environment, in particular for the nearby surface and groundwater [3].

Ice accumulation can occur due to the supercooled droplets colliding with a hard surface forming an ice film [4] with an air temperature between 0 and −20 ◦C [5]. As reported by the FAA (Federal Aviation Administration), structural or in-flight ice and ground ice can be identified [6,7]. The former occurs when the aircraft is flying through visible water such as rain or cloud droplets. The latter, instead, may accumulate on parked aircraft due to precipitation and atmospheric conditions.

#### *Drones* **2020**, *4*, 45

According to the temperature, liquid water content, speed of the formation process, aircraft surface temperature and shape, particle concentration and size, it is possible to distinguish three different types of structural ice [8] (Figure 1):


**Figure 1.** Ice types: rime ice, clear ice, and mixed-ice. (Photo credit: NASA, adapted from [8]).

Frost, snow (or slush), fog, drizzle, rain (and their freezing states), and ice pellets can be considered the foremost examples of ground icing.

The sublimation of deposited water vapor on the aircraft can form frost when the temperature is at or below freezing. A fog formed of supercooled water droplets which freeze upon the impact with the aircraft surface, also known as freezing fog, produces a coating of rime/clear ice.

Rain and drizzle, uniform precipitations of liquid water particles, can be distinguished by drop diameters and proximity. The rain is characterized by drops with a diameter more than 0.5 mm very separated, instead of drizzle, that has close drops with diameters less than 0.5 mm. These two kinds of precipitations, in the freezing state, can create ice deposits with a transparent appearance. Snow or slush are precipitations of ice crystals. The slush is formed by water-saturated snow.

As gathered from the Manual of Aircraft Ground De-icing/Anti-icing Operations [2], the difference between in flight and on ground icing is not referred mainly on the characteristics of the ice but the impacts on the flight and the de-icing procedures. As it is possible to notice from the definitions of freezing fog or frost and freezing rain, clear and rime ice can also occur on the ground.

Due to the physical characteristics of these types of ice, their identification is currently based on visual (e.g., for rime ice, snow) and tactile (e.g., for clear ice, frost) inspections carried out by trained and qualified ground crew or flight crew [11].

The cleaning process moreover involves the use of a considerable amount of aircraft de-icing fluids (ADFs), because targeted operations are not achievable. The contamination check shall cover all surfaces that have aerodynamic-, control-, sensing-, movement- or measuring-function such as wings, tail surfaces, engine, fuselage, antennas, and sensors. This investigation requires enough visibility of these parts. A verification of cleaned surfaces shall always be made after the de-icing/anti-icing, and this inspection can be either visual or tactile. The whole procedure is time-consuming and demanding, especially since it is crucial to maintain the flight's schedule. A time-effective strategy for ice detection is required to limit ADFs use and improve the management of the crew's operations.

In this context, UAV (uncrewed aerial vehicle) [12] imagery combined with machine learning algorithms has shown excellent potential for rapid, remote, cost-effective detection tasks. This approach allows ice identification from multiple views with an automatic check-up operation.

The SEI (Spectral Evidence of Ice) project [13,14] proposes to provide an integrated solution that can handle the automatic pre-deicing inspection, ice detection, and cleaning verification procedure. The expert crew manages the request of de-icing and sends the UAV to the parking area (or hangar) of a specific aircraft that needs the procedure. The UAV can autonomously recognize the location of the aircraft and starts the inspection. Indeed, the multi-sensor UAV platform, equipped with a hyperspectral or multispectral camera, has been designed to monitor and inspect aircraft in the specific de-icing area of the airport. The main task of the drone is the identification of the location and the extension of the ice-contaminated area. For this purpose, an automatic methodology for geometric and radiometric detection of the ice has been determined. The development of computer-oriented methods for ice detection is still challenging due to the physical characteristics of the ice, variable atmospheric condition, and lack of autonomous technology in this application field.

Several devices have been developed for ice characterization for on-ground and in-flight inspection [15,16]. Some examples are based on ultrasonic, magnetostrictive, and electromagnetic sensors [17,18]. Some researchers such as Gong et al. [19] have discussed the use of the mid-infrared sensor for ice detection. In this field, the spectral imagery, not only in the mir-infrared but also in all electromagnetic range, is an emerging technology because of its high spectral and spatial resolution [20]. Our study would fulfil the gap and present the potential of hyperspectral or multispectral imaging technique in the ice detection on aircraft.

Regardless of devices for data acquisition, machine learning approaches, such as random forest (RF) [14] and support vector machine (SVM) [15], have been utilized for material detection and their characterization [21]. These algorithms perform well in reducing the complexity of the classification task associated with spectral data because they can handle the high dimensionality input space and noisy dataset [22,23].

This work, within the activities of the SEI project, tests the feasibility to use spectral sensors, such as hyperspectral and multispectral cameras, and random forest and support vector machine, as machine learning algorithms.

Firstly, the purpose is the selection of the most suitable sensor to mount on a UAV prototype that has to respond to cost requirements. For this reason, a multispectral camera, as a low-cost sensor, was examined to reduce the system production cost. At the same time, the paper addresses the definition of the time-effective automatic methodology for ice detection using the machine learning approach. As known, the hyperspectral camera has a spectral resolution of more than 100 bands instead of the multispectral camera that has a few bands (most of the cases from three to 15). A dimensionality reduction process has been applied to accurately compare the performance of the two algorithms on images with sharply different spectral resolution.

Since the UAV prototype is under construction, the experimental analysis was performed with a simulation dataset acquired on the ground. However, the methodology can be easily transferred to a UAV application.

#### **2. Materials and Methods**

This section describes the two sensors (Section 2.1), the methodology (Section 2.2), and the algorithms and accuracy assessment (Sections 2.3–2.5).

#### *2.1. Sensor Description*

The data acquisition was performed by a hyperspectral camera (Senop Rikola) and a multispectral camera (MAPIR Survey 3N). Senop Rikola hyperspectral camera is a snapshot camera based on a Fabry–Perot Interferometer [24,25]. It includes two not aligned sensors: one sensor acquires near-infrared bands (659.2–802.6 nm) and the second captures visible bands (from 502.8 to 635.1 nm). The MAPIR survey 3N is a multispectral camera, and it records RGN (red, green, and near-infrared bands) images as red (660 nm), green (550 nm), and near-infrared (850 nm) bands [26]. An RGB camera with specifications comparable with the MAPIR (same spatial resolution, optics, and pixel size) was

used to include the blue band (475 nm). The reason for the introduction of an additional band is explained in the Methodology (Section 2.2).

The Senop and the MAPIR are lightweight UAV sensors, and they were selected because they have a similar spectral range from 500 to 950 nm. Table 1 summarizes the specifications of the two sensors.

**Table 1.** Sensor specifications: the Senop Rikola hyperspectral camera and the MAPIR Survey3N multispectral camera.


#### *2.2. Methodology*

The overall methodology is shown in the schema below (Figure 2).

For data collection, ice samples were generated in the laboratory using molds and a real section of the aircraft wing. Since the idea was to have ice samples similar to rime (white ice) and clear ice (transparent ice), two types of ice were created. Snow or other varieties of ice cited above are not considered in this analysis because its production in our laboratory was not possible. The details of sample production and data acquisition will be explained in Section 3.

The hyperspectral and multispectral images were radiometrically corrected using the empirical line method (ELM) and the reference panel.

After that, the dimensionality reduction with principal component analysis (PCA) was executed on the hyperspectral data. This step allows defining two new datasets, the one composed by principal component hypercubes and the ones with a reduced number of bands. The original hyperspectral dataset and both the new ones also with multispectral images were classified to evaluate the performance of the detection and the computational time. Moreover, for the multispectral case, further analysis was made using the RGBN (red, green, blue, near-infrared) images, to understand the improvement of the additional band on the classification.

#### *2.3. Dimensionality Reduction of Hyperspectral Images: Feature Extraction and Feature Selection*

The high dimensionality of hyperspectral images is a crucial problem in real-time application because it takes time both in the acquisition and in ice detection steps. Moreover, it can produce the so-called Hughes phenomenon [27]. For addressing this issue, the most popular methods for dimensionality reduction are feature extraction and feature selection. Feature extraction refers to a linear or nonlinear transformation procedure that reduces the data redundancy in the spatial and spectral domain. Feature selection refers to a process to define a subset of the original features without a transformation [28–30]. PCA is widely used as a feature extraction method, but it can also be used for feature selection.

The PCA dimensionality reduction is based on the estimation of the eigenvalues of the covariance matrix [31–33]. For each pair of bands, the covariance is calculated as (1):

$$\sigma\_{i,j} = \frac{1}{N-1} \sum\_{p=1}^{N} (DN\_{p,i} - \mu\_i) \left( DN\_{p,j} - \mu\_j \right) \tag{1}$$

where *DNp,j* and *DNp,j* are digital numbers of a pixel *p* in the bands *i* and *j*, respectively, and the μ*<sup>i</sup>* and μ*<sup>j</sup>* are the averages of the DN for bands *i* and *j*. Then the covariance matrix is defined as (2):

$$\mathbf{C}\_{\mathbf{b},\mathbf{b}} = \begin{pmatrix} \sigma\_{1,1} & \dots & \sigma\_{1,\mathbf{j}} \\ \dots & \dots & \dots \\ \sigma\_{\mathbf{i},1} & \dots & \sigma\_{\mathbf{i},\mathbf{j}} \end{pmatrix} \tag{2}$$

The roots of the characteristic equation provide the eigenvalues λ (3):

$$\det(\mathbb{C} - \lambda I) = 0 \tag{3}$$

where *C* refers to the covariance matrix (2), and *I* is the diagonal identity matrix.

The eigenvalues indicate the quantity of original information that they compress. The variance percentage for each principal component is calculated as the ratio of each eigenvalue and the sum of all of them. Those components which contain minimum variance and, thus, the minimum number of information can be discarded. The matrix form of the principal components can be expressed as (4):

$$\mathbf{Y}\_{i} = \begin{pmatrix} w\_{1,1} & \dots & w\_{1,j} \\ \dots & \dots & \dots \\ w\_{i,1} & \dots & w\_{i,j} \end{pmatrix} \mathbf{X}\_{j} \tag{4}$$

where *Y* is the vector of the principal components (PC), *W* the transformation matrix, and *X* the vector of the original data, the coefficients *wi,j* are the eigenvectors, and they link the PC with the real variable providing information on their relationship. The eigenvectors can be calculated for each λ*<sup>k</sup>* as (5):

$$(\mathbb{C} - \lambda\_k I)w\_k = 0\tag{5}$$

where *C* and *I* can be defined as the (3), while λ*<sup>k</sup>* is the *k* eigenvalues and *wk* is the *k* eigenvectors. There are three practical criteria to select the most representative PCs [34]:


#### *Drones* **2020**, *4*, 45

Once the PCs have been chosen, the interpretation of them is based on eigenvectors, derived from the (5). The meaning of PCs can be determined looking at the coefficient (*wi*,*j*) of variables *Xj*. The greater *wi*,*<sup>j</sup>* is, the higher the correlation, and *Xj* is the most important for the PC [36].

#### *2.4. Machine Learning: Random Forests and Support Vector Machine*

Random forest (RF) algorithm builds multi-decision trees (forest) that operate as an ensemble trained with a bagging mechanism [37,38]. The bagging mechanism samples N random bootstraps of the training set with replacement. The number of trees characterized the forest, and the higher the number of trees, the more accurate the classification [39]. Moreover, the following parameters can affect the performance of the RF classifier: the tree depth, that is the number of splits for each tree, the split criteria, that handle the split at each node (such as GINI index), and the minimum split [40,41].

Support vector machine (SVM) is a binary algorithm and constructs an optimal hyperplane or a set of hyperplanes, that can be employed for the classification task [42]. The best hyperplane can separate data points of different classes, and it is usually the plane that has the most significant margin between the two classes [40]. SVM can be extended to the multiclass problem through two different approaches: the one-against-all or the one-against-one. In the one-against-all approach, a set of N binary classifiers is applied to the N-class problem. The second approach, one-against-one, carries out a series of binary classifiers to each pair of classes. The training sample size has a high impact on the performance of the SVM, as defined in Myburgh G. et al. [43].

#### *2.5. Evaluation Metrics*

Either for the random forest and the support vector machine, the accuracy assessment for the performance evaluation can be achieved with different parameters based on the error matrix. According to the literature, the selected parameters are the following [44,45]:


Moreover, in this specific real-time application, the computational time for the classification part was assessed. The processing time of the training procedure was not taken into account because the final goal was to use transfer learning.

#### **3. Ice Detection: An Experimental Analysis**

For the experimental analysis, it was not possible to collect the real types of ice; thus, different kinds of ice were generated in the laboratory as much as similar to the case study. Two types of ice were produced: the first one similar to the rime ice with milky-white color and the second one to the glassy clear ice.

The former was created using the water vapor condensed into the freezer at a temperature of −15 ◦C and its thickness reaches values between 2 and 6 cm. The latter was generated by freezing tap water within plastic molds at a temperature of −15 ◦C. Different plastic molds were used for producing different blocks of ice that contained from 5 to 20 mL of water with a thickness of approximately around 3 cm. The ice blocks were located on a section of an aircraft wing to simulate the typical conditions in which the ice is present.

The aluminum panel used in the tests was a section of a Socata MS.894 Rally Minerva with a dimension of 400 × 400 × 2 mm. Before icing, the panel was stored in a freezer so the icing would start with low surface temperature. Figure 3 shows the configuration of the samples.

**Figure 3.** Example of annotated image with reference data. Training set sample (**a**) and validation set sample (**b**).

#### *3.1. From Data Collection to Sample Annotation*

The dataset was built by collecting a ground measurement. The acquisitions were performed at Photogrammetry, Geomatics and GIS Laboratory of DIATI (Department of Environment, Land and Infrastructure Engineering) at Politecnico di Torino (Italy) [46]. During this campaign, 10 images for the hyperspectral and eight for the multispectral sensor with different illumination conditions were collected (18 images in overall). The various illumination conditions were generated using a different number of lamps and a combination of lamps and natural light to simulate the real scenario in which the drones will be used in the parking area or the hangar. The term "Test" refers to each image with different environmental conditions in this paper. All data were recorded, maintaining stable positions and varying rotations of the camera slightly.

The hyperspectral camera was used in manual mode connected to the computer through a USB cable. The selected image resolution was 1010 × 1010 pixels. The images were composed by 100 bands covering the spectral range from 502 to 906 nm, with a wavelength step of 4 nm and a Full with Half Maximum resolution (FWHM, where Wide means low gap index). The integration time was set at 450 ms based on the environmental illumination condition. The sequence of the bands was automatically generated using the Rikola Hyperspectral Imager software v2.0. These parameters were chosen to cover the whole spectral range. The whole electromagnetic spectrum was also covered to identify the most characteristic bands and features of studied materials. For the MAPIR instead, the camera's sensitivity was set to ISO- 800, and the exposure time was fixed to 1/15 s.

The two datasets of images were radiometrically calibrated using the Empirical Line Calibration tool of ENVI 4.7 [47]. Then, the images of each sensor were manually annotated. In both cases, the same 10 classes were considered: rime ice, clear ice, white aluminum, aluminum, floor tile, wood and reference panel (white, black, grey 21%, and grey 27%). The representative classes were only the rime ice, the clear ice, and the white aluminum (Figure 3). These classes were chosen according to the materials that it was possible to distinguish in the real case at aircraft scale. The selected materials were related to the object (in our case, the aircraft) and the ice. Other materials were included in the background in different classes to improve the performance of the classification. The option of a single class for background materials would alter the accuracy of outcomes. The number of samples per class for each dataset are reported in Table 2.


**Table 2.** Hyperspectral and multispectral reference samples per class.

The training and test samples were collected based on visual interpretation. ArcGIS Pro 2.5.0 toolbox was used to create polygons as reference data for each class.

#### *3.2. Dimensionality Reduction of Hyperspectral Data: Results*

To reduce the hyperspectral data dimensionality, PCA was carried out using the "Principal components tools" of ArcGIS Pro 2.5.0 [48]. As described in Section 2.3, it was possible to adopt the PCA as feature extraction and band selection method.

As the first step, the feature extraction was performed to define the principal components. In the second step, the selected PCs were used for significative band selection. Both feature extraction and band selection methods were applied for understanding the best solution for ice detection.

Therefore, for the feature extraction, the eigenvalue and cumulative variance were obtained to identify the number of principal components (PCs), which means the new dimensionality. The outcomes of the first image only were reported as an example because the selection process and conclusions were the same for the other pictures. Table 3 shows the percentages of the primary five components of the sample image. As can be seen from Table 3, three PCs reach 90.31% of the total variance in original data for the first criterion and pass the 1% for Kaiser's rule (Section 2.3). As a consequence, the dimensionality of the new representation is three, and the rest of the component can be discarded. Moreover, also the screen plot in Figure 4 illustrates that it is possible to identify three as the number of PCs (third rule described in Section 2.3).

**Figure 4.** Screen graph on a sample image. Zoom of the 'elbow'.


**Table 3.** Principal component analysis (PCA): example of eigenvalue and cumulative variance in percentage on a single sample image.

After the identification of the PC number, the first three principal components were used to select a reduced number of original bands for the classification task. The band selection process was carried out using the eigenvectors for each PC. The higher the absolute value of the band eigenvector, the higher the importance of that band for the specific principal component. According to this criterium, considering that the number of significative bands is strictly related to the application, a threshold of eigenvector values, defined for each component, allows to identify the significative bands. The plot of eigenvector values reports the correlation between spikes of the function and the representative bands. Figures 5–7 represent the eigenvector values with respect to the band number for the three selected principal components in four representative images (Test1, Test2, Test6, Test10) among 10 hyperspectral images. Different illuminations and the change of the state of ice characterize these four tests, and this comparison was made to check the recurrence of the most significant bands, that can be selected. The presence of spikes in the eigenvectors function allows to recognize the bands, for all the images for the three selected PCs (Table 4):

**Figure 5.** First principal component eigenvalues plot.

**Figure 6.** Second principal component eigenvalues plot.

**Figure 7.** Third principal component eigenvalues plot.

As it is possible to notice in Table 4, there are recurring bands in each test. Taking into account all the identified bands, a new hypercube with 27 bands, that are 1 (506.31 nm), 3–7 (from 514.48 to 530.11 nm), 14 (558.28 nm), 25 (602.47), 32–38 (from 630.2 to 654.19 nm), 78–89 (from 817.58 to 861.65 nm) can be generated. However, considering only the popular bands in each principal component of all images, the number of significative bands can be further reduced to 10. The significative bands, in this latter case, are 4–5 (from 518.12 to 522.48 nm), 33–37 (from 634.36 to 650.38 nm), 83–85 (from 837.98 to 846.21 nm).


**Table 4.** Band selection for the three principal components (PC) in the representative tests (Test1, Test2, Test6, Test10).

Three new datasets came from the dimensionality reduction process: 10 new images composed by the three PCs, 10 new hypercubes with 10 bands and 10 hypercubes with 27 bands. The first set of modified hypercubes was created through the toolbox "Principal Component Analysis" of ArcGIS Pro 2.5.0 selecting three as a maximum number of principal components. The two remaining datasets with reduced hypercubes were generated using a customized routine of Matlab for hyperspectral data decomposition and the "Composite Bands" tool of ArGIS Pro 2.5.0 [49] for the selected band composition.

#### *3.3. Hyperparameter Tuning for Random Forest and Support Vector Machine*

The hyperparameter tuning process plays a crucial role in improving the accuracy of RF and SVM algorithms.

Before starting with the hyperparameters adjustment, data were split in 80% for training and 20% for testing. The tuning of hyperparameters was made on the training set for defining a model. Accuracy assessment was carried out for either the training and testing set to verify the performances of the model in the classification task. The validation curve allows to visualize the values of the model hyperparameters, and it shows different values of the single hyperparameter related to the accuracy trend.

The optimized hyperparameters were chosen according to two criteria. The first one is the minimum difference between the overall accuracy of training and validation models, and the second one is the best user's accuracy only for the validation. It is necessary to notice that for the evaluation of the accuracy the random choice of the samples has to be taken into account; thus, tolerance has to be considered. Accuracy analysis of training and validation are presented for both the algorithms.

Moreover, the accuracy assessment of the rime ice, clear ice, and white aluminum is under the attention among the other classes. These three classes are distinctive in the real de-icing application. The clear ice, as explained in the Introduction (Section 1), is critical to identify by visual inspection. Thus, it has a relevant weight in this analysis.

The tuning was implemented either for RF or SVM on a single image (Test\_1) of both datasets (hyperspectral and multispectral images) and the "Segmentation and Classification tools" of ArcGIS Pro 2.5.0 [50]. The tests were made on a window workstation (Windows 10) with an Intel® Core™ I7-6500U CPU at 2.50 Ghz, GPU AMD Radeon™ R7 M360 (Iceland) (six compute units at 980 MHz, 2048 MB) and 16 GB of RAM.

Since SEI project application requires a near real-time approach (Section 1), in this section, the computational time was evaluated because one of the aims of the optimization is the definition of the trade-off between accuracy and processing time.

As described in Section 2.4, the hyperparameters for each classifier have to be tuned. They are the same either for hyperspectral and multispectral. The optimization was executed in manual mode. For the RF algorithm, the maximum number of samples for each class was fixed; we tuned two hyperparameters: the tree depth and the number of trees. For what concerns the SVM instead, only the maximum number of samples per class was tuned.

#### 3.3.1. Hyperparameters Tuning for the Hyperspectral Dataset

Starting for the RF, the sample size was set to 2000 for each class for tuning the tree depth and the number of the trees. The tree depth optimization was done varying its value from 5 to 30. Instead, the number of trees was fixed to 50. As reported in Table 5, the difference between the overall accuracy (OA) is comparable in all training and validation configuration. The case with a depth equal to 5 was kept out because it was not reasonable with a low number of trees. Therefore, looking just to the validation results, the tree depth equal to 30 produces the best value of OA and clear ice accuracy. For these reasons, the selected tree depth was equal to 30.

**Table 5.** Training accuracy (on the left) and validation accuracy (on the right) for random forest (RF) tree depth optimization. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum. In yellow, the selected optimized hyperparameter.


Instead, for the number of tree selection, the tree depth value was fixed to 30, according to Table 5. The number of trees was varied from 5 to 50 (Table 6). Following the same reasoning defined for the selection of tree depth, the cases with a lower number of trees were excluded. Indeed, the differences among the OA is comparable in the other configurations. Concerning Table 6, the cases 15, 30, and 50 were characterized by a similar OA value that is also the highest one (87.9% on average). However, the accuracy of the C\_i class leads with a gap in the case 50. As a consequence, the number of trees equal to 50 was the optimized value.

**Table 6.** Training accuracy (on the left) and validation accuracy (on the right) optimization of RF number of trees. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum. In yellow the selected optimized hyperparameter is shown.


In Figure 8, the comparison between training and validation overall trends among all the considered depths and numbers of trees can be appreciated. The validation curves confirm the previous observations and the criteria used for the optimized hyperparameters selection.

Table 7 presents the RF processing time for the training in the analyzed configurations considering the tuning of both the hyperparameters. In the case of D\_Trees, the processing times are not excessively influenced by the increase of the depth number. While in the case of N\_trees, the higher the number of trees, the higher the computational time. However, the computational time is stable after 15 N\_tree because the selected sample size does not affect the number of trees. It is possible, thus, to choose 50 as N\_trees value.

**Figure 8.** RF validation curve for maximum depth (on the **left**) and maximum number of trees (on the **right**).

**Table 7.** Processing time for the training in RF using different values of tree depth (on the left) and the number of trees (on the right). In yellow the time for training the model with the selected optimized hyperparameter is shown.


For what concerns the SVM instead, the maximum number of samples per class ranged between 100 and 5000 samples. The differences between OA training and validation remains 8% on average in all configurations (Table 8). The highest value of the OA (validation) occurs in the case of 5000 samples, but this configuration was excluded because the related computational time is too long (3h4 25") (Table 9). Thus, the cases of 100, 750, and 1000 were taken into account. The configuration 100 was discarded because the sample size was small, and the random choice of the sample hugely affects the overall accuracy in the images.

**Table 8.** Training accuracy (on the left) and validation accuracy (on the right) optimization of support vector machine (SVM) number of the sample. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum. In yellow the selected optimized hyperparameter is shown.



**Table 9.** Processing time for the training in SVM using different values of sample size. In yellow the time for training the model with the selected optimized hyperparameter is shown.

For the remaining cases, since OA is comparable, thus the hyperparameter selection was based on the accuracy of the clear ice. In the validation, the C\_i accuracy is 94.0% in the 750 sample case, instead of 92.7% in the 1000 sample case. According to this consideration, the selected number of samples was 750. The validation curves confirm that this parameter is the best fit (Figure 9).

**Figure 9.** SVM validation curve for the maximum number of samples.

Table 9 reports the processing time for each configuration. It is possible to notice that, as expected, the computational time increases according to the increase of the sample size.

#### 3.3.2. Hyperparameter Tuning for the Multispectral Dataset

With the RF, the samples size was set to 10,000 for each class. Tree depth and the number of trees ranges were chosen according to the number of samples and the image resolution. For multispectral images, the training sample size is five times greater than the hyperspectral, and the resolution is 4000 × 3000 pixels instead of 1010 × 1010 pixels. The case 50\_30 was chosen as starting point according to the previous tuning on hyperspectral data (Section 3.3.1). The tree depth optimization was done, varying its value from 30 to 60. The number of trees ranged from 50 to 125.

Table 10 presents the training and validation accuracies considering all the combinations of the number of trees (xx in the test code) and the depth tree (yy in the test code).

The OA in training and validation is constant in all configurations, 81% and 77%, respectively. As a consequence, the best configuration can be defined, looking only to the validation accuracy. The OA accuracy is not strongly affected by the different hyperparameters. However, the test 100\_30 presents the highest value of OA (77,8%). Looking at the clear ice UA, the best case should be 125\_40 with a value of 69.8% instead of 69.7% in the case 100\_30. From these observations, it is not possible to recognize this case as the best fit without the computational time analysis because the UA for C\_i is quite similar (Table 11).


**Table 10.** Training accuracy (on the left) and validation accuracy (on the right) RF tree depth and the number of trees optimization. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum. The test name is defined as xx\_yy, where xx is the number of trees, yy is the depth. In yellow the selected optimized hyperparameter is shown.

**Table 11.** Processing time for the training in RF using different values of tree depth and the number of trees. In yellow the time for training the model with the selected optimized hyperparameter is shown.


For what concerns the processing time, in all configurations, the trend increases according to the number of trees increasing. The case 100\_30 was selected because it has a good trade-off between processing time (6 55" instead of 8 58" for the case 125\_40), overall and glassy ice accuracy.

For the SVM classifier, only the sample size per class ranged between 500 and 2000 samples. In Table 12, it can be noticed that cases with 1500 and 2000 samples have the best OA and the latter also has the highest value for the C\_i accuracy (68.6%). Nonetheless, taking into account the computational time (Table 13), the test with 2000 samples lasts around 20 min more than the test with 1500 samples (1h5 43"). The best fit can be considered the configuration with 1500 samples.


**Table 12.** Training accuracy (on the left) and validation accuracy (on the right) optimization of SVM number of the sample. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum. In yellow the selected optimized hyperparameter is shown.

**Table 13.** Processing time for the training in SVM using different values of the number of samples. In yellow the time for training the model with the selected optimized hyperparameter is shown.


Table 13 provides the computational time in all configurations and displays that the increase of sample size defines the increasing of the processing time trend in a proportional way.

#### *3.4. Ice Detection using Hyperspectral Data: Results*

The ice detection was performed on three types of hypercubes:


For the classification, the "Classify Raster" tool of ArcGIS Pro 2.5.0 [51] and the Test\_1 was employed for the training. The analysis, in this section, is focused on two main parameters: the accuracy and the computational time for the classification only.

In general, as explained in Section 2.5 both overall accuracy and user's accuracy were used for assessing the classification. As mentioned in Section 3.1, some materials are included in the background, but at the same time annotated as different classes to check the performance on different materials. Since these classes were not included in the real scenario, because in that case, the background will be different (e.g., asphalt instead of floor tile), the overall accuracy was included just to show the general performance of the algorithms. However, the primary parameter is the user's accuracy, because the object of this study is the detection of the ice and in detail the clear ice due to its transparent property.

For each dataset (original hypercubes, reduced hypercubes, and PC images), random forest and SVM with the optimized hyperparameters derived from Section 3.3.1 were used. For the RF, the hyperparameters selected for the classification are the number of trees equal to 50, tree depth equal to 30 and 2000 samples. For the SVM, the classification with 750 samples was performed.

As in the case of PCA (Section 3.2), the classification evaluation is shown only in four representative images (Test\_1, Test\_2, Test\_6, and Test\_10). Test\_1, Test\_2 and Test\_6 present varied environmental conditions and Test 10 was included to display the behavior of the model in the presence of ice phase transition.

For what concerns the original dataset classified with the RF, the overall accuracy reaches a maximum value of 88%, and the computational time is 14 min on average (Table 14). The classification performs better on the clear ice than the other classes, reaching a maximum value of ca. 96%. Regarding the rime ice, its accuracy is on average under 50% since the radiometric response is similar to that related to the white aluminum (67% on average).

**Table 14.** Accuracy and processing time on the original dataset with random forest. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum.


With SVM classifier (Table 15), the overall accuracy reaches a maximum value of 92%, and the computational time is 17 min on average. Additionally, in this case, C\_i user's accuracy is higher than the other significant classes, reaching a maximum value of ca. 97%. Rime ice accuracy is on average under 60% since the radiometric response is similar to that related to the white aluminum (78% on average).

**Table 15.** Accuracy and processing time on the original dataset with SVM. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum.


As presented in Tables 14 and 15, the SVM reaches better accuracy on average than the RF. Indeed, the overall accuracy is 86.2% for the RF and 88.8% for SVM. The drawback of the SVM is the processing time. The average computational time is 14 22" for the RF and the 17 5" for the SVM. Therefore, the RF is faster than the SVM in the classification process.

For both algorithms, Test\_10 reports low user's accuracy values compared with the other tests, because in this case, ice was starting to melt. This evidence also recurs in the reduced hypercube datasets and the PCA dataset. There is only one test in which the ice is starting to melt. Thus, it is predictable that the algorithm, in this case, works worst and the detection of the ice in other physical states was out of this preliminary study. It is well known that the ice changes its features according to its state. Thus, for the real case application, further acquisitions will be made for training the algorithm and improving the detection of the ice while changing its state to the liquid one.

The same analysis was carried out for the reduced hypercube datasets (27 bands and 10 bands) (Table 16). The reached OA with RF classifier has a maximum value of 83.8% for the hypercubes with 27 bands and 80.6% for the hypercubes with 10 bands. The computational time varies from 28.5" for the 27 bands to 26.5" for the 10 bands. These observations demonstrate that the two cases are comparable, and the dataset with 10 bands can be considered reliable. Moreover, the C\_i user's accuracy stands that the model has in any way a good performance. As a consequence, it verifies that the set of bands selected using the PCA is adequate for the classification task.


**Table 16.** Accuracy and processing time on the reduced hypercube (27 and 10 bands) with RF. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum.

Table 17 presents the outcomes of the SVM. The overall accuracy reaches a maximum value of 87.3% for the 27 bands-hypercubes and 80.8% for the 10 bands-hypercubes. The processing time varies from 1 50" to 1 40", respectively, for the 27 bands and the 10 bands. Regarding the clear ice, the dimensionality reduction does not affect accuracy. Even if the SVM accuracy has the same trend of RF for the two datasets, there are still slight differences (Tables 16 and 17). Referring to the average OA for the 27 bands, the SVM performs better than the RF, while for the 10 bands is the opposite. For both reduced datasets, the SVM has a higher user's accuracy of C\_i class, and it is slower compared with the RF.

**Table 17.** Accuracy and processing time on the reduced hypercube (27 and 10 bands) with SVM. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum.


Turning now to the analysis on the PC dataset, for the RF algorithm, Table 18 indicates that there is a perceivable reduction of the accuracy rate concerning original and reduced hypercubes. The average OA does not overcome the value of 72%. Regarding the computational time, it is around 38" on average.

**Table 18.** Accuracy and processing time on the PC images with RF. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum.


For the SVM algorithm, Table 19 presents findings comparable to the RF one. The average OA does not overcome the value of 76%. Moreover, the computational time is around 1 17" on average.


**Table 19.** Accuracy and processing time on the PC images with SVM. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum.

These observations demonstrate that either SVM and RF produce ambiguous and inaccurate outcomes in some cases and the resulting average accuracy for clear ice (77.8% for SVM and 73.4% for RF) could be not acceptable for our applications. Finally, the comparison between the two algorithms proves that SVM behaves better than RF. Overall and user's accuracy for all classes always have a higher value in SVM, but its computational time is twice as much as RF's one. Hence the classification with the SVM is still slower than the RF's.

According to the above analysis, for the ice detection in all datasets (original, reduced 27 bands, reduced 10 bands and PC), it is possible to make some general considerations. SVM and RF accuracy are comparable in all cases. Indeed, the difference between the accuracies is from 0.1% to 2%. Despite these minimal differences, the SVM presents the higher values of the user's and overall accuracy on average than the RF classifier. The reduction of dimensionality affects the overall accuracy slightly. Considering the difference between the original hypercube and the PC hypercube, the OA decreasing on average is 12% for the SVM classifier, instead of 14% for the RF one.

For the user's accuracy of the C\_i, a descending trend based on the dimensionality of the feature space cannot be defined. However, the differences of clear ice accuracy between the best case (10 bands hypercube) and the worst case (PC images) is 7% for the SVM and 9% for the RF.

The processing time is strictly related to the size of the feature space. The dimensionality reduction helps to contract the processing time. In general, the analysis on the computational time shows that the RF is faster than the SVM, both in training and classification parts.

Figure 10 illustrates the results of the classification for each dataset. It can be appreciated graphically the discrepancy related to the reduction of the number of bands. Specifically, Figure 10 refers to the Test\_1 hypercube classified with SVM. The RF graphical results are not included because it is impossible to detect the visual differences compared with the SVM accuracy. As can be seen, the clear ice is well detected in all the cases; while the rime ice identification gets worse according to the reduction of the bands.

**Figure 10.** The classification results on Test\_1 with SVM. (**a**) Original classified hypercube, (**b**) reduced classified hypercube-27 bands, (**c**) reduced classified hypercube-10 bands, and (**d**) PC classified hypercube.

*3.5. Ice Detection Using Multispectral Data: Results*

The ice detection was performed in the case of multispectral images on two datasets:


The optimized hyperparameters identified after the tuning (Section 3.3.2) were used. For the RF, the hyperparameters selected for the classification are the number of trees equal to 100, tree depth equal to 30 and 10,000 samples. For the SVM, the number of samples is 1500.

Considering the RGN dataset, the classification assessment is described in four representative images (Test\_1, Test\_2, Test\_6, and Test\_10) that have the same characteristics of the hypercubes described in Section 3.4. For the RGBN dataset, only the Test\_1 integrated with the blue band is cited to demonstrate the improvement related to the blue presence in terms of accuracy.

With the RF, the evaluation of the outcomes on RGN images shows that OA is on average 49.5%. Instead, the computational time on average is 6 43" (Table 20). As expected, the lack of blue band alters deeply also values of UA compared with hyperspectral hypercubes. For example, the maximum value for the glassy ice accuracy is lower than the 80%. Thus, it is not sufficient to be considered correct and accurate.


**Table 20.** Accuracy and processing time on the RGN (red, green, and near infrared) images with RF. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum.

With the SVM, the accuracy assessment of the RGN dataset shows that OA is on average 49.2%. Instead, the computational time on average is 23 40" (Table 21). The missing blue band problem is still visible. Indeed, the clear ice accuracy does not surpass the value of 67% in the best configuration.

**Table 21.** Accuracy and processing time on the RGN images with SVM. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum.


Nonetheless, the comparison among the two algorithms shows that RF performs better classification on average and its computational time is lower than SVM's one.

Considering that the accuracy assessment for the RGN dataset is not comparable with the ones obtained using the hypercube with similar feature space size (e.g., PC images), the blue band was added to create RGBN images. Indeed, looking to the results of the band selection (Section 3.2) it is possible to notice that the blue band has essential weight.

Table 22 illustrates an overview of the accuracy and the processing time in the RGBN case. As can be seen, the OA reported a significant increase compared with the RGN images. Indeed, it overcomes the value of 80% for both the algorithms. The UA values are higher than the average of the respective values in the RGN case. At the same time, the computational time decreases for the RGBN images.


**Table 22.** Accuracy and processing time on the RGBN(red, green, blue, near-infrared) image with RF and SVM. R\_i stands for rime ice, C\_i for clear ice, and W\_a for white aluminum.

Regarding the comparison between the two algorithms in the RGBN case, the SVM produces a better classification, but its computational time is longer than the RF's one.

Figure 11 highlights the results of the classification in the two datasets using the SVM. The improvement related to the introduction of the additional band is evident. The blue band allows to reduce the classification noise and at the same time a better identification of all materials. Moreover, the enhancement of the distinction between rime ice or clear ice and white aluminum is clear.

**Figure 11.** The classification results on Test\_1 with SVM. (**a**) RGN classified images and (**b**) RGBN classified images.

#### **4. Discussion**

Since the previous section describes already the outcomes related to the dimensionality reduction of hyperspectral data and the classification with the two algorithms on the different datasets, the discussion focuses on:


For what concern the first point, this study confirms that for the ice detection task, the use of hyperspectral images is more reliable. However, on the other hand, highlights that the advantage of operating with multispectral data (with the same spatial resolution of the hyperspectral images) is related to the computational time. The latter, indeed, is one of the crucial problems in real-time application.

The outcomes demonstrate that the hyperspectral data are suitable for real-time applications after a priori analysis. A dimensionality reduction process can easily compress the size of hyperspectral data preparing the data for the classification task. This step leads to breaking down the limits related to the computational time of the original hypercubes. The experimental analysis shows that the processing time can be improved, downscaling the spectral resolution. The case of reduced hypercube with 10 bands can be considered a trade-off between accuracy and computational time, regardless of the employed algorithm.

Moreover, starting with the analysis of significative bands, a multispectral sensor can be defined for facilitating the acquisition and classification operations. It is crucial to take into account, in the case of multispectral images, that the spatial resolution is four times greater than the hyperspectral. The results obtained with the introduction of a significative band (in this case, the blue band) shows the effectiveness of predefined-band knowledge in the classification.

Regarding the classifiers, it is possible to state that the SVM performs better in terms of accuracy; on the opposite, the RF classifier is faster than the SVM. This observation is valid for both datasets: the multispectral and the hyperspectral dataset. However, the accuracy reached with the multispectral data is not comparable with ones of the hyperspectral camera regardless of the selected algorithm.

Finally, turning now to material detection, this study focuses the attention on the classification of clear ice. As explained in the Introduction (Section 1), the clear ice is not visible with the naked eye and requires tactile inspection. The ice, in the real case, could have different features (e.g., density, shape) than the ice samples that were generated in this study. At this stage of our work, other types of ice

were examined to understand if the classification was able to distinguish, as two classes, the different condition of the same material, such as transparent ice and rime ice. Due to its characteristics, the rime ice is more visible; thus, the detection is more straightforward. Concerning the ice detection reliability, it is possible to underline that both algorithms conservatively recognize both forms of ice. For instance, if an area is contaminated by ice, conservative way means that it is much more unlikely that the algorithms recognize that area as aluminum or as other material instead of ice (this is the case of a false negative). This example implies that even if the algorithms recognize an area of aluminum as ice, it is very improbable that the contrary occurs. It is possible to notice in Figures 10 and 11 that the main false negative can be associated with the white aluminum, that is identified as ice.

In the real case, not to apply de-icing fluids to some areas with ice is more dangerous than applying the de-icing procedure on some other areas that do not need it. Therefore, it could be possible that some de-icing fluids will be wasted for areas in which actually ice is not present; however, the safety, that is a primary goal, is not compromised.

Moreover, although the radiometric classification is noisy for the rime ice, the results can be improved with the use of geometric features. The previous activities of the SEI project [14] demonstrated that the rime ice is accurately identifiable in this application, using the RGB sensor and a photogrammetric approach.

#### **5. Conclusions**

In this paper, the feasibility to use spectral images for ice detection was studied, testing different sensors with different spectral resolutions such as the Senop Rikola and the MAPIR. To address this purpose, two different types of ice samples were created to understand if it was possible to distinguish clear ice (transparent ice) from rime ice (white ice). Then, images were collected in different illumination conditions, because there is no open-source and ready to use dataset to face this specific task. Moreover, semantic segmentation algorithms (such as RF and SVM) were defined, also evaluating the accuracy and the processing time.

The main challenges of this work were the definition of the efficient use of hyperspectral data in the near real-time application and the research of spectral resolution and algorithms capable of providing higher accuracy and limited computational time.

This experimental analysis demonstrates the possibility to use the reduced hypercubes and both the RF and SVM as a classifier with an OA higher than 80% on average.

As feature work, we plan to transfer the knowledge and the promising outcomes acquired through this simulation in a drone application. Moreover, the drone's application can help to consider also other kinds of ice that are not possible to reproduce in the laboratory, such as snow and freezing rain.

**Author Contributions:** Conceptualization, A.M.L., L.M. and M.A.M.; methodology and validation, A.M.L., L.M. and M.A.M.; writing—original draft preparation, L.M. and M.A.M.; writing—review and editing, A.M.L., L.M. and M.A.M.; visualization, L.M. and M.A.M.; supervision, A.M.L.; project administration, M.A.M.; funding acquisition, A.M.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the European Union and the Piedmont Region funds within the framework of the Action "MANUNET III—POR FESR 2014–2020 (project code: MNET18/ICT-3438)".

**Acknowledgments:** This project was carried out within the activities of the PoliTO Interdepartmental Centre for Service Robotics (PIC4SeR).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

1. European Organisation for the Safety of Air Navigation. The Flight Safety Foundation Aircraft Ground De/Anti-Icing. Available online: https://www.skybrary.aero/index.php/Aircraft\_Ground\_De/Anti-Icing (accessed on 28 June 2020).


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Deep Learning Classification of 2D Orthomosaic Images and 3D Point Clouds for Post-Event Structural Damage Assessment**

#### **Yijun Liao, Mohammad Ebrahim Mohammadi and Richard L. Wood \***

Department of Civil and Environmental Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-531, USA; yijun.liao419@huskers.unl.edu (Y.L.); me.m@huskers.unl.edu (M.E.M.) **\*** Correspondence: rwood@unl.edu

Received: 15 May 2020; Accepted: 20 June 2020; Published: 22 June 2020

**Abstract:** Efficient and rapid data collection techniques are necessary to obtain transitory information in the aftermath of natural hazards, which is not only useful for post-event management and planning, but also for post-event structural damage assessment. Aerial imaging from unpiloted (gender-neutral, but also known as unmanned) aerial systems (UASs) or drones permits highly detailed site characterization, in particular in the aftermath of extreme events with minimal ground support, to document current conditions of the region of interest. However, aerial imaging results in a massive amount of data in the form of two-dimensional (2D) orthomosaic images and three-dimensional (3D) point clouds. Both types of datasets require effective and efficient data processing workflows to identify various damage states of structures. This manuscript aims to introduce two deep learning models based on both 2D and 3D convolutional neural networks to process the orthomosaic images and point clouds, for post windstorm classification. In detail, 2D convolutional neural networks (2D CNN) are developed based on transfer learning from two well-known networks AlexNet and VGGNet. In contrast, a 3D fully convolutional network (3DFCN) with skip connections was developed and trained based on the available point cloud data. Within this study, the datasets were created based on data from the aftermath of Hurricanes Harvey (Texas) and Maria (Puerto Rico). The developed 2DCNN and 3DFCN models were compared quantitatively based on the performance measures, and it was observed that the 3DFCN was more robust in detecting the various classes. This demonstrates the value and importance of 3D datasets, particularly the depth information, to distinguish between instances that represent different damage states in structures.

**Keywords:** convolutional neural network; deep learning; transfer learning; point clouds; structural damage assessment

#### **1. Introduction**

One of the emerging approaches for aerial image collection is to utilize the unpiloted (or unmanned) aerial system (UAS), commonly known as a drone [1–3]. Following natural hazard events, the data collection is often limited by time and site accessibility imposed by precarious structures, debris, road closures, curfews, and other restrictions. However, UAS imagery enables first responders and emergency managers to perform effective logistical planning, loss estimates, and infrastructure assessment for insurance adjusters, engineers, and researchers [4]. UAS with an onboard camera enables assessors to collect numerous images from large areas efficiently as well as to reconstruct the three-dimensional (3D) scene via three steps, including Scale Invariant Feature Transform (SIFT), Structure-from-Motion (SfM), and Multi-View Stereo (MVS). Here, SfM is generated using two-dimensional (2D) aerial images [5]. The SfM derived point cloud has relative accuracy at the centimeter level [1]. The creation of a 3D SfM point cloud is a time-consuming process; however, it enables a reconstruction of the depth information in the scene, which may improve various analyses. Recently, deep learning techniques have become a more common approach to develop various computer vision workflows. These techniques have been utilized to create workflows to investigate damage in the aftermath of extreme events from aerial images, in particular for large areas and at the community scale.

The main objective of this manuscript is to study and compare three deep learning models based on 2D aerial images and 3D SfM derived point clouds to detect damaged structures following two hurricanes. In addition, the study investigates the application of transfer learning for the 2D convolutional neural network (CNN) as a rapid post-event strategy to develop a model for damage assessment of built-up areas with minimal to no prior data. The model created for damage assessment using 2D images is developed based on transfer learning from two well-known image classification networks, namely AlexNet and VGGNet [6]. During the training process, the pre-trained weights of these models were further modified to match the user-defined classes. Moreover, a 3D fully convolutional network (3D FCN) with skip connections is developed based on expanding the 3D FCN model proposed by Mohammadi et al. [7]. While the goal of 2DCNN is to classify the aerial images based on the most prominent object observed in the images, the 3D FCN with skip connections semantically classifies the SfM derived point cloud. Both 2D and 3D models were trained on similar datasets with identical numbers of classes and were compared based on precision, recall, and overall accuracy. The comparison between 2D and 3D models is to demonstrate the entropy and value associated with the depth-information that is present in the 3D datasets.

#### **2. Literature Review**

#### *2.1. Studies Used 2D Images for Detection and Classification*

The task of object detection or classification of a set of images has been investigated by various studies using CNNs with different architectures. Among all proposed methods, transfer learning has become one of the most popular techniques. Transfer learning corresponds to the process of fine-tuning the upper layers of a pre-trained model based on a new dataset for a newly proposed task(s) [8]. Models developed based on a transfer learning strategy demonstrated not only improved performance in comparison to other models but also that such models could be developed more efficiently. In an early study within the area of deep learning, Bengio discussed transfer learning algorithms and their effectiveness in classifying new instances based on pre-trained models and demonstrating the process through numerous examples of transfer learning [8]. One of the most referenced studies was performed by Krizhevsky et al. [9]. The authors within this study developed a CNN model through transfer learning from the subset of the ImageNet network [10]. Krizhevsky et al. modified the fully connected layers to accommodate classifying the new labels [9]. The modified network architecture contained eight layers to maximize the correct label under prediction distribution. The authors reported that the validation of the results showed test error rates of 17% to 37.5% for the used datasets. Another application of transfer learning was studied by Oquab et al. [11]. The authors developed various CNN models for the task of visual recognition based on transfer learning from the pertained ImagetNet model [10]. Within this study, the training images are mainly comprised of centered objects with a clear background in the image, and authors reported that the model was able to classify images with a high level of accuracy after an extended training process. Different CNN performs distinctively as the network architecture varies. As a result, Shin et al. listed a few popular image-based classification transfer learning networks, such as CifarNet, AlexNet, and GoogleNet [6]. These network performances were compared based on a set of medical images via transfer learning. It was concluded that transfer learning has been consistently beneficial for classification experiments.

Various studies have investigated the application of CNN models for post-event assessments using aerial images. For example, Hoskere et al. proposed post-earthquake inspections based on UAS imagery and CNN models [12]. Within this study, the authors developed a fully convolutional network to semantically segment images into three classes of pixels. The developed model was able to segment the images with an average accuracy of 91.1%. More recently, Xu et al. studied the post-earthquake scene classification task using three deep learning methods. These methods included a Single Shot MultiBox Detector (SSD), post-earthquake multiple scene recognition (PEMSR) based on transfer learning from SSD, and Histogram of Oriented Gradient along with Support Vector Machine (HOG+SVM) [13]. Within the proposed method, the aerial images were initially classified into six classes, including landslide, houses, ruins, trees, clogged, and ponding. The dataset was created via web-searched images of the 2014 MW 6.5 Ludian earthquake (China), which were later preprocessed and degraded into 300 × 300 pixels and manually classified into the six classes aforementioned. The authors reported that the PEMSR model demonstrated a higher efficiency of 0.4565 s comparing to HOG+SVM of 8.3472 s as well as higher accuracy. In their work, the transfer learning strategy also improved the overall accuracy and performance, although the average processing time was slightly higher than the SSD method. Moreover, in addition to the effect of transfer learning in the increasing accuracy and performance of 2D CNN models, Simonyan and Zisserman pointed out the CNN performance improvement can be achieved by increasing the network depth [14]. As a result, Gao and Masalam developed a deep 2D CNN based on transfer learning from the VGGNet model for Structural Health Monitoring (SHM) and rapid post-event damage detection [15]. The 2D image-based SHM used red, green, blue (RGB) information based unsupervised training algorithms and was able to obtain 90% accuracy for binary classification.

#### *2.2. Studies Used 3D Point Clouds for Detection and Classification*

With the rapid development of technologies to collect remotely sensed 3D point clouds and the growing application of these data in various fields of civil engineering, many researchers have proposed various methods to analyze 3D point clouds, in particular for routine inspections or post-event data collection and analyses [16,17]. The datasets here are considered to be non-temporal, which is a single post-event only dataset that does not utilize change detection from a baseline (or pre-event) dataset. For example, Axia et al. proposed a workflow to classify an aerial 3D point cloud into damaged and undamaged classes [18]. Within the proposed workflow, Axia et al. estimated a normal vector for each point within the point cloud data as the key damage sensitive feature and identified the variation of these normal vectors with respect to a global reference vector. Lastly, the study used a region growing approach based on the variation of normal vectors to classify the point cloud. Axia et al. reported that while the proposed method can classify the collapsed structures, the developed method may misclassify the partially damaged structures.

In general, one of the main steps in point cloud analysis workflows is to classify the points into a set of predefined classes. As a result, multiple workflows have been introduced to classify point clouds through machine learning and more recently deep learning techniques. Hackel et al. introduced one of the most successful workflows to classify dense point clouds of urban areas into multiple classes [19]. These classes include building façades, ground, cars, motorcycles, traffic signals, and pedestrians. Within this study, authors extract a series of features for each point based on various neighboring sizes using eigendecomposition, the height of points, and first and second statistical moments and used random forest learning algorithm to classify each point. The proposed method results in the main overall accuracy of 95%. More recently, Xing et al. used the Hackel et al. workflow as a basis and developed a more robust workflow by adding a series of features computed based on the difference of normal vectors for better identification [19,20]. Their study demonstrated a 2% improvement on average.

Recently deep learning method applications become more widespread to analyze 3D datasets in science and engineering fields. Various deep learning-based workflows have been developed to classify the 3D point cloud datasets. The main advantage of deep learning-based learning algorithms over the more traditional learning algorithms (e.g., artificial neural networks) is the capability of the algorithms to learn the feature extractors from the input data directly. Therefore, deep learning algorithms, in particular CNN architectures, eliminate the need for engineering feature extractors based on the

geometry of the objects within the dataset and background. One of the early studies to investigate the application of deep learning for 3D point cloud classification was performed by Prokhorov [21]. Prokhorov proposed a 3D network architecture similar to CNN to classify point cloud of various objects by converting the point cloud data into 3D grid representations. The developed 3D CNN network had one convolutional layer, one pooling layer, and two fully connected layers, which was followed by a 2-class output layer. The weights or parameters within the convolutional layers were pre-trained using lobe component analysis and were updated using the stochastic meta-descent method [22]. Following this study, Maturana and Scherer proposed a 3D CNN for object recognition similar to that of Prokhorov [22,23]. The proposed 3D network had two tandem convolutional layers, one max-pooling layer, and one fully connected layer, which was followed by the output layer. In contrast to the study conducted by Prokhorov [22], Maturana and Scherer did not pre-train the developed network while the network resulted in a performance on par or better than the network proposed by Prokhorov. This highlights that the developed 3D CNN network was able to extract features during the training process effectively.

Recently, Hackel et al. introduced a point cloud classification network based on 3D CNN architecture. The proposed network accepts five occupancy grid models with different resolutions for each instant as input and has five convolutional layers in parallel with an organization similar to VGGNet, which followed by a series of fully connected and one output layers [14,24]. The authors have reported a maximum overall accuracy of 88% and an intersection over the union value of 62% for datasets collected from urban environments. This work classifies the scene into classes of natural terrain, high vegetation, low vegetation, buildings, hardscape, vehicles, and human-made terrains. More recently, Zhang et al. proposed a network to semantically segment point clouds based on a model that consists of three distinct networks [25]. The first network encodes the point cloud into 2D instances. The second network consists of a series of fully connected and max-pooling layers, which are followed by convolutional layers. Finally, the third and last network goal converts the 2D encoded data into 3D grid models, which semantically classify the voxels in the grid and creates a bounding box for each detection object. The authors reported that the experimental results demonstrate an overall improvement in accuracy of 10% in comparison to the network developed by Maturana and Scherer [25].

#### *2.3. Knowledge Gap*

Previous studies have explored the application of CNNs in post-natural hazard event assessment using aerial images. Both deep learning-based methods and unsupervised learning were implemented for 2D and 3D datasets, while the difference between 2D and 3D datasets in deep learning has yet to be fully understood by quantitative comparisons. As reviewed, the majority of studies developed to analyze the 3D point clouds for post-event applications were created based on traditional methods. In contrast, the applications of deep learning models developed based on transfer learning for 2D aerial images was investigated in various studies. However, due to the lack of depth information, limitations of the damage and structural component recognition still exist. As a result, this study investigates the application of deep learning-based models using 2D images and 3D SfM derived point clouds corresponding to the same post-event scenes.

#### **3. Datasets**

#### *3.1. Introduction to Hurricane Harvey and Maria*

Within this study, three othomosiac image and point cloud datasets were collected in the aftermath of Hurricanes Harvey and Maria. Hurricane Harvey made landfall on 25 August 2017 on the coastline of Texas. Hurricane Harvey was a Category 4 hurricane and produced wind gusts over 215 km/h, and storm surges as high as 3.6 m. This incident resulted in the destruction of more than 15,000 partial damage, 25,000 residential and industrial structures, as well as other critical infrastructure in coastal

communities, including the towns of Rockport and Port Aransas [26]. Hurricane Maria made landfall on 19 September of 2017 in Puerto Rico. Hurricane Maria was classified as Category 5 hurricane and produced wind guests over 280 km/h, and storm surges as high as 2.3 m, which makes it the most severe natural hazard event recorded in history to affect Puerto Rico and other Islands in the region [27]. As a result of this extreme event, the power grid of Puerto Rico was significantly damaged, a major dam for the Guajataca reservoir sustained critical structural damage, and more than 60,000 buildings were damaged [28].

#### *3.2. Data Collection Method*

To carry out the data collection task of the selected areas, a medium-size drone with an onboard camera was deployed. A DJI Phantom 4 UAS collected high-resolution aerial images with an onboard camera. The selected flight paths were fully controlled autonomously with the Pix4dcapture application on a handheld tablet. The data collection in Puerto Rico produced 4077 images in 7 flights, which covered approximately 1.75 km<sup>2</sup> area with a 53.5 m elevation change. The Texas Salt Lake dataset contained 1379 images in 2 flights with 0.75 km<sup>2</sup> area coverage with an elevation range of 9.3 m. The Texas Port Aransas site had 1424 images collected from four flights, with a 0.88 km<sup>2</sup> area coverage with an elevation range of 1.9 m. The collected images were further processed using SfM workflow, which used a series of two-dimensional images with sufficient overlap to generate 3D point cloud and further processed orthomosaic datasets of the surveyed area [1]. The SfM derived point clouds for the two sites are shown in Figures 1–3. Other key characteristics of these datasets are presented in Table 1.

**Figure 1.** Puerto Rico point cloud (scale in meters): (**a**) top view and (**b**) side view.

**Figure 2.** Texas Salt Lake point cloud (scale in meters): (**a**) top view and (**b**) side view.

**Figure 3.** Texas Port Aransas point cloud (scale in meters): (**a**) top view and (**b**) side view.


**Table 1.** Summary characteristics of the datasets.

#### *3.3. Dataset Classes*

Within this study, each dataset was segmented manually into one of the following seven classes: Undamaged structures, partially damaged structures, completely damaged structures, debris, roadways, terrain, and vehicles. An earlier study by Mohammadi et al. informs the classification used here [7]. However, the scope of damaged structures is expanded where the instances are modified into two damaged structure classes based on the level of damage sustained during the event, which relates to the degree of damage sustained. A structure that sustained partial damage includes any building that does not represent any physical changes while the roof of the structure is covered by tarps, which are typically blue or red. Completely damaged structures are buildings that underwent physical changes due to the event, such as roof damage without tarp coverings with visible structural components such as beams, columns, or walls. If a structure is collapsed such that no structural component can be identified, the structure is classified as debris. The class of debris consists of everything that is not in its native state. Debris, in general, is comprised of rooftop shingles, fallen trees, downed utility or light poles, and other wind-blown objects. Terrain incorporates any region that is comprised of grass, low-height vegetation, water, sand, trees, exposed soil, fences, or utility poles. Note that any nonbuilding structural objects that are represented by a cylindrical shape (e.g., utility and light poles) are considered as terrain [29]. Lastly, the vehicle class corresponds to objects used for the transportation of people or goods. This includes cars, SUVs, trucks, carts, recreational vehicles, trailers, construction vehicles (e.g., excavators), or any water-borne vessels that can be propelled on water by oar, sail, or engine. Figures 4 and 5 demonstrate the example of each class point cloud and its corresponding images. Lastly, Tables 2 and 3 illustrate the number of point cloud and image instances that were manually identified from the datasets, respectively.

**Figure 4.** Examples of instances from all datasets: (**a**,**b**) Undamaged structure, (**c**,**d**) partially damaged structure, and (**e**,**f**) roadways.

**Figure 5.** Examples of instances existed from all datasets: (**a**,**b**) vehicles, (**c**,**d**) terrain, (**e**,**f**) debris field, and (**g**,**h**) completely damaged structure.

**Table 2.** Summary of point cloud instances for Salt Lake, Puerto Rico, and Port Aransas.



**Table 3.** Summary of image instances for Salt Lake, Puerto Rico, and Port Aransas.

#### **4. Methodology**

#### *4.1. Dataset Preparation for 2D Images*

The process of creating image instances was started by creating an orthomosaic image of the entire scene. This was done using Pix4Dmapper. Afterward, the orthomosaic image was segmented into a series of to 256 × 256 images. As a result, approximately a total of 18,000 images were created from the Puerto Rico dataset, the Salt Lake dataset resulted in a total of 60,000 images, and Port Aransas dataset was divided into 120,000 segmented images. The next step within the preparation image instances was to assign a label to each 256 × 256 image based on the seven classes mentioned in Section 3. Within this study, the image classes are determined by the most prominent object that is visible in the image. Moreover, the Salt Lake and Puerto Rico datasets were used for model development, and the Port Aransas dataset was used to test and validate the developed models.

#### *4.2. 2D Convolutional Neural Network Architecture*

Pre-trained CNNs have advantages due to their relative stability during the training process, efficiency, and higher performance over various diverse tasks. Among the various networks available to select for transfer learning, AlexNet and VGGNet were selected as a basis to develop the 2D CNN models in MATLAB 2020a. These two networks were pre-trained by millions of images for more than 1000 classes. These selected networks for transfer learning, on the other hand, represented different architectures. AlexNet model was developed in 2012 and was the first CNN model to perform well on the ImageNet database, and it still performed consistently well on diverse datasets [9,30]. This network contained five layers, including convolutional and max-pooling layers, and two fully connected layers, as illustrated in Figure 6. The developed model based on AlexNet had identical architecture to the AlexNet network; however, within the fully-connected layers, the dropout regularization method was applied to combat the overfitting while training [31]. The input images were also augmented through rotation and reflection processes to reduce the generalization error of the models. The second CNN model was developed based on transfer learning from VGGNet from 2014. The VGGNet model had 16 convolutional and max-pooling layers, which was followed by the fully connected layers, as shown in Figure 7. These small filter sizes (i.e., 3 × 3 kernels) in VGGNet captured and learned the small details of input instances while larger filter sizes of the network (i.e., 5 × 5) permitted the network to extract features that corresponded to larger regions. Development of the networks based on transfer learning permitted to modify the previously learned feature extractors of these networks according to a new task using a smaller number of training images and epochs [30].

**Figure 6.** AlexNet network architecture.

**Figure 7.** VGGNet network architecture.

During the training process of models developed based on the transfer learning strategy, the 256 × 256 image instances were rescaled to 227 × 227 and 224 × 224 for AlexNet and VGGNet, respectively. In addition, the batch size, which represents the number of images input into the network at once, was set to 64 [15]. While the number of epochs was originally set as high as 2000, the training was terminated when the computed losses reached a plateau to combat the overfitting. The learning rate was set as 0.01 for both networks. Besides these parameters, the remaining hyperparameters were kept identical to the original networks [32]. Note that the training images for both AlexNet and VGGNet are identical in order to compare the results, and because of the augmentation process, approximately over 10,000 images within seven classes are used within the network training. The training performance was evaluated by computing the losses and validation accuracy. Generally, the training of AlexNet contained approximately 300 iterations, while VGGNet has a higher number of approximately 500 iterations. Both developed networks resulted in optimized accuracy for seven classes, which is 88.7% of AlexNet and 91.0% of VGGNet. Figure 8 shows the confusion matrices for the developed networks, which demonstrates that both networks were able to detect the majority of class terrain with high-level accuracy.

The evaluation of the results indicated that the model struggled to learn differences in some of the original classes, particularly as related to structural damage assessment. Consequently, a selected number of classes were merged to reduce the total number of classes from the original seven to specific five and four classes. This was done to demonstrate if the model was able to distinguish between a structural class in general, and an improvement was noted. However, none of the networks were able to learn the other classes, including the partially damaged structure, completely damaged structure, and debris, due to significant similarities between partially damaged and completely damaged structures within the segmented orthoimages.

**Figure 8.** 2D convolutional neural network (CNN) confusion matrices during the training process: (**a**) AlexNet and (**b**) VGGNet models.

#### *4.3. Dataset Preparation for 3D Point Clouds*

Raw and unstructured point clouds were typically incompatible with CNN architectures. This was due to the issue that point clouds generally lack a grid structure, unlike images. Consequently, the raw point cloud instances were converted into volumetric or occupancy gird models, which were 3D arrays. Occupancy grid models provided a suitable data structure for point clouds that can be used within robust CNN learning models. To convert the point clouds instances to occupancy grid models, a method as proposed by Mohammadi et al. was used [7]. Within this study, initially, the point cloud instances were created by slicing the labeled point cloud dataset into roughly 10 m × 10 m segments. Then, the coordinates within each segment, which consisted of objects with various labels, were processed to have only positive values and normalized [7]. Afterward, each segment was downsampled based on the selected occupancy grid dimensions. Within this study, the occupancy grid model of 643 was used as it results in a sampling of 10 to 16 cm for 10 m × 10 m segments, which was a sufficient resolution to perform per building damage assessment in the aftermath of wind storm events [17]. Lastly, an extra-label corresponding to the empty cells within the 3D arrays was assigned to each instance and denoted as neutral. This allowed the network not only to learn the label instances but also to learn the geometry of the output based on the input instances as well since occlusion or gaps in point clouds are common.

#### *4.4. 3D Fully Convolutional Network Architecture with Skip Connections*

The model developed to learn 3D point cloud instances was guided by the previous work of Long et al. and, as discussed in Mohammadi et al. [7,33]. However, the authors reported that developed 3D FCN required a large number of training iterations to achieve an acceptable level of accuracy. As a result, the 3D FCN architecture was modified within this study with skip connections, such that the network can recover the most useful features during the training process at a faster rate [25,34]. The 3D FCN was implemented in TensorFlow v1.15 within this study, and the developed 3D FCN model had an overall general architecture similar to that presented in Mohammadi et al. [7]. In summary, the network was comprised of an input layer that accepted three 3D arrays corresponding to red, green, and blue channels. In addition, the network consisted of an encoding part and decoding parts. The encoder was comprised of 6 3D convolutional layers. The decoding segment of the network consisted of a total of 6 3D transpose convolutional layers. Note that the network did not use any max-pooling layers. Lastly, the output layer was a single-occupancy grid model, each of which cells represented the label of the input point cloud instance (Figure 9). Skip connections added the output

of the convolutional layers within the encoder to the corresponding input of transpose convolutional layers in the decoder. The skip connections conceptually helped the network to recover the fine details in the prediction and reduce any gradient vanishing issues. Figure 9 illustrates the skip connections by arrows.

**Figure 9.** The developed 3D fully convolutional network with skip connections pipeline.

The developed 3D FCN with skip connections was optimized based on stochastic gradient descent, and the cells that contained labels except neutral classes were waited by a factor of 2.0 while updating the learnable parameters to increase boost training and reduce the convergence time. The model was trained on instances from the Salt Lake and Puerto Rico datasets. To further improve the network for generalization, the training instances were augmented by randomly rotating each instance two times. This resulted in a total of 10,958 training instances. In addition, it was observed that network convergence improved as the number of mini-batches increased from 64 to 256. Therefore, the model was trained based on the minibatch size of 256. To evaluate the training process, three performance measures were also calculated, including precision, recall, and cell accuracy in addition to loss, as shown in equations below:

$$Recall = \frac{C\_{ii}}{C\_{ii} + \sum\_{j \neq i} C\_{ij}} \tag{1}$$

$$Precision = \frac{C\_{ii}}{C\_{ii} + \sum\_{j=i} C\_{ji}} \tag{2}$$

$$\text{Cell accuracy} = \frac{\sum \text{C}\_{ii}}{\sum\_{i} \sum\_{j} \text{C}\_{ji}} \tag{3}$$

where *Cii* represents the diagonal of the confusion matrix, which also corresponds to true predictions, *ji Cij* denotes to the false negatives, *ji Cji* denotes the false positive predictions, *Cii* represents the total count of true predictions, and *i j Cji* represents the total count of all predictions. Table 4

demonstrates the result of these performance measures for the developed model during training for a total of 2500 epochs, and Figure 10 shows the training losses, which was measured based on mean squared error (MSE). Lastly, Figure 11 demonstrates the confusion matrix for the trained model. The training results demonstrated that while the model had learned the geometry of the input instances with a high level of accuracy, cell accuracy of 98.1%, it cannot distinguish the discrepancy between partially damaged structures, completely damaged structures, debris, and vehicles. Lastly, the developed network with skip connections demonstrated massive improvement in comparison to the 3D FCN model introduced by Mohammadi et al. as the network was able to achieve a similar level of accuracy with almost 25% training iterations [7].


**Table 4.** Quantified performance measures for the training dataset.

**Figure 10.** Loss progress (MSE) during model training.

**Figure 11.** 3D fully convolutional network (FCN) confusion matrix for the training dataset.

#### **5. Discussion**

#### *5.1. 2D CNN Experiment*

The developed 2D CNN networks have demonstrated a significant difference between training and testing performance measures. The network accuracy during training reached 88.7% and 91.0% for AlexNet and VGGNet, respectively, while lower accuracy was demonstrated in testing. This could be caused by the limitation of the 2D CNN classification based only on RGB information with the lack of depth information. Moreover, this demonstrates that the network was not able to learn useful features to distinguish between different classes. To further investigate this thought, the model developed based on transfer learning form VGGNet was trained using five and four classes, where the classes related to the structures were grouped together. As VGGNet demonstrates a better performance, this model was selected as the network for a more detailed performance investigation. The combined classes represent more general object classes than the original seven class instances. To reduce the classes to five, the classes comprising of completely damaged and partially damaged were merged to create the class named damaged. Similarly, to reduce the total number of classes to a total of four classes, the classes completely damaged structures, partially damaged, and undamaged structures were combined to create a general class of structures. Identical parameters and architecture were used to train the new networks based on the reduced number of classes. It was observed that the training has improved in terms of accuracy, which turned out to be 92.0% and 94.6% for the five and four classes, respectively. Original confusion matrix of seven classes is shown in Figure 12. As for testing results, confusion matrices of merged five and four classes are shown in Figures 13 and 14, respectively. In the end, the VGGNet transfer learning using four classes has a significant improvement in both training accuracy and testing performance as expected. However, this model is not ideal for the targeted structural damage classification following natural hazards events. This is because the structural classes were combined, and the VGGNet training (in all models) cannot reliably distinguish between undamaged, partially damaged, and completely damaged structures. Instead, the general object classification of structure, roadway, terrain, and vehicles was proved to perform well. The improved performance when the classes combined demonstrate that depth information within 3D point clouds is critical to classify damaged structures from undamaged structures automatically.

**Figure 12.** VGGNet transfer learning confusion matrix of testing results on the Port Aransas dataset in seven classes.

**Figure 13.** VGGNet transfer learning confusion matrix results for five classes: (**a**) training and (**b**) testing.

**Figure 14.** VGGNet transfer learning confusion matrix results for four classes: (**a**) training and (**b**) testing.

#### *5.2. 3D FCN Experiment*

Similar to the 2D CNN network, the 3D FCN model was developed and trained based on Salt Lake and Puerto Rico instances and was tested on the Port Aransas instances. To create the testing dataset, a procedure similar to that of creating the training dataset was followed; however, the testing instances were not augmented. Figure 15 shows the confusion matrix for testing on the Port Aransas dataset, and Table 5 provides the performance measures for each class.

The 3D FCN network prediction results on the test dataset demonstrated an overall similar performance measure in comparison to resulted performance measures observed during the training process. The overall cell accuracy of the network was 97.8%. The network was able to predict the class of terrain instances with a high level of accuracy while this was unexpected as the general terrain within the testing dataset in terms of texture and geometry differs in comparison to the training dataset. This suggests that the model was able to learn features that can generalize well between datasets with moderate to low similarities. A similar pattern in detecting the classes of partially damaged structures, completely damaged structures, debris, and vehicles to that of training was observed. Authors expect

that by performing extended training and using more learnable parameters, the network will learn features to distinguish between these classes with a higher level of accuracy.

**Figure 15.** 3D FCN confusion matrix for the Port Aransas dataset (in 7 classes).


**Table 5.** Quantified performance measures on the Port Aransas dataset.

#### *5.3. Comparison of 2D CNN and 3D FCN*

The detection accuracies of the 2D CNN models were consistently lower than the performance obtained based on the 3D FCN network, which is 91.0% for the 2D model and 97.8% for the 3D model. Comparing the structural damage detection performance, 3D FCN demonstrated a superior advantage over 2D models developed based on various class numbers. Keen advantages of 2D CNN and images relate to the smaller number of learnable parameters and reduced data sizes in comparison to the 3D FCN model. While 2D CNN performance improved from 92.0% to 94.6% in general object classification such as structures, terrain, roadway, or vehicles. This basic detection was not adequate for damage detection between structural classes like completely damaged, partially damaged, debris, and other classes. These results demonstrate a significant classification limitation when it is solely based on RGB information (which corresponds to 2D images) in comparison to RGB with depth information (which corresponds to 3D point clouds). Consequently, 3D FCN performs with a marked improvement in structural damage detection when comparing to 2D CNN.

#### **6. Conclusions**

Aerial image data collection provides an efficient technique to collect perishable data following a natural hazard event. Both 2D orthomosaic images and the 3D point clouds can be obtained and processed for analyses and automated classification. This study compared post-event site damage classification using 2D and 3D datasets following two separate hurricanes from 2017 using 2D CNN and 3D FCN. The 2D CNN was conducted via transfer learning of two pre-trained networks: AlexNet and VGGNet. The inputs to the 2D CNN networks are 2D segmented images for unsupervised training and outputs the label for each image segment. The 3D FCN was conducted using aerial image-derived point clouds. Within the FCN method, point clouds are semantically classified into various classes. To keep the parameters consistent, both 2D CNN and 3D FCN have identical classes initially. To further demonstrate the 2D CNN classification performance, a reduction and combination of the classes were used for performance evaluation. The combination was aimed to eliminate the narrative of the classes of damage detection, which combines the structural damage and undamaged classes together.

Within the reduced numbers of classes for 2D CNN training, the accuracy improved at the cost of reducing and eliminating the classes corresponding to structural damage. The accuracy improvement demonstrates the 2D deep learning classifications are ideal for general object detection such as terrain, structures, vehicles, roadway, etc. However, they have a demonstrated limited learning capability to predict distinct structural characteristics from undamaged, partially damaged, and completely damaged as well as debris. However, this limitation was overcome when using a 3D point cloud dataset in deep learning, which contains both RGB and depth information. The model developed based on 2D data was only able to learn the dominant class (i.e., terrain) effectively. This results in lower precision and accuracy for other classes for both training and testing phases. On the contrary, the model developed based on 3D point clouds was able to learn other classes in addition to the dominant class. Classification within damage detection is known to be a class imbalance scenario, where the instances that represent damaged or debris are often the minority class that follows a random and unique geometric and color patterns.

Comparing the training durations, 2D CNN requires a significantly shorter time from a few hours to a day, while 3D FCN requires numerous days. The 2D CNN training accuracy achieved 88.7% and 91.0% for seven classes, and the highest accuracy was achieved by VGGNet training using four classes of 94.6%, while 3D FCN training accuracy was as high as 97.8%. However, when it comes to testing accuracy, the 2D CNN has significantly lower accuracy compared to 3D FCN. The accuracy decrease in the 2D dataset is expected due to the lack of depth information. Classification on 2D images is RGB based only, which can be influenced by the object surface reflection, sunlight, existence of shadows, etc. Moreover, despite 3D dataset preparation and network development is more time consuming, higher accuracy and reliability can be guaranteed. This is especially true when classifying the location and severity of damage following natural hazard events.

**Author Contributions:** Conceptualization, Y.L., M.E.M., and R.L.W.; data curation, Y.L. and M.E.M.; formal analysis, Y.L. and M.E.M.; funding acquisition, R.L.W.; methodology, Y.L. and M.E.M.; project administration, R.L.W.; supervision, R.L.W.; validation, Y.L., M.E.M., and R.L.W.; writing—original draft, Y.L., M.E.M., and R.L.W.; writing—review and editing, Y.L., M.E.M., and R.L.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** No external funding directly supported this work.

**Acknowledgments:** This work was completed utilizing the Holland Computing Center of the University of Nebraska, which receives support from the Nebraska Research Initiative. Data related to the site in Texas were collected by Michael Starek of Texas A & M at Corpus Christi, and its availability is greatly appreciated by the authors, as published on the National Science Foundation's Natural Hazards Engineering Research Infrastructure (NSF-NHERI) DesignSafe cyberinfrastructure. Data related to Puerto Rico were collected by researchers under the supervision of Matt Waite of the University of Nebraska-Lincoln, and the sharing of this data with the authors is greatly appreciated.

**Conflicts of Interest:** The authors declare no conflict of interest. In addition, the funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Drones* Editorial Office E-mail: drones@mdpi.com www.mdpi.com/journal/drones

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34

www.mdpi.com ISBN 978-3-0365-6190-5