*2.3. Raw Samples of Continuous Deflection of Bridge*

The spatial resolution in motion direction refers to the adjacent sampling interval of the device in Figure 2. This parameter is determined by the wheel diameter and the reticle of the rotational speed code wheel, and is approximately 1.48 mm. For the continuous deflection of the main span, taking intact scenario for an example shown in Figure 4, the deformation response of **U**<sup>0</sup> consisted of sequence data of 6554 dimensions which depicted the length of main span of 9.7 m. Due to the high spatial resolution, the continuous curve clearly reflected a certain degree of pre-camber applied at the main span. Moreover, the continuous curve revealed that the experiment platform did not exhibit completely symmetrical structural deformation owing to the handcrafted control for the cable force.

**Figure 4.** Deformation of main span depicted by continuous deflection.

The local continuous deflection curves of **u**0, **u**1, **u**<sup>2</sup> and **u**<sup>3</sup> were shown in Figure 5. Each type of the local continuous deflection curve contained 390 dimensions of sequence data. The coverage length of the area affected by the pad was considered to be the primary basis for determining the length of the local continuous deflection. Moreover, through preliminary data observation, the length of the region having the largest influence range among the three disturbance positions of the pad was selected, rounded, and defined as the final truncated length, which guaranteed the consistency of multiple sets of sample dimensions. The continuous curve mode test technique was used to separately collect the structural response of the scale-down bridge model under intact and simulated structural damage, and five groups of **U**0, **U**1, **U**2, and **U**3, were collected, respectively. Therefore, five groups of **u**0, **u**1, **u**2, and **u**<sup>3</sup> corresponded to four types of structural conditions, namely intact, damage1/4, damage1/2, and damage3/4, and these were used as raw samples to conduct the following study.

**Figure 5.** Truncated deformations depicted by continuous deflections.

#### **3. Detection Methodology Based on Deep CNN**

#### *3.1. Data Augmentation and Pre-processing*

Data augmentation and pre-processing are two essential tasks before carrying out deep learning. The former is always the first choice to boost the performance of a deep network. For image recognition based on deep CNN, there are a wide range of ways to perform data augmentation [12,44,45]. However, the above approaches are not suitable for signal-based pattern classification when using deep CNN algorithm. As shown in Figure 6a, dividing the raw acquisition signals into the same sub-fragment directly is common means of data augmentation [24,25]. It can be seen from Figure 6b that, for fragments of the same length as that in Figure 6a, the overlapping zone set in the adjacent fragments causes the amount of the fragment *m* to be larger than *n* shown in Figure 6a, which effectively increases the amount of data size.

**Figure 6.** Data augmentation of (**a**) common means and (**b**) adopted operation.

Since the original experimental samples were small, the overlapping zone was taken as *g* = 1. It was obvious that the larger the value of *k* in Figure 6, the smaller the number of fragments after data augmentation and vice versa, which also indicated the greater number of fragments needed more computational overhead of model training. With the consideration of a tradeoff result between the training objective of model and the computational overhead, the length of fragment was set as *k* = 50, followed by the 390-dimensional original sequence becoming 341 50-dimensional sequence samples. For the five raw groups of **u**0, **u**1, **u**2, and **u**3, after data augmentation, the sample set of **u** <sup>0</sup>, **u** <sup>1</sup>, **u** 2, and **u** <sup>3</sup>, each including 1705 samples, corresponding to Figure 7a–d, were shown by mesh graphics.

**Figure 7.** Sample set represented by mesh graphics through data augmentation for (**a**) intact, (**b**) damage1/4, (**c**) damage1/2, and (**d**) damage3/4.

To eliminate the difference in the deflection amplitudes of four types of samples in Figure 7 and boost a better classification effect [46,47], a type of min-max normalization [48] expressed in Equation (1) is used to normalize all the amplitudes to the range of 0~1.

$$\mathbf{u}\_{i}^{\prime\prime} = \frac{\mathbf{u}\_{i}^{\prime} - \min(\mathbf{u}\_{i}^{\prime})}{\max(\mathbf{u}\_{i}^{\prime}) - \min(\mathbf{u}\_{i}^{\prime})}, (i = 0, 1, 2, 3 \dots) \tag{1}$$

As shown in Table 1, raw and truncated represented the continuous deflections of the test area and analysis area shown in Figure 3, respectively. After data augmentation and normalization based on the truncated stage, each category of the four-dataset including intact and three types of simulated damage contained 1705 samples. The four types of state, namely, **u** <sup>0</sup> , **<sup>u</sup>** <sup>1</sup> , **<sup>u</sup>** <sup>2</sup> , and **<sup>u</sup>** <sup>3</sup> were used as input data, in which **u** <sup>0</sup> represented the intact baseline and the rest three represented different damage scenarios. One-hot form was used to describe the output labels corresponding to the four categories, meaning that the label vector was generated by the rule that the vector had all zero elements except the position *j*, where *j* was the type number of structural state.

**Table 1.** The dataset details of training and test.


The entire measurement process was performed under stable temperature field, and the data of this study came from actual measurements, which already contained noise disturbances existing in the indoor environment. Therefore, extra interferences of simulated noise and temperature effect were not further considered here.

#### *3.2. Descriptions of the Proposed CNN Architecture*

Table 2 gave the details of the proposed CNN structure through the trial-and-error under the current computing resource configuration. The model structure was inspired by Cifar-10 [49], in which operations of convolution and pooling were not pairwise used. Figure 8 showed the graphical representation of CNN structure with 50 input sample lengths where the green, blue, and yellow referred to the kernel size, max-pooling, and fully connected layer, respectively.


**Table 2.** The details of CNN structure.

Note: PReLU—parametric rectified linear unit.

**Figure 8.** The proposed deep CNN architecture.

Layer 0 as the input layer in Figure 8 was convolved with a kernel of size 2 to produce Layer 1. The convolution and cross-correlation were used interchangeably in deep learning [50], which can be described as:

$$\mathbf{f}(i) = \sum\_{n=1}^{N} \mathbf{s}(i+n)\mathbf{k}(n) \tag{2}$$

where s is input signal, k is filter, and *N* is the number of elements in s. The output vector f is the cross-correlation of s and k. Next, Layer 1 was convolved with a same kernel size to produce Layer 2. After two times of convolution, a max-pooling of size 2 was applied to every feature map (Layer 3). By repeating the above operations two times, other four convolutional layers and two max-pooling layers were created. In Layer 9, the neurons were then fully connected to 200 neurons in Layer 10 by flatten. Eventually, Layer 10 was fully connected to 128 neurons in Layer 11 and Layer 11 was fully connected to 64 neurons in Layer 12. Finally, Layer 12 was connected to the last layer (Layer 13) with 4 output neurons which represented intact, damage1/4, damage1/<sup>2</sup> and damage3/4.

Because the gradient of the left side of rectified linear unit (ReLU) [51] as shown in (3) is always zero, the activation operation may become invalid during training process if the weights updated by a large gradient become zero after being activated.

$$\mathbf{f}(\mathbf{x}) = \begin{cases} & \mathbf{x}, & \mathbf{x} \ge \mathbf{0} \\ & \mathbf{0}, & \mathbf{x} < \mathbf{0} \end{cases} \tag{3}$$

The Leaky ReLU method [52] is a good alternative to address such problem by considering a parameter α in (4),

$$f(\mathbf{x}) = \begin{cases} & \mathbf{x}, & \mathbf{x} \ge \mathbf{0} \\ & a\mathbf{x}, & \mathbf{x} < \mathbf{0} \end{cases} \tag{4}$$

where α is usually set to a small number, and once α is set, its value will keep constant. This allows a small, non-zero gradient when the unit is not active. The parametric rectified linear unit (PReLU) [53] which has the same mathematical expression to Leaky ReLU, takes this idea further by making the coefficient α into a parameter that is learnt along with the other neural network parameters. Since it was not necessary to consider how to specify α, PReLU was used to take the place of Leaky ReLU in this work as an activation function for the convolutional layers (1, 2, 4, 5, 7 and 8) and two fully connected layers (11 and 12).

Further, the Softmax function was used to compute the probability distribution of the four output classes, which can be expressed as follows:

$$p\_k = \frac{\mathfrak{e}^{\mathbf{x}\_k}}{\sum\_{i=1}^n \mathfrak{e}^{\mathbf{x}\_i}} \tag{5}$$

where *xk* is the input of last layer, *n* is the number of output nodes and output values of *pk* are between 0 and 1 and their sum equals to one. Equation (5) was used for Layer 13 in Figure 8 to predict which category the input signals (intact, damage1/4, damage1/2, or damage3/4) belonged to.

Compared with shallow neural networks, deep CNN as a more complicated model contains more hidden layers and more weights, and is particularly prone to overfitting. In the proposed deep CNN, a dropout rate of 0.35 was used before the classification layer (Layer13) as shown in Table 2, which together with early stopping [54] mentioned in the following, effectively suppressed incidence of overfitting during all training processes.

#### *3.3. Training Setting*

Ten percent of the total dataset was used for test, while the rest 90% was divided into two parts, namely, training (80%) and validation (20%). The reason for validation was to evaluate the performance of the model for each epoch and prevent overfitting.

Because the cross entropy function is much more sensitive to the error, the learning rules derived from the cross entropy function generally yield better performance. Here, categorical cross entropy was used as the objective function to estimate the difference between original and predicted damage types, expressed as follows:

$$J = \sum\_{i=1}^{k} \left[ -d\_i \ln(y\_i) - (1 - d\_i) \ln(1 - y\_i) \right] \tag{6}$$

where *J* is the cross entropy, *yi* is the output of prediction class, *di* is the original class in the training data, and *k* is the number of output nodes.

To minimize the above objective function, adaptive moment estimation (Adam) was selected as the optimization algorithm. It calculated an adaptive learning rate for each parameter and stored both an exponentially decaying average of past squared gradients and an exponentially decaying average of past gradients [55]. Details about the training parameters in this work are given in Table 3, in which the early stopping technique was used to control training epochs and further avoid overfitting, and the parameters set in Adam were based on the suggestion in [56].


**Table 3.** The training parameters of CNN structure in this work.

Moreover, a ten-fold cross-validation approach was used in this study, the purpose of which was to reduce the sensitivity of algorithm performance to data partitioning and to obtain as much valid information as possible from the enhanced data. First, all the prepared dataset was randomly divided into ten equal parts. Nine out of ten parts of the total were used to train the proposed deep CNN while the remaining one-tenth dataset were used to test the performance of the model. This strategy was repeated ten times by shifting the training and test dataset. The accuracies reported in the paper were the average values obtained from ten evaluations.

#### **4. Results and Discussion**

The proposed CNN model was implemented by Python package Tensorflow and Keras [57]. The average training runtime of each fold for the proposed model was approximately 15 minutes, which was run on a GPU core (GTX 1080 Ti) with twelve 2.20 GHz processors (Intel Xeon E5-2650 v4). According to the setting in Table 3, the training processes showed that in the initial 500 epochs, the convergence speed was rather quickly for all of the dataset from the ten-fold cross-validation, but it still needed approximately 3000 to 4500 epochs to reach the best performance based on the patience rule set in early stopping. The typical training process regarding accuracy and loss represented by fold 2 is shown in Figure 9, which stopped at the epochs of 3036.

**Figure 9.** Typical training processes about (**a**) accuracy and (**b**) loss.

The confusion matrix cross all ten-fold was presented in Figure 10a. It was observed that 98.3% of **u** <sup>0</sup> signals were correctly classified as intact. Moreover, 1.7% of **<sup>u</sup>** <sup>0</sup> were erroneously classified as other damage categories. Further, a high percentage of 98.4% of **u** <sup>1</sup> signals were correctly classified as damage1/<sup>4</sup> with 1.3% of **u** <sup>1</sup> wrongly classified as damage3/4. For **<sup>u</sup>** <sup>2</sup> the accuracy rate for damage1/<sup>2</sup> reached 96.8% with 2.9% of **u** <sup>2</sup> wrongly predicted as damage3/4. Similarly, 94.2% of **<sup>u</sup>** <sup>3</sup> signals were correctly classified as damage3/<sup>4</sup> with 5.8% wrongly classified as intact (1.1%), damage1/<sup>4</sup> (1.9%), and damage1/<sup>2</sup> (2.8%).

$$\mathbf{(a)}\tag{b}$$

(**e**)

**Figure 10.** The confusion matrices of (**a**) CNN, (**b**) random forest, (**c**) support vector machine, (**d**) k-nearest neighbor, and (**e**) decision trees.

Furthermore, to evaluate the capability in each fold of cross-validation, average accuracy results shown in Figure 11 for different classes were compared between the proposed model and other four pattern recognition methods. When the samples were directly used to classify without heavy consideration regarding features extraction, the accuracy of automatic detection for proposed CNN model (96.9%) was obviously better than that of random forest (RF) (81.6%), support vector machine (SVM) (79.9%), k-nearest neighbor (KNN) (77.7%), and decision trees (DT) (74.8%). Here, the allocation of dataset of the four comparison methods was consistent with the proposed deep CNN algorithm. To fully compete with the proposed model, the most decent key hyperparameters set in sklearn [58] for RF, SVM, KNN and DT were derived through trial-and-error. To further quantify the effect of classifiers, Figure 10b–e show confusion matrices of the other four methods, respectively. It was observed that the best accuracy in various comparison methods can reach to 90.3% as shown in Figure 10b, which was still inferior to the lowest accuracy 94.2% as shown in Figure 10a.

**Figure 11.** Comparisons of average accuracy in each fold of cross-validation.

Next, as shown in Figure 12a, for all five methods, the classification effects on damage1/<sup>4</sup> obviously outperformed the results of the other three categories. Moreover, the detection results of damage3/<sup>4</sup> were the worst in all methods, having a direct influence on the average accuracy of various ways. Further, as shown in Figure 12b, the classification imbalance presented in the confusion matrix was most severe when KNN was used as the classifier. This phenomenon may be related to the relatively lower algorithm complexity of KNN [59,60] compared with other methods mentioned in this work. Only the proposed approach based on deep CNN effectively mitigated this imbalance, although the accuracy of damage3/<sup>4</sup> in Figure 10a was still slightly less than the other three classes. The limited data samples should be a major aspect for such imbalance. In addition, current tests were all from the one-way results and lack of the data from the opposite direction. This may introduce a cumulative system error to the results of the structural response. Further, only slight pre-processing was carried out for the original dataset, which reduced the learning ability of each method mentioned in this paper. Actually, as shown in Table 4, the other four machine learning methods for comparison had fewer key parameters to consider in the process of balancing training accuracy and training error than the proposed method. This weak complexity, determined by the principles of the algorithm, resulted in a poor predictive effect on training and validation. Therefore, under the same circumstance, the proposed approach clearly demonstrated better overall performance in automatic feature extraction than other comparison means.

**Figure 12.** Accuracy distribution based on (**a**) damage category and (**b**) detection method.


gamma = 10 *C* = 10

**Table 4.** Key parameters set in the comparison methods.

Note: FS—Feature selection, gamma—The influence of kernel radius, C—The penalty parameter, *K*—k value defined in KNN, DM—Distance metric.

*K* = 6

DM <sup>=</sup> euclidean FS criteria <sup>=</sup> entropy
