5.1. Distribution of Datasets
The most important step is to determine the underlying distribution of data before applying any ML approach. In an ML context, the input data for processing comprise only numerical values. A histogram is a visual representation of the dataset distribution that shows any outliers or gaps in the data. ML performance depends on the distribution of the datasets. If the data distribution is not normal, then it can be skewed to the left or right or be completely random. Whether or not the shear wave velocity obtained from the field is normally distributed is discussed below.
After analyzing
Figure 12a–l, it can be concluded that all of the datasets after 3 m are positively skewed. Moreover, the layers are contaminated with the outliers found after 3 m of depth. Another observation from
Table 2 is that the number of observations decreases with the increase of depth. However, a wide range of data points with the increase of depth can be observed in the histogram.
Figure 12a illustrates the histogram of V
S of 0–0.5 m depth. The histogram has a maximum value of 116.79 ms
−1 and a minimum value of 0 ms
−1. The frequency is zero within 81.36–90.21 ms
−1 and 214.22–240.79 ms
−1.
Figure 12b illustrates the histogram of V
S. at 0.5–1 m depth, and the majority of the values of shear wave velocity lie in the range of 120.47–250.67 ms
−1. The frequency is zero between 258.81 and 275.09 ms
−1. The maximum value of the third histogram (
Figure 12c) is 173.71 ms
−1, with the lowest being 0 ms
−1. Two outliers at 12.06 and 402.7 ms
−1 can be observed in the datasets. Besides, five nonfrequency ranges, from 12.06–92.89, 106.35–133.29, 200.65–214.12, 254.53–267.99, and 294.93–389.23 ms
−1, can be found in the datasets.
Figure 12d is the histogram of V
S. at 1.5–2 m depth. This data can be defined as being normally distributed. The majority of the values of shear wave velocity lie in the range of 155.66–285.45 ms
−1. No outliers can be observed in the dataset. The histogram of the V
S. at 2–2.5 m depth is depicted in
Figure 12e. It has one peak at 219.93 ms
−1 and two gaps (zero frequency) within 86.47–136.52 and 336.7–353.38 ms
−1. The histogram at 2–2.5 m depth follows a similar pattern to the fourth histogram.
Figure 12f shows the histogram of the V
S. at 2.5–3 m depth. There is one outlier at 459.79 ms
−1 of V
S.
The histogram of the V
S at 3–3.5 m depth is emphasized in
Figure 12g. These datasets are right or positively skewed. They also contain a significant number of outliers. The majority of the data can be found in the 206.65–251.24 ms
−1 range, and the outliers are found from 429.6 to 697.14 ms
−1. The histogram of V
S at 3.5–4 m depth shows a similar pattern to the histogram at 3–3.5 m. The distribution of the datasets is also right skewed, as shown in
Figure 12h. These datasets are contaminated with outliers detected from 322.59–697.14 ms
−1.
Figure 12i,j also depicts that the histogram of V
S at 4–4.5 m and that of 4.5–5 m layers of depth are right skewed. Moreover, the outliers were detected from 421.46–799.44 ms
−1 at 4–4.5 m and 455.36–875.71 at 4.5–5 m depth.
Figure 12k,l illustrates that the V
S histograms at 5–5.5 and 5.5–6 m layers of depth are right or positively skewed. Both layers are highly contaminated with the outliers detected from 857.49–1541.83 ms
−1 at 5–5.5 m depth and 512.79–1361.85 ms
−1 at 5.5–6 m depth. Therefore, it can be observed that the range of the V
S started to increase from 3 m, going from 72.88–697.14 ms
−1 at the seventh layer (3–3.5 m) and eighth layer (3.5–4 m). The range varies from 119.07–799.44 ms
−1 at the ninth (4–4.5 m) layer and 100–900 ms
−1 at the tenth (4.5–5 m) layer, while it goes from 173.17 to 1541.83 ms
−1 for the last two layers (5–6 m), even though the number of observations is too low, at 77 and 60, respectively, in these layers (refer to
Table 2). Real-time data are never usually normally distributed. In skewed data, the tail region may act as an outlier for a statistical model. Moreover, the outliers have an adverse effect on the model performance, especially on the regression-based models.
5.2. Comparative Analysis of Shear Wave Velocity for ML Algorithms
A comparison of the predicted shear wave velocity profiles from all ML algorithms with the shear wave velocity profiles from conventional SASW inversion is illustrated in
Figure 13,
Figure 14,
Figure 15,
Figure 16 and
Figure 17. Each figure shows test 1 to test 10, respectively. Within the depth of 0.5 m, the
obtained from conventional SASW is 206.7 m s
−1, whereas the
values obtained from MLP, RF, SVR, and LR are 213.01, 226.65, 213.25, and 213.74 m s
−1, respectively, for test 1. Similarly, for the second layer, the
values obtained are 240.73, 227.89, 211., 234.84, and 224.18 m s
−1 from SASW, MLP, RF, SVR, and LR, respectively. The
deviated at 4–5 m from the SASW value in test 1, as shown in
Figure 13a.
Figure 13b shows high similarity between SASW and SVR. However, a significant difference can be noticed over the depth interval of 1–2 m (up to 100 ms
−1).
Figure 14a illustrates that the
from SVR was close to the SASW values and continued up to 6 m. The
from SVR was consistent throughout the depth.
Figure 14b also shows significant similarities up to 6 m, but from 2–3 m, SVR drifted from the
from SASW.
SVR showed notable results up to 6 m for
Figure 15a,b and
Figure 16a,b as well. The values obtained from the SVR deflected at the first layer for tests 5 and 6.
SVR makes quite a good prediction for test 9, as analyzed in
Figure 17a. Furthermore, SVR predicts the
profile quite well, except for the second layer (1–1.5 m) and seventh layer (3.5–4 m) for test 10, as illustrated in
Figure 17b.
It is apparent from these figures that SVR has outstanding performance in comparison to the other algorithms in predicting shear wave velocities. A few drifts have occurred because there were insufficient data points to train in those layers. A comparison of the experimental data and the theoretical dispersion curve corresponding to the SASW profile is provided in the appendices in order to show what impacts this difference has on the effective theoretical dispersion curves that correspond to the shear wave velocity profiles from SASW and ML, and, subsequently, on how well they compare to the experimental data.
The confidence limit, percentage error, RMSE and R2 were measured for every layer (0–6 m depth) so as to produce a better evaluation of the ML algorithms.
Table 4 shows the confidence limit and the percentage error associated with the RMSE of composite values for the 10 test cases. It is visible that after 3 m depth, the confidence limit is lower than 80% in MLP, which is not reliable for design purposes [
45]. The limit is found to be greater than 85% for SVR from 0–6 m depth. A detailed guideline on target confidence levels was provided by Lorig and Stacey [
46]. Among all of the algorithms, SVR shows a better prediction, as the confidence limit is greater than 80%. Similarly, RF performs better, except for the last layer of depth. The highest percentage error is less than 15% for SVR, whereas it is greater than 20% for all algorithms.
The RMSE of composite values for the 10 test cases for different ML algorithms with conventional SASW inversion is shown in
Figure 18. In the SVR algorithm the range of RMSE varies from 6.79 to 34.44 ms
−1 up to 3 m, whereas in MLP it differs from 9.05 to 36.84 ms
−1, in RF from 10.13 to 29.29 ms
−1, and in LR from 10.66 to 31.20 ms
−1. After 3 m, the RMSE nonlinearly increases with depth. For SVR, MLP, RF, and LR, the range of RMSE varies from 46.03 to 51.15 ms
−1, 76.96 to 135.56 ms
−1, 39.98 to 81.51 ms
−1, and 49.28 to 87.95 ms
−1, respectively. The resolution of shear wave velocity is not considerably good after the depth of 3 m. The distribution of datasets (refer to
Figure 12) after 3 m of depth may cause a sudden increase of RMSE in the ML model.
R
2 assesses how strong the linear relationship is between the conventional SASW and ML algorithms.
Figure 19 represents the R
2 of composite values for the 10 test cases. The R
2 of RF, MLP, SVR, and LR is 0.98, 0.95, 0.97, and 0.95, respectively, for the 0–0.5 m layer, which is the highest value of R
2. The lowest value is found to be 0.72 at 2–2.5 m depth for RF, 0.55 at 2–2.5 m depth for SVR, 0.52 at 2–2.5 m depth for MLP, and 0.55 at 2–2.5 m for LR. The R
2 value of LR also decreases with depth but the value increases marginally at 4–4.5 m, and then begins to decline again.
The values of mean square error (MSE), RMSE, and R
2 are given in
Table 5.
In
Figure 20a–l a schematic representation of the R
2 of composite values for the 10 test cases between SVR and SASW can be analyzed. In the first layer (0–0.5 m), the R
2 between SVR and SASW is high. The second layer (0.5–1 m) also reveals a strong association. The association between SVR and SASW is fairly high for the third layer (1–1.5 m) and the fourth layer (1.5–2 m). The R
2 value is 0.85, which is a significant value at a depth of 3–3.5 m, including a few outliers, as seen in
Figure 20g. The coefficient of determination increases significantly to 0.92 and 0.93 after 4 m, which is known to represent a good association between the shear wave velocity obtained from SVR and SASW, as seen in
Figure 20i,j. At these layers, there are very few data points, and the distribution of the datasets can create a greater RMSE than the other layers.
Although an R2 close to 1 is a reasonable match, this value alone does not decide if the data points or predictions are biased. The performance also depends on the outliers and regression line, which is close to the points.
It can be concluded that the performance of SVR is better than that of the other algorithms when it comes to predicting the shear wave velocity. The other algorithms have high RMSE after 3 m of depth. Indeed, some outliers can be observed after 3 m of depth. The range of the datasets is broad, but the point that we obtain from SASW is low at this depth, which may affect the ML models.