5.1. Input Layer Performance
Human eyes distinguish watercourses better in layers such as TPI and sky view, compared to a DEM. However, in our study, the DEM provided the best results based on f1-scores. One possible explanation, in addition to possibly losing information in generating the layers, is that while the layers are good at highlighting watercourses in many types of terrain, they also tend to show watercourses in less detail in some other types of terrain. Perhaps somewhat surprisingly, combining input datasets did not increase the f1-score. Most of the other layers are derived from the DEM, and probably, therefore, the results may indicate that they do not provide additional information for training the CNN. Although the orthophoto is not a good input dataset because of trees blocking the view to small watercourses, watercourses in agricultural fields and on the sides of roads are often visible in them. However, orthophotos also appeared to not provide sufficient additional information to improve the results. Further research is needed for a comprehensive understanding of why the results did not improve by adding datasets.
The relatively large differences in f1-scores between folds mean that the choice of validation area was sensitive to the selected geographic area. Therefore, it is expected that the study would have benefitted from a larger training and validation dataset to obtain more precise metrics. On the other hand, the scores between the ten models trained with the DEM had a small variance, which means that random elements of the study, including initial weights of the network, the random rotation and random mirroring augmentations, as well as the training data being picked from random locations of the full training dataset, did not cause major variance in scores.
5.2. Causes of False Predictions
Almost a third of FNPs were caused by the offset between the labeled data and the predictions. These false predictions impact vector data that is derived from the predictions by causing an offset (assuming the center is correct in the labeled data), rather than a missing feature or feature section. Because the displacement is one meter at most, it does not impact many, if any, use cases of small watercourses. The results also show that using relaxed recall reduces a higher percentage of FNPs for the higher CCs. Although not surprising, it highlights that relaxed recall may explain a larger percentage of FNPs when semantically segmenting only clearer features. The results indicate that the number of FN predictions calculated with regular recall that are TP with relaxed recall increases as the features become more unclear. This may be due to partially correctly predicted watercourse features becoming more complete, not only towards the sides of the watercourse but also along it. When a greater number of watercourses are incomplete along the watercourse, it increases the number.
Of the remaining FNPs, when using relaxed recall, roughly seven out of eight are caused by CC4 and CC5 which have a lot of uncertainty in them. The remaining FNPs in CC1–CC3 account for only 3.88% of the total watercourse pixels in the labeled data. The visual analysis showed that some of these errors in predictions can be explained by overestimating their CC when digitizing, or by small unclear sections in otherwise clear watercourses. Improving the architecture or optimizing hyperparameters might not then significantly increase quantitative results. On the other hand, the metrics would likely be improved by improving the quantity and quality of the labeled data. Visual inspection showed that many of the unclear features can be explained by uneven distribution in the point cloud leaving some areas with sparsely distributed points. Filling the gaps with sparse point distribution is likely to solve the issue. Roadside ditches were common among both FP and FN results. Interpreting roadside ditches can be difficult because there are typically elevation changes on both sides of the road which when observed from layers such as RTP and relief shading, can be misinterpreted as ditches, ditches could be interpreted as other features. FNPs were also relatively common when the watercourse was wide but shallow, compared to the most typical watercourse types in the area which are narrow ditches. In such cases, improving the CNN method to account for the imbalance in feature type and/or increasing the quantity of training data, to enable the model to train more on such features, could possibly improve the correct prediction of such features.
Most FPPs were caused by narrow, hard-to-see watercourses that were not digitized to the ground-truth data, suggesting that the completeness of the digitized watercourse dataset could be improved. It would likely result in improved precision. Broersen et al. [
2] noted a similar finding in their study, stating that roughly half of their false-positive findings were in fact watercourses. Quality improvements could be done to the already digitized watercourses, correcting CCs of features, and considering that sections of watercourses can belong to different classes. Nevertheless, because the visual assessment did not identify cases of FNPs caused by inaccurate digitizing of the ground-truth data, the results suggest that the original dataset is mostly accurate in terms of position, and therefore, quality improvements to data may not significantly improve the predictions for higher CCs. Increasing the quantity of data could potentially improve results for some feature types, such as wide watercourse sections or natural streams. Additional tests could be conducted to determine how much training data are enough for optimal results. Stanislawski et al. [
5] used varying amounts of their 4600 km
2 of training data (at 5 m resolution) in segmenting watercourses with U-Net, starting from 3% and increasing up to 35% of the available data. They found that increasing the amount of data beyond 15%, would not improve the model metrics. Although it is not possible to directly conclude their study, due to differences in resolution and density of watercourses in the datasets, in terms of pixels, the training data in our study exceeds the number of pixels of 15% of their data, while the density of the watercourses in our study area was high.
The results showed that excluding CC5 watercourses from the labeled dataset improved the f1-score but had a negative effect on the recall of the remaining CCs, compared to training with CC1–CC5 features in the labeled dataset. It means that to achieve the optimal completeness for the clearer features, the less clear ones need to be included in the dataset. Completeness is often important for the automatic extraction of watercourses because it ensures that the features that are found have continuity and that the hydrological network is depicted correctly. Based on the results, typical machine-learning metrics, f1-score, recall, and precision, may not give an accurate validation of the result. It depends on both the requirements of the final watercourse dataset that is derived from the predictions and the post-processing steps. For accurate validation, the accuracy and completeness of such a final watercourse dataset need to be assessed. Watercourses are one part of a hydrological network that includes other features, for example, underground watercourses, culverts, ponds, and lakes. Automatic methods have been developed for, for example, the detection of culverts from DEMs [
36] and ponds [
37]. As the methods develop and mature, these methods need to be combined into a completely automated workflow that produces an optimal final hydrological network.
Multiple factors could have impacted both the comparison of input datasets and causes of false predictions that were not accounted for in the study. The quality of the digitized training dataset was shown to have caused false predictions. This means it also could impact the results of the input data comparison. For example, the errors of omission by the digitizer would falsely increase the f1-score of layers that would not find the missing watercourses and decrease the score for the ones that would find them. The study did not consider alternative loss functions or optimizer algorithms, nor the resolution of input and training data. Different loss functions have been shown to provide different results in, for example, road segmentation with CNN from remotely sensed, and the function should be selected based on the use case and dataset [
38]. Śliwiński et al. [
39] found that when delineating watercourses from DEMs of different resolutions, 1.5 m and 1 m resolution DEM resulted in only a small decrease in lengths of the watercourse lines, which indicates that the resolution of 0.5 m of the DEM used in our study is sufficient for capturing most watercourses.