*4.2. CNN for Semantic Segmentation*

Semantic segmentation is a process of assigning a label to each pixel in an image such that pixels with the same label are connected via some visual or semantic property [79]. In order to carry out semantic segmentation, the spatial information needs to be retained. Hence no fully connected layers are used, which is why they are called fully convolutional networks. *Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 11 of 26 semantic segmentation, the spatial information needs to be retained. Hence no fully connected layers are used, which is why they are called fully convolutional networks.

#### 4.2.1. Moving from a Fully Connected to a Fully Convolution Network 4.2.1. Moving from a fully connected to a fully convolution network

This is where all fully connected layers are converted into 1 × 1 convolutional layers. In the case of labelling, the output is a 1D vector giving probabilities of the input belonging to n classes. In the case of segmentation, an output layer is a group of 2D probability-maps of each pixel belonging to each class. These are known as score maps. The score maps are coarse as throughout the network; the information (image) has been down-sampled to absorb minute details. Therefore, to make the output compatible with the input in size, up-sampling is required. This is where all fully connected layers are converted into 1 × 1 convolutional layers. In the case of labelling, the output is a 1D vector giving probabilities of the input belonging to n classes. In the case of segmentation, an output layer is a group of 2D probability-maps of each pixel belonging to each class. These are known as score maps. The score maps are coarse as throughout the network; the information (image) has been down-sampled to absorb minute details. Therefore, to make the output compatible with the input in size, up-sampling is required. Up-sampling can be done using either bilinear interpolation or cubic interpolation (or similar

Up-sampling can be done using either bilinear interpolation or cubic interpolation (or similar techniques). One way of up-sampling is via skip-connections or shortcut connections. In skip-connection, the feature maps obtained as the output from the max-pooling layers are up-sampled using bilinear interpolation and added to the output score maps. The method works well but requires some amount of learning to up-sample the score maps and feature maps to match it to the size of the input image. In order to minimise the amount of learning, another method encoder-decoder is widely used. Here, the layers which down-sample the input are the part of the encoder and the layers which up-sample are part of the decoder. Three key fully connected models, SegNet [80], UNet [81], and Pyramid Scene Parsing Network (PSPNet) [82] are used in this study. A brief description of the models is given in the following subsections. techniques). One way of up-sampling is via skip-connections or shortcut connections. In skipconnection, the feature maps obtained as the output from the max-pooling layers are up-sampled using bilinear interpolation and added to the output score maps. The method works well but requires some amount of learning to up-sample the score maps and feature maps to match it to the size of the input image. In order to minimise the amount of learning, another method encoder-decoder is widely used. Here, the layers which down-sample the input are the part of the encoder and the layers which up-sample are part of the decoder. Three key fully connected models, SegNet [80], UNet [81], and Pyramid Scene Parsing Network (PSPNet) [82] are used in this study. A brief description of the models is given in the following subsections.

#### 4.2.2. SegNet Model 4.2.2. SegNet Model

SegNet works with encoder-decoder architecture, followed by a pixel-wise classification layer for multiple classes. Encoders extract the most relevant features from the given input. The decoder uses the information from encoder to up-sample the output (Figure 6). SegNet works with encoder-decoder architecture, followed by a pixel-wise classification layer for multiple classes. Encoders extract the most relevant features from the given input. The decoder uses the information from encoder to up-sample the output (Figure 6).

**Figure 6.** SegNet architecture for semantic segmentation for bog-ecotope semantic segmentation. **Figure 6.** SegNet architecture for semantic segmentation for bog-ecotope semantic segmentation.

The up-sampling technique used by the decoder is known as max-unpooling. Max-unpooling eliminates the need for learning to up-sample (as was required in skip-connections) as shown in Figure 7. Based on the location of the maximum value, the max-pooled values are placed. The remainder of the matrix is loaded with zeros. Convolution is done using any CNN models (as discussed in Section 4.1) using this layer. The up-sampling technique used by the decoder is known as max-unpooling. Max-unpooling eliminates the need for learning to up-sample (as was required in skip-connections) as shown in Figure 7. Based on the location of the maximum value, the max-pooled values are placed. The remainder of the matrix is loaded with zeros. Convolution is done using any CNN models (as discussed in Section 4.1) using this layer.

*Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 12 of 26

*Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 12 of 26

**Figure 7.** Pooling and unpooling for semantic segmentation. **Figure 7.** Pooling and unpooling for semantic segmentation. **Figure 7.** Pooling and unpooling for semantic segmentation.

#### 4.2.3. UNet Model 4.2.3. UNet Model

UNet network carries out the transpose convolution (encoder-decoder) and also uses skip connections (Figure 8). 4.2.3. UNet ModelUNet network carries out the transpose convolution (encoder-decoder) and also uses skip connections (Figure 8). UNet network carries out the transpose convolution (encoder-decoder) and also uses skip connections (Figure 8).

**Figure 8.** UNet architecture for semantic segmentation for bog-ecotope semantic segmentation. **Figure 8.** UNet architecture for semantic segmentation for bog-ecotope semantic segmentation. **Figure 8.** UNet architecture for semantic segmentation for bog-ecotope semantic segmentation.

At every layer in the decoder side, the network finds a corresponding feature map (of the same size) from the encoder and adds (1 × 1 convolution) to the score map. This way, the size of the feature map is always in sync. Due to its architecture and depth, UNet is most widely used in biomedical image analysis. At every layer in the decoder side, the network finds a corresponding feature map (of the same size) from the encoder and adds (1 × 1 convolution) to the score map. This way, the size of the feature map is always in sync. Due to its architecture and depth, UNet is most widely used in biomedical image analysis. At every layer in the decoder side, the network finds a corresponding feature map (of the same size) from the encoder and adds (1 × 1 convolution) to the score map. This way, the size of the feature map is always in sync. Due to its architecture and depth, UNet is most widely used in biomedical image analysis.

#### 4.2.4. PSPNet Model 4.2.4. PSPNet Model 4.2.4. PSPNet Model

PSPNet stands for Pyramid Scene Parsing Network. This network incorporates the scene and global features for scene parsing and semantic segmentation as shown in Figure 9. PSPNet stands for Pyramid Scene Parsing Network. This network incorporates the scene and global features for scene parsing and semantic segmentation as shown in Figure 9. PSPNet stands for Pyramid Scene Parsing Network. This network incorporates the scene and global features for scene parsing and semantic segmentation as shown in Figure 9.

The pyramid pooling module in PSPNet fuses the features in four scales: coarse (1 × 1), 2 × 2, 3 × 3, and 6 × 6. The up-sampling done is a bilinear interpolation, and all the features are concatenated as the final pyramid pooling global feature [82]. The spatial pyramid pooling technique eliminates the need for using the input image of a specific size, which is used in SPPNet [83].

4.2.4. PSPNet Model

image analysis.

4.2.3. UNet Model

connections (Figure 8).

global features for scene parsing and semantic segmentation as shown in Figure 9.

**Figure 8.** UNet architecture for semantic segmentation for bog-ecotope semantic segmentation.

At every layer in the decoder side, the network finds a corresponding feature map (of the same size) from the encoder and adds (1 × 1 convolution) to the score map. This way, the size of the feature map is always in sync. Due to its architecture and depth, UNet is most widely used in biomedical

**Figure 7.** Pooling and unpooling for semantic segmentation.

UNet network carries out the transpose convolution (encoder-decoder) and also uses skip

**Figure 9.** Pyramid Scene Parsing Network (PSPNet) architecture for semantic segmentation for bog-ecotope semantic segmentation.

#### *4.3. Methodology for the Comparison between CNN Models for the Case Study on Raised Bog Drone Images*

Using the drone images captured on 21st April 2019, semantic segmentation using various CNN architectures was applied to identify and label the ecotopes present on Clara Bog. The entire computation was performed in python v3.7 [84] using GPU (NVIDIA Tesla K40C 12GB CUDA), accessed remotely from trinity college high performing computer (TCHPC), and partly on google virtual machine (Tesla K40C 12GB). The study uses the repository in [85].

#### 4.3.1. Training Data Preparation

In order to smoothly run the semantic segmentation, the preparation of training data was done as follows


Steps (2–5) were repeated for all 40 images having all four ecotope classes mentioned in step 1. The final training data consisted of 40 images (both RGB and labelled) of the size 2<sup>9</sup> <sup>×</sup> <sup>2</sup> 10, which was fed to the CNN models described in the next subsection. The testing was carried out on the rest of the 20 images.

#### 4.3.2. Models Used for Semantic Segmentation

The models were created using a base network (tested on ImageNet) along with a segmentation architecture. Since CNN takes a considerable amount of time to train, only the most frequently used and tested models (in the literature) were compared. The optimisation algorithm used was SGD Adam with initial LR = 0.05 and L2 regularisation. Initially, a high LR was used, as it is reduced throughout the epochs by a factor of 10. The max number of epochs = 100, and images were shuffled

at every epoch and a mini-batch size of 64. The loss between the labels given by the model and the actual (training) label at every epoch was calculated using cross-entropy loss described in Equation (9). The cross-entropy loss is commonly applied for classification applications, whereas loss like half mean square error is more common for regression tasks. Therefore, a cross-entropy loss was used here. = − 1 ൫ො log ,൯ ୀଵ ே ୀଵ + (1 − ො,) log(1 − ,) (9) where N is the total number of pixels, n is the total number of classes, is the training label (input),

$$cross\ entropy = -\frac{1}{N} \sum\_{i=1}^{N} \sum\_{j=1}^{n} \left( \pounds\_i \log x\_{i,j} \right) + \left(1 - \pounds\_{i,j} \right) \log \left(1 - x\_{i,j} \right) \tag{9}$$

where *N* is the total number of pixels, n is the total number of classes, *x* is the training label (input), and *x*ˆ is the output label as predicted by the models. Instead of training the network from scratch, one of the most common techniques is to use a pre-trained network. The idea is to transfer the information learned by the network and then fine-tune and train the classification layer of the model for our specific task. In this manner, given that the weights are already pre-trained for a large dataset, even with a small dataset, the performance is much more improved. Pre-trained weights also speed up the convergence process (to reach local minima, i.e., to overall minimise the loss). It is also considered better than random initialisation. For the four models listed below, 'ImageNet' dataset [86] was used to initialise the weights. Other details are mentioned in detail in [85]. The architecture for these models is shown in Figure 10. for our specific task. In this manner, given that the weights are already pre-trained for a large dataset, even with a small dataset, the performance is much more improved. Pre-trained weights also speed up the convergence process (to reach local minima, i.e., to overall minimise the loss). It is also considered better than random initialisation. For the four models listed below, 'ImageNet' dataset [86] was used to initialise the weights. Other details are mentioned in detail in [85]. The architecture for these models is shown in Figure 10. 1. VGG16 base model with SegNet architecture. 2. ResNet50 base model with SegNet. 3. VGG16 with UNet. 4. ResNet50 with UNet

**Figure 10.** CNN Network architecture. (**a**) VGG16 + SegNet, (**b**) ResNet50 + SegNet, (**c**) VGG16 + UNet, (**d**) ResNet50 + UNet. **Figure 10.** CNN Network architecture. (**a**) VGG16 + SegNet, (**b**) ResNet50 + SegNet, (**c**) VGG16 + UNet, (**d**) ResNet50 + UNet.


the right-hand side of the figure(s). The operation ensures the size of the image was restored to the original size, and the output image has the same spatial-dimension as the input image. Figure 10c,d represents the UNet architecture with VGG16 and ResNet50 models, respectively. The network uses Figure 10a,b represents the SegNet architecture with VGG16 and ResNet50 as the base model, respectively. The left-hand side is the encoder, which has five blocks, and the layers are from the

original VGG16 and ResNet models. The Max pooling operation is depicted by the red arrows. This operation reduces the image dimensions by 2 × 2. The Unpooling is depicted by the green arrows on the right-hand side of the figure(s). The operation ensures the size of the image was restored to the original size, and the output image has the same spatial-dimension as the input image. Figure 10c,d represents the UNet architecture with VGG16 and ResNet50 models, respectively. The network uses the original layers from the VGG16, ResNet50, with the UNet architecture. A clear U-connection can be seen in the figure. The skip connections were used, and upsampling was performed to restore the image dimensions. A concatenated operation was applied to implement the skip connections to combine them with the corresponding feature map (image). The unpooling, skip connections, and upsampling functions were used to ensure that the size of the output image is the same as the input image mentioned in Section 4.3.1. *Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 15 of 26 be seen in the figure. The skip connections were used, and upsampling was performed to restore the image dimensions. A concatenated operation was applied to implement the skip connections to combine them with the corresponding feature map (image). The unpooling, skip connections, and

For a specific task of semantic segmentation, dedicated segmentation based dataset was also used for initialising the weights. For the PspNet, the pre-training was done using ADE20K data [87], and Cityscapes dataset [88]. The ADE20K dataset has 21,200 images of various day to day scenes. The Cityscapes data contains images taken from video frames (≈20,000 coarse images) from 50 cities taken in spring, summer, and fall seasons. The models used are listed below, and the layers and architecture are described in Figure 9. upsampling functions were used to ensure that the size of the output image is the same as the input image mentioned in Section 4.3.1. For a specific task of semantic segmentation, dedicated segmentation based dataset was also used for initialising the weights. For the PspNet, the pre-training was done using ADE20K data [87], and Cityscapes dataset [88]. The ADE20K dataset has 21,200 images of various day to day scenes. The Cityscapes data contains images taken from video frames (≈20,000 coarse images) from 50 cities taken in spring, summer, and fall seasons. The models used are listed below, and the layers and architecture


#### **5. Results 5. Results**

*5.1. Machine Learning* 

pixels (Equation (10)).

training data (as discussed in Section 3).

Figure 11 depicts the segmentation results from both ML and DL techniques for a drone image (sized 512 × 1024) taken of Clara bog. The segmentation was carried out for four ecotope classes present in the drone image captured in the spring season. The accuracy and results are further discussed in this section. Figure 11 depicts the segmentation results from both ML and DL techniques for a drone image (sized 512 × 1024) taken of Clara bog. The segmentation was carried out for four ecotope classes present in the drone image captured in the spring season. The accuracy and results are further discussed in this section.

**Figure 11.** Segmentation results. (**a**) Drone image, (**b**) ground truth labelled image, (**c**) machine learning (ML) (random forest (RF) + Graph cut) segmentation using RGB features, (**d**) ML (RF + Graph cut) segmentation using RGB and textural features, (**e**) deep learning (DL) semantic segmentation using SegNet and VGG 16 model, (**f**) DL semantic segmentation using SegNet and ResNet50 model, (**g**) DL semantic segmentation using UNet with VGG16 model, (**h**) DL semantic segmentation using UNet with ResNet50 model, (**i**) DL semantic segmentation using PSPNet (Cityscapes), (**j**) DL semantic segmentation using PSPNet (ADE 20k). **Figure 11.** Segmentation results. (**a**) Drone image, (**b**) ground truth labelled image, (**c**) machine learning (ML) (random forest (RF) + Graph cut) segmentation using RGB features, (**d**) ML (RF + Graph cut) segmentation using RGB and textural features, (**e**) deep learning (DL) semantic segmentation using SegNet and VGG 16 model, (**f**) DL semantic segmentation using SegNet and ResNet50 model, (**g**) DL semantic segmentation using UNet with VGG16 model, (**h**) DL semantic segmentation using UNet with ResNet50 model, (**i**) DL semantic segmentation using PSPNet (Cityscapes), (**j**) DL semantic segmentation using PSPNet (ADE 20k).

image to give overall accuracy (OA). The OA is the ratio of true positives (TP) with a total number of

As discussed in Section 3, the ML classifiers were tested for model accuracy (5-fold validation), misclassification cost, and training time. Table 2 depicts the metric calculated over the entire 70%

#### *5.1. Machine Learning*

As discussed in Section 3, the ML classifiers were tested for model accuracy (5-fold validation), misclassification cost, and training time. Table 2 depicts the metric calculated over the entire 70% training data (as discussed in Section 3).

RF was chosen to be the best performing classifier, and further segmentation using Graphcut algorithm was performed using the results from RF. The segmentation is a post-classification area based smoothing process. The final segmented image was checked against a fully manually labelled image to give overall accuracy (*OA*). The *OA* is the ratio of true positives (TP) with a total number of pixels (Equation (10)).

$$OA = \frac{TP}{TP + FP + FN + TN} \tag{10}$$

where, *TP* = true positives, *FP* = false positives, *FN* = false negatives, and *TN* = true negatives. This was done for visible bands (RGB) and RGB + textural features. For proper comparison between ML and DL, the image was resampled from its original size (3000 × 4000) to a smaller scale (512 × 1024). The resampling of the image was done using bilinear interpolation [89]. Table 3 depicts the accuracies obtained by using a random forest classifier along with graph cut segmentation for both the sizes of the image. Since, the image used in segmentation using DL techniques is also resized, for an accurate comparison, the resized image (512 × 1024) was used in the further analysis.

**Table 3.** Segmentation accuracies using ML.


As can be seen from Figure 11b,c, there is not much difference in the segmentation using RGB with or without textural features. However, the textural features do add extra information and are known to be highly useful when there is a terrain variation in the scene. However, in this application, where the ecotopes under consideration are low-lying, homogenous communities, the addition of textural features did not improve accuracy very significantly—the OA only increasing by approximately 2%.

#### *5.2. Deep Learning*

The semantic segmentation using CNNs was performed for 100 epochs. The LR was decreased by a factor of 10 each time a model's accuracy was saturated. The overall accuracy performed on testing data (OA is also calculated for testing data) of all the models is shown in Figure 12.

There is a jump of an average ≈32% in OA from the first to last epoch, with the PspNet model and ResNet50+SegNet showing the maximum increase in OA (≈30%, 25% respectively). The cross-entropy loss decreased by an average of ≈28% for the CNN models under consideration. This decrease happens by reducing the LR. Although accurate, a detailed analysis of per-class accuracy is required to make an informed decision about the best CNN architecture for the segmentation for this particular application in the identification of raised bog vegetation ecotopes. The per-class analysis is done to make sure there is no overfitting. As seen from Figure 11i, a model can lead to overfitting, giving sufficient accuracy but incorrect classification.

approximately 2%.

*5.2. Deep Learning* 

=

comparison, the resized image (512 × 1024) was used in the further analysis.

where, TP = true positives, FP = false positives, FN = false negatives, and TN = true negatives. This was done for visible bands (RGB) and RGB + textural features. For proper comparison between ML and DL, the image was resampled from its original size (3000 × 4000) to a smaller scale (512 × 1024). The resampling of the image was done using bilinear interpolation [89]. Table 3 depicts the accuracies obtained by using a random forest classifier along with graph cut segmentation for both the sizes of the image. Since, the image used in segmentation using DL techniques is also resized, for an accurate

**Table 3.** Segmentation accuracies using ML.

As can be seen from Figure 11b,c, there is not much difference in the segmentation using RGB with or without textural features. However, the textural features do add extra information and are known to be highly useful when there is a terrain variation in the scene. However, in this application, where the ecotopes under consideration are low-lying, homogenous communities, the addition of textural features did not improve accuracy very significantly—the OA only increasing by

The semantic segmentation using CNNs was performed for 100 epochs. The LR was decreased

Original size (3000 × 4000) 83.3 85.1 Resized (512 × 1024) 82.9 84.8

testing data (OA is also calculated for testing data) of all the models is shown in Figure 12.

 **RGB Features RGB + Textural Features** 

+ + + (10)

**Figure 12.** Overall accuracy over 100 epochs used for all CNN architectures for semantic segmentation of ecotopes in Clara Bog. **Figure 12.** Overall accuracy over 100 epochs used for all CNN architectures for semantic segmentation of ecotopes in Clara Bog.

There is a jump of an average ≈32% in OA from the first to last epoch, with the PspNet model and ResNet50+SegNet showing the maximum increase in OA (≈30%, 25% respectively). The crossentropy loss decreased by an average of ≈28% for the CNN models under consideration. This decrease happens by reducing the LR. Although accurate, a detailed analysis of per-class accuracy is required Table 4 describes the confusion matrix for every community, and both ML and DL algorithms, which is discussed further in Section 6. Other accuracy checking parameters like Precision, Recall, and F1-score were also calculated for every community (ecotope) under consideration. Equations (11)–(13) give the formula to calculate the above-stated accuracy parameters.

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{11}$$

$$Recall = \frac{TP}{TP + FN} \tag{12}$$

$$F1\ \text{Score} = \ 2 \times \frac{Precision \times Recall}{Precision + Recall} \tag{13}$$


**Table 4.** Confusion matrix per community per segmentation model (ML and DL).

#### **6. Discussion**

The study describes methods to map vegetation communities in a raised bog 'Clara Bog' located in Ireland using drone images from DJI Inspire 1™ drone captured during the spring season. The size of the images were 3000 × 4000, and 40 images were used as training. Furthermore, both ML, DL algorithms were tested for the rest of the 20 images. The study shows that high-resolution (1.8 cm) RGB images are adequate for mapping vegetation communities. However, a key challenge associated with RGB images is the change in intensity due to sunlight conditions, particularly in a temperate climate like Ireland, where sunlight levels are rarely constant for long. Therefore, in this study, all the images with significantly different light conditions were removed. The use of a colour correction technique could be a possible solution to this problem, which is a domain yet to be explored. Similarly, the addition of textural properties does create the challenge of increasing the computations (time and complexity). The segmentation is done using 13 features instead of three, thereby being more computationally expensive.

Initially, a comparative analysis of the state-of-the-art classifiers was performed (Table 2). It was seen that the RF ensemble classifier outperformed the other classifiers. The RF classifier uses bootstrapping for forming multiple trees leading to reduced possibilities of overfitting of the data. The SVM classifier with RBF kernel had similar accuracy and misclassification cost as RF, but with twice the training time. Hence, RF was deemed to be the best choice for drone image classification with model accuracy of 92%. As pixel-based segmentation often fails to take the contextual (area-based) information into account; therefore, to form segments based on area, graph cut segmentation was subsequently applied. Out of the 40 training images, only a part of labelled pixels (≈12k) was input to the ML model. The entire processing time of ML segmentation was ≈30 min.

This was done for both RGB and RGB <sup>+</sup> textural images. The images were resampled to 2<sup>9</sup> <sup>×</sup> <sup>2</sup> 10 for proper comparison with deep learning algorithms (discussed later). It has to be noted that the aspect ratio of the imagery was maintained while resampling it. This was done mainly to keep the textural properties intact. The authors of [37] explain that in order to capture textural properties, the size of the image/sliding window should be carefully chosen. Therefore, a decrease in the size of the image (or change in aspect ratio) can lead to a change in textural properties. Table 3 shows that the resampling using bilinear interpolation did not make a big difference in the OA. The resampled image with textural properties performs comparably to the original image. The OA with textural properties is also comparable to OA with just RGB for this application with a low-lying homogenous area of interest. Overall, the textural properties perform the best segmentation with both the original-sized image and the rescaled image.

From Figure 11c,d, it can be seen that using textural properties, the ecotopes like SMSC and AF are differentiated better (see Table 4). Likewise, from Table 4, it can be seen TP for the C ecotope increases with the addition of textural properties but decreases for the AF ecotope. The decrease in misclassified pixels (FP, FN) between SMSC and AF has led to an increase in precision and recall for the SMSC ecotope. There is a definite increase in accuracy for the C and AF ecotopes by using textural features, whereas, the SMSC and M ecotopes are identified with similar precision, recall, and f1 score values for both the images.

The deep learning technique used for segmentation is semantic segmentation using CNN models. In this study, six different models were tested for the semantic segmentation to identify the different bog-ecotopes. The training data for the CNN models consisted of 40 images containing all the ecotopes in different orientations and lighting (brightness). The size of the training data is a notable factor in this study, as for many applications, 100s of images are more usually required for training such CNN models. This study demonstrates the usage of minimum labelled training images for attaining the segmentation, given that 40 images seemed to be sufficient for this application as the weights were initialised using ImageNet dataset having 1000 different classes. This reduces the dependence on the extensive training dataset and also is faster [90]. All these 40 images were resized to 2<sup>9</sup> <sup>×</sup> <sup>2</sup> <sup>10</sup> for efficiently performing semantic segmentation. For an application involving a prominent area such as

this, the classes are also sparsely located. Therefore, cropping or extracting patches from the images was leading to a reduction in classes (ecotopes) covered in an image. In order to make sure that the model identifies all the ecotopes, the images were resized. Nevertheless, for an application where the classes are located close enough (spatially), cropping/extracting patches can be a viable option.

The algorithms were run for 100 epochs, after which the accuracy was becoming saturated. The computation time was ≈700 min per model for 100 epochs. It was decided not to increase the number of epochs as it may lead to overfitting of the model [91]. The LR was decreased with epochs when OA saturated. This decrease leads to faster convergence and an increase in accuracy. Without decreasing the LR, if the same LR is continued, one may still get high accuracy, but it would require a massive number of epochs; therefore, it is not recommended. There is an apparent increase in accuracy using DL methods when compared to ML methods. At the end of the epochs, it is clear that SegNet and UNet architecture with ResNet50 yield the best results for the semantic segmentation of bog-ecotopes. In comparison, the VGG16 base model has led to the over-classification of ecotopes such as M, AF. The VGG model has been shown to be effective when there is noise in the data but does not perform well when the brightness of the images changed [92]. This explains the low accuracy of the model, as the images had different lighting due to variable weather conditions. Figure 11e–j depicts the DL segmentation results. It can be seen that the segmentation using SegNet and UNet is similar for ecotopes like SMSC and C, but is different for AF and M ecotopes.

The study also demonstrates the use of transfer learning by using a segmentation specific pre-trained PspNet model. This model was pre-trained using ADE20K and cityscapes image set instead of widely applicable ImageNet. In our application, the usage of these segmentation datasets was not successful as the weights were calculated for a specific task of segmenting areas of traffic, cars, houses, pavements, etc. Additionally, due to the uniqueness of these communities, the weights transferred from the pre-trained models were not accurate. In order to use transfer learning, the model selected should be pre-trained using similar categories as the application.

For making the final decision of the best CNN architecture, the accuracy parameters for every ecotope were considered. Table 4 shows that the SMSC ecotope is identified quite well using all the CNN models, with the exception of the PspNet model pre-trained with cityscapes images. Using the base model ResNet50, the ecologically important, peat-forming communities (the SMSC, and C ecotopes) are better identified using SegNet than UNet. Using PspNet (ADE20K), the C ecotope was identified the best, although the OA of the model is low. Therefore, taking into consideration the OA, precision, recall, and f1-score of all the communities, SegNet architecture with the ResNet50 base model appears to be the best choice for drone image segmentation in relation to identifying raised bog vegetation types.

The best OA recorded from ML was 85%, and from DL was 91%. However, the most appropriate technique for this study was not decided just on the basis of OA. For applying the technique to new applications, other parameters cannot be ignored. For example, a lot more training data was required for using DL as compared to ML. Similarly, time and hardware also play a significant role in deciding the best technique. Table 5 summarises the essential pros and cons of the two techniques.

It is clear that there are many pros and cons of both techniques, as described in this study. The main idea behind using remote sensing techniques is to reduce the amount of manual fieldwork that is required for monitoring the wetlands. This includes minimising the training data given as input to the classifiers. Additionally, the idea is to automate the process in the simplest way possible, given that the availability of high performing computers or GPUs cannot always be guaranteed, in order to optimise the speed. Keeping in mind the above requisites, the ML technique is the clear choice for our application. Whilst DL techniques can be used once there is enough labelled data created from all the wetlands such that all the species are covered, in the case of a new wetland, which contains new species to be mapped, a clear indication of the species (with full coverage) is required. Therefore, DL is more advantageous to use for more global or applied applications, whereas for a more specific application such as this where not enough training data is available, ML can produce accuracies almost comparable to the DL.


**Table 5.** Pros and cons of ML vs. DL for mapping ecotopes, Clara bog.

Finally, the drone images are very high resolution and can be captured at any given time. However, although practical, drone image capturing does have certain limitations. The first limitation of using drones is the length of the battery life. For example, on average, the drone (such as DJI Inspire 1) the battery will only last approximately 15 min and so to cover a large area, many batteries are required. Therefore, in the future, in order to make the process cost-effective, drone images can be used in conjunction with the freely available satellite images. Satellite images give excellent coverage and proper temporal resolution meaning that the usage of drones and satellites together should be advantageous.

#### **7. Conclusions**

This study aimed at providing mapping of vegetation in wetlands using image segmentation. For this, ML and DL algorithms were compared by applying them to a set of drone images of Clara Bog, a raised bog located in the middle of Ireland. The images were captured using DJI Inspire 1TM drone (RGB sensor), with the open-source and freely available Pix4DCapture application. Using ML, a total of six different state-of-the-art pixel-based classifiers were compared, out of which, the best ML algorithm for the given dataset was shown to be the RF classifier (model accuracy = 92.9%). For ML image segmentation, RF classifier was used with maximum a-posteriori graph cut segmentation. It was seen that accuracy is improved by ≈2% after addition of textural features (OA = 85.1%) when compared to the original RGB image (83.3%), and ecologically important communities such as central ecotope were mapped better.

Apart from ML, the study also describes image segmentation or 'semantic segmentation' using DL methods. A detailed account on the selection of variables for the DL segmentation was presented. The study was done on a combination of six architectures and base models. For the given dataset, the ResNet50 base model with both UNet (OA = 91.5%) and SegNet (OA = 89.9%) architecture performs very well. ResNet50+SegNet model was deemed best, as it was able to identify complex vegetation communities, such as SMSC ecotope better. All the models were run for 100 epochs, and it was seen the accuracy saturated after ≈35 epochs. It was seen that for mapping ecotopes in a raised bog, transfer of initial weights from wide-ranged ImageNet outperforms the segmentation-specific datasets like ADE20K or Cityscapes.

Overall, the accuracy of the DL was ≈4% higher than the ML methods. Additionally, the DL method does not require any colour correction or the addition of extra textural features. However, DL requires a large amount of initial labelled training data (≈<sup>48</sup> <sup>×</sup> <sup>10</sup><sup>6</sup> pixels). On the other hand, the ML algorithm requires much less training data (≈12,000 labelled pixels) and is much faster (≈30 times) when

compared to CNNs. Therefore, in retrospect, for such a specific application as the wetland mapping application, it was considered that the ML approach is more suitable. This would be particularly useful for any un-surveyed wetland, where the minimum amount of information of the vegetation communities is required to produce accurate maps.

**Author Contributions:** Conceptualization, S.B. and B.G.; methodology, S.B.; software, S.B.; validation, S.B., B.G. and L.G.; formal analysis, S.B.; investigation, S.B., B.G.; resources, S.B., L.G.; data curation, S.B.; writing—original draft preparation, S.B.; writing—review and editing, S.B., L.G., B.G.; visualisation, B.G.; supervision, B.G.; project administration, L.G.; funding acquisition, L.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study is funded by the Environmental Protection Agency of Ireland (EPAIE) (Grant No.: 2016-W-LS-13).

**Acknowledgments:** The authors would like to thank Jonathan Munro (University of Bristol) and Yunis Ahmad Lone for their advice, and Trinity College High Power Computing (TCHPC) department for providing the necessary hardware and support.

**Conflicts of Interest:** The authors declare no conflict of interest.
