**1. Introduction**

The use of drones for different types of vegetation classification has increased many folds over the last decade. This is due to the technological development of affordable and lightweight drones. With drones, a very high and flexible spatial resolution can be achieved, which is not possible with satellite imagery due to their fixed orbits. Satellites are both open-source and commercial. Some of the most popular open-source satellites include the Sentinel and Landsat series. These satellites provide global information but lack high spatial resolution with the best resolution possible of 10 m using Sentinel-2 (S2). The S2 imagery has been widely used for classifications; for example, a study carried out by [1] used S2 imagery for temporal mapping of wetland vegetation communities. However, one conclusion from the study was that accuracy decreases for smaller wetlands. In many cases in Ireland, at least, the area of wetlands can be relatively small, whereby satellite-based classification is not sensitive enough and can produce large errors. One of the significant problems is pixel-mixing; when the size of the pixel is 10 m, for example, each pixel can have a combination of species present in it. This affects the overall reflectance value of the pixel, and hence, a good boundary or extent of the species cannot be achieved. There are several ways to reduce the error in satellite images, but most of them require extensive hyperspectral bands. However, another method to get detailed monitoring of small areas is to use unmanned aerial vehicles (UAVs), more commonly known as drones.

Most drones typically carry optical cameras (RGB) and occasionally can support a thermal sensor, but some drones can also support more expensive hyperspectral cameras. The presence of a thermal/hyperspectral sensor allows more details to be gathered and improves spectral resolution. However, the dilemma about spectral versus spatial remains unanswered. Missions like Airborne Visible InfraRed Imaging Spectrometer (AVIRS) and hyperspectral satellite Hyperion provides high spectral resolution. A study [2] has used AVIRIS hyperspectral data with 224 bands and 20 m spatial resolution to detect invasive plant species (*Colubrina asiatica* (Brongniart, Adolphe Théodore) in Florida. Another study [3] used Hyperion (30 m) hyperspectral data to detect *Phragmites australis* (Steudel, Ernst Gottlieb) in coastal wetlands and states that, due to low spatial resolution, the analysis was affected by pixel mixing. Therefore, apart from high spectral resolution, a proper spatial resolution is also required for monitoring vegetation communities closely. Drone images have much higher spatial resolution when compared to satellite images. Drones have been explicitly used for species detection [4–7]. A study by [8] states many advantages of using a drone over satellite imagery for identification of land cover communities such as water, land *Avicennia alba* (Blume, Carl (Karl) Ludwig), *Nypa fruticans* (Verh. Batav. Genootsch. Kunst.), *Rhizophora apiculata* (Blume, Carl (Karl) Ludwig), and *Casuarina equisetifolia* (Linnaeus, Carl). Drone imagery has also been applied for specific applications such as analysing vegetation under shallow water, tracking waterbirds, and their habitats [9,10]. A study by [11] concluded that a thermal (infrared) sensor on its own performs comparable to an RGB sensor, but a multispectral sensor (with multiple spectral bands and indices) is required for the best analysis of nitrogen on rice fields. Multiple spectral sensors, however, although useful, are costly [12]. A study by [13] supports the hypothesis that proper spatial resolution with an RGB sensor is sufficient for the analysis of wetland delineation, classification, and health assessment. Therefore, taking all the points into consideration, as an alternative to an expensive camera, an RGB camera was used in this study.

For the analysis of drone data, many techniques are available. The state-of-the-art techniques in drone image analysis consist of both machine learning (ML) and deep learning (DL). A study by [14] demonstrates the application of ML techniques to classify drone images into roads, vineyards, asphalt, and roofs. The study uses ensemble decision trees with an object-oriented approach. The study [15] has used object-based multi-resolution segmentation (eCognition software) of UAV imagery for the segmentation of the agricultural field. Other than object-based, there are multiple pixel-based studies, such as [16,17] use support vector machine (SVM) classifier for agricultural mapping and reef monitoring using drone imagery. The study [18] applies multiple ML algorithms including random forest (RF), SVM, and gradient boosting decision tree (GBDT) to classify trees, grasses, bare gravel/sand bed, and water surface. The study achieved a high accuracy of up to 98% using RF classifier on UAV images. ML algorithms have also been used for vegetation segmentation. A study by [19] has used simple linear iterative clustering (SLIC) for mangrove segmentation. Another study by [1] has used graph cut for the segmentation of vegetation communities in wetlands. Hence, the segmentation of drone images can help in the identification of subtle changes in vegetation communities. Apart from ML, advanced deep learning techniques are also now being widely applied for drone image segmentation. A recent state-of-the-art review [20] shows the surge of applying DL in the field of RS. It also gives details about various convolutional neural network (CNN) models and suggests that ≈20% of all studies since 2012, uses DL with UAV imagery. The study [21] has used DL to segment concrete-cracks in drone images. The segmentation using a CNN is known as semantic segmentation. It has been applied for various applications like urban land classification [22,23], forest cover classification [24], and wetland type classification [25,26]. A study by [27] uses ResNet50 and UNet for classification of forest tree-species, and [28] has used transfer learning to get the best semantic segmentation of the aerial images AeroScapes dataset. Both [27,28] suggest that the usage of transfer learning enhances the analysis. A study by [29] has utilised both ML (linear regression) and DL (neural network) for predicting water and chlorophyll content in citrus leaves. The study suggested that both ML and DL give comparable results for predictions using UAV images. From the literature, it is apparent

that both ML and DL can be applied for drone image segmentation. However, it is not clear which technique, the traditional state-of-the-art machine learning or the advanced deep learning is better for the identification of the communities. Therefore, in this study, we applied both ML and DL techniques for vegetation classification of different vegetation communities on a raised bog wetland. Our study also demonstrates the pros and cons of both methods. It also gives a clear insight into both the techniques and their applicability for future studies on vegetation identification. *Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 3 of 26 learning is better for the identification of the communities. Therefore, in this study, we applied both ML and DL techniques for vegetation classification of different vegetation communities on a raised bog wetland. Our study also demonstrates the pros and cons of both methods. It also gives a clear insight into both the techniques and their applicability for future studies on vegetation identification.

#### **2. Study Area and Materials 2. Study Area and Materials**

reflective of such conditions.

The area of study is one of the largest intact raised bogs present in Ireland, covering approximately 460 ha area located in the midlands called Clara bog. The two sides of the bog are divided by a road: East Clara is a restored bog (after years of drainage and peat cutting), whereas, the West Clara remains a natural active raised bog. This study concentrates on a small part of the bog located in West Clara bog (as shown in Figure 1). The different vegetation species have been grouped into different communities on the basis of similar habitats, which are termed 'ecotopes' [30]. The area of study is one of the largest intact raised bogs present in Ireland, covering approximately 460 ha area located in the midlands called Clara bog. The two sides of the bog are divided by a road: East Clara is a restored bog (after years of drainage and peat cutting), whereas, the West Clara remains a natural active raised bog. This study concentrates on a small part of the bog located in West Clara bog (as shown in Figure 1). The different vegetation species have been grouped into different communities on the basis of similar habitats, which are termed 'ecotopes' [30].

**Figure 1.** Study area: (**a**) map of Ireland (with the highlighted area: Clara Bog). (**b**) West of Clara Bog, County Offaly (with the highlighted area covered by drone). (**c**) Area covered by DJI Inspire 1TM **Figure 1.** Study area: (**a**) map of Ireland (with the highlighted area: Clara Bog). (**b**) West of Clara Bog, County Offaly (with the highlighted area covered by drone). (**c**) Area covered by DJI Inspire 1™ drone.

drone. The major ecotopes present in Clara bog are Central (C), Subcentral (SC), Submarginal (SM), Marginal (M), and Active flush (or flush) (AF). Other ecotopes like Inactive flush (IAF), Facebank (FB) are also present in this bog but have not been considered in this study due to their low ecological impact. Out of all these ecotopes, the main focus is on the conservation of the active peat-forming areas [1,30,31], which are considered to be C, SC, and AF ecotopes. These areas have high sphagnum The major ecotopes present in Clara bog are Central (C), Subcentral (SC), Submarginal (SM), Marginal (M), and Active flush (or flush) (AF). Other ecotopes like Inactive flush (IAF), Facebank (FB) are also present in this bog but have not been considered in this study due to their low ecological impact. Out of all these ecotopes, the main focus is on the conservation of the active peat-forming areas [1,30,31], which are considered to be C, SC, and AF ecotopes. These areas have high sphagnum moss coverage, with hummocks, hollows, lawns, and many pools. The SM ecotope that appears

distinguish between them. The SM and M ecotopes are located on drier areas with vegetation

at the boundaries of the SC ecotope can appear to be almost homogenous, which makes it hard to distinguish between them. The SM and M ecotopes are located on drier areas with vegetation reflective of such conditions. *Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 4 of 26

For capturing high-resolution images, a DJI Inspire 1™ drone was used. The camera used with the drone was Zenmuse X3. It is an optical camera with 100–1600 ISO range (for photo) and 94◦ field of view (FOV). The lens is anti-distortion and autofocus (20mm of 35mm format equivalent). The aspect ratio, while clicking the images, was kept at 4:3. The images were captured on 21st April 2019 at around noon time. The highest temperature on the day was recorded at 19 ◦C. The height of the flight was ≈100m, and the spatial resolution of the images captured was 1.8 cm. The drone mission was pre-loaded using Google maps in Pix4DCapture application to capture ≈8 ha of the area using an iOS-12 device. The images were captured individually with 70% frontal and 80% sideways overlap at an average speed of 3 m/s. Figure 1c provides the drone imagery of the study area. For georeferencing, the drone imagery had geo-tags (lat-long locations) present in it. For better orientation, imagery was overlayed on high-resolution DigitalGlobe World Imagery (spatial resolution = 30cm) available as a base map in ArcMap v.10.6.1 [32,33]. Using 'georeferencing' toolbox present in [32], 3–4 ground control points (GCPs) were identified for every image, and projection was rectified to Geographic Coordinate System—World Geodetic System 84 (GCS WGS 84). In this study, C, SC, SM, M, and AF ecotopes were all captured using high-resolution drone imagery (Figure 2). For capturing high-resolution images, a DJI Inspire 1TM drone was used. The camera used with the drone was Zenmuse X3. It is an optical camera with 100–1600 ISO range (for photo) and 94° field of view (FOV). The lens is anti-distortion and autofocus (20mm of 35mm format equivalent). The aspect ratio, while clicking the images, was kept at 4:3. The images were captured on 21st April 2019 at around noon time. The highest temperature on the day was recorded at 19°C. The height of the flight was ≈100m, and the spatial resolution of the images captured was 1.8 cm. The drone mission was pre-loaded using Google maps in Pix4DCapture application to capture ≈8ha of the area using an iOS-12 device. The images were captured individually with 70% frontal and 80% sideways overlap at an average speed of 3 m/s. Figure 1c provides the drone imagery of the study area. For georeferencing, the drone imagery had geo-tags (lat-long locations) present in it. For better orientation, imagery was overlayed on high-resolution DigitalGlobe World Imagery (spatial resolution = 30cm) available as a base map in ArcMap v.10.6.1 [32,33]. Using 'georeferencing' toolbox present in [32], 3–4 ground control points (GCPs) were identified for every image, and projection was rectified to Geographic Coordinate System—World Geodetic System 84 (GCS WGS 84). In this study, C, SC, SM, M, and AF ecotopes were all captured using high-resolution drone imagery (Figure 2).

**Figure 2.** Ecotopes in Clara bog. Drone images, April 2019. **Figure 2.** Ecotopes in Clara bog. Drone images, April 2019.

The SM and SC ecotopes are highly homogenous and appear to be mixed throughout the bog [1]. These communities were therefore merged for the rest of the study. In total, around ≈75 images of dimension 3000 × 4000 were captured. Out of these images, 15 images were discarded due to differences in light intensity, motion blur, and camera tilt. The usable 60 images were divided into 70% training and 30% testing randomly, which is around 40 images for training and 20 images for testing. In order to have a correct idea of mapping accuracy, all the images were labelled for the four vegetation communities (M, SMSC, C, AF). For ML only a part of the labelled training data was required, whereas for DL fully labelled images were used. This is discussed further in Sections 3 and 4. For the creation of a training dataset, it is essential for all the images to have a similar intensity range. Depending on the lighting situation when the picture was taken, the colour properties may be changed, even though the textural properties remain unchanged. In a temperate climate like Ireland, this change in sunlight while capturing drone images is unavoidable. Therefore, going forward in future studies, the usage of colour correction techniques for drone mages is recommended such that all the captured images can be used. The SM and SC ecotopes are highly homogenous and appear to be mixed throughout the bog [1]. These communities were therefore merged for the rest of the study. In total, around ≈75 images of dimension 3000 × 4000 were captured. Out of these images, 15 images were discarded due to differences in light intensity, motion blur, and camera tilt. The usable 60 images were divided into 70% training and 30% testing randomly, which is around 40 images for training and 20 images for testing. In order to have a correct idea of mapping accuracy, all the images were labelled for the four vegetation communities (M, SMSC, C, AF). For ML only a part of the labelled training data was required, whereas for DL fully labelled images were used. This is discussed further in Sections 3 and 4. For the creation of a training dataset, it is essential for all the images to have a similar intensity range. Depending on the lighting situation when the picture was taken, the colour properties may be changed, even though the textural properties remain unchanged. In a temperate climate like Ireland, this change in sunlight while capturing drone images is unavoidable. Therefore, going forward in future studies, the usage of colour correction techniques for drone mages is recommended such that all the captured images can be used.

#### **3. Segmentation Using Machine Learning 3. Segmentation Using Machine Learning**

The segmentation of the images using machine learning techniques utilises combinations of intensity, colour, texture, and motion attributes to come up with hierarchical segments [34]. The drone images used for this study have intensity and colour information. Although textural information is not present in the original image, textural features were subsequently calculated using the parameters mentioned in Table 1, [35]. This was done by converting the RGB image into a grayscale image. The textural information presented in Table 1 was added as features along with the RGB layers. The entire computation of machine learning techniques and the steps described below The segmentation of the images using machine learning techniques utilises combinations of intensity, colour, texture, and motion attributes to come up with hierarchical segments [34]. The drone images used for this study have intensity and colour information. Although textural information is not present in the original image, textural features were subsequently calculated using the parameters mentioned in Table 1, [35]. This was done by converting the RGB image into a grayscale image. The textural information presented in Table 1 was added as features along with the RGB layers.

was performed using MATLAB v.2019b using image processing toolbox [36].

**Property Description** 

The entire computation of machine learning techniques and the steps described below was performed using MATLAB v.2019b using image processing toolbox [36].


**Table 1.** Textural properties calculated using drone imagery.

The segmentation technique used in this study, called graph cut, is based on max-flow min-cut [43]. This is done using posterior probabilities associated with every pixel for every class. In order to calculate the posterior probabilities, an initial classification of the drone images was carried out. Based on the texture and colour intensity, a total of 13 bands are used for further classification of the drone images. The type and choice of classifier used are discussed in the following subsection.

#### *3.1. Choice of the ML Classifier*

For efficient classification, the choice of the classifier is the most crucial decision that has to be made. Multiple studies have applied hyperspace based SVM [44,45] for image classification. Other studies, like [46], have used decision trees. Studies [47,48] suggest that there is an advantage of using ensemble classifiers over other state-of-the-art classifiers. The most commonly used ensemble classifier consists of a tree model. The tree models are easy to understand and could be used for both classification and regression. There is no need for variable selection (since it is automatic) or variable transformation. They are robust to outliers and missing data, and particularly useful for large datasets.

In this study, in order to provide proper comparative analyses, the drone images captured on 21st April 2019, were classified using multiple classifiers. The training dataset (≈12k pixels from 40 images) was the input for all the classifiers. The classifiers were tested on model accuracy, misclassification cost (i.e., the total number of incorrectly identified pixels per 10,000 pixels), and training time (time taken by the classifier for training). The model accuracy for each ML model was calculated using 5-fold cross-validation for the entire 70% training dataset. This accuracy indicates the capability of the model to label the pixels correctly. The results (Table 2) describes all the classifiers and the corresponding accuracy metric. All the calculations were performed using MATLAB v.2019b [36].

The preliminary comparison was made using six classifiers, namely, decision trees [49], naïve Bayes [50], discriminant analysis [51], SVM [52], k-nearest neighbour (KNN) [53], and random forest (RF) [54]. Based on the misclassification rate, model accuracy, and training time (see Table 2), RF was found to be best classifier. Random forest or bagging is a general-purpose procedure for reducing the variance of a predictive model. When applied on trees, the number of trees (t) is bootstrapped, each having a variance σ 2 . In RF each tree can split on only a random subset of the samples (hence, the name). RF requires an attribute (sample) selection and a pruning method. Information gain ratio criterion [55] and Gini Index [56] are the most common attributes selection

methodology. For this study, the Gini index criterion was used to decide the attributes. The Gini index (G) is given in Equation (1). Based on the value of G, the attribute was decided automatically.

$$G = \sum\_{n} \sum\_{i=1}^{N} (p\_i \times (1 - p\_i))\_n \tag{1}$$

where *p<sup>i</sup>* is the proportion of the pixel (i = 1 to N) belonging to a particular class n, i.e., it is the prior probability. A minimum of 10% of the entire ground truth image should be given as training and rest could be used for testing [1]. The samples were divided into 100 random subsets (with repetition), and for each tree, and the attributes (splitting criteria: which of the RGB bands) were decided using Equation (1). The final class selection for every pixel was made using majority voting. The workflow of the RF classifier is given in Figure 3. (RF) [54]. Based on the misclassification rate, model accuracy, and training time (see Table 2), RF was found to be best classifier. Random forest or bagging is a general-purpose procedure for reducing the variance of a predictive model. When applied on trees, the number of trees (t) is bootstrapped, each having a variance σ2. In RF each tree can split on only a random subset of the samples (hence, the name). RF requires an attribute (sample) selection and a pruning method. Information gain ratio criterion [55] and Gini Index [56] are the most common attributes selection methodology. For this

**Table 2.** Comparison of ML classification techniques. study, the Gini index criterion was used to decide the attributes. The Gini index (G) is given in Equation (1). Based on the value of G, the attribute was decided automatically.


**Figure 3.** Workflow—random forest classifier. **Figure 3.** Workflow—random forest classifier.

#### *3.2. Segmentation 3.2. Segmentation*

joined.

Once the drone images were classified, they were segmented using the maximum-a-posterior (energy minimisation) technique. The technique uses contextual (area) information to form proper segments from pixels. The pixels, therefore, are no longer treated as a single entity but part of a more significant segment. It can be considered as a post-classification smoothing process based on spatial similarities. The formation of segments was done using a max-flow min-cut algorithm, commonly known as graph cut. This algorithm uses data cost and smoothness costs [57]. The graph cut Once the drone images were classified, they were segmented using the maximum-a-posterior (energy minimisation) technique. The technique uses contextual (area) information to form proper segments from pixels. The pixels, therefore, are no longer treated as a single entity but part of a more significant segment. It can be considered as a post-classification smoothing process based on spatial similarities. The formation of segments was done using a max-flow min-cut algorithm, commonly known as graph cut. This algorithm uses data cost and smoothness costs [57]. The graph

map. Based on the maximum probability of the pixels, the segments were formed, and the pixels were

cut segmentation was performed in MATLAB v.2019b [40] using MATLAB wrapper mex file function that enables the user to call C/C++ files [58]. The steps for the segmentation include calculation of the data cost, smoothness cost, and energy using posterior probability from the pixel-based classification map. Based on the maximum probability of the pixels, the segments were formed, and the pixels were joined.

The data cost (*Dp*) is based on individual labels of pixels and their likelihood function. The data cost *D<sup>p</sup>* measures the cost of assigning the class *n* to the pixel *p* for a given set of features *U<sup>N</sup>* in the vectorised image having *N* pixels. In image processing *D<sup>p</sup>* can be typically expressed as [59], given by Equation (2).

$$D\_p = \left\| U\_p(n) - I(p) \right\|^2 \tag{2}$$

where, *I*(*p*) was the observed reflectance of the pixel *p*.

The smoothness cost (*Vp*,*q*) on the other hand, was used to promote groups. It was assumed that the neighbouring pixels should belong to the same class, and hence, this cost was given based on the likelihood of pixels *p*, *q* belonging to same class *n*. *n<sup>p</sup>* , *n<sup>q</sup>* are labels of pixels *p*, *q* respectively. It was defined using described in Equation (3).

$$\text{V}p.q(n\_p \mid n\_q) = c \times \exp\left(-\Delta(p \mid q)/\sigma\right) \times T(n\_p \neq n\_q) \tag{3}$$

where ∆(*p*, *q*) = *<sup>I</sup>*(*p*) <sup>−</sup> *<sup>I</sup>*(*q*)  denotes how different the reflectance values of *<sup>p</sup>* and *<sup>q</sup>* are, *<sup>c</sup>* <sup>&</sup>gt; <sup>0</sup> is a smoothness factor, standard deviation σ > 0 is used to control the contribution of ∆(*p*, *q*) to the penalty, and *T* = 1 if *n<sup>p</sup>* , *n<sup>q</sup>* and 0 otherwise.

As described in [1], the steps followed for drone images were the same as for the satellite image segmentation. The main difference comes in the choice of the smoothness factor. Since a drone image was much more detailed for forming distinct segments compared to a satellite image, a high smoothness factor was required. After an iterative parametrisation optimisation exercise, a smoothness factor (*c*) of *c* > 5 was chosen for the drone images. This can be compared to the optimum value of *c* < 1 when processing satellite images [1]. Therefore, it was seen that for a high resolution (1.8 cm), a higher value of *c* was required, whereas, when working with the 10 m spatial resolution from satellite images, a small value of *c* suffices.

The pioneering work done by [59] explains energy (*E*) minimisation can be interpreted directly as posterior maximising. Using probability functions from previous steps, we get the energy function as described in Equation (4).

$$E\left(\mathcal{U}\_{\mathcal{N}},\boldsymbol{\eta}\right) = \sum\_{p \in \mathcal{N}} Dp + \sum\_{p,q \in \mathcal{N}} Vp\_{\boldsymbol{\prime}}q \tag{4}$$

Therefore, *E*(*UN*, *n*), i.e., energy for the image vector with total *N* pixels (*UN*) for all *n* classes is minimised, leading to the formation of smooth segments. The pixels with least *E* are joined together to form the segments depending on their initial labels as obtained from the pixel-based RF classification. The results of the segmentation are further discussed in Section 5.1.

#### **4. Segmentation Using Deep Learning**

#### *4.1. Parameters in Convolutional Neural Network*

Convolution neural networks (CNNs or Covnets) have caused a step-change in pattern recognition progress. Here each neuron is connected to a local region of the input only, making the network faster and less prone to overfitting for a large dataset. Therefore, CNNs, when compared to traditional NNs, can have fewer parameters. In addition, the same parameters are used in more than one place on CNN, making the model both statistically and computationally efficient. The initial layers of the CNN identify lines, corners, edges, textures, and then the deeper the network goes, the more precisely it can learn from the features, as shown in Figure 4, which gives the architecture of CNN. The different layers used in CNN are described in detail in the following subsections.

*Remote Sens.* **2020**, *12*, x FOR PEER REVIEW 8 of 26

**Figure 4.** The general architecture of convolutional neural network (CNN). **Figure 4.** The general architecture of convolutional neural network (CNN).

#### 4.1.1. Convolutional Layer

4.1.1. Convolutional Layer Convolution in CNN is the mathematical operation that combines signals and [∗], i.e., filtering input with kernel . It is a process to overlay '' on '', multiply the numbers and sum the products and move. In CNN the convolutional layer is used instead of only fully connected layers. For visualising, convolution may look like a sliding window operation, but it is implemented as Convolution in CNN is the mathematical operation that combines signals *a* and *b* [*a* ∗ *b*], i.e., filtering input *a* with kernel *b*. It is a process to overlay '*b*' on '*a*', multiply the numbers and sum the products and move. In CNN the convolutional layer is used instead of only fully connected layers. For visualising, convolution may look like a sliding window operation, but it is implemented as matrix multiplication. The input is divided into arrays as well as the kernels and rearranged into columns. **Figure 4.** The general architecture of convolutional neural network (CNN). 4.1.1. Convolutional Layer

matrix multiplication. The input is divided into arrays as well as the kernels and rearranged into

Convolution in CNN is the mathematical operation that combines signals and [∗], i.e.,

#### columns. 4.1.2. Pooling Layer filtering input with kernel . It is a process to overlay '' on '', multiply the numbers and sum the

4.1.2. Pooling Layer The pooling layer downsamples the input by locally summarising the data in it. The two types of pooling are shown in Figure 5. products and move. In CNN the convolutional layer is used instead of only fully connected layers. For visualising, convolution may look like a sliding window operation, but it is implemented as matrix multiplication. The input is divided into arrays as well as the kernels and rearranged into


2. Average pooling: where the local average of the filtered region is carried forward. Of the two methods, max-pooling was used for this study, as it is a more efficient pooling technique [60]. A feature existing in the input layer is fed forward regardless of its initial position (as the local maxima will still make it to the next layer). The advantages of pooling include decreasing the size of the activation layer that is fed forward to the next layer and increasing the receptive field of the subsequent units. 4.1.2. Pooling Layer The pooling layer downsamples the input by locally summarising the data in it. The two types of pooling are shown in Figure 5. 1. Max Pooling: where the local maxima of the filtered region are carried forward. 2. Average pooling: where the local average of the filtered region is carried forward.

the local maxima will still make it to the next layer). The advantages of pooling include decreasing the size of the activation layer that is fed forward to the next layer and increasing the receptive field **Figure 5.** Types of pooling used in CNN. **Figure 5.** Types of pooling used in CNN.

#### of the subsequent units. Of the two methods, max-pooling was used for this study, as it is a more efficient pooling 4.1.3. Kernel Size

4.1.3. Kernel Size

4.1.3. Kernel Size Kernels or the filters are used in order to down-sample the layers in CNN. It is preferable to use smaller kernels stacked on top of one another than using a large kernel [61]. Using smaller kernels technique [60]. A feature existing in the input layer is fed forward regardless of its initial position (as the local maxima will still make it to the next layer). The advantages of pooling include decreasing the size of the activation layer that is fed forward to the next layer and increasing the receptive field of the subsequent units. Kernels or the filters are used in order to down-sample the layers in CNN. It is preferable to use smaller kernels stacked on top of one another than using a large kernel [61]. Using smaller kernels decreases the number of parameters and also increases the nonlinearity (see Section 4.1.6). For example, a stack of two 3 × 3 kernels and one 5 × 5 kernel will have the same receptive field. However, 3 × 3 will

smaller kernels stacked on top of one another than using a large kernel [61]. Using smaller kernels decreases the number of parameters and also increases the nonlinearity (see Section 4.1.6). For example, a stack of two 3 × 3 kernels and one 5 × 5 kernel will have the same receptive field. However, have fewer parameters (as the same kernel is used twice) and more nonlinearity. Therefore, in this study, the kernel of size 3 × 3 was used.

#### 4.1.4. Stride

Stride defines by how much the kernel will move in the convolution layer. The stride can be used to increase the receptive field. Example, stride = 2. Using stride > 1 provides a down-sampling effect and can be used as an alternative to the pooling layer.

#### 4.1.5. Padding

Padding is required to maintain the spatial resolution of the input image. Padding can be of two types, valid and same. In valid padding, the spatial dimension of the output shrinks by one pixel less than the kernel spatial dimension. Whereas, in same padding, the input is surrounded with zeros such that the spatial dimension of output is the same as the input layer. Therefore, the same padding was used in this study in order to maintain the same dimension between input and output.

#### 4.1.6. Activation Function

The activation function (*f*(*x*)) defines the output for a given input. It also imparts nonlinearity to the input.

#### *Why do we need nonlinearity?*

Combining linear functions yields a linear function; however, in order to compute more in-depth features, nonlinearity is required. With just linear functions, the model is no more expressive than a logistic regression model without any hidden layer. Hence, without any nonlinearity, the entire network behaves as a single linear function.

The study [62] describes the types of activation functions. Some of the most commonly used and well-known activation functions are identity (when linear relation is required), binary step (nonlinear, good for binary classification), sigmoid (nonlinear function, ranging from 0 to 1), tangent hyperbolic (tanH) (same as sigmoid, but ranges from −1 to +1), rectified linear unit (ReLu) (nonlinear function, removes all the –ve part of the input). Sigmoid, tanH, and ReLu also has other variants, see [63]. Other studies like [64,65] compare the various activation functions. A study by [66] presents a comparison between 11 activation functions and suggests ReLu to be the best. Additionally, the ReLu function is much more computationally effective, and therefore, for this study, the ReLu activation function was used. Equation (5) describes the ReLu function.

$$\begin{array}{l} f(\mathbf{x}) = \mathbf{0} \mathrel{\mathbf{x}} \mathbf{x} < \mathbf{0} \\ f(\mathbf{x}) = \mathbf{x} \mathrel{\mathbf{x}} \mathbf{x} \ge \mathbf{0} \end{array} \tag{5}$$

#### 4.1.7. Softmax Classifier

Softmax Classifier is an activation function, typically used as the top layer (after a fully connected layer). It imparts probabilities of each input belonging to each output when there are more than two outputs. For the n number of classes, the Softmax activation (σ) can be defined by Equation (6).

$$\sigma(\mathbf{x})\_j = \frac{e^{\mathbf{x}\_j}}{\sum\_{k=1}^n e^{\mathbf{x}\_k}} ; j = 1, \dots, n \tag{6}$$

#### 4.1.8. Batch Normalisation

It is apparent that each layer is dependent on its previous layer; therefore, even the smallest error in one layer can be magnified in further layers, causing much more significant errors in the final output. To avoid this, a batch normalisation layer is used. This layer normalises the hidden nodes before they are fed into an activation function.

#### 4.1.9. Additional Parameters in CNN

An essential parameter in CNN is optimisation. Training a network can be considered to be an optimisation problem where the goal is to minimise the loss function. There are various optimisation algorithms that can be used to minimise the loss function, such as online learning [67], batch learning [68], and stochastic gradient descent (SGD) [69]. As described in [70] for faster and efficient processing, a subset of the data is taken one at a time, and therefore a stochastic gradient descent was used for optimisation in this study. The subset of data is called a mini-batch, and the number of samples in a mini-batch is called batch size.

Another important parameter in CNN is regularisation. Regularisation of the model can be carried out to make the model simple but effective. This reduces overfitting and adds additional information. This ensures that augmenting the input will not change the quality of the output. Regularisation can be done by adding a weight penalty term to the loss function (Equation (7)).

$$\text{Loss} = \text{Loss} + \text{weight penalty} \,(\text{w}) \tag{7}$$

L2 or ridge regularisation leads to the formation of small weights [71]. Additionally, L2 regularisation never causes a degradation in performance, even with the addition of kernels [72]. Therefore, L2 regularisation was used in CNNs for this study. For a given input *x* and its corresponding output *x*ˆ the regularisation function is given in Equation (8).

$$Loss = \sum\_{i} (\mathbf{x}\_{i} - \mathbf{x}\_{i})^{2} + \alpha \sum\_{i} w\_{i}^{2} \tag{8}$$

A third, important parameter for CNN architecture is the learning rate (LR). The LR is defined as the rate at which the weights are updated during the training of the network. The study [73] suggests to start with a bigger learning rate and gradually decrease the gradient when getting closer to the local minima of the loss function. Since adaptive momentum estimation (ADAM) is fast and requires low memory for computation [74], it was selected as the optimisation parameter for the network used in this study. ADAM is a method that learns the LR on a parameter basis and is a combination of both adaptive gradient (AdaGrad) and root mean square (RMSProp).

#### 4.1.10. Popular CNN Models

CNN models are formed using the combinations of parameters mentioned in the above subsections. The combinations of layers and the type of parameters used are often application-based and applied to solve a bigger problem. In this study, VGG16 [75] and ResNet50 [76] were applied based on the work done by [77,78], the models with their salient features are briefly discussed as follows.

*VGGNet*

