**1. Introduction**

In agricultural fields, the main causes of losing quality and yield of harvest are virus, bacteria, fungi and pest [1]. To prevent these harmful pathogens, farmers generally treat the global crop to prevent different diseases. However, using a large amount of chemicals has a negative impact on human health and ecosystems. This constitutes a significant problem to be solved; precision agriculture presents an interesting alternative.

In recent decades, the precision agriculture [2,3] has introduced many new farming methods to improve and optimize crop yields, constituting a research field in continuous evolution. New sensing technologies and algorithms have enabled the development of several applications such as water stress detection [4], vigour evaluation [5], estimation of evaporate-transpiration and harvest coefficient [6], weed localization [7,8], and disease detection [9,10].

Disease detection in vine is an important topic in precision agriculture [11–22]. The aim is to detect and treat the infected area at the right place and the right time, and with the right dose of phytosanitary products. At the early stage, it is easier to control diseases with small amounts of chemical products. Indeed, intervention before infection spreads offers many advantages, such as

preservation of vine, grap production and environment, and reducing the economics losses. To achieve this goal, frequent monitoring of the parcel is necessary. Remote sensing (RS) methods are among the most widely used for that purpose and essential in the precision agriculture. RS images can be obtained at leaf- or parcel-scale. At the leaf level, images are acquired using a photo sensor, either held by a person [23] or mounted on a mobile robot [24]. At the parcel level, satellite was the standard RS imaging system [25,26]. Recently, drones or UAVs have gained popularity due to their low cost, high-resolution images, flexibility, customization and easy data access [27]. In addition, unlike satellite imaging, UAV does not have the cloud problem, which has helped to solve many remote sensing problems.

Parcels monitoring generally requires orthophotos building from geo-referenced visible and infrared UAV images. However, two separated sensors generate a spatial shift between images of the two sensors. This problem also occurred after building the orthophotos. It has been established that it is more interesting to combine the information from the two sensors to increase the efficiency of disease detection. Therefore, image registration is required.

The existent algorithms of registration rely on an approach based on either the area or feature methods. The most commonly used in the precision agriculture are feature-based methods, which are based on matching features between images [28]. In this study, we adopted the feature-based approach to align orthophotos of the visible and infrared ranges. Then, the two are combined for the disease detection procedure, where the problem consists of assigning a class-label to each pixel. For that purpose, the deep learning approach is currently the most preferred approach for solving this type of problem.

Deep learning methods [29] have achieved a high level of performance in many applications, in which different network architectures have been proposed. For instance, R-CNN [30], Siamese [31], ResNet [32], SegNet [33] are architectures used for object detection, tracking, classification and segmentation, respectively, which operate in most cases in visible ranges. However, in certain situations, the input data are not only visible images but can be combined with multispectral or hyperspectral images [34], and even depth information [35]. In these contexts, the architectures can undergo modification for improving the methods [36]. Thus, in some studies [37–40], depth information is used as input data. These data generally provide precious information about scene or environment.

Depth or height information is extracted from the 3D reconstruction or photogrammetry processing. In UAV remote sensing imagery, the photogrammetry processing can to build a digital surface model (DSM) before creating the orthophoto. The DSM model can provide much information about the parcel, such as the land variation and objects on its surface. Certain research works have shown the ability to extract vinerows by generating a depth map from the DSM model [41–43]. These solutions have been proposed to solve the vinerows misextraction resulting from the NDVI vegetation index. Indeed, in some situations, the NDVI method cannot be used to extract vinerows when the parcel has a green grassy soil. The advantage of the depth map is its ability to separate areas above-ground from the ground, even if the color is the same for all zones. To date, there has been no work on the vine disease detection that combines depth and multispectral information with a deep learning approach.

This paper presents a new system for vine disease detection using multispectral UAV images. It combines a highly accurate orthophotos registration method, a depth map extraction method and a deep learning network adapted to the vine disease detection data.

The article is organized as follows. Section 2 presents a review of related works. Section 3 describes the materials and methods used in this study. Section 4 details the experiments. Section 5 discusses the performances and limitations of the proposed method. Finally, Section 6 concludes the paper and introduces ideas to improve the method.

#### **2. Related Work**

Plant disease detection is an important issue in precision agriculture. Much research has been carried out and a large survey has been realised by Mahlein (2016) [44], Kaur et al. (2018) [45], Saleem et al. (2019) [46], Sandhu et al. (2019) [47] and Loey et al. (2020) [48]. Schor et al. (2016) [49] presented a robotic system for detecting powdery mildew and wilt virus in tomato crops. The system is based on an RGB sensor mounted on a robotic arm. Image processing and analysis were developed using the principal component analysis and the coefficient of variation algorithms. Sharif et al. (2018) [50] developed a hybrid method for disease detection and identification in citrus plants. It consists of lesion detection on the citrus fruits and leaves, followed by a classification of the citrus diseases. Ferentinos (2018) [51] and Argüeso et al. (2020) [52] built a CNN model to perform plant diagnosis and disease detection using images of plant leaves. Jothiaruna et al. (2019) [53] proposed a segmentation method for disease detection at the leaf scale using a color features and region-growing method. Pantazi et al. (2019) [54] presented an automated approach for crop disease identification on images of various leaves. The approach consists of using a local binary patterns algorithm for extracting features and performing classification into disease classes. Abdulridha et al. (2019) [55] proposed a remote sensing technique for the early detection of avocado diseases. Hu et al. (2020) [56] combined an internet of things (IoT) system with deep learning to create a solution for automatically detecting various crop diseases and communicating the diagnostic results to farmers.

Disease detection in vineyards has been increasingly studied in recent years [11–22]. Some works are realised at the leaf scale, and others at the crop scale. MacDonald et al. (2016) [11] used a Geographic Information System (GIS) software and multispectral images for detecting the leafroll-associated virus in vine. Junges et al. (2018) [12] investigated vine leaves affected by the esca in hyperspectral ranges and di Gennaro et al. (2016) [13] worked at the crop level (UAV images). Both studies concluded that the reflectance of healthy and diseased leaves is different. Albetis et al. (2017) [14] studied the Flavescence dorée detection in UAV images. The results obtained showed that the vine disease detection using aerial images is feasible. The second study of Albetis et al. (2019) [15] examined of the UAV multispectral imagery potential in the detection of symptomatic and asymptomatic vines. Al-Saddik has conducted three studies on vine disease detection using hyperspectral images at the leaf scale. The aim of the first one (Al-Saddik et al. 2017) [16] was to develop spectral disease indices able to detect and identify the Flavescence dorée on grape leaves. The second one (Al-Saddik et al. 2018) [17] was performed to differentiate yellowing leaves from leaves diseased by esca through classification. The third one (Al-saddik et al., 2019) [18] consisted of determining the best wavelengths for the detection of the Flavescence dorée disease. Rançon et al. (2019) [19] conducted a similar study for detecting esca disease. Image sensors were embedded on a mobile robot. The robot moved along the vinerows to acquire images. To detect esca disease, two methods were used: the scale Invariant Feature Transform (SIFT) algorithm and the MobileNet architecture. The authors concluded that the MobileNet architecture provided a better score than the SIFT algorithm. In the framework of previous works, we have realized three studies on vine disease detection using UAV images. The first one (Kerkech et al. 2018) [20] was devoted to esca disease detection in the visible range using the LeNet5 architecture combined with some color spaces and vegetation indices. In the second study (Kerkech et al. 2019) [21], we used near-infrared images and visible images. Disease detection was considered as a semantic segmentation problem performed by the SegNet architecture. Two parallel SegNets were applied for each imaging modality and the results obtained were merged to generate a disease map. In (Kerkech et al. 2020) [22], a correction process using a depth map was added to the output of the previous method. Post-processing with these depth information demonstrated the advantage of this approach in reducing detection errors.

#### **3. Materials and Methods**

This section presents the materials and each component of the vine disease detection system. Figure 1 provides an overview of the methods. It includes the following steps: data acquisition, orthophotos registration, depth map building and orthophotos segmentation (disease map generation). The next sections detail these different steps.

**Figure 1.** The proposed vine disease detection system.

#### *3.1. Data Acquisition*

Multispectral images are acquired using a quadricopter UAV that embeds a MAPIR Survey2 camera and a Global Navigation Satellite System (GNSS) module. This camera integrates two sensors in the visible and infrared ranges with a resolution of 16 megapixels (4608 × 3456 pixels). The visible sensor captures the red, green, and blue (RGB) channels and the infrared sensor captures the red, green, and near-infrared (R-G-NIR) channels. The wavelength of the near-infrared channel is 850 nm. The accuracy of the GNSS module is approximately 1 m.

The acquisition protocol consists of a drone flying over vines at an altitude of 25 m and at an average speed of 10 km/h. During flights, the sensors acquire an image every 2 s. Each image has a 70% overlap with the previous and the next ones. Each point of the vineyard has six different viewpoints (can be observed on six different images). The flight system is managed by a GNSS module. The flight plans include topographic monitoring aimed at guaranteeing a constant distance from the sol. Images are recorded with their GNSS position. Flights are performed at the zenith to avoid shadows, and with moderate weather conditions (light wind and no rain) to avoid UAV flight problems.

#### *3.2. Orthophotos Registration*

The multispectral acquisition protocol using two sensors causes a shift between visible and infrared images. Hence, a shift in multispectral images automatically implies a shift in orthophotos. Usually, the orthophotos registration is performed manually using the QGIS software. The manual method is time-consuming, requires a high focus to select many key points between visible and infrared orthophotos, and the result is not very accurate. To overcome this problem, a new method for automatic and accurate orthophotos registration is proposed.

The proposed orthophotos registration method is illustrated in Figure 2 and is divided into two steps. The first one concerns the UAV multispectral images registration and the second permits the building of registered multispectral orthophotos. In this study, the first step uses the optimized multispectral images registration method proposed in [21]. Based on the Accelerated-KAZE (AKAZE) algorithm, the registration method uses feature-matching between visible and infrared images to match key points extracted from the two images and compute the homographic matrix for geometric correction. In order to increase accuracy, the method uses an iterative process to reduce the Root Mean Squared Error (RMSE) of the registration. The second step consists of using the Agisoft Metashape software to build and obtain the registered visible and infrared orthophotos. The Metashape software is based on the Structure from motion (SfM) algorithm for the photogrammetry processing. Building orthophotos requires the UAV images and the digital surface model (DSM). To obtain this DSM model, the software must go through a photogrammetry processing and perform the following steps: alignment of the images to build a sparse point cloud, then a dense point cloud, and finally the DSM. The orthophotos building is carried out by the option "build orthomosaic" process in the software. To build the visible orthophotos, it is necessary to use the visible UAV images and the DSM model, while, to build a registered infrared orthophoto, it is necessary to use the registered infrared UAV images and the same DSM model of the visible orthophoto. The parameters used in the Metashape software are detailed in Table 1.

**Figure 2.** The proposed orthophotos registration method.

#### *3.3. Depth Map*

The DSM model previously built in the orthophotos registration process is used here to obtain the depth map. In fact, the DSM model represents the terrain surface variation and includes all objects found here (in this case, objects are vine trees). Therefore, some processings are required to determine only the vine height. To extract the depth map from the DSM, the method proposed in [41] is used. It consists of applying the following processings: the DSM is first filtered using a low-pass filter of size 20 × 20; this filter is chosen to smooth the image and keep only the terrain surface variations, also called the digital terrain model (DTM). The DTM is thereafter subtracted from the DSM to eliminate the terrain variations and retain only the vine height. Due to the weak contrast of the result, an enhancement processing was necessary. The contrast is enhanced here by using a histogram-based (histogram normalization) method. The obtained result is an image with a good difference in grey levels between vines and non-vines. Once the contrast is corrected, an automatic thresholding using the Otsu's algorithm is applied to obtain a binary image representing the depth map.

#### *3.4. Segmentation and Classification*

The last stage of the vine disease detection system concerns the data classification. This step is performed using a deep learning architecture for segmentation. Deep learning has proven its performance in numerous research studies and in various domains. Many architectures have been developed, such as SegNet [33], U-Net [57], DeepLabv3+ [58], and PSPNet [59]. Each architecture can provide good results in a specific domain and be less efficient in others. These architectures are generally used for the segmentation of complex indoor/outdoor scenes, medical ultrasound images, or even in agriculture. One channel is generally used for greyscale medical imaging or three channels for visible RGB color images. Hence, they are not always adapted to a specific problem. Indeed, for this study, multispectral and depth map data offer additional information. This can improve the segmentation representation and the final disease map result. For this purpose, we have designed our deep learning architecture adapted to the vine disease detection problem, and we have compared it to the most well known deep learning architectures. In the following sections, we describe the proposed deep learning architecture and the training process.

#### 3.4.1. VddNet Architecture

Vine Disease Detection Network (VddNet), Figure 3 is inspired by VGG-Net [60], SegNet [33], U-Net [57] and the parallel architectures proposed in [37,61–63]. VddNet is a parallel architecture based on the VGG encoder; it has three types of data as inputs: visible a RGB image, a near-infrared image and a depth map. VddNet is dedicated to segmentation, so the output has the same input, with a number of channels equal to the number of classes (4). It is designed with three parallel encoders and one decoder. Each encoder can typically be considered as a convolutional neural network without the fully connected layers. The convolutional operation is repeated twice using a 3 × 3 mask, a rectified linear unit (ReLU), a batch normalization and a subsampling using a max pooling function of 2 × 2 size and a stride of 2. The number of feature map channels is doubled at each subsampling step. The idea of VddNet is to encode each type of data separately and, at the same time, concatenate the near-infrared and the features map of the depth map with the visible features map before each subsampling. Hence, the central encoder preserves the features of the near-infrared and the depth map data merged with the visible features map, and concatenated at the same time. The decoder phase consists of upsampling and convolution with a 2 × 2 mask. It is then followed by two convolution layers with a 3 × 3 mask, a rectified linear unit, and a batch normalization. In contrast to the encoder phase, after each upsampling operation, the number of features map channels is halved. Using the features map concatenation technique of near-infrared and depth map, the decoder retrieves features lost during the merging and the subsample process. The decoder follows the same steps until it reaches the final layer, which is a convolution with a 1 × 1 mask and a softmax providing classes probabilities, at pixel-wise.

**Figure 3.** VddNet architecture.

#### 3.4.2. Training Dataset

In this study, one crop is used for model training and validation, and two crops for testing. To build the training dataset, four steps are required: data source selection, classes definition, data labelling, and data augmentation.

The first step is probably the most important one. Indeed, to allow a good learning, the data source for feeding models must represent the global data in terms of richness, diversity and classes. In this study, a particular area was chosen that contains a slight shadow area, brown ground (soil) and a vine partially affected by mildew.

Once the data source has been selected, it is necessary to define the different classes present in these data. For that purpose, each type of data (visible, near-infrared and depth map) is important in this step. In visible and near-infrared images, four classes can be distinguished. On the other hand, the depth map contains only two distinct classes, which are the vine canopy and the non-vine. Therefore, the choice of classes must match all data types. Shadow is the first class; it is any dark zone. It can be either on the vine or on the ground. This class was created to avoid confusion and

misclassification on a non-visible pattern. Ground is the second class; from one parcel to another, ground is generally different. Indeed, the ground can have many colors: brown, green, grey, etc. To solve this color confusion, the ground is chosen as any pixels in the non-vine zone from the depth map data. Healthy vine is the third class; it is the green leaves of the vine. Usually, it is easy to classify these data, but when ground is also green, this leads to confusion between vine and ground in 2D images. To avoid that, the healthy class is defined as the green color in the visible spectrum and belonging to the vine canopy according to the depth map. The fourth and last class corresponds to diseased vine. Disease symptoms can present several colors in the visible range: yellow, brown, red, golden, etc. In the near-infrared, it is only possible to differentiate between healthy and diseased reflectances. In general, diseased leaves have a different reflectance than healthy leaves [17], but some confusion between disease and ground classes may occur when the two colors are similar. Ground must also be eliminated from the disease class using the depth map.

Data labelling was performed with the semi-automatic labelling method proposed in [21]. The method consists of using automatic labelling in a first step, followed by manual labelling in a second step. The first step is based on the deep learning LeNet-5 [64] architecture, where the classification is carried out using a 32 × 32 sliding window and a 2 × 2 stride. The result is equivalent to a coarse image segmentation which contains some misclassifications. To refine the segmentation, output results were manually corrected using the Paint.Net software. This task was conducted based on the ground truth (realized in the crop by a professional reporting occurred diseases), and observations in the orthophotos.

The last stage is the generation of a training dataset from the labelled data. In order to enrich the training dataset and avoid an overfitting of networks, data augmentation methods [65] are used in this study. A 256 × 256 pixels patches dataset is generated from the data source matrix and its corresponding labelled matrix. The data source consists of multimodal and depth map data and has a size of 4626 × 3904 × 5. Four data augmentation methods are used: translation, rotation, under and oversampling, and brightness variation. Translation was performed with an overlap of 50% using a sliding window in the horizontal and vertical displacements. The rotation angle was set at 30◦, 60◦ and 90◦. Under- and oversampling were parametrized to obtain 80% and 120% of the original data size. Brightness variation is only applied to multispectral data. Pixel values are multiplied by the coefficients of 0.95 and 1.05, which introduce a brightness variation of ±5%. Each method brings an effect on the data (translation, rotation, etc.), allowing the networks to learn, respectively, transition, vinerows orientations, acquisition scale variation and weather conditions. At the end, the data augmentation generated 35.820 patches.

#### **4. Experimentations and Results**

This section presents the different experimental devices, as well as qualitative and quantitative results. The experiments are performed on Python 2.7 software, using the Keras 2.2.0 library for the development of deep learning architectures, and GDAL 3.0.3 for the orthophotos management. The Agisoft Metashape software version 1.6.2 is also used for photogrammetry processing. The codes were developed under the Linux Ubuntu 16.04 LTS 64-bits operating system and run on a hardware with an Intel Xeon 3.60 GHz × 8 processor, 32 GB RAM, and a NVidia GTX 1080 Ti graphics card with 11 GB of internal RAM. The cuDNN 7.0 library and the CUDA 9.0 Toolkit are used for deep learning processing on GPU.

#### *4.1. Orthophotos Registration and Depth Map Building*

To realize this study, multispectral and depth map orthophotos were required. Two parcels were selected and data were aquired at two different times to construct the orthophotos dataset. Each parcel had one or more of the following characteristics: with or without shadow, green or brown ground, healthy or partially diseased. Registered visible and infrared orthophotos were built from multispectral images using the optimized image registration algorithm [21] and the Agisoft Metashape software

version 1.6.2. Orthophotos were saved in the geo-referenced file format "TIFF". The parameters used in the Metashape software are listed in Table 1.

To evaluate the registration and depth map quality, we chose a chessboard test pattern. Figure 4 presents an example of visible and infrared orthophotos registration. As can be seen, the alignment between the two orthophotos is accurate. The registration of the depth map with the visible range also provides good results (Figure 6).

**Table 1.** The parameters used for the orthophotos building process in the Agisoft Metashape software.


#### *4.2. Training and Testing Architectures*

In order to determine the best parameters for each deep learning architecture, four cross-optimizers with two loss functions were compared. Architectures were compiled using either the loss function "cross entropy" or "mean squared error", and with one among the four optimizers: SGD [66], Adadelta [67], Adam [68], or Adamax [69]. Once the best parameters were defined for each architecture, a final fine-tuning was performed on the "learning rate" parameter to obtain the best results (to achieve a good model without overfitting). The best parameters found for each architecture are presented in Table 2.

**Table 2.** The parameters used for the different deep learning architectures. LR means learning rate.


For training the VddNet model, data from visible, near-infrared and depth maps were incorporated separately in the network inputs. For the other architectures, a multi-data matrix consists of five channels with a size of 256 × 256. The first three channels correspond to the visible spectrum, the 4th channel to the near-infrared data and the 5th channel to the depth map. Each multi-data matrix has a corresponding labelled matrix. Models training is an iterative process that is fixed at 30.000 epochs for each model. For each iteration, a batch of five multi-data matrices with their corresponding labelled matrices are randomly selected from the dataset and sent to feed the model. In order to check the convergence of the model, a test using validation data is performed each 10 iterations.

A qualitative study was conducted for determining the importance of depth map information. For this purpose, an experience was conducted by training the deep learning models with only multispectral data and with a combination of both (multispectral and depth maps). The comparison results are shown in Figures 7 and 8.

To test the deep learning models, test areas are segmented using a 256 × 256 sliding window (without overlap). For each position of the sliding window, the visible, near-infrared and depth maps are sent to the network inputs (respecting the data order for each architecture) in order to perform segmentation. The output of the networks is a matrix of size of 256 × 256 × 4. The results are saved after an application of the Argmax function. They are then stitched together to obtain the original size of the orthophoto tested data.

#### *4.3. Segmentation Performance Measurements*

Segmentation performance measurements are expressed in terms of recall, precision, F1-Score/Dice and accuracy (using Equations (1)–(5)) for each class (shadow, ground, healthy and diseased) at grapevine-scale. Grapevine-scale assessment was chosen because pixel-wise evaluation is not suitable for providing disease information. Moreover, imprecision of the ground truth, small surface of the disease and difference of deep learning segmentation results do not allow for a good evaluation of the different architectures, at pixel-wise. These measurements use a sliding window equivalent to the average size of a grapevine (in this study, approximatively 64 × 64 pixels). For each step of the sliding window, the class evaluated is the dominant class in the ground truth. The window is considered "true positive" if the dominant class is the same as the ground truth, otherwise it is a "false positive". The confusion matrix is updated for each step. Finally, the score is given by

$$Recall = \frac{TP}{TP + FN} \tag{1}$$

$$Precision = \frac{TP}{TP + FP} \tag{2}$$

$$F1\text{-}Score = 2\frac{Recall \times Precision}{Recall + Precision} = \frac{2TP}{FP + 2TP + FN} \tag{3}$$

$$\text{Dicz} = \frac{2|X \cap Y|}{|X| + |Y|} = \frac{2(TP)}{(FP + TP) + (TP + FN)} = \frac{2TP}{FP + 2TP + FN} \tag{4}$$

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{5}$$

where TP, TN, FP and FN are the number of samples for "true positive", "true negative", "false positive" and "false negative", respectively. Dice equation is defined by X (set of ground truth pixels) and Y (set of the classified pixels).
