1. Introduction
Global population growth has resulted in an increase in food demand. To meet the anticipated demand, the agricultural produce needs to increase by approximately 70% [
1]. The farm output and its quality, along with crop cultivation, are, however, adversely affected by a number of factors. Among these issues is the growth of weeds, which occurs simultaneously with crop growth. A variety of weed plants exist that spread quickly and thus negatively impact crop yield. These weeds directly compete with crops for resources such as water, nutrients and sunlight, which leaves the crops prone to a number of diseases. Studies show that the vegetable yield decreases by 45% and up to 95% in the case of weed–vegetable confrontation [
2]. This extends even beyond crop cultivation; weed growth is also a problem in turfed surfaces such as those of football and golf, residential lawns, parks and sports fields. In order to tackle the issue of weed gardening, appropriate means must be taken. The focus of this paper is weed control for the latter case of turf grass management, but the same technology might be used for the former case of crop cultivation.
Weed control is a very challenging task. Various strategies can be employed by farmers for weed reduction in targeted areas. These methodologies can be divided into five main categories: (a) preventative—prevent weed growth preemptively; (b) mechanical—mowing, hand weeding and mulching; (c) cultural—maintaining field hygiene; (d) biological—utilizing weeds’ natural adversaries, such as insects, grazing animals, etc.; (e) chemical—spraying herbicides [
3]. All of these approaches entail a few drawbacks. These can be either in terms of the required costs/time and crop or environment contamination. As a result, the optimal solution for economic and environmental interests is the development of a vision-based system for the automatic removal of weed plants.
Over the past few years, deep learning (DL) has made huge advancements in multiple domains, including those of vision, audio and text. Object detection and segmentation models designed with the deep learning approach have exhibited high precision in the identification of target objects. As mentioned, weed detection is a very daunting problem, owing to the semantic similarities between weeds and their surrounding background. Common challenges faced during weed detection are similarities in color and texture, occlusion and visual similarities between different weeds. However, advancement in vision-based intelligent machines have made it possible to design an accurate system for the detection of weeds under such complex background conditions.
The target of this work is to propose a deep learning-based model that is able to classify and localize the weed area precisely on grassy fields, i.e., perform weed segmentation, using Visual Transformer [
4]. In contrast to weed detection in a crop-filled setting, this work focuses more on detecting weeds in a grassy environment. Owing to the success of the Transformer model in NLP, recent studies have focused on importing it into the visual domain, where it has shown great potential. In the context of weed detection, Youjie et al. [
5] have already established the effectiveness of the attention mechanism for the precise segmentation of weeds of varying shapes, and Visual Transformer is an extension of the non-local attention technique.
In this work, we explore the applications of the Transformer model in the context of weed detection and localization. The successful implementation of such a system would greatly reduce the required time and effort for weed identification and removal. This DL-based system should also be robust to a variety of real-life visual challenges, such as deformation, different illumination conditions, occlusion, etc. Such a detection system can be deployed within an autonomous wheeled robot, capable of performing surveillance in the entirety of the grass field and identifying weeds using solely vision. An actuator function similar to weeding by hand might be used to mechanically pull the detected target weeds. Conversely, we can also have sprinklers acting on the exact location of the weed to remove it with a minimum amount of chemicals, making the system eco-friendly and cost/time efficient.
Following our experiments, we report the results of three different types of Transformer architectures, including Swin Transformer [
6], SegFormer [
7] and Segmenter [
8], on our in-house weed dataset. The weed dataset consisted of 1006 images that allowed us to segment 10 types of weeds in grass. The dataset was further augmented to cater for the needs of a large sample set of Transformer models. To increase the trainable sample size in terms of its quality and quantity, we performed a range of different augmentation techniques. The results from these trained models were reported using different metrics. SegFormer achieved the best result on our dataset, with final mAcc and mIoU of 75.18% and 65.74%, respectively. Swin Transformer showed comparable performance to SegFormer, albeit with a much higher number of network parameters. Thus, it may be inferred that the SegFormer-based system would be the most suitable for the automation of weed removal from grassy surfaces.
The keys contributions are two-fold:
Predicting accurate segmentation masks for weeds using Transformer-based architectures for the purpose of automatizing weed control with a focus on turf management.
We investigate a range of recent Transformer models using our weed dataset and make detailed comparisons in terms of performance and complexity.
In
Section 2, we provide a detailed review of previously designed methods for similar purposes.
Section 3 contains information about the Transformer model architectures employed in the study.
Section 4 contains details about our dataset, along with the applied augmentations and evaluation metrics. Subsequently, in
Section 5, we provide the details of our experiments, the comparison of different models and the extracted conclusions. Finally, in
Section 6, we provide a brief summary of the work performed.
3. Methods
For our experiments, we selected three high-performing Transformer-based segmentation models, Swin Transformer, SegFormer and Segmenter. Public implementations were used for network training. A brief description of each model is provided in the following sections.
3.1. Swin Transformer
Swin Transformer is built by replacing the standard multi-head self-attention (MSA) module in a Transformer block with a module based on shifted windows, whereas the other layers are kept the same. As illustrated in
Figure 1b, a Swin Transformer block consists of a shifted window-based MSA module, followed by a 2-layer Multilayer Perceptron (MLP) with GELU nonlinearity in between. A LayerNorm (LN) layer is applied before each MSA module and MLP, and a residual connection is also applied after each module. In addition, Swin Transformer also uses the hierarchical feature map constructed by the Patch Merging block to compute the representation of the input. The process of Patch Merging is shown in
Figure 2.
As shown in
Figure 1, the architecture alternates between Patch Merging and Swin Transformer blocks. Starting off from an input image of size
H ×
W, the initial Patch Splitting module splits the image into non-overlapping patches, each of which is then treated as a ‘token’ in the input sequence of the split patches. Each patch size is 4 × 4, with a feature dimension of 4 × 4 × 3 = 48. A linear embedding is applied on these raw-pixel valued vectors in order to project it into an arbitrary dimension
C. Within the whole architecture, the Patch Merging module builds hierarchical feature maps by concatenating the features of each group of 2 × 2 neighboring patches, where the 2 × 2 features within each patch are placed in the channel dimension. This results in a 2× downsampling of resolution. So, the
H/4 ×
W/4 number of tokens, or patches, is reduced to
H/8 ×
W/8. The number of tokens is further reduced in the subsequent modules as visualized in
Figure 1.
The features coming from the Patch Merging modules are passed through a Swin Transformer block that applies Self-Attention to the partitioned image. The input sequence length is preserved after the application of the attention blocks. Self-attention is implemented in two steps, Window-based Self-Attention (W-MSA) and Shifted Windows Self-Attention (SW-MSA), where these two modules are placed in a sequential manner. In W-MSA, self-attention is applied locally within each window, which leads to a linear increase in complexity with reference to the number of windows or patches. This is an improvement over the previous ViT model, where attention was calculated between each patch/token, which resulted in quadratic complexity with reference to the number of tokens. The SW-MSA approach introduces connections between neighboring non-overlapping windows coming from the previous layer by means of shifting the window configuration slightly.
3.2. SegFormer
SegFormer is an efficient semantic segmentation framework based upon the encoder and decoder concepts. The encoder outputs multi-scale features, and a simple All-MLP decoder aggregates this multi-scale information from different layers, combining both local and global attention to compute rich representations in order to perform semantic segmentation.
Figure 3 shows the proposed architecture of SegFormer, which is divided into two sections, the encoder and the decoder. The input image is first divided into 4 × 4 patches, unlike ViT, which uses a patch size of 16 × 16. This results in better performance in dense prediction tasks. The Transformer block in the encoder is composed of three sub-modules: (a) Efficient Self-Attention, (b) Mix-Feedforward Network (FFN) and (c) Overlapping Patch Merging. Efficient Self-Attention is similar to the multi-head self-attention in the original Transformer model; however, it employs a sequence reduction process, as introduced in [
7], that results in the reduction in the sequence length using a reduction ratio. This helps to lower the computational cost of the self-attention process. ViT uses fixed resolution Position Encodings (PEs) in order to incorporate positional information, which reduces the performance in the case in which the test and the training resolution differ, since the positional code has to be interpolated for the new resolution. To solve this, SegFormer uses a 3 × 3 Conv in the feed-forward network for data-driven positional encoding. Lastly, the Overlap Patch Merging block is used to reduce the feature map size throughout the architecture. This results in hierarchical feature representation comprising high-resolution coarse features and low-resolution fine-grained features. Hierarchical feature maps of sizes 1/4, 1/8, 1/16 and 1/32 of the original image resolution are obtained as such.
The decoder modules contain a full-MLP layer, which takes the features from the encoder module and aggregates them together. The process is performed in four steps: (a) Multi-level features from the encoder go through an MLP layer to be unified in the channel dimension. (b) The features are then upsampled to 1/4 of their sizes and concatenated together. (c) An MLP layer then concatenates the upsampled features. (d) Lastly, an MLP takes these fused feature maps to predict the final segmentation mask of size H/4 × W/4 × N resolution, where N refers to the number of categories.
3.3. Segmenter
Segmenter is also a Transformer-based image segmentation model built upon the original Vision Transformer (ViT) that allows modeling global dependencies early on in the architecture. The decoder module of Segmenter is based on the Transformer framework. It adds
K learnable class embeddings to Mask Transformer, which is input to Transformer as a patch embedding; then, a multiplication operation is performed between the class and the patch embedding, followed by softmax application and 2D feature conversion, with a restoration of the original input image size after upsampling in the end. The final class labels are obtained from these embeddings using a Point-wise Linear decoder or a Mask Transformer decoder. The structure of Segmenter is shown in
Figure 4.
The input image, x ∈ RH×W×C, is first split into a sequence of patches. The raw RGB values are then flattened; then, these vectors are passed through a linear embedding for producing a sequence of patch embeddings. A learnable position embedding is added to the sequence of patches individually for incorporating the location information. These semantic embeddings are then passed through standard Transformer blocks consisting of multi-head self-attention and feed-forward layers to obtain contextualized encoding containing rich semantic information.
This sequence of embeddings is then passed to the decoder, which learns to map these patch-level encodings to patch-level class scores, which are then upsampled using bilinear interpolation to obtain pixel-level scores. This can be performed using a Point-wise Linear decoder or a Mask Transformer decoder. For the Point-wise Linear decoder, a Point-wise Linear layer is applied to the encoder outputs to produce patch-level class logics. This sequence is reshaped into a 2D shape and upsampled to the original image size. Final segmentation maps are obtained by applying softmax to the class dimension. For the Mask Transformer decoder, a set of K learnable class embeddings, where K refers to the number of classes, are introduced. These are all assigned to a specific semantic class and are used to predict the class map. These class embeddings are processed together with the output embedding of the encoder. The decoder is a Transformer encoder by design that generates K masks by computing the scalar product between L2-normalized patch embeddings and the aforementioned class embedding. A set of mask sequences are obtained, which are then reshaped into a 2D mask and upsampled to the original image size. The final segmentation map is obtained after the application of softmax followed by LayerNorm.
4. Dataset
As part of the evaluation, we constructed a weed dataset that could be used to assess the model’s performance. The dataset included 10 categories of weeds: clover (Trifolium repens), common ragweed (Ambrosia artemisiifolia), crabgrass (Digitaria), dandelion (Taraxacum), ground ivy (Glechoma hederacea), lambsquarter (Chenopodium album), pigweed (Amaranthus), plantain (Plantago), tall fescue (Festuca arundinacea) and unknown weed. The unknown weed category contained weeds with features different from those of other classes for general weed detection. An example case of every category is visualized in
Figure 5, where we can see diverse colors, textures and weed shapes on grassy backgrounds. Note that the density and the colors of grass in the images are different in the cluttered background.
The dataset contained 1006 images in total, as shown in
Table 1. All images were taken by lab members using cell phone cameras in Jeonju and Wanju, Jeonbuk Province, in South Korea. As the images were taken in real fields instead of a laboratory, they involved a number of visual challenges, including complex background conditions, differing illumination settings, etc. In addition, the density or the grass growth state varied between different fields. Furthermore, there also existed intra-class variations for each weed class in terms of their color, texture and shape. In
Figure 5, we can see complex backgrounds for clover and unknown weed and different illuminations between crab grass and lambs quarter, along with various stages of grass growth in most of the images. In
Figure 6, we can find examples of intra-class variations. All these challenges should be dealt with properly to achieve accurate weed segmentation.
Note that such diversity in a training dataset may help to train a model with high robustness, on the condition that its sample size is above a certain threshold. That is part of the reason why the training data should be augmented to enhance the diversity. For our training, we split the dataset into 805 and 201 images for training and testing, respectively.
4.1. Data Augmentation
Data augmentation is used to increase the training data to evade overfitting and develop powerful models with limited amounts of initial training samples. However, the results of augmentation should look similar to the images captured in real fields. For augmentation, we used multi-scale training and geometric transforms, including random cropping, random flipping and random rotation, along with photometric distortions, including brightness and contrast changes.
Figure 7 and
Figure 8 shows some examples of augmented images using geometric transforms and photometric distortions.
In multi-scale training, an original image with size 512 × 512 is randomly changed to a scale of 512–2048 during training. Multi-scale training increases the robustness of the model by training it on images of different sizes.
4.2. Evaluation Metrics
We evaluated the semantic segmentation results in terms of two metrics, the pixel accuracy and IoU (Intersection of Union). It is important that the metrics reflect the purpose of weed segmentation. Since the segmentation results can be utilized to control a robot manipulator or to drive a weedicide spray nozzle, the exact localization of the weed area is important in order not to damage any healthy grass.
4.2.1. Pixel Accuracy (PA) and Mean PA (mPA)
The pixels belonging to a class are specified by the target mask, which can be compared with results from test data. The pixel accuracy in a class can be calculated as the ratio of the number of correctly classified pixels to the total number of pixels as
The class-wise PA can be averaged over all classes of weed objects to calculate the Mean Average Precision (mAP). Because the exact mask of ground truth for weeds is impossible to specify due to their complicate boundary, the PA can be treated as an approximate to measure the weed area.
4.2.2. Intersection over Union (IoU) and Mean IoU (mIoU)
The IoU is the area of overlap between the predicted segmentation mask and the ground truth divided by the area of union between the predicted segmentation mask and the ground truth. In segmentation, the area is calculated with the number of pixels in a segment. In addition, the object-wise IoUs can be averaged over all objects included in an image to produce the mIoU. From the point of view of its implementation, the IoU for weed objects is important to properly remove the weed using a robot or weedicide to exactly localize the end effector.
In this study, we focused on the weeds that needed to be removed, but the area of background grass is usually much wider than the sparse weed areas, resulting in the mIoU being larger than the IoU of each weed object.