1. Introduction
Beach litter (BL) identifies all persistent, manufactured, and processed solid material discarded, disposed of, or abandoned in the marine and coastal environment [
1,
2]. Preserving coastal ecosystems is an imperative goal in the face of escalating environmental challenges. One of the most relevant issues affecting coastal areas worldwide is the increase in beach litter pollution, especially in the semi-closed basins with heavy anthropization along its coast—such as the Mediterranean basin [
3]—and with high tourist fruition [
4]. Several transport processes, both natural (i.e., streams, rainfall, and marine currents) and anthropogenic (i.e., drainage ditches and sewage discharges), move BL items toward coastal environments. The GRID-Arendal Environmental Communication Center, a UNEP partner, has identified numerous litter sources on land and in marine environments [
5]. It is estimated that most of the litter coming from these sources settles on the bottom of the ocean (70%), and the remaining part (30%) is retained equally between the water column and beaches [
6]. In Europe, statistics made available for 2016 by the Joint Research Centre recorded a high percentage (84%) of plastic material among the total marine litter items found along European coasts [
7]. These critical issues confirm the need to counter the increase in BL abundance, and the scientific community has been particularly sensitive to this issue. Indeed, a great interest in the scientific community occurred in the past decade, evidenced by the increase in the use of the terms “marine” and “beach” litter in several European and international peer-reviewed papers [
8].
BL elements are classified based on their size (micro-litter < 5 mm; 5 mm < meso-litter < 2.5 cm; 2.5 cm > macro-litter > 1.0 m; mega-litter > 1 m) [
9] in order to define a uniform scale for field monitoring and data acquisition. This classification also affects methodologies for the activities to be carried out, as elements of different sizes require different tools and approaches to be sampled and analyzed. In this study, the analysis is focused on the macro- and mega-litter categories, as also suggested for the beach monitoring activities performed at the national and international levels. BL assessment has traditionally relied on manual surveys conducted by human observers. These approaches have yielded valuable insights into the composition and distribution of debris along coastlines [
10,
11,
12,
13]. However, the manual nature of these methods presents limitations in terms of time and effort effectiveness, necessitating the exploitation of novel automated alternatives.
Recent advances in machine learning (ML) techniques, particularly deep learning (DL) [
14], represent a challenge in the field of BL assessment. Indeed, an emerging research branch has focused on developing methodologies to address this issue, spanning conventional human-driven approaches and innovative DL approaches, and exploiting drone-acquired data for comprehensive assessments [
15,
16]. The field of BL assessment has witnessed significant advances, with a growing emphasis on integrating ML, DL, and computer vision algorithms to assess items’ densities and distribution. Fallati et al. [
17] used a domain-specific deep learning architecture called PlasticFinder on beaches in the Republic of Maldives. The authors stated a 78% F-score as the best average result on Adangau Island (peaking at 81%). Gonçalves et al. [
18] used a multidisciplinary approach, comprising photogrammetry, geomorphology, machine learning, and hydrodynamic modeling to improve and make the mapping of marine litter items more efficient with UAS on coastal environments. They reached a F-test score of 75% in marine litter object detection compared to the manual procedure. Among the several computer vision algorithms commonly used for the segmentation and classification tasks on images acquired with Unmanned Aerial Vehicles (UAVs), such as Deep Neural Networks [
19], Random Forests [
16], and other kinds of neural networks [
20], in this study, a CNN-based architecture was chosen to develop a new DL computer vision tool for the automatic identification and classification of BL items on UAV images. Indeed, the introduction of Convolutional Neural Networks (CNNs) in litter detection and classification tasks has revolutionized the field of data analysis. Winans et al. [
21] tried to exploit the ML potential in a very large geographic area, the Hawaiian Islands. In their study, the authors provided the best percentage of accurate predictions, achieving an average precision of 72%, using three ML models trained on aerial images collected over 1900 km of Hawaiian coastline. Duarte et al. [
22] dealt with the imbalance problem for training ML algorithms, using DenseNet121 combined with oversampling, and obtaining an F-score of 68%. Similarly, Scarrica et al. [
23] employed deep learning for litter detection in aerial images acquired in southern Italy, showcasing the efficacy of Mask Region-based CNN models and achieving an F-score of 0.96% on real data. In general, Convolutional Neural Networks (CNNs) have shown remarkable efficacy in automating litter detection and classification tasks on specific images. These methods not only promise enhanced accuracy but also have the potential to significantly reduce the time and resources required for large-scale monitoring efforts. In addition, integrating UAV technology into BL assessment campaigns has opened up exciting possibilities. Drones equipped with high-resolution cameras enable rapid and comprehensive data acquisition, covering extensive coastal areas with unprecedented detail and within a short time. This approach facilitates more frequent and detailed data collection surveys and minimizes disturbances to fragile coastal habitats. On the other hand, UAVs with multispectral cameras have been used to classify litter types via the Spectral Angle Mapping (SAM) technique, improving the speed and robustness of traditional image screening [
24].
Additionally, UAVs equipped with hyperspectral imaging have shown promise in identifying plastics in coastal environments. By using a push-broom sensor and Linear Discriminant Analysis (LDA), materials such as polyethylene (PE) and PET were successfully detected, demonstrating the feasibility of real-time plastic identification in both beach and marine environments [
25]. However, relying solely on UAV photogrammetry, which captures imagery in the visible spectrum, presents more challenges, as distinguishing between various materials becomes less precise compared to using multispectral or hyperspectral sensors. The complexity increases in environments where overlapping textures or buried items exist, further limiting the efficacy of traditional photogrammetric techniques.
In this study, we contribute to the existing literature by comprehensively evaluating four distinct beach litter detection and classification methods. Our objectives are (i) to perform direct in situ BL assessment; (ii) to manually identify BL items on georeferenced orthomosaics; (iii) to improve the performance of the segmentation and classification tasks of a deep learning model previously developed [
23] for the automated BL recognition and single-class classification on UAV images; and (iv) to evaluate the performance of a machine learning-based SVM algorithm provided by Orfeo Toolbox plugin, hosted by QGIS software (v.3.16). Finally, the DL method requires orthorectified images as the source of the training dataset for the classification task. To this aim, a survey protocol reporting the main technical and logistic parameters was drafted.
This research provides valuable insights into the strengths and limitations of the applied methods, shedding light on their potential applications in addressing beach litter pollution and furthering the cause of coastal sustainable management through efficient and cost-effective survey techniques.
3. Materials and Methods
In this section, the methodological approaches proposed for both direct (in situ) and indirect (visual screening and automatic detection) BL distribution analysis are described (
Figure 2). Field activities were carried out at Torre Guaceto Beach in May 2023. During these activities, in situ BL surveys were carried out following the international guidelines [
29], which identify a beach sector of 100 m long as a standard site for litter monitoring. Furthermore, a set of images was collected using a UAV system for photogrammetric surveys. The UAV images and related orthomosaics were exploited for both the visual screening and the automatic detection techniques. To assemble a training dataset including images reporting different environmental characteristics and useful for the training phase of the Mask-RCNN-based and SVM algorithms, additional UAV flights were performed in different coastal sectors in the Apulia Region (Capitolo Beach and Torre Guaceto Beach) and central Portugal (Leirosa Beach). Leirosa Beach, located 43 km west of the city of Coimbra along the Atlantic coast of Portugal, is a wide sandy beach sector characterized by a meso-tidal environment and backed by an eroded dune system from which litter items are gathered. Moreover, for a homogeneous quantification of the four BL assessment methodologies applied in this study (i.e., in situ assessment, visual screening on the orthomosaic, and the two machine learning techniques), six boxes with a mean surface area of 45 m
2 were defined within the study area, from the shoreline up to the foredune limit (
Figure 3). Litter density was calculated for the test boxes using results obtained by the in situ counting, as described in sub-paragraph 3.1. In sub-paragraph 3.2, the photogrammetric workflow describes the process of acquiring and post-processing of UAV images in detail. Then, indirect beach litter detection methodologies are illustrated in sub-paragraphs 3.3, 3.4, and 3.5. Activities envisaged by all four methodologies were conducted on test tiles enclosed in the six boxes, covering a total surface of 274.76 m
2 of the overall test site.
3.1. In Situ Beach Litter Survey
The in situ visual assessment was conducted to measure litter abundance in Torre Guaceto Beach. Following the international monitoring guidelines [
30,
31], the study area covers two 100 m long beach sectors, north of the Canale Reale mouth (cf.,
Figure 1c). This activity was conducted taking into consideration six boxes identified from the shoreline up to the inland foredune limit (
Figure 3). The identification process was focused on all litter items bigger than 2.5 cm, so considered in the macro-litter category, which was listed and classified according to the
Joint list of litter categories for marine macro-litter monitoring proposed by the European Commission [
31]. BL density values were computed by dividing the number of items by the box area. To make the in situ counting easier, each box was split into two parts (e.g., Box A was divided into A1 and A2, Box B into B1 and B2, and so on, as shown in
Figure 3). Targets for GCPs were also used to define box extension, aside from georeferencing purposes.
3.2. UAV Surveys: Image Acquisition and Post-Processing
UAV surveys were planned to build the dataset for the training phases of the ML algorithms. Furthermore, in the Torre Guaceto test site, UAV surveys performed in May 2023 were also exploited to monitor the environmental status of the coastal sector soon after the period of occurrence of winter marine events. As for the in situ activities, surveyed areas covered 100 m long beach sectors.
In detail, for the UAV flights in the test site, twenty wooden targets (
Figure 3) were used as Ground Control Point (GCP) to obtain georeferenced products in the photogrammetric post-processing. To this end, their coordinates and elevation were measured with a “Stonex S9III-N” GNSS receiver in Real Time Kinematic (GNSS-RTK) mode. The UAV images were acquired using a multirotor quadcopter “DJI Inspire 2” equipped with a “DJI Zenmuse X5S” optical camera (20.8 MP, DJI MFT 15 mm/1.7 ASPH supported lens, 4/3” CMOS sensor, FOV 72° and image resolution 5280 × 3956 pixel), property of the Department of Earth and Geoenvironmental Sciences of the University of Bari (Italy). The UAV flight missions were set up with a single grid path perpendicular to the shoreline, a nadir camera angle (−90°), and an image forward and side overlap of 80% and 65%, respectively. Four flights were performed at 10 and 15 m Above Ground Level (AGL) of the take-off location, respectively, detecting the site’s status regarding the beach litter distribution. Then, additional amounts of litter were added by operators along the beach, and two additional flights, performed at the same altitude as the previous ones, were conducted for the specific purpose of the algorithms’ training phase (new litter distribution). Indeed, this last step was necessary to construct an ideal litter pattern with known items to provide a valid dataset for the Mask-RCNN training phase. Moreover, also to this end and thanks to the collaboration with the Coimbra University, an additional flight was carried out along the Leirosa Beach sector (Portugal). In this way, the machine learning algorithms were trained on beach sectors with different background features. For technical and environmental acquisition specs of this flight, see
Appendix B.
The post-processing of all the acquired UAV images (listed in
Table 1) was executed using Agisoft Metashape Professional (v. 1.6.5) following the principles of the Structure from Motion (SfM) technique [
32].
A sequential step-by-step workflow (
Figure 4) was conducted to obtain a Digital Surface Model (DSM) and an orthomosaic for each flight. First, raw images were aligned to produce a sparse point cloud, representing the image tie points identified by the software. Then, a high-precision georeferencing procedure was carried out using the GCP coordinates collected on the field with vertical and horizontal accuracy of about 0.02 m and 0.01 m, respectively. Finally, a dense point cloud was generated and used as input to obtain the DSM and, consequently, the RGB orthomosaic. The DSM and orthomosaic spatial resolution, strongly influenced by different processing parameters (e.g., the dense cloud downscale factor), is expressed by the Ground Sampling Distance (GSD), a parameter that associates the spatial distance on the field with the final orthorectified image resolution, as previously indicated by Ventura et al. [
33] with the following equation:
where
GSD is the image spatial resolution at ground level,
Sw is the sensor width expressed in millimeters,
Fh is the flight altitude in meters,
FL is the camera’s focal length expressed in millimeters, and
Iw is the image width expressed in pixels. In
Table 1, details of all UAV flights performed for this study are provided.
3.3. Visual Screening for Litter Identification
Manual visual screening was carried out to recognize and digitize all litter items in the boxes identified at Torre Guaceto Beach. This operation was performed on the very high-spatial-resolution orthomosaics (1 mm/pixel) obtained from UAV images at a 10 m altitude, using QGis software (v. 3.16). To this aim, a specific geo-database was assembled, compiling the attribute table of the related shapefile and reporting for each item the following fields: X and Y coordinates, European classification code, and the composition material.
3.4. Mask-RCNN-Based Algorithm Dataset Building and Application
To build the dataset, each orthomosaic was split into 1000 × 1000 pixel tiles in Global Mapper GIS software (v. 20.1) to improve performance and processing times during the elaboration steps. The training dataset is composed of tiles from UAV images collected for training purposes (cf.
Table 1) while the test dataset comprises 104 tiles, covering the area inside the six boxes. Three different settings of the training datasets (Dataset 1, Dataset 2, and Dataset 3) were proposed to conduct three experiments on the box areas at the Torre Guaceto test site. In detail, Dataset 1 and Dataset 2 consider the classes “Bottles”, “Worked wooden”, and “Other”, accounting both for the availability of BL items on every specific site and the most common categories of BL on the Italian coastline [
34]. The label “Bottles” does not consider vials and flasks for cleaning products but only common drink bottles, whether they are made of glass or plastic. The label “Worked wooden” includes wooden artifacts such as beams, decks, or platforms, and the label “Other” includes all the remaining items of other kinds of waste, regardless of the material. In Dataset 3, the label “Other” was replaced by “Nets”, encompassing fishing nets primarily used for mussels. The network was trained three times because the three settings of the training dataset differ in the number of polygons. This is a crucial aspect of the balance among classes and for the final performance. In the first setting (Dataset 1), those objects other than “Bottles” or “Worked wood” were grouped into a single class, “Other”. In Dataset 2, the number of elements in the “Other” class was reduced, while in Dataset 3, the class “Other” was replaced by the “Nets” class, which represented one of the most abundant and characteristic objects in the coastal test site. In all experiments, the background was not included in any classes, since it was represented by sand, sea, and vegetation whose chromatic features are very variable depending on several factors, i.e., the site-specific morphodynamic characteristics, the hour, and the weather conditions during the acquisition flight.
On each tile (training and test), BL items were manually digitalized and labeled in QGIS software (v. 3.16), producing polygon shape files. Polygons labeled on test tiles represented the ground truth elements for the accuracy evaluation while polygons labeled on training tiles were used for the training phase. The DL algorithm for automatic BL analysis exploits Mask-RCNN architecture [
35]. This is based on Faster-RCNN and performs an instance segmentation task with a pre-training phase on the COCO dataset [
36]. The subsequent training phase was performed in one epoch consisting of 100 intermediate steps. This algorithm works on a two-stage network (
Figure 5): in the first one, ROIs (regions of interest) are generated, while in the second stage, ROIs are ranked, and a bounding mask regression is returned. So, the algorithm adds parallel elaboration branches for object mask segmentation, aside from the box recognition. ResNet101 was used as the backbone. Since the manual labeling of polygons that constitute the ground truth was carried out in QGIS software (v.3.16), a conversion from shapefile format to JSON was needed. This was executed to make shapefile metadata available to the Mask-RCNN algorithm. The overall dataset was uploaded on Google Drive and analyses were conducted on the Google Colab platform, which provides Nvidia Tesla K80 GPU and 12 GB of RAM. As a result of the analysis, images in .png format were obtained on which beach litter items identified by the Mask-RCNN algorithm were segmented with masks and framed in a rectangular bounding box (
Figure 5). Furthermore, the accuracy of the ML tools was evaluated through Mean Average Precision using Intersection over Union (mAP@IoU), as already used in Scarrica et al. [
23]. It is derived from the precision–recall curve, incorporating the concept of Intersection over Union (IoU). IoU represents the ratio of the overlapping area, expressed in pixels, to the total area of two polygons, typically a predicted bounding box/polygon and a ground truth box/polygon. In the context of mAP, IoU is used to determine whether a detection is a true positive based on a set threshold. More specifically, mAP@IoU is defined as the mean of the areas under the precision–recall curve across all classes and IoU thresholds, where true positives are detections with an IoU higher than the given threshold. This allows mAP to account for the accuracy of localization in object detection tasks [
34]. The values of IoU range between 0 and 1. In this work, the value of 0.1 was chosen as the threshold. Since the Mask-RCNN-based algorithm does not manage geo-spatial data, the script was implemented with a code block for the georeferencing task. In detail, a .txt file is associated with each exported test image. It reports projected coordinates for each segmented and classified polygon. The algorithm reads the georeferenced input tiles (.tiff format) and, at the same time, computes bounding box centroid coordinates (expressed in pixels) on the output images. After that, knowing the spatial resolution (GSD) and up-left image vertex projected coordinates (origin point) measured on the georeferenced input image, it is possible to compute the projected coordinates (X and Y) of each centroid. Indeed, the georeferenced code of the script is based on the input images in .tiff format, which keeps metric coordinates on its origin point (top-left image vertex) since it is derived from the georeferenced orthomosaic. This procedure can be applied to every projected reference system. As well as coordinates, other parameters were reported in each .txt file: object ID, category name, and score value.
The pseudo-code strings used to extrapolate the coordinates of each detected item are reported in
Appendix C.
3.5. SVM Dataset Building and Application
The Support Vector Machine (SVM) algorithm [
37] was selected for the classification task conducted entirely on GIS environment. This tool is included in the Orfeo Toolbox (OTB) suite, an open-source library available as a plugin on the QGIS environment [
38], which provides many algorithms for high-resolution optical, multispectral, and SAR image post-processing. Moreover, OTB’s algorithms perform object detection on images following the typical object-based image analysis (OBIA) workflow [
39]. Also in this case, different settings of the training datasets were organized. The training database was assembled with manually digitalized polygons considering those items belonging to Dataset S1 and Dataset S3, also used for the Mask-RCNN-based algorithm classification, but because SVM executes a panoptical pixel-based classification on the entire input image, a fourth class was added, labeled “Sediment”, to include the background. To distinguish them from the previous ones, they are called Dataset S1, Dataset S2, and Dataset S3 pertaining to the SVM elaboration. Moreover, a third experiment was conducted with a third dataset (Dataset S3) composed of five classes, that are “Bottle”, “Worked Wood”, ”Other”, ”Nets”, and “Sediment”. In this case, polygons were digitalized directly on the orthomosaic chosen as training site (Capitolo Beach) on which additional BL items were added by operators to obtain as many training objects as possible. The metadata of each polygon were characterized by two attributes, “Class” and “Label”, the latter of which was used for the training and validation phases. To this end, a file containing the pixels zonal statistics of the RGB orthomosaic was also needed in order to assign the label of each polygon to a specific cluster of pixels (groundtruth). A split value of 0.5 was set between the training and validation number of pixels. Then, the training phase task was performed using the “TrainImageClassifier” tool provided by OTB, where a polynomial kernel type wasset up for the SVM algorithm, since a “linear” type would have been too limiting to our class geometry. This step led to the generation of the model file required by the next test phase conducted on the 104 test tiles. In this case, the SVM performance accuracy was evaluated through the K index [
40] and confusion matrices referring to the validation phase for each dataset setting. The kappa coefficient (K) is a statistical measure used to evaluate the degree of agreement or consistency between two or more raters or observers when they classify items into categories. Unlike simple percentage agreements, “K” considers the agreement occurring by chance. Varying K-values across different datasets when using an SVM for pixel classification reflects the differences in model performance, dataset characteristics, and quality. Higher K-values signify better model performance and alignment with true classifications, while lower values indicate potential areas for improvement in data quality or model tuning.
5. Discussion
In this study, a multiple analysis of both direct and indirect methodologies for the identification and classification of BL items was conducted. From the analysis of the international literature, it emerged that the number of studies focusing on beach litter assessment is constantly growing [
8,
41,
42]. Most of the studies, both at the national and international scale, are based on BL data collected through in situ surveys and are aimed at defining the overall environmental quality of the investigated beach sectors by applying specific indices [
11,
12,
43,
44,
45,
46]. Due to the human efforts required to carry out in situ surveys for long coastal sectors, the most recent investigations exploit the use of drones to collect tailored images [
47,
48,
49]. Using GIS tools, the UAV images can be used to identify and map BL items by performing ex situ visual screening. The accuracy of the identification process depends on the image resolution that, in turn, depends on the technical parameters for drone flight deployment, such as flight altitude [
50].
UAV images used in this study were acquired in two different beach contexts (Italian and Portuguese coastlines), with different sediment characteristics, to carry out the training task under different backgrounds. Indeed, regarding the physical characteristics of the coastal areas, variations in sediment texture and composition were the leading cause of background features. The grain size and composition of the sediment altered the sediment’s light reflection and absorption, giving the sand a specific color. This aspect deeply affected automatic segmentation and classification tasks.
All methodologies were applied at Torre Guaceto Beach while images from other sites were used as supplementary material to train machine learning algorithms. The results of the in situ surveys and the manual visual screening showed how BL items were mostly located on the southernmost part of the Torre Guaceto Beach, near the Canale Reale mouth. They were mostly located on the inland part of the backshore (
Figure 7). As also highlighted in previous studies [
10], this peculiar distribution could be related to the material transported by the creek and reallocated on the beach face by swash and backwash currents.
The in situ survey and visual screening litter density average values (0.51 and 0.3, respectively) and their total amount of detected BL items (158 and 92, respectively) revealed that through the latter methodology, it was possible to detect fewer objects than with the in situ survey (
Figure 8). Although this was very high (0.03–0.04 cm/pixel), the detection of small objects (centimeters in scale, e.g., caps, cigarette butts, and undefined small pieces) was sometimes very difficult, and therefore it was impossible to catalog them according to litter categories. So, the visual screening methodology could be very useful for the detection and distribution purposes of macro- and mega-litter, but it showed limitations for classification purposes, especially for the meso-litter and the smaller objects belonging to the macro-litter category. This is in line with the outcomes of the research carried out by Moy et al. [
51], in which precise measurements of the quantity, location, type, and size of macro-debris were provided using manual screening conducted on orthomosaics.
The Mask-RCNN-based algorithm quantitative results exhibited both an underestimation and an overestimation issue for the detection task. This phenomenon can be traced back to the quantitative equilibrium among the classes, as inferred from the comparison of the conducted experiments (
Table 4). In the case of unbalanced classes (Dataset 1), the number of elements was higher, but the accuracy was low. On the other hand, in the case of more balanced classes (Dataset 2 and Dataset 4), the number of detected elements was lower, but the classification accuracy was higher. The Mask-RCNN-based algorithm was trained using three dataset settings that strongly differed in the number of training polygons. This is a crucial aspect for the balance among classes. In detail, in Dataset 1, classes proved to be unbalanced and the accuracy value returned was very low (
Figure 11, mAP@IoU = 0.017). In Dataset 2, the number of elements composing the “Other” class was lowered to 123, to reach a more balanced status with the other two classes. In this case, the final performance was lower than the previous experiment and an underestimation issue occurred. To fix this issue, the class “Other” was replaced with the class “Nets”, which is considered one of the most common BL items identifiable on coastal environments. For this reason, in Dataset 3, a balance status in the polygon number (28, 38, and 38;
Table 3) was reached. Nevertheless, the number of training polygons was still not high enough to return an acceptable classification output. It can be deduced that the performance of the Mask-RCNN-based algorithm can lead to good results with a higher number of training elements. Moreover, a main factor influencing segmentation and the subsequent classification outcome was the shape of training polygons. So, considering both segmentation accuracy assessment (
Table 3) and classification accuracy (
Figure 9), it can be deduced that the Mask-RCNN-based algorithm could be better in segmentation tasks rather than as a classifier tool. Therefore, to reach an acceptable performance for a multi-class classification, it is necessary to provide training element shapes as well defined as possible that faithfully reproduce the objects to be recognized.
The approach proposed in this study allowed the researchers to geolocate BL items in the site-specific UTM Reference System (
Figure 10). Pfeiffer et al. [
52] proposed an analog way to obtain a georeferenced output based on bounding boxes and framed in a geographic coordinate Reference System (WGS84). Nevertheless, the approach proposed here proved to be very accurate (with a horizontal error of about 2–3 cm) and linked to the geometrical centroid of each object, ensuring a very detailed analysis of BL item distribution along the beach profile.
As far as the SVM algorithm provided by the Orfeo Toolbox plugin is concerned, the obtained validation results reached an acceptable accuracy especially with Dataset S3 (
Figure 12c), where the main objects were distinct from the background. The same cannot be said for the visual results reported in
Figure 13, where, using completely new images which were different from those of the training and validation phases, some critical issues emerged. SVM panoptical classification is only pixel-based and does not consider polygon shape during the training phase, and this resulted in an indirect and passive segmentation. “Bottle” and “Worked wood” proved to be classes with highly critical issues. In the first case, elements below the real bottles lying on the beach training sector (in most cases, consisting of sand) were quite visible since they are almost always made of transparent glassy or plastic material. This generated a high degree of heterogeneity in pixel zonal statistics pertaining to this class, leading to misclassification by SVM (
Figure 13a). A similar issue was encountered with the “Worked wood” class, as wooden objects labeled for the training phase had a chromatic range like the sand one. Indeed, as can be seen from confusion matrices concerning the validation phase (
Figure 12), a high number of pixels belonging to the “Sediment” class were classified as “Bottle” (5655 in Dataset S1) or “Worked wood” (3182 in Dataset S1). On the other hand, the “Other” class showed the highest P.A. value (0.915 in Dataset S2 and 0.918 in Dataset S3;
Figure 12) along with the “Sediment” class, as labeled training objects had a chromatic range contrasting the background one. Moreover, objects’ shadows or sandy ripples were considered independent objects, thus increasing the number of total masks produced. Trials were also conducted to verify processing time, and test tiles larger than 1000 pixels led to a significant increase in it. In the end, because of the pixel-based nature of the SVM classification process, it was impossible to obtain a density value from a raster output, unless further processing steps were added that would greatly increase the time for obtaining a preliminary and immediate estimate of the BL items’ distribution. Despite the high accuracy values achieved during the validation phase (>80%), the aspects discussed above make the SVM classifier unsuitable for a multi-class detection task on very high-resolution images with three spectral bands, but at least, it proved to be a good tool for a segmentation task. Indeed, better performances could be achieved using multiple-band images, since this kind of algorithm, as well as most of the other algorithms available in the OTB suite, are very suitable for remote sensing processing [
53,
54].
As a final remark, we can observe that the Mask-RCNN-based algorithm proposed in this study resulted in a promising alternative tool for the automatic detection of BL items, although still not accurate enough (mAP@IoU = 0.02), since the object shape is a crucial element for the segmentation and classification tasks. In Scarrica et al. [
23] a preliminary version of the Mask-RCNN-based algorithm was exploited to detect and classify BL items on UAV images through a single-class classification approach. In this study, the algorithm was improved to reach a multi-class classification task, evaluating its performance in classifying BL items according to three classes. The robustness of the Mask-RCNN-based algorithm was evaluated by choosing an IoU threshold of 0.1, but further experiments are planned to reach a threshold value of 0.5. The number of training polygons should be as high as possible to obtain acceptable performance, and this aim could be easily reached through more aero-photogrammetric surveys.
Based on the field photogrammetric surveys conducted both in this study and in previous activities [
13,
23,
50,
51], a protocol for a potential standardization of a drone-based BL survey was proposed. In detail, specific logistic, technical, and environmental parameters are defined in
Table 5.
6. Conclusions
In the past twenty years, the interest of the international scientific community regarding the environmental issue of litter in coastal areas has been increasing. In this study, an integrated analysis of different methodologies for macro-litter (i.e., objects greater than 2.5 cm) identification and classification is provided. In detail, data directly collected in situ at Torre Guaceto Beach (southern Italy) were used to evaluate beach litter density, and the obtained values were used as a reference for those calculated from indirect analyses based on the exploitation of UAV photogrammetric surveys (i.e., manual visual screening and automatic detection through ML algorithms). The results of such analysis allowed us to state that, although there is a need for further improvements in the training phase to overcome the overestimation/underestimation issues, the automatic classification tool performance is promising.
As far as the novel Mask-RCNN-based algorithm is concerned, it turned out to be poorly suitable for multi-class classification, because of low accuracy values (less than 10%). Nevertheless, the overall performance of this tool could be improved by providing additional training polygons for a set of selected BL classes (representing the most common BL items), which can be obtained by increasing the number of available training images. To this end, a flight protocol reporting the mechanical and environmental settings is proposed to carry out tailored UAV surveys. The simplicity of the training and testing steps makes the Mask-RCNN-based algorithm a versatile tool that can be further exploited to support the integrated management of the coastal zone. In addition, thanks to the introduction of a block code that allows to obtain a very accurate georeferenced position of the identified BL items, the current version of the proposed Mask-RCNN method allows the assessment of BL item spatial distribution along the investigated beach sector and, therefore, it can be exploited to identify accumulation zones and define priority areas where beach management action should be directed.
On the other hand, although the SVM classifier available in the QGIS environment exhibited a relatively high accuracy during the validation phase, it did not reach a reasonably satisfactory performance for a multi-class pixel-based detection task on very high-resolution images with three spectral bands. In this regard, the integration with multi-spectral or hyper-spectral images could be very useful to improve the suitability of this tool to be applied for BL detection purposes.
Improving the automatic detection approaches means facilitating the autonomous use of ML-based tools, reducing the time and effort needed for coastal monitoring through in situ field activities. In compliance with the objectives of sustainable development [
55,
56], further investigations will be carried out to provide an adequate and definitive coastal monitoring tool to be exploited for the integrated geo-environmental characterization of both natural and highly anthropized coastal areas [
57].