1. Introduction
With the development of industry and agriculture, large amounts of wastewater are discharged, and the ecological environment of watersheds is seriously damaged. Floating objects on the water [
1], water pollution [
2], and water eutrophication [
3] are all important causes of water pollution, which not only pollute and damage water resources but also threaten human safety and health. The traditional monitoring method is to deploy devices in the water that make use of sensors to monitor and analyze the water quality. This method can accurately measure local water quality pollution, but it cannot carry out quick statistical analysis of large areas of water pollution. For visible floating objects on the water surface, monitoring is often accomplished by watching a screen. Although this method is simple, the coverage is limited, and it requires a lot of labor and material resources; thus, it is not very efficient.
In addition to traditional methods, an increasing number of technologies are being used to protect water resources. The use of remote sensing satellites [
4] for monitoring the pollution of target waters enables the acquisition of information over a large area in a short period. With the development of deep learning, a method for detecting floating objects based on a convolutional neural network has been proposed [
5]. The convolutional neural network (CNN) can recognize and classify floating objects by extracting, training, and learning their features. In practice, target detection in streaming video involves huge computational effort and therefore has high hardware requirements. Edge computing [
6] has been proposed as a new method for intelligent video surveillance, which can significantly reduce video-processing latency and ensure real-time performance. In comparison with traditional methods, these methods exhibit the following characteristics:
Large coverage area: remote sensing technology can achieve comprehensive scanning and monitoring of target waters and areas, and coverage of inland waters can reach 100%.
High detection accuracy: deep-learning-based target detection can achieve highly accurate detection and classification of floating objects on the water surface.
Good real-time performance: edge computing computes, analyzes, and stores data near the data source to reduce redundant data transmission and meet the demand for real-time performance in practical application scenarios.
Remote sensing technology often needs to be combined with ground sampling and analysis data to invert the overall data of the watershed. Chlorophyll content is an important indicator for detecting water-surface algae. Deep-learning-based target detection can help in classifying floating objects. Different network models perform differently. In addition to pursuing higher detection accuracy and faster detection speed, many researchers have also investigated the balance between detection speed and accuracy. Edge computing is mainly performed at edge nodes, and the limited computing power limits the size and computation power of the detection model. Optimization can yield a smaller and more accurate network model. The 6G networks will be a fully connected domain with integrated terrestrial wireless and satellite communications. To realize the comprehensive monitoring of floating objects on the water surface, a new framework based on 6G is proposed for the warning and identification of such objects. Remote sensing technology is used to issue warnings regarding pollution of large water areas. A new method based on SSD-MobileNetV3 in edge computing is being used to monitor floating objects on small water areas. The CBAM, system-collaborative optimization, image preprocessing, and key-frame extraction were designed to reduce the interference of complex backgrounds and improve calculation speed, detection accuracy, and overall stability.
2. Related Works
Ecological monitoring is dynamic, large-scale, long-term work. With the rapid development of science and technology, remote sensing technology has been widely used in ecological environment monitoring. Using remote sensing for pollution monitoring is a new application for this technology. It can achieve rapid and large-scale monitoring of the ground environment, and is often used for water pollution monitoring [
7] and water eutrophication monitoring [
8]. Qun et al. [
9] used remote sensing technology for detection of the water of Nansihu Lake. First, the relevant data were subjected to sensor, geometric, and atmospheric correction, and water extraction. Taking chlorophyll and suspended solids as important water quality indicators, the authors established a remote sensing inversion model of the water area. The inversion model was then combined with the conventional water detection method to invert the water quality parameters. The method was found to be faster, broader in scope, and more credible for water quality evaluation. Xiao et al. [
10] proposed a random forest-based algorithm to distinguish
Ulva prolifera and
Sargassum from multispectral satellite images. Differential analysis was performed, mainly by capturing the spectra of
Ulva prolifera and
Sargassum using the GF-1 satellite sensor. The method can be used in marine waters with similar environments for phytoplankton traceability and competitive succession with high accuracy and stability. It provides reference values for identifying algae, monitoring water blooms, and providing early warning in inland waters.
Researchers have used the Gaussian mixture model (GMM), deep learning, and other methods to study the recognition and classification of floating objects on the water surface. In 2019, Jin et al. [
11] proposed an improved GMM-based automatic segmentation method (IGASM) to detect floating objects on the water surface. The method first maps the GMM results onto the HSV color space and detects light and shadow using the light and shadow discriminant function. Then, floating objects on still water are segmented by the background update strategy combined with the graph cuts algorithm to optimize the segmentation results based on the spatial information of video images. The experimental results showed that the method could effectively eliminate the effects of light and shadow and water ripples. The improved background update strategy enabled better segmentation of floating objects on the water surface. In 2021, Zhang et al. [
12] added an improved anchor refinement module to the convolutional layer of the RefineDet model. High-level semantic features could be extracted, and different levels of features could be fused to improve detection accuracy. The parameter settings of anchor points could be adjusted according to the scale and aspect ratio distribution, and the focus-loss function could be used to solve the foreground–background imbalance problem caused by too many anchor points. The experiments showed that the method could basically meet the requirements of real-time performance and precision. He et al. [
13] proposed an improved YOLOv5 water surface floating object detection algorithm. The method suppressed overfitting in network training by introducing smoothing labels. The original topology was also used to enhance the feature extraction of floating objects and to reduce the number of parameters and computational effort. The loss function of the model was also optimized to improve speed and accuracy. The experiments showed that the strategy was feasible for detecting floating objects on the water surface.
The emergence of edge computing makes up for the shortcomings of cloud computing. As a result of the proliferation of mobile and IoT devices, large volumes of multimodal data (e.g., audio, picture, and video) of physical surroundings can be continuously sensed on the device side [
14]. Taking intelligent video surveillance as an example, it requires 24 h of video data processing, computing and storage, which will impose higher requirements on the equipment environment. Cloud computing cannot meet the demand of network and computing costs, and edge computing has the advantages of low latency, low bandwidth, and low cost, and has been applied in various fields.
Sun et al. [
15] proposed an edge computing-enabled mobile video processing system. Due to the limited resources of edge devices, they cannot deploy high-precision network inference models. The authors propose using mobile edge computing units with cameras and cloud nodes as edge cloud nodes through which video streams are processed. This method first preprocesses the video stream, then uploads the results to the upper node for processing, and utilizes the computing resources of the cloud to speed up data analysis. The experimental results show that this method can reduce video transmission delay and network overhead, provide a new idea for video processing in edge computing. Wang et al. [
16] proposed an edge computing environment for accurate part model classification using a convolutional neural network-based element segmentation method. Xu et al. [
17] proposed a cloud–edge collaboration framework for video surveillance in coal mines. The two are integrated, with cloud computing used for non-real-time and global tasks and edge computing used for real-time processing of local surveillance videos. A mixed edge-based and cloud-based framework with the final goal of PM2.5 value prediction is proposed in the literature [
18]. In this scheme, the original and preprocessed data on a real-world dataset from air quality sensors distributed in Calgary, Canada, is used to evaluate the quality of predictions. The above methods effectively solve practical problems such as industrial production, environmental monitoring, and urban security through the collaboration of cloud computing and edge computing. The main work related to floating objects is summarized in
Table 1.
However, with the development of the chip industry, system-on-a-chip (SOC) [
19] is widely used, which can provide richer computing resources for edge devices. In this paper, SOC and edge devices are combined as edge computing nodes to implement real-time floating object detection on water.
With the above-mentioned surface floating object detection methods, real-time video detection cannot be performed directly, and traditional cloud-based intelligent video surveillance has difficulty meeting the demand for real-time performance. The emergence of edge computing can help to change this situation. Running AI models at the edge requires not only improving the computational storage capacity of edge devices, but also optimizing network models to adapt to them. Compared to existing methods for detecting floating objects on the water, our proposed scheme has the following improvements:
Remote sensing satellites can provide a larger range of monitoring data. We can monitor the chlorophyll content of water using remote sensing satellites and use that value to determine whether to conduct early warning.
After the early warning is issued, a UAV is used to perform aerial photography for detection of target waters, and the classification and detection of identifiable planktonic algae. The UAV and surveillance cameras around the water are the edge devices that generate data, combined with the SOC as the edge computing node.
The SSD network is deployed at edge nodes for floating object detection. Prior to that, we replaced VGG16 in the SSD network with MobileNet, reducing the computational cost to accommodate edge nodes.
By adding the key-frame extraction module, the frame difference method can effectively determine changes of floating objects on the water surface, and the detection of floating objects in key frames will help in capturing important information.
Image preprocessing is applied to key frames, including median filtering to remove noise, and Laplace sharpening, which can help the detection model to extract floating object features.
This paper is organized as follows:
Section 3 describes the work related to water basin monitoring and early warning by satellite remote sensing;
Section 4 presents the SSD-MobileNet network and the optimization improvement of the model; and
Section 5 presents the deployment of the edge computing architecture and the analysis of the experimental results.
3. Framework of Remote Sensing Monitoring and Early Warning
The chlorophyll content in a water body is an important indicator to evaluate water quality and eutrophication. The reflection spectrum of a normal water body is mainly in the blue and green wavelengths, with a certain degree of absorption of other wavelengths, and the absorption capacity is the strongest in the near-infrared band. Due to the steep slope effect of chlorophyll in phytoplankton in visible and near-infrared wavelengths, an increase in chlorophyll concentration will weaken the absorption capacity of the water column. Estimating chloroplast concentration from the spectral reflectance of water provided by satellites is a commonly used monitoring method. We used this method to determine the planktonic biomass at the water surface.
The remote sensing monitoring and early warning process used in this paper included data collection [
20] (shown in
Figure 1), data preprocessing, water quality inversion, and abnormal warning.
The process of remote sensing monitoring and early warning based on 6G is shown in
Figure 2. First, the satellite collects global remote sensing image data and transmits the data to the ground receiving station through 6G. Several high-definition images in different wavelength bands of the water to be monitored can be acquired through ground receiving stations. Then, preprocessing procedures, such as radiometric calibration, atmospheric correction, multispectral correction, and image stitching, are performed [
21]. The accuracy of radiation correction is an important indicator of the quality of satellite images. The China Resources Satellite Application Center provides absolute radiation correction coefficients, which can be used to calibrate GF-1 data and realize the conversion of DN values to radiation brightness values. The atmospheric correction module in the software is then used for atmospheric correction processing. Geometric correction then digitally performs a point-by-point fine correction of the image, and finally multiple images of the target water are stitched into a mosaic to ensure coverage of the whole water body. Remote sensing data and water quality analysis data of water are combined to invert the water quality parameters. When the detection result is abnormal, the ground station will send the abnormal location information to the UAV through 6G. Then, the UAV will perform target area aerial photographic detection to detect the planktonic algae in the water with the real-time target.
When the planktonic algae in the water are growing in large numbers, it leads to increased chlorophyll in the water column, and at this time the absorption of NIR wavelengths on the water surface is significantly weakened. In this paper, chlorophyll-a content is used as the main reference indicator. When the index is abnormal, the UAV target detection is carried out in the designated area according to the analysis results of remote sensing data. The UAV is equipped with the MobileNetv3-SSD target detection algorithm, which can detect and classify a variety of planktonic algae and other common floating debris on the target water surface. The detection data are uploaded in real time using the edge computing method to facilitate the next step.
5. System Optimization
5.1. System Analysis
In the early stages of model design, we had to consider application scenarios and some of the constraints imposed by resource allocation. Complex models often require a large amount of computation, which is difficult to afford with the resource allocation of edge devices. Adapting the network model to edge devices and making it perform better is also an important task. As shown in
Figure 8, a number of influencing factors are tuned and optimized according to the application requirements, resulting in faster calculations and more stable overall performance. In real-time detection, the amount of model computation is an important factor affecting the speed of detection. In this paper we replace the VGG16 network in SSD with MobileNet and quantify the model; this method can significantly reduce the amount of model computation. The SOC in the edge device is the main component of the whole system and the most critical unit for performing data-processing calculations. Adjusting the model parameters according to the hardware performance will enable the hardware to perform better.
In addition, in order to further simplify the process and improve the detection accuracy, we optimized and improved the processing system (PS). As shown in
Figure 9, we added an extract key-frame module and an image-preprocessing module to the programmable logic (PL). The main function of the extraction key-frame module is to extract the moving video frames of floating objects from the video stream. These video frames contain important information for the detection of floating objects, and the processing of redundant data can be reduced by detecting these key frames. It will effectively reduce the amount of calculation. The image-preprocessing module is used to de-noise and sharpen key frames to further improve the detection accuracy. These two modules are described in detail in the following subsections.
5.2. Extracting Key Frames
To further simplify the detection process, improve the speed of detection, and meet the demand for real-time performance in practical application scenarios for edge devices fixed on the shore, a key-frame extraction module was added to their detection process. In this paper, video frames that can reflect increase and decrease, and displacement changes of floating objects in the water, are used as key frames. The inter-frame difference method [
27] can help us to quickly calculate and extract key frames. In this paper, the two-frame difference method is adopted to perform the difference operation between the
nth and
n − 1th frames of two temporally consecutive images, and the specific algorithm is as follows:
Let A be the whole frame image, and the nth frame image and n − 1th frame image in the video sequence be and .
The grayscale values of the corresponding pixel points of the two frames are denoted as and . Then, the absolute value of the difference between the grayscale values of the corresponding pixel points in the two frames is summed. The calculation process is given in Equation (1):
when
exceeds a certain threshold, it is determined that there is a floating object moving in this video frame, and the frame is used as a key frame. A threshold value that is too small cannot suppress many noise points in the image, and a threshold value that is too large tends to obscure the target information. Fixed thresholds cannot adapt to light changes in the scene. In this paper, we added an addendum to the determination condition to adjust the threshold value according to the overall lighting. The key-frame determination conditions are given in Equation (2):
where
is the total number of pixels in the area to be detected,
λ is the rejection factor for illumination, and
A is the whole image. The addition term indicates the change in illumination in the whole image.
If the change in illumination in the scene is small, the value of this term tends to be zero. If the change in illumination in the scene is significant, the value of this term increases significantly, and the right-hand side of the judgment condition increases adaptively, thus effectively suppressing the effect of light changes on the detection results of moving targets.
5.3. Image Preprocessing
The water-surface environment is complex and easily disturbed by other factors in the process of floating object detection, resulting in loss of detection accuracy and even false detection or omission. In this paper, an image-preprocessing module was added before the model detection for median filter noise elimination and Laplacian sharpening of key frames, to preserve floating object edge information while eliminating noise. Passing the processed image into the model detection is beneficial to the feature extraction of the image by the model, which can effectively improve the detection accuracy.
Median filtering is a nonlinear signal-processing method, so it is a nonlinear filter and a statistical-sorting filter. First, we specify the sliding window size, take the median of the grayscale values of the neighboring pixels in the center of the window, and replace the value of the center pixel with the calculated median value. The key frames are de-noised using median filtering, which can effectively suppress the noise effect and keep the edge effects of the image without making it too blurry. The image is then further processed with Laplacian sharpening. When the grayscale value of the central pixel is lower than the average grayscale of other pixels in its neighborhood, the grayscale of the central pixel will be further reduced. When the grayscale value of the central pixel is higher than the average value of other pixels in its neighborhood, the grayscale of this pixel should be further improved. By sharpening the image in this way, the details can be enhanced, and the edges can be highlighted. As shown in
Figure 10, the detection precision of this method is slightly improved compared with the original model.
5.4. Edge Deployment
The advent of convolutional neural network-based target detection has rapidly moved intelligent video analysis from theory to practical application. Deep convolutional neural networks require large amounts of computation and must rely on hardware such as a graphics processing unit (GPU) to achieve this. The traditional cloud-based real-time video streaming analysis model is shown in
Figure 11. The video data are transmitted to the cloud server in the network center in real time through the Internet, and the data are cleaned, stored, analyzed, and reasoned by the cloud server, then the reasoning results are returned to the terminal device. This model has a stable overall structure and is widely used in various business scenarios. However, problems such as large bandwidth consumption, high-transmission delay, unreliable network, and difficult privacy protection still need to be solved.
The emergence of SOC has provided arithmetic support for edge computing, making its deployment and popularity possible. In this paper, we apply the model of edge computing to intelligent video surveillance by sinking the cloud server at the center of the network to an edge node that is physically close to the video source. SOCs with some computational power are embedded in the camera as edge nodes, and the above detection model is deployed. The camera transmits the captured video stream data to the SOC, which then decodes it according to the frame rate, resolution, and other parameters, and the video coding protocol. The above key-frame extraction algorithm is then used to extract key frames from the video stream and pass them into the model for target detection. The edge analysis architecture [
28] is shown in
Figure 12.
5.5. Limitations of the Method
The calculation of edge nodes used in edge computing platforms mainly depends on SOC, which has limited computing power. In this paper, the channel pruning of the trained network model is carried out to reduce the computational burden and basically achieve the demand of edge adaptation. However, in the actual deployment of the network model, there will be some phenomena such as the inability to detect the target continuously, missing detection, and the occasional jump and drift of the detection box. Therefore, it is an important challenge to reduce the computational burden of the model and reduce the accuracy loss while ensuring the real-time performance of object detection. In the future, we will use filtering and smoothing methods to predict targets and reduce missed detection. Meanwhile, multi-thread optimization, inter-frame optimization, and algorithm co-optimization will be used to shorten the processing delay.
In addition, small object detection is always a difficult problem in the field of object detection. In the target detection task, convolutional neural network achieves localization and classification by extracting the feature information of the target. Obviously, the amount of feature information carried by the target directly affects the final prediction result. Small objects occupy a low proportion of pixels in the image and carry less effective feature information, which makes the detection and recognition of small objects more difficult. In the water environment, special background factors such as light, ripples, and reflections have to be considered, which can lead to a false detection of the results. Meanwhile, in practical application, the different types of floating objects on the water surface are various, and the size distribution is different, which also brings great challenges to the identification and detection. Aiming at the above problems, a data enhancement method is used for small targets to increase the number of small target samples and improve the generalization ability of the model. The attention mechanism is added to make the network pay more attention to the key information carried by small targets. Experiments show that these methods can improve the accuracy of small target detection. However, there are only four types of floating objects in the dataset used in this paper. The objects with a small data amount and insignificant features are not included in the dataset due to the difficulty of collection. In the future, we will increase the collection and sorting of such image data and improve the floating object dataset. The multi-size detection strategy and cross-feature layer-fusion method will be used to improve the accuracy of small object detection.
6. Experimental Analysis
6.1. Datasets
In this paper, the experimental data on river floaters were mainly obtained from publicly available datasets, such as ImageNet [
29] and COCO [
30], manual photography, and relevant images using web crawler techniques. Then, LabelImg software was used to label the images, and a dataset of floating objects on the water surface was produced in VOC format. The dataset consisted of four main categories: bottles, plastic bags, planktonic algae, and dead fish. The dataset was expanded by rotation, contrast enhancement, and mirroring, as shown in
Figure 13. A total of 22,000 images were collected from the dataset, and the statistics are shown in
Table 2.
6.2. Experimental Environment
The experimental environment was divided into model-training and edge deployment environments. The model-training software environment was Ubuntu 18.04, python 3.8.8, using the Pytorch framework. The model-training hardware environment was GeForce RTX 3090, Intel(R) Xeon(R) CPU E5-2678 v3. The edge deployment environment used a 2 megapixel 1/1.8-inch (charge coupled cevice, CCD) CMOS smart capture camera and a RV1126 chip.
6.3. Experimental Results and Analysis
The metrics generated during the training of the network model were the criteria for evaluating the quality of the model and provide an objective picture of the model’s performance. For the performance evaluation, we selected accuracy rate
P, recall rate
R, average accuracy rate
AP, and detection speed
FPS to represent the performance of the model.
P and
R are defined as follows:
where
TP indicates the number of correctly detected floating objects,
FP indicates the number of non-floating objects, and
FN indicates the number of undetected floaters.
In this paper, we replaced the SE module in the network with CBAM before training SSD-MobieNetV3 and added the small target data augmentation (STDA) module during network training. An ablation study was conducted for the above improvements and the experimental results are shown in
Table 3.
Where ✓ indicates that the CBAM or STDA has been added. As can be seen from the experimental data, the values of the indicators are significantly lower when no improvements are made to the model. When CBAM and STDA were added, P improved by 2.01 and 2.48%, R improved by 3.44 and 2.10%, and AP improved by 2.66 and 4.76%, respectively. When both modules were added at the same time, the three metrics improved by 3.34, 3.41, and 5.53% respectively. This shows that including CBAM in the network and using the proposed STDA method in this paper can effectively improve the model detection accuracy.
To verify the effectiveness of the improvements, we deployed SSD, SSD-MobileNetV3, and the improved methods on edge devices for experiments. In all, 1000 data items were used as the validation dataset, including 800 small target and 200 regular data items. The evaluation metrics included P, mean average precision (mAP) (0.5), mAP (0.75), and frames per second (FPS). P is the detection accuracy; the two evaluation metrics, mAP (0.5) and mAP (0.75), are set according to different intersection over union (IOU) thresholds; and FPS is the number of image frames per second detected.
Through edge-testing experiments, SSD still maintained high detection precision, and SSD-MobileNetV3 replaced the computationally intensive VGG16 network with significantly higher detection speed. In
Table 4, the addition of the extraction key-frame module and image-preprocessing module introduced additional computational effort into the system but showed good results in terms of speed and accuracy. The experimental data show that our method improved detection accuracy by 2.9% and 5.5% compared to the other two methods, and detection speed by 55% compared to SSD. A detection speed of 33 frames per second is perfectly suited to real-time requirements at the edge.