This section contains an analysis of the various ensembles of neural network pipelines involved in this research study.
Figure 1 shows the schematic of the proposed framework for our waste management system. The input images are fed into the preprocessing network where normalization, flipping, subsetting, and cropping are performed. The preprocessed images are then passed onto the four neural network modules of waste detection, classification, volume calculation, and organic content estimation, using Mask Region-Based Convolutional Neural Network (RCNN) [
15], You Only Look Once (YOLO) [
6], Very Deep Convolutional Neural Network (VGG16) [
24], and Structure from Motion (SFM) [
25] models, respectively. All the modules function independently and are integrated into our fully functional mobile app.
3.1. Data Preparation
To train the different neural network models for waste management, we combined two sets of recently released trash image datasets [
5]. The datasets contain images of six different objects, namely plastic, metal, glass, paper, cardboard, and other organic trash. The conventional trash datasets do not consider the different organic waste that can be identified. Our model rectifies this issue and can classify trash accurately. Organic waste can be either an orange peel or hospital waste. We have included images of trash from various perspectives, as we cannot account for user behavior. The different angles and sizes of trash will enable a robust network and system for the neural pipeline to train and predict. There are 2527 labeled images in our dataset.
We ensure that none of the trash classes have any biases amongst themselves, thus allowing our neural network classification to achieve greater accuracy and address the issue of trash variance based on different user environments. The 2527 images are split into a 70:15:15 ratio for training, validation, and testing, respectively. The experimental setup was executed on an Ubuntu-based system with 16 GB RAM and 1 TB HDD storage with NVIDIA GeForce RTX 3090 as the primary GPU and tested on Python 3.7+ for the neural network models.
3.2. Waste Image Segmentation and Detection
The next important step was to perform waste detection and bounding box generation around the waste pile or litter object. Image segmentation was performed initially with Otsu image segmentation. Otsu thresholding was applied to the waste images for foreground and background separation to make the waste region more prominent for further analysis and implementation.
Figure 2 is the resultant image after performing the Otsu method on an input image depicting a lineup of litter piles along bayside regions. Otsu’s methodology is a well-known technique for image segmentation which uses the threshold that minimizes the between-class variance. The threshold can be defined as the combined weighted total of the probability values of inter-class and intra-class calculated from the histogram bins, as shown in
Figure 3.
The next step in image segmentation was edge thresholding, which was accomplished in our research using a multiscale LOG Detector, also known as the Marr–Hildreth Edge Detector. It makes use of a threshold value that keeps track of the multi-level intensity variation. Intensity variation results in a peak or a drop in the scaled graph. Gaussian and Laplacian filters are jointly convolved using a differential operator and scale fine-tuning. After convolution, the zero-crossing areas of the output image highlight the regions of intensity change. The sigma values were experimented with and tuned across Gaussian filters to obtain the highest optimal threshold value. The blurred image is convolved with the original image and then passed through a Laplacian filter, finally giving the output as shown in
Figure 4.
Finally, Mask RCNN is used for trash detection on images with the dimensions 1024 × 1024 × 3. Mask RCNN is used to automatically perform image segmentation and bounding box and contour identification, followed by object detection. This deep neural network architecture first generates a feature map for obtaining the region of interest in the image. It is then followed by mapping the proposed regions to object classes using ROIAlign. In our model, the learning rate of 0.001 and the learning momentum of 0.9 were used to train the data. A 28 × 28 mask was applied with a mask pool size of 14. The model was trained using 17 steps per epoch with a validation step count of 50 on 200 regions of interest for each image. The segmentation masks (visible as different colored sections) after applying Mask RCNN are displayed in
Figure 5,
Figure 6 and
Figure 7, which accurately identify the waste pile location or individual trash in cases of no pileups.
Figure 5 depicts a case where the detection is performed on a collective waste pile, in contrast to
Figure 6, in which distributed waste piles are identified. To make the system more diversified,
Figure 7 is a case where individual pieces of garbage are present. This makes our detection model more robust, incorporating various situations in which garbage can be identified in real time. The final accuracy of the Mack RCNN model was determined to be 92%, with an average loss of 0.452.
3.3. Waste Classification and Real-Time Labeling
Waste classification is one of the most critical aspects of the proposed application. We primarily used the YoloV5 framework to carry out the whole prediction.
YOLO v5 is a novel, lightweight convolutional neural network (CNN) for real-time objects with improved accuracy. YOLO v5 makes use of a distinct neural network for single image processing. It is followed by separating images into segments and predicting bounding box coordinates and the probabilities for each segment. The expected probability is then used to weigh the detected bounding boxes. The method only has one forward propagation running through the neural network to make the predictions, followed by non-max suppression for the final prediction results.
The YOLO v5 architecture can be divided into three parts as follows:
Backbone: The backbone is the key element for feature extraction given an input image. Robust and highly useful image characteristics can be extracted from an input image in this backbone layer using CSPNet (Cross Stage Partial Network), which is used as a backbone in YOLO v5.
Neck: The neck is specifically used to create the feature pyramids, which are useful for object scaling. This is particularly useful for detecting objects of varied scales and sizes. PANet is used as the neck layer in YOLO v5 for obtaining the feature pyramids.
Head: The model head is the final detection module. It makes use of anchor boxes for the construction of resultant output vectors with their classification scores, bounding boxes, and class probabilities.
The whole network is a combination of an ensemble system, with several learners comprising the backbone network, neck, and the head of the neural network system. The image first goes into a custom neural network and is primarily resized into 608 × 608 pixel resolution. It is essential that the resolution is of optimal quality and accuracy, which can be easily found in most handheld devices. The network itself makes the image compatible for further processing. The image is color plane separated and passed into the ensemble pipeline. One of the neural network parts has an architecture of six convolutional layers and max pooling blocks, obtaining valuable pieces of information from every layer. The max pooling layer plays a crucial role in reducing the pixel information at every successive layer, making it suitable for the convergence to be further used by the YOLO network in the next part of the network. In this part of the neural pipeline, the pixelated image’s first stage passes through the 53 Darknet layers, acting as the additional feature detector with the last part of the network. The previous features and the currently identified ones comprise the backbone of the network, enhancing the accuracy of correct detection. This is a typical example of deep transfer learning. Simultaneously, another parallel pipeline executes the same 608 × 608 pixel image using a different procedure.
The three losses, Loss1, Loss2, and Loss3, are the box, classification, and objectness losses, respectively. The box loss represents how perfectly the algorithm can locate the object center and how well the predicted bounding box covers an object. Classification loss measures the accuracy of correct class prediction by the algorithm. Objectness is essentially a probability measure of the existence of an object in a proposed region of interest. The higher the objectivity, the greater the likelihood of a window to contain the object.
where h
i, w
i, x
i, y
i are the height, width, and centroid coordinate, respectively, of the specific anchor box. The aggregate Total
Loss function is calculated by summing up Loss
1, Loss
2, and Loss
3. c
i is the resultant computed confidence score of object p
i(c) which pertains to the classification loss. The parameters with hats correspond to the estimated values. c denotes the respective classes. ε
objij is 1 only when there is an object in the grid cell and 0 in all other cases.
The second learner primarily uses the EfficientNet network. It is a more careful network than the previous neural pipeline. EfficientNet balances the network width and depth and optimizes the image resolution for better accuracy. It contains a mixture of convolutions and mobile inverted convolutions. This system in practice has excelled over other famous networks. After both pipelines—or, in ensemble terms, learners—have processed all the convolutions, it is passed into a decider, which considers each system’s flaws. If one of the pipelines misses any important feature, the decider changes the weight accordingly. After several iterations of this process, the decider converges into a standard weight value, detecting the correct features of the trash images in a scene image. Initially, InceptionV3 was also incorporated in this pipeline, but unfortunately, the pipeline acted as an inhibitor in the current neural network and decreased the accuracy. Thus, after the decision and a series of forwarding and backward propagation, a feature map of 7 × 7 × 256 was the result of the entire system. There was appreciable downsampling in the image structure. The final part of the network was passed into a wide range of further convolutions to result in 7 × 7 × 21 pixel resolution. This was carried out to convert a two-dimensional tensor into a three-dimensional tensor to establish clear bounding boxes.
The last part of the YOLO framework now comes into play—as compared to the previous part, which acted as the feature detector added with the EfficientNet additional rectified features—wherein the weights are passed into the object detector part of the YoloV5 framework. The successive 53 layers allow widespread object detection of small, medium, and large sizes. This is crucial, as the user might supply an image of a trash pile with a variable distance. The object detector is efficient enough to categorize a variety of trash types from a pile with an accuracy of 93.65%. Thus, there are several fully connected layers at the end of the network, and finally, the total number of possible classification classes in the data. The bounding boxes created by the network are exact in creating the blobs and generating the labels classified in real time, as per the final deciding output of the entire network, depicted in
Figure 8. The model has been trained to identify recyclable waste items from a trash pile in this figure. The network outputs bounding boxes with the class identified and their confidence scores. The MobileNetv3 and Detectron2 models were also tested for the waste classification module to compare the models in search of selecting the best one. MobileNet V3 is ideally tuned for mobile devices, which is appropriate for the use case of this research. It internally uses AutoML and leverages two primary techniques, i.e., MnasNet and NetAdapt. It first searches for a coarse architecture using the former, applying reinforcement learning to choose an optimal configuration. The latter technique primarily fine-tuned the model, trimming the underutilized network activation channels by using small decrementing steps in every iteration. Detectron2 is an imposing network model. It comprises a backbone network that consists of Base-RCNN-FPN network features. It is tasked to extract the feature maps efficiently and comprehensively. The next part of the network is the region proposal subnetwork that detects object regions from multiscale features. The final part is the box head component, which warps and crops the feature maps accompanied by the previous component and obtains fine-tuned boxes locating the region and object of interest.
3.4. Organic Waste Estimation
Waste management is an important problem worldwide. In addition, estimation, and classification of waste as organic and recyclable is another important factor that is difficult to annotate manually. The approach used in the paper applied a convolutional neural network model to a household waste dataset consisting of 22,500 training samples, split into 18,000 images for training and 4500 images for validation. VGG16 network architecture is used as a base for model training. The model architecture is shown below in
Figure 9. The network feeds in an input shape of (7, 7, 512), where the last index is the number of input channels. Transfer learning has been implemented in this network by incorporating one flattened layer and three dense, normalization, dropout, and activation layers, consisting of a total of 41,564,993 trainable parameters. The model was then compiled with binary cross-entropy loss and OPT optimizer. The results were impressive, with loss and AUC metrics of 0.184 and 0.9779, respectively, and validation loss and AUC of 0.33 and 0.9399, respectively. The accuracy of validation data was 95.60%, and the accuracy of test data was 94.98%.
The images were successfully separated into recyclable and organic waste categories. In the application, the class would be prompted to the cleaner along with the waste geotagged location and volumetric content.
3.5. Volumetric Analysis of Waste Using STL and Point Cloud Models via SFM
The volumetric analysis of the waste pile was an essential integration into the research. The Structure from Motion (SFM) approach was used in modeling the 2D to the 3D reconstruction of the waste pile. The research considered the possibility of waste in a pile format or even as scattered litter. If a waste pile is detected, the user would be prompted to take a short video clip of the waste pile with a 360-degree view. Our model would then capture 100 frames from the video clip and use the captured frames in the SFM module. SFM is a crucial 2D to 3D reconstruction technique that uses a series of frames or images and aims to reconstruct a 3D view of the object. Point clouds are used in SFM for generating the 3D model. Another important reason for selecting the SFM model was its minor dependency on high-resolution camera equipment. Simple day-to-day smartphone cameras can be easily used to capture images fed into our SFM model. If the images have a certain overlapping degree, the SFM model yields accurate results, and this is ensured by the user-shot, 360-degree video of the waste pile. The first stage of SFM is matching the image features using SIFT or SURF methods and estimating the distance between them. It is followed by the important stage of understanding our structures by calculating the camera positions and orientations.
The image features and parameters are related to the scene structure, whereas the camera positions are referred to as motion; hence, the term “Structure from Motion” was coined. A point cloud is essentially a collection of data points in three-dimensional space.
Figure 10 creates a point cloud representation of a trash pile. Since volume is an essential morphological characteristic of an object, having an idea of the volume estimation using point cloud structures is key. This 3D surface construction uses dense point cloud structure generation in
Figure 11 and
Figure 12. The point cloud generated from SFM methodology is uniquely used in our work for volume estimation. The edge points of the top-most and bottom-most pixels of the generated point cloud were measured to approximate the height of the waste pile. The height obtained was fed into our STL and Trimesh model. A conical mesh was created and stored in STL format, serving as our image mask. Using the Pillow Python library, one of the waste pile images was read, and after surface construction (
Figure 11 and
Figure 12) and plugging in the height value, the Python Trimesh library was used for volume estimation of the waste pile, as shown in
Figure 13. The volumetric estimation pipeline had an accuracy of 85.3%.
3.6. WDCVA App Module Integration
Each application user can be distinguished into three categories: admin, worker, and citizen. The user must first log in or sign-up using phone number authentication, regardless of the user type. The phone authentication is powered by Firebase Authentication, which validates the phone number and returns an OTP to verify the user. Apart from the phone number, the user must submit personal information including name, age, and gender. All this information will be stored in Firestore as specific user documents. These documents are used to categorize the user into the above-mentioned types. Hence, when logging in, if the phone number is found in the database, the app displays information according to the user type.
For citizens, the landing page displays a dashboard where the user can track the progress of the reports submitted earlier. This would provide transparency and support the motive for submitting more reports. There is also a quick access floating button, allowing citizens to make a quick report. When the button is pressed, the application will display a screen where the user can capture a picture of the litter whose report needs to be submitted. The application allows the image to be taken from the phone storage or captured through the camera. After the image is decided, the app previews the chosen image. Once the citizen is satisfied with the image, the citizen can report the litter. When the button is pressed, the user’s geolocation is also tagged with the report, enabling workers to find the litter. The report details are submitted to a custom Flask API, which analyzes the submitted image at the backend. The same analysis is returned to the citizen as proof of submission.
The reports are stored separately on Firestore as documents and displayed in the user interface, as shown in
Figure 14, comprising the identified trash contents, the type of garbage identified, and the confidence accuracy. It also depicts the volume calculated for the current localized garbage in the image. Apart from containing the location and picture of the litter, the report document also contains a status field, which can hold the values of pending, in progress, and cleaned. This allows tracking of the litter cleaning process.
For workers, the dashboard displays a map with pins that signify the locations with litter reports that are still pending. The workers can choose the reports closest to them and start cleaning those locations. When the worker has started the process, it is the worker’s responsibility to mark the report as “in progress” when the work has started and then as “cleaned” upon completion.