*Article* **Hybrid Deep Learning Algorithm for Forecasting SARS-CoV-2 Daily Infections and Death Cases**

**Fehaid Alqahtani 1, Mostafa Abotaleb 2,\*, Ammar Kadi 3,\*, Tatiana Makarovskikh 2, Irina Potoroko 3, Khder Alakkari 4,\* and Amr Badr <sup>5</sup>**


**Abstract:** The prediction of new cases of infection is crucial for authorities to get ready for early handling of the virus spread. Methodology Analysis and forecasting of epidemic patterns in new SARS-CoV-2 positive patients are presented in this research using a hybrid deep learning algorithm. The hybrid deep learning method is employed for improving the parameters of long short-term memory (LSTM). To evaluate the effectiveness of the proposed methodology, a dataset was collected based on the recorded cases in the Russian Federation and Chelyabinsk region between 22 January 2020 and 23 August 2022. In addition, five regression models were included in the conducted experiments to show the effectiveness and superiority of the proposed approach. The achieved results show that the proposed approach could reduce the mean square error (RMSE), relative root mean square error (RRMSE), mean absolute error (MAE), coefficient of determination (R Square), coefficient of correlation (R), and mean bias error (MBE) when compared with the five base models. The achieved results confirm the effectiveness, superiority, and significance of the proposed approach in predicting the infection cases of SARS-CoV-2.

**Keywords:** hybrid deep learning; time series; LSTM; Stacked LSTM; CNN-LSTMs; BDLSTM; CNN; GRU; modeling; SARS-CoV-2

**MSC:** 35-00; 35-01; 35-02; 35-03; 35-04; 35-06; 35-11

### **1. Introduction**

The outbreak of the coronavirus infection known as SARS-CoV-2 was reported in Wuhan city, China, in December 2019 SARS-CoV-2, and it spread to more than 200 countries in less than a year [1]. The world health organization (WHO) called it COVID-19, which stands for "Coronavirus Disease 2019," which is the second version of the previously known severe acute respiratory syndrome SARS (SARS-COV) and identified in short as SARS-CoV-2 [2]. There have been regular restrictions to avoid the infection spreading in all countries, including Russia. In almost all of the countries currently being impacted by the SARS-CoV-2 pandemic, the rate at which patients are becoming infected with and succumbing to the disease is alarmingly high [3]. The treatment of patients who required intensive care was one of the most influential factors in determining the death and case rates associated with (SARS-CoV-2). A significant challenge for healthcare systems all over the world is posed by the administration of SARS-CoV-2 treatment to patients who require acute or critical respiratory care [4]. Artificial intelligence and machine learning,

**Citation:** Alqahtani, F.; Abotaleb, M.; Kadi, A.; Makarovskikh, T.; Potoroko, I.; Alakkari, K.; Badr, A. Hybrid Deep Learning Algorithm for Forecasting SARS-CoV-2 Daily Infections and Death Cases. *Axioms* **2022**, *11*, 620. https://doi.org/10.3390/ axioms11110620

Academic Editor: Oscar Humberto Montiel Ross

Received: 26 September 2022 Accepted: 4 November 2022 Published: 7 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

two non-clinical computer-aided rapid fixes, are needed to battle (SARS-CoV-2) and halt its global expansion [5]. Intelligent healthcare is increasingly relying on AI, in particular machine learning algorithms [6]. More and more, these technologies are referred to be the brains of intelligent healthcare services [7]. Deep learning, a kind of machine learning in artificial intelligence, comprises networks that can learn from unstructured or unlabeled data without supervision [8]. SARS-CoV-2 is just one of the numerous applications that have heavily incorporated deep learning [9]. These solutions are also required in order to prevent the disease from becoming more widespread. Techniques for making predictions regarding the future are based on the evaluation of the past [10]. People are under the impression that nothing will be the same as it was before as a result of the widespread coronavirus pandemic, which has numerous global implications. The three most significant things being explored at the moment are figuring out the causes, implementing preventative measures, and attempting to develop an effective cure [11]. In Russia, there are more than 20 million confirmed cases and 386 thousand death cases as on September 2022 [12]. Continued research is being conducted on related diseases, as well as public health policies and containment mechanisms. Quarantine procedures vary from nation to nation, but their overall goal is the same: to slow or stop the spread of infectious diseases in order to keep hospitals operational and able to meet the rising demand for medical care [13]. If the number of patients diagnosed with SARS-CoV-2 continues to rise, it is possible that healthcare facilities will be unable to meet the needs of their patients and provide the services they require. This is the worst-case scenario that can be anticipated. It is crucial that the nations' health capabilities be used properly and that the demand for the supplies needed for medical infrastructure is predictable when infection rates are also taken into consideration [14]. This is because it is important that both the health capacities of the countries and the infection rates be taken into account. In this regard, it is recommended that public health strategies be developed and implemented [15]. As a consequence, deep learning (DL) models are considered precise tools that may aid in the development of prediction models [16]. The recurrent neural network (RNN) and the long short-term memory (LSTM) are the ones that are being explored in the (SARS-CoV-2) forecasting process because they utilize temporal data, despite the fact that several neural networks (NNs) have been reported in the past [17]. Deep learning networks, such as RNN and LSTM, were utilized in this investigation. These networks were selected because, by analyzing time series data, they were able to provide an accurate forecast of what would occur in the future [18]. An SIR model is a type of epidemiological model that estimates the total number of people in a closed community that could potentially become infected with an infectious illness over a period of time. This category of models gets its name from the fact that they use coupled equations to relate the number of susceptible people to one another *S*(*t*), the number of people infected *I*(*t*), and the number of people who are recovered *R*(*t*), so the initial letters of the three terms that make up the SIR model were shortened to form the acronym (susceptible, infected, and recovered) [19]. The simulation of the SARS-CoV-2 in the Isfahan province of Iran from 14 February 2020 to 11 April 2020 was the subject of one of the first articles published. The authors of this study made a prognosis of the remaining infectious cases using three different scenarios. These scenarios ranged from one another in terms of the extent of social distancing required. In spite of the fact that it was able to estimate infectious cases in shorter time intervals, the developed SIR model was not successful in predicting the actual spread and pattern of the epidemic over a longer period of time. Surprisingly, the majority of the published SIR models that were constructed in order to predict SARS-CoV-2 for different communities all suffer from the same conformity. The SIR models are predicated on assumptions that do not appear to be correct in the circumstances surrounding the SARS-CoV-2 epidemic. Therefore, in order to foresee the pandemic, more complex modeling methodologies and extensive knowledge of the biological and epidemiological features of the disease are required [20]. In addition to more conventional methods, these two models demonstrated a significant amount of success in the forecasting of temporal data. In the first place, recurrent neural networks (RNNs) have

been put to use for the processing of time series and sequential data [18]. These networks are also useful for modeling sequence data. RNNs are a type of artificial neural network that is derived from feed-forward networks and exhibit behavior that is analogous to that of the human brain [21]. To put it another way, RNNs have the ability to predict outcomes based on sequence data, whereas other algorithms do not. After that, LSTMs, which have complex gated memory units designed to handle the vanishing-gradient problems that limit the efficiency of simple RNNs, have been used [22]. The average predicted errors for SARS-CoV-2 infection cases using machine learning models are substantially equal to those using statistical models. Machine learning algorithms can be used to forecast long-term time series [23]. They compared (TS-system) and (DLM-system) LSTM-BI-LSTM-GRU faults. Ensembling models provided fewer mistakes than (DLM-system) models at the level of four countries, and hence the ensembling model outperformed (DLM-system) deep learning models [24].

In this research, we aim to forecast SARS-CoV-2 cases (infection—death) in Russia and Chelyabinsk; the period extends (80–20) Using Hybrid deep learning models, which are based on different assumptions about data estimation.

### **2. Related Work**

Researchers have been focusing on x-ray image diagnosis of SARS-CoV-2 and, on the other hand, using time series models and artificial intelligence for the prediction of daily infection, recovery, and death cases for SARS-CoV-2. X-ray images for SARS-CoV-2 were diagnosed using neural networks. In [25], they created a system using five models and deep learning algorithms: Xception, VGG19, ResNet50, DenseNet121, and Inception for binary classification of X-ray images for SARS-CoV-2. In order to aid medical efforts and lessen the strain on medical professionals while dealing with SARS-CoV-2, they provided deep learning models and algorithms that have been developed and evaluated. Based on machine learning and deep learning approaches, a survey of recent works for misleading information detection (MLID) in the health sectors is presented [26]. Other research focused on a database called COVIDGR-1.0 has all severity levels, from normal with positive RT-PCR to mild, moderate, and severe. With an accuracy of 97.72%, 0.95%, 86.90%, 3.20%, 61.80%, and 5.49% in severe, moderate, and mild SARS-CoV-2 severity levels, the technique produced excellent and steady results [27]. The use of user-generated data is envisioned as a low-cost method to increase the accuracy of epidemic tolls in marginalized populations. Utilizing the potential of user-posted data on the web is what they suggested [28]. In addition to social media channels, bogus news about the SARS-CoV-2 epidemic may be automatically classified and located using deep neural networks. In this investigation, the CNN model performs better than the other deep neural networks, with the greatest accuracy of 94.2% [29]. A brand-new interactive visualization system illustrates and contrasts the SARS-CoV-2 pandemic's pace of spread over time in various nations. The method used by the system, called knee detection, splits the exponential spread into many linear components. It may be used to analyze and forecast upcoming pandemics [30]. In [31], they provided a technique for extracting implicit responses from huge Twitter collections. Tweets were cleaned up and turned into a vector format that could be used by various machine-learning methods. For both informational and non-informational classes, the Deep Neural Network (DNN) classifier had the maximum accuracy (95.2%) and F1 score (73.6%). Other research has developed a brand-new relation-driven collaborative learning strategy for segmenting SARS-CoV-2 CT lung infections. Extensive research demonstrates that using shared information from non-SARS-CoV-2 lesions may enhance current performance by up to 3.0% in the dice similarity coefficient [32]. A domain-specific Bi-directional Encoder Representations from Transformer (BERT) language model called COVID-Twitter BERT (CT-BERT) has been introduced in recent sentiment analysis research on SARS-CoV-2. CT-BERT does not always perform better at comprehending sentiments than BERT. In comparison to a broad language model, a domain-specific language model would perform

better. An auxiliary technique using BERT was developed to address performance concerns with the single-sentence categorization of SARS-CoV-2-related tweets [33].

In our work, we built a hybrid deep learning algorithm as part of our research, as well as an application that makes use of this algorithm, with the goal of forecasting the number of daily SARS-CoV-2 infections and death in the Russian Federation and the Chelyabinsk region. Therefore, in our work, we will be using hybrid deep learning models for modeling and forecasting SARS-CoV-2 infection and daily death cases in Russia and Chelyabinsk. Chelyabinsk is located in the Ural Federal District in central Russia [34]. The most important contribution made by this study is the development of DL prediction models that, when applied to historical and recent data, are capable of producing the most accurate forecasts of confirmed positive (SARS-CoV-2) cases and cases in which (SARS-CoV-2) was determined to be the cause of death in Russia and Chelyabinsk [35].

### **3. Data and Materials**

When preparing data, deep learning faces some issues with long sequences in the database [36]. For the first problem, training is time-consuming and demands a lot of memory. Second problem, back-propagating extended sequences, results in an incorrectly trained model. Prepare and preprocess data before importing it into neural networks. Normalization and standardization problems are two aspects of data preparation. We used data normalization, a scaling procedure, to set the mean and standard deviation to 0 and 1, respectively [37]. We used daily data on SARS-CoV-2 infection and death cases in the Russian Federation and Chelyabinsk region. The dataset was obtained from the official website of the World Health Organization between the dates of 22 January 2020 and 23 August 2022. The dataset is then further prepared in such a way that the first eighty percent of the datasets are used for training purposes while the remaining twenty percent of the datasets are used for testing purposes (the last 20% of this dataset approximates the last 6 months (last 190 days)). The training dataset was used to train and improve the models, and 20% of the training data was utilized to analyze if the models were overfitting or underfitting the data. The performance of the model is evaluated with the help of the test set. Ref [38] provides both the method and the daily SARS-CoV-2 infection and death case data. Both of these can be accessed from our source.

Figure 1 showed a visual depiction of SARS-CoV-2 infection cases (left panel) and death cases (right panel) in Russia and Chelyabinsk repeatedly (Figure 1A,C). Figure 1A shows that the maximum month for total infection cases in Russia is February 2021. Figure 1C shows the same situation for infection cases in Chelyabinsk that same month (February 2021 and 2022). It had close to 100 thousand infection cases in 2022 when the mutant omicron appeared. We also note an upward trend in the development of death cases in Russia and Chelyabinsk (Figure 1B,D), with the emergence of volatility in death cases during the period. Figure 1B shows that the maximum month for total death cases in Russia is February 2022; Figure 1D shows that the maximum total number of death cases in November, December, and February in Chelyabinsk exceeded 800 death cases in November 2021. Then we find a decrease in the death cases after this month as a result of precautionary measures taken by both regions. One of the clear patterns in the visual is a similar trend in cases and death in both Russia and Chelyabinsk, which shows the unification of anti-SARS-CoV-2 policies. Using a heatmap enables us to extract some features from the SARS-CoV-2 data.

Figure 2 presents the heatmap for total monthly infection and death cases. Figure 2A shows that the maximum month for total infection cases in Russia is February 2021, and the same month in 2022 had close to 5 million infection cases in 2022 when the mutant omicron appeared. Figure 2B shows that the maximum month for total death cases in Russia is February 2022, and the same situation occurred in February 2021 when the mutant delta appeared. Figure 2C shows the same situation for infection cases in Chelyabinsk that same month (February 2021 and 2022). It had close to 100 thousand infection cases in 2022 when the mutant omicron appeared. Figure 2D shows that the maximum total number of death

cases in November, December, and February in Chelyabinsk exceeded 800 death cases in November 2021.

**Figure 1.** Daily infections and death cases SARS-CoV-2 in Russian Federation and Chelyabinsk. (**A**): Daily SARS-CoV-2 infection cases in Russian Federation. (**B**): Daily SARS-CoV-2 death cases in Russian Federation. (**C**): Daily SARS-CoV-2 infection cases in Chelyabinsk. (**D**): Daily SARS-CoV-2 death cases in Chelyabinsk.

**Figure 2.** *Cont.*

**Figure 2.** SARS-CoV-2 infection and death heatmap in Russian Federation and Chelyabinsk. for total monthly infection and death cases.

### **4. Proposed Framework Algorithm and Methodology**

The mechanism that underlies our proposed approach for modeling and forecasting SARS-CoV-2 is depicted in Figure 3. The subsequent stages are carried out.

**Figure 3.** Proposed framework schematic schema.

### *4.1. Proposed Framework Algorithm*

First step → Input time series data for daily infection and death cases into our algorithm. Then input parameters for the deep learning model (number of neural networks, number of epochs, Loss Function, and optimizer) start running the algorithm.

Second step → preprocessing step, training takes time and memory. Second, backpropagating extended sequences create a poorly trained model. Before importing data into neural networks, prep it. Normalization and standardization are data prep steps. Using data normalization, we set the mean and standard deviation to 0 and 1, respectively.

Third step → Separate the dataset into training, validation, and testing. SARS-CoV-2 infection and death cases. From 22 January 2020 to 23 August 2022, WHO website data was collected. We test our model using 20% of this dataset (the last 190 days). The dataset is divided such that 80% is used for training and 20% for testing. We utilized the training dataset to train and improve the models and 20% to test overfitting and underfitting. Test set is used to evaluate model performance.

Fourth step → Modeling In this stage, we execute our algorithm for LSTM, LSTMs (stacked LSTM), BDLSTM (Bidirectional LSTM), ConvLSTMs, and other forecasting models.

Fourth step → Performance and Models Evaluation

Fifth step → Forecasting using best models

### *4.2. Methodology*

### **(A) LSTM Model (long short-term memory model)**

One of the first and most successful techniques for addressing vanishing gradients came in the form of long short-term memory (LSTM) due to [39].

The (long-term memory) part comes after simple recurrent neural networks have longterm memory in the form of weights. Weights change slowly during training, encoding general knowledge about the data. Moreover, the other part (short-term memory) is due to ephemeral activations, which go from each node to successive nodes. The LSTM model introduces an intermediate type of storage via the memory cell. A memory cell is a complex unit built from simpler nodes in a specific communication pattern with a new inclusion of multiplex nodes. A generalized LSTM unit consists of three gates (input, output, and a forget). The LSTM transition equations are given as follows [40].

**Input gate:** this gate makes the decision of whether or not the new information will be added to LSTM memory. This gate consists of two layers: (1) the sigmoid layer and (2) tanh layer. The first layer defines the values to be updated, and tanh layer creates a vector of new candidate values that will be added to LSTM memory. The output of these layers is calculated by:

$$i\_t = \sigma \left( \mathcal{W}^i \mathbb{x}\_t + \mathcal{U}^i h\_{t-1} + b^i \right) \tag{1}$$

$$\mu\_t = \tan \mathbf{h} (\mathcal{W}^\mu \mathbf{x}\_t + \mathcal{U}^\mu h\_{t-1} + b^\mu) \tag{2}$$

where *it*: values updates, *ut*: new candidate values, σ: sigmoid layer (or nonlinear function), *xt*: represents a sequence of length *t*, *b*: is a constant bias, *h*: represents RNN memory at time step *t*. *W* and *U* are weight matrices.

**Forget gate:** the sigmoid function of this gate is used to decide what information to remove from LSTM memory. This decision is mainly made based on the value of *h* and *xt*. The output of this gate is *f* , which is the value between 0 and 1, where 0 indicates completely eliminating the acquired value, and 1 indicates that the entire value is preserved. This output is calculated as:

$$f\_t = \sigma\left(\mathcal{W}^f \mathbf{x}\_t + \mathcal{U}^f h\_{t-1} + b^f\right) \tag{3}$$

where *ft*: values updates, σ: sigmoid layer (or nonlinear function), *xt*: represents a sequence of length *t*, *b*: is a constant bias, *h*: represents RNN memory at time step *t*. *W* and *U* are weight matrices.

**Input gate:** this gate first uses a sigmoid layer to decide which part of LSTM memory contributes to output. Next, it implements a nonlinear tanh function to set values between −1, 1. Finally, the result is multiplied by output of the sigmoid layer. The following equation represents the formulas for calculating output:

$$\boldsymbol{\phi}\_{t} = \sigma(\mathsf{W}^{o}\mathsf{x}\_{t} + \mathsf{U}^{o}\mathsf{h}\_{t-1} + \mathsf{b}^{o})\tag{4}$$

$$h\_l = o\_l \tanh\_l c\_{l-1} \tag{5}$$

where *ot*: is an output gate, *ht*: is represented as a value between [1, −1].

Combining these two layers provides an update to LSTM where the current value is forgotten using forget layer by doubling the old value *ct*−<sup>1</sup> followed by adding candidate value *itut*, The following equation represents its mathematical equation:

$$\mathbf{c}\_{t} = \mathbf{i}\_{t}\mathbf{u}\_{t} + f\_{t}\mathbf{c}\_{t-1} \tag{6}$$

where *ct*: is a memory cell. *ft* are the results of forget gate, which is a value between 0 and 1 where 0 indicates completely rid-of value; 1 implies completely preserved value. The hypothetical combination between these units is illustrated in Figure 4

**Figure 4.** Long-short-term memory layer.

### **(B) Stacked LSTM (Stacked long-short-term memory model)**

Stacked LSTM model is an extension of LSTM model as it consists of multiple hidden layers where each layer contains multiple memory cells. It was introduced by [41]. They found that the depth of network was more important than the number of memory cells in a given layer to model skill layer for modeling the skill.

A stacked LSTM architecture can be defined as an LSTM model comprised of multiple LSTM layers. It provides a sequence output rather than a single value output to LSTM layer below. Specifically, one output per input time step rather than one output time step for all input time steps. This is illustrated in Figure 5.

**Figure 5.** A stacked LSTM architecture.

### **(C) Bi LSTM model (Bidirectional long-short-term memory model)**

Bi LSTM model put two independent RNNs together. This architecture allows network to obtain back-and-forth information about the sequence at each time step [42].

Using Bi LSTM will run inputs in two ways, one from past to future and one from future to past; where this approach differs from unidirectional is that in LSTM running backward, you keep information from the future and using the two hidden states together are able at any time to hold the information from the past and future. Calculating the output *y* at time *t* is illustrated in Figure 6.

$$y\_t = \sigma(\mathcal{W}\_y[h\_t^{\rightarrow}, h\_t^{\leftarrow}] + b\_y) \tag{7}$$

where σ is nonlinear function, *Wy*: are weight matrices that are used in deep learning models, *by*: is a constant bias. *ht*: are hidden states.

**Figure 6.** Bidirectional long-short-term memory layer with both forward and backward LSTM layers.

Is illustrated in Figure 5:

Figure 6 shows us how Bi LSTM model works, as it shows information sent from past and future time series (green color), from inputs *xt*, which are collected in hidden layers *ht* and extract features through nonlinear function σ to predict moment *yt*.

### **(D) GRU model (Gated Recurrent Unit model)**

Gated Recurrent Unit (GRU) is an advanced and more improved version of LSTM. It is also a type of recurrent neural network. It uses less hyper parameters because of reset gate and update gate in contrast to three gates of LSTM. Update gate and reset gate are basically vectors and are used to decide which information should be passed to the output [43]. The reset gate controls how much of the previous state we need to remember. From there, update gate will allow us to control whether the new state is a copy of old state. Two gate outputs are given by two fully connected layers with sigmoid activation function; Figure 7 shows the inputs for both reset and update gates in GRU. Mathematically, output is calculated as follows:

$$\mathbf{r}\_{t} = \sigma(\mathbf{W}^{r}\mathbf{x}\_{t} + \mathbf{U}^{r}h\_{t-1} + b^{r}) \tag{8}$$

$$z\_t = \sigma(\mathcal{W}^z \mathbf{x}\_t + \mathcal{U}^z h\_{t-1} + b^z) \tag{9}$$

where *rt*: is reset gate, *zt*: is update gate, σ: sigmoid activation function, *W* and *U* are weight parameters, *ht*−1: the hidden state of the previous time step, *b*: is a constant bias. Next, we combine the reset gate with the regular refresh mechanism; it is given mathematically according to following equation:

$$i\_t = \sigma \left( \mathcal{W}^i \mathbf{x}\_t + \mathcal{U}^i h\_{t-1} + b^i \right) \tag{10}$$

**Figure 7.** Gated Recurrent Unit (GRU) layer.

Which leads to the next candidate hidden state:

$$a\_l = \tan \mathbf{h} \left( w \mathbf{x}\_l + r\_l \mathbf{U}^l h\_{l-1} + b^h \right) \tag{11}$$

where: *at*: candidate hidden state, tan h: activation function, *w* and *U* are weight parametres, *rt*: is reset gate, *ht*−1: the hidden state of the previous time step, *b*: is a constant bias. Finally, we need to incorporate the effect of update gate. This determines how closely new hidden state is with old state versus how similar it is to new candidate state. Update gate can be used for this propose, simply by taking element-wise convex combinations of *ht* and *ht*−1. This leads to final update equation for GRU:

$$h\_t = z\_t h\_{t-1} + (1 - z\_t) a\_t \tag{12}$$

where *zt*: update gate, *rt*: reset gate, *at*: activation function, *ht*: hidden state output gate. The following Figure 7 illustrates this model:

### **(E) Conv and CNN-LSTM Model**

The convolutional neural network consists of two convolutional layers; this allows for spatial advantage extraction. Where one-dimensional convolution operation is performed over the flow of data *x<sup>s</sup> <sup>t</sup>* at each time step t., a one-dimensional convolution kernel filter is used to acquire the local perceptual domain by a sliding filter [44]. The process of convolution kernel filter can be expressed as follows:

$$Y\_l^s = \sigma(\mathcal{W}\_s \* \mathfrak{x}\_l^s + b\_s) \tag{13}$$

where *Y<sup>s</sup> <sup>t</sup>* : output of convolutional layer, *Ws*: weights of the filter, *x<sup>s</sup> <sup>t</sup>*: input traffic flow at time *t*, *σ*: activation functions.

CNN-LSTM Model is combination of Conv and LSTM; the input of CNN-LSTM is a spatial-temporal traffic flow matrix *x<sup>s</sup> <sup>t</sup>*, as follows [2]:

$$\mathbf{x}\_{t}^{s} = \begin{bmatrix} \mathbf{x}\_{t-n}^{s} \\ \mathbf{x}\_{t-(n-1)}^{s} \\ \vdots \\ \mathbf{x}\_{t}^{s} \end{bmatrix} \begin{bmatrix} f\_{t-n}^{1} & f\_{t-(n-1)}^{1} & \cdots & f\_{t}^{1} \\ f\_{t-n}^{2} & f\_{t-(n-1)}^{2} & \cdots & f\_{t}^{2} \\ \vdots & \vdots & \ddots & \vdots \\ f\_{t-n}^{m} & f\_{t-(n-1)}^{m} & \cdots & f\_{t}^{m} \end{bmatrix} \tag{14}$$

where *x<sup>s</sup> <sup>t</sup>* = *f* <sup>1</sup> *<sup>t</sup>* ... *f <sup>m</sup> <sup>t</sup>* : denotes the traffic flow of the prediction region at time *t*, which represents the historical traffic flow of the POI to be predicted and its neighbors. As shown in Figure 8:

**Figure 8.** CNN-LSTMs Model is combination of Conv and LSTM.

Figure 8 shows us how CNN-LSTM model works; this is performed by adding CNN layer on the front end (left panel) followed by LSTM layers with a dense layer on output (right panel). CNN model works to extract features, and LSTM model works to interpret over time steps.

### **(F) Adam Optimization Algorithm**

Stochastic gradient descent is extended by Adam optimization in order to update network weights in a more efficient manner. The method of adaptive moment estimation is used in stochastic optimization. This makes it possible for the rate of learning to adjust over the course of time, which is a vital concept to grasp, given that Adam also demonstrates this phenomenon. Adam is the result of combining the two variables (Momentum and RMSprop) as shown in Algorithms 1, which presents a method in greater detail and also Pseudo-code 1.

Adam proposed algorithm for stochastic optimization and for a slightly more efficient order of computation. *g*<sup>2</sup> *<sup>t</sup>* indicates the elementwise square *gt gt*. Good default settings for the tested machine learning problems are α = 0.001, *β*<sup>1</sup> = 0.9, *β*<sup>2</sup> = 0.999, and = 10−8. All operations on vectors are element-wise. With *β<sup>t</sup>* <sup>1</sup> and *<sup>β</sup><sup>t</sup>* <sup>2</sup> we denote *β*<sup>1</sup> and *β*<sup>2</sup> to the power *t* [19].

**Algorithms 1:** Adam algorithm for stochastic optimization [19].

**Require**: *a* : Stepsize **Require:** *β*1, *β*<sup>2</sup> ∈ [0, 1) : Exponential decay rates for the moment estimates **Require:** *f*(*θ*) : Stochastic objective function with parameters *θ* **Require:***θ*<sup>0</sup> : Initial parameter vector *m*<sup>0</sup> ← 0(Initialize 1st moment vector) *v*<sup>0</sup> ← 0(Initialize 2nd moment vector) *t* ← 0(Initialize timestep) **while** *θ* not converged **do** *t* + *t*<sup>1</sup> *gt* ← ∇*<sup>θ</sup> ft*(*θt*−1) (Get gradients w.r.t. stochastic objective at timestep *<sup>t</sup>*) *mt* ← *<sup>β</sup>*1·*mt*−<sup>1</sup> + (<sup>1</sup> − *<sup>β</sup>*1)·*gt* (Update biased first moment estimate) *vt* <sup>←</sup> *<sup>β</sup>*2·*vt*−<sup>1</sup> <sup>+</sup> (<sup>1</sup> <sup>−</sup> *<sup>β</sup>*2)·*g*<sup>2</sup> *<sup>t</sup>* (Update biased second raw moment estimate) *m*ˆ *<sup>t</sup>* ← *mt*/ <sup>1</sup> <sup>−</sup> *<sup>β</sup><sup>t</sup>* 1 (Compute bias-corrected first moment estimate) *v*ˆ*<sup>t</sup>* ← *vt*/ <sup>1</sup> <sup>−</sup> *<sup>β</sup><sup>t</sup>* 2 Compute bias-corrected second raw moment estimate) *<sup>θ</sup><sup>t</sup>* ← *<sup>θ</sup>t*−<sup>1</sup> − *<sup>a</sup>*·*m*<sup>ˆ</sup> *<sup>t</sup>*/( <sup>√</sup>*v*ˆ*<sup>t</sup>* <sup>+</sup> (Update parameters) **end while return** *θt* (Resulting parameters

### **Adaptive Moment Estimation (Adam) Pseudo-code: Adam algorithm for stochastic optimization** Note: We have two separate beta coefficients → one for each optimization part. We implement bias correction for each gradient On iteration t: Compute dW, db for current mini-batch # #Momentum v\_dW = beta1 \* v\_dW + (1 − beta1) dW v\_db = beta1 \* v\_db + (1 − beta1) db v\_dW\_corrected = v\_dw/(1 − beta1 \*\* t) v\_db\_corrected = v\_db/(1 − beta1 \*\* t) # #RMSprop s\_dW = beta \* v\_dW + (1 − beta2) (dW \*\* 2) s\_db = beta \* v\_db + (1 − beta2) (db \*\* 2) s\_dW\_corrected = s\_dw/(1 − beta2 \*\* t) s\_db\_corrected = s\_db/(1 − beta2 \*\* t) # #Combine W=W − alpha \* (v\_dW\_corrected/(sqrt(s\_dW\_corrected) + epsilon)) b=b − alpha \* (v\_db\_corrected/(sqrt(s\_db\_corrected) + epsilon)) **Coefficients** alpha: the learning rate. 0.001. beta1: momentum weight. Default to 0.9. beta2: RMSprop weight. Default to 0.999. epsilon: Divide by Zero failsave. Default to 10 \*\* −8. **(G) Performance indicators**

To compare the prediction performance of the three models used: Calculating root mean square error (RMSE) between the estimated data and actual data:

$$\text{RMSE} = \sqrt{\frac{\sum\_{t=1}^{n} (y\_t - y\_t)^2}{n}} \tag{15}$$

where *y*ˆ*t*: the forecast value, *yt*: the actual value, *n*: number of fitted observed. Calculating relative root mean square error (RRMSE):

$$\text{RRMSE} = \sqrt{\frac{\frac{1}{n} \sum\_{t=1}^{n} (\mathcal{y}\_t - y\_t)^2}{\sum\_{t=1}^{n} (\mathcal{y}\_t)^2}} \tag{16}$$

Calculating mean absolute error (MAE):

$$\text{MAE} = \frac{1}{n} \sum\_{t=1}^{n} |y\_t - \hat{y}\_t| \tag{17}$$

Calculating mean bias error (MBE):

$$\text{MBE} = \frac{\sum\_{t=1}^{n} (y\_t - \hat{y}\_t)}{n} \tag{18}$$

Calculating Coefficient of correlation (R):

$$\mathcal{R} = \frac{Cov(y\_{t'} \hat{y}\_t)}{\sqrt{V(y\_t) \ V(\hat{y}\_t)}} \tag{19}$$

Calculating Coefficient of determination (R Square):

$$\mathbf{R}^2 = 1 - \frac{\sum\_{t=1}^n \left(\hat{y}\_t - \overline{y}\_t\right)^2}{\sum\_{t=1}^n \left(y\_t - \overline{y}\_t\right)^2} \tag{20}$$

The model that has the least values of (RMSE—RRMSE—MAE—MBE) and greater values of (R–R-Square) is the best model.

### **5. Results**

To prove the effectiveness and superiority of the proposed approach, several experiments were conducted to predict SARS-CoV-2. Firstly, a set of baseline experiments were conducted using six base models, including LSTM, BDLSTM, GRU, LSTMs, and CONVLSTMs. The results of these models were compared to the achieved results using the Bi-LSTM, LSTM, CNN, and CNN-LSTMs algorithm for daily infection and death for *SARS-CoV-2* in Russia and Chelyabinsk, respectively. Table 1 presents the results of the testing for each of the base models along with the proposed approach based on the adopted evaluation criteria.

**Table 1.** Comparison of six methods evaluation testing 20% SARS-CoV-2 daily infection and death cases in Russian federation and Chelyabinsk.


As presented in the table, the proposed approach could achieve the best values over all the evaluation criteria, which confirms the superiority of the proposed approach. The achieved RMSE on the test set using the proposed approach **BDLSTM** for infection cases of **SARS-CoV-2 in Russia** is (**2611.48**). In addition, RRMSE, MAE, R2, r, and MBE of the test set using the proposed approach **BDLSTM** is (**0.11**), (**1417.74**), (**0.99**), (**1**), and (*−***59.11**). These values prove the effectiveness of the proposed approach. The achieved RMSE on the test set using the proposed approach **LSTM** for death cases of **SARS-CoV-2 in Russia** is (**24.46**). In addition, RRMSE, MAE, R2, r, and MBE of the test set using the proposed approach **LSTM** is (**0.12**), (**20.19**), (**0.99**), (**1**), and (**13.85**). These values prove the effectiveness of the proposed approach. The achieved RMSE on the test set using the proposed approach **Conv** for infection cases of **SARS-CoV-2 in the Chelyabinsk region** is (**24.69**). In addition, RRMSE, MAE, R2, r, and MBE of the test set using the proposed approach **Conv** are (**0.13**), (**14.36**), (**0.96**), (**0.98**), and (**3.86**). These values prove the effectiveness of the proposed approach. The achieved RMSE on the test set using the proposed approach **CNN-LSTMs** for death cases of **SARS-CoV-2 in the Chelyabinsk region** is (**1.60**). In addition, RRMSE, MAE, R2, r, and MBE of the test set using the proposed approach **CNN-LSTMs** are (**0.51**), (**1.29**), (**0.54**), (**0.78**), and (**0.63**). These values prove the effectiveness of the proposed approach.

Table 2 shows us the large difference between the maximum and minimum values of all variables and thus affects the shape of the distribution. Thus, the estimators here (Mean, Median, Mode, and SD) are useless because they are breakdown points. We notice from the table that the largest difference is for the variable number of infections in Russia, from 0 to 202,211 cases, which leads to a kurtosis that gives a pointed top of the distribution as its value is much greater than three and a greater value for standard error (more difficulty in predicting), with the distribution skewed towards the right as the frequency of values greater than the average is greater for this variable. as the injury variable in Russia took 700 days to move from the lowest value to the largest value. The same thing happened for infection Chelyabinsk, with less difference between max and min values leading to less S.D. As for death cases, we notice a negative kurtosis, which indicates less volatility for both variables and, therefore, a smaller S.D than infection cases with a slight Skewness due to the convergence of the values from the arithmetic mean, and therefore, the cases of death are less developed than the cases of injury with the preventive measures that have been taken in these areas.


**Table 2.** Descriptive statistics of SARS-CoV-2.

The table shows us that the best model for predicting SARS-CoV-2 infection cases in Russia is (BDLSTM) because it has the least values of (RMSE—RRMSE—MAE—MBE) and, therefore, the least difference between the real and estimated values using the model. We also note that the model is able to explain the volatility in a variable through the high value of the coefficient of determination (R Square = 99%); there is a perfect linear correlation between the estimated and actual values. As before, we note that the best model for SARS-CoV-2 death cases in Russia is (LSTM), and for SARS-CoV-2 infection cases in the Chelyabinsk region is (CONV), and for SARS-CoV-2 death cases in the Chelyabinsk region is (CNN-LSTMs). As these models achieve convergence between the actual and estimated values of the training and test data, noting their ability to capture extreme values (Maximum and Minimum value). This is illustrated by the following figures:

Figure 9 shows us the convergence of data on actual daily infection of SARS-CoV-2 in Russia with estimated using the BDLSTM model (training–testing), so we notice a great convergence between the actual and estimated data and the ability of the model to clarify

volatility in infection of SARS-CoV-2 and capture structural points, and thus this model can be used to predict in daily infection of SARS-CoV-2 in Russia.

**Figure 9.** Comparison of the forecasting SARS-CoV2 infection cases and the real infection cases for BDLSTM.

Figure 10 shows us the convergence of data on actual daily death SARS-CoV-2 in Russia with estimated using the LSTM model (training–testing), so we notice a great convergence between the actual and estimated data and the ability of model to clarity volatility in death SARS-CoV-2 and capture trends change and thus this model can be used to predict in daily death SARS-CoV-2 in Russia.

**Figure 10.** Comparison of the forecasting SARS-CoV-2 death cases and the real infection cases for LSTM.

Figure 11 shows us the convergence of data on actual daily infection SARS-CoV-2 in the Chelyabinsk region estimated using the CNN model (training–testing), so we notice a great convergence between the actual and estimated data and the ability of the model to clarify volatility in SARS-CoV-2 infection and capture structural points, and thus this model can be used to predict in daily SARS-CoV-2 infection in the Chelyabinsk region.

**Figure 11.** Comparison of the forecasting SARS-CoV-2 infection cases and the real infection cases for CNN.

Figure 12 shows us the convergence of data on actual daily death SARS-CoV-2 in the Chelyabinsk region with estimated using the CNN-LSTMs model (training–testing), so we notice a great convergence between the actual and estimated data and the ability of the model to clarify volatility in death SARS-CoV-2 and capture structural points, and thus this model can be used to predict in daily death SARS-CoV-2 in the Chelyabinsk region. The hyper-parameters for deep learning models are shown in Table 3.

**Figure 12.** Comparison of the forecasting SARS-CoV-2 death cases and the real infection cases for CNN-LSTMs.



### **6. Conclusions and Future Research**

In this study, a hybrid deep learning model's algorithm was used to improve the performance of a standard LSTM network in the analysis and forecasting of SARS-CoV-2 infections and death cases in the Russian Federation and the Chelyabinsk region. This was accomplished by using a combination of traditional LSTM networks and hybrid deep learning models. In order to demonstrate that the strategy being offered is effective, a dataset is gathered for the purposes of analysis and prediction. The suggested method was evaluated by applying it to datasets obtained from an official data source that was representative of the Russian Federation and the Chelyabinsk region. The utilization of these six key performance indicators allows for the performance of the suggested methodology to be evaluated and analyzed. In addition, the performance of the suggested method is evaluated and compared to that of the other five prediction models in order to demonstrate that the proposed method is superior. The compiled data provided unmistakable evidence that the strategy being recommended (Hybrid Deep-Learning models) are not only successful but also significantly more advantageous and important. On the other hand, it serves as a reference for the health sector in Russia, in particular, as well as the World Health Organization (WHO), as well as, more generally, for the health sectors in other nations. As for future research directions, it is planned to enable medium- and long-term forecasting of time series in weakly structured situations, to develop mechanisms for correcting long-term forecasts, to force a set of forecasting models to account for forecasting quality in previous periods, and to consider the possibility of employing nonlinear forecasting models for weakly structured data. All of these, along with the use of additional criteria for the verification of the best models, can be used to expand and enhance the algorithm discussed in this study and create a new package in Python for modeling and forecasting not only SARS-CoV-2 data but any univariate-dimensional time series data.

**Author Contributions:** Methodology, M.A.; software, M.A. and T.M.; validation, M.A., A.K. and I.P.; formal analysis, F.A.; investigation, A.K.; resources, M.A.; data curation, M.A.; writing—original draft preparation, K.A.; writing—review and editing, F.A.; visualization, A.B.; supervision, M.A.; project administration, M.A.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The research was supported by RSF grant 22-26-00079.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


## *Article* **Score-Guided Generative Adversarial Networks**

**Minhyeok Lee <sup>1</sup> and Junhee Seok 2,\***

	- **\*** Correspondence: jseok14@korea.ac.kr

**Abstract:** We propose a generative adversarial network (GAN) that introduces an evaluator module using pretrained networks. The proposed model, called a score-guided GAN (ScoreGAN), is trained using an evaluation metric for GANs, i.e., the Inception score, as a rough guide for the training of the generator. Using another pretrained network instead of the Inception network, ScoreGAN circumvents overfitting of the Inception network such that the generated samples do not correspond to adversarial examples of the Inception network. In addition, evaluation metrics are employed only in an auxiliary role to prevent overfitting. When evaluated using the CIFAR-10 dataset, ScoreGAN achieved an Inception score of 10.36 ± 0.15, which corresponds to state-of-the-art performance. To generalize the effectiveness of ScoreGAN, the model was evaluated further using another dataset, CIFAR-100. ScoreGAN outperformed other existing methods, achieving a Fréchet Inception distance (FID) of 13.98.

**Keywords:** generative adversarial network; image generation; image synthesis; GAN; generative model; Inception score; scoreGAN

**MSC:** 68T45

**1. Introduction**

A recent advancement in artificial intelligence is the implementation of deep learning algorithms to generate synthetic samples [1–3]. These types of neural networks are able to learn how to map inputs to outputs after being trained on large datasets. In the past few years, researchers have used deep learning algorithms to create synthetic samples in various domains such as music, images, and speech [4–6]. One important application of synthetic sample generation is in the field of data augmentation [3,7]. Data augmentation is a technique used in machine learning to increase the size of the training datasets. Synthetic samples can be used to create new data points that are similar to existing data points, but may have different labels or attributes. This can help improve the performance of machine learning algorithms by providing them with more data to train on.

Due to their innovative training algorithm and superb performance in image generation tasks, generative adversarial networks (GANs) have been widely studied in recent years [8–12]. GANs generally employ two artificial neural network (ANN) modules, called a generator and a discriminator, which are trained with an adversarial process to detect and deceive each other. Specifically, the discriminator aims at detecting synthetic samples that are produced by the generator; meanwhile, the generator is trained by errors that are obtained from the discriminator. By such a competitive learning process, the generator can produce fine synthetic samples of which features are incredibly similar to those of actual samples [13,14].

However, the performance evaluation of GAN models is a challenging task since the quality and diversity of generated samples should be assessed from the human perspective [15,16]; furthermore, unbiased evaluations are also difficult because each person can have different

Score-Guided Generative Adversarial Networks. *Axioms* **2022**, *11*, 701. https://doi.org/10.3390/axioms 11120701

Academic Editor: Joao Paulo Carvalho

**Citation:** Lee, M.; Seok, J.

Received: 3 November 2022 Accepted: 3 December 2022 Published: 7 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

views on the quality and diversity of samples. Therefore, several studies have introduced quantitative metrics to evaluate GAN models in a measurable manner [16,17].

The Inception score is one of the most representative metrics to evaluate GAN models for image generation [16]. A conventional pretrained ANN model for image classification, called the Inception network [18], is employed to assess both the quality and diversity of the generated samples, by measuring entropies of inter- and intra-samples in terms of estimated probabilities for each class. The Fréchet Inception distance (FID) is another metric to measure GAN performance, in which the distance between feature distributions of real samples and generated samples are calculated [17].

From the adoption of the evaluation metrics, the following questions then arise: Can the evaluation metrics be used as targets for the training of GAN modelssince the metrics reasonably represent the quality and diversity of samples? By backpropagating gradients of the score or distance, is it possible to maximize or minimize them? Such an approach seems feasible since the metrics are generally differentiable; therefore, the gradients can be computed and backpropagated.

However, simply backpropagating the gradients and training with the metrics correspond to learning adversarial examples in general [19,20]. Since the complexity of ANN models is significantly high, we can easily make a sample be incorrectly predicted, by adding minimal noises into the sample; this noisy sample is called the adversarial example [20]. Therefore, in short, a fine quality and rich diversity of samples can have a high Inception score, while the reverse is not always true.

Barratt and Sharma [21] studied this problem and found that directly maximizing the score does not guarantee that the generator produces fine samples. They trained a GAN model to maximize the Inception score; then, the trained model produced image samples with a very high Inception score. While the Inception score of real samples in the CIFAR-10 dataset is around 10.0, the produced images achieved an Inception score of 900.15 [21]. However, the produced images were entirely different from the real images in the CIFAR-10 dataset; instead, they looked like noises.

In this paper, to address such a problem and utilize the evaluation metric as a training method, we propose a score-guided GAN (ScoreGAN) that employs an evaluator ANN module using pretrained networks with the evaluation metrics. While the aforementioned problems exist in ordinary GANs, ScoreGAN solves the problems through two approaches as follows.

First, ScoreGAN uses the evaluation metric as an auxiliary target, while the target function of ordinary GANs is mainly used. Using the evaluation metric as the only target causes overfitting of the network used for the metric, instead of learning meaningful information from the network, as shown in related studies [21]. Thus, the evaluation metric is employed as the auxiliary target in ScoreGAN.

Second, in order to backpropagate gradients and train the generator in ScoreGAN, we employ a different pretrained model called MobileNet [22]. This prevents the generator from overfitting on the Inception network. To the best of our knowledge, employing a pretrained MobileNet with an additional score function for the training of the generator has not been explored thus far. Additionally, this approach allows us to validate that the generator has actually learned features, rather than simply memorizing details from the Inception network. In this process, we can assess whether ScoreGAN is able to achieve a high Inception score without using the Inception network, which can prove the effectiveness of ScoreGAN.

The main contributions of this paper are as follows:


dataset, where its Inception score in the CIFAR-10 is 10.36 ± 0.15, and the FID in the CIFAR-100 is 13.98.

### **2. Background**

Generative models aim to learn sample distributions and produce realistic samples. For instance, generative models can be trained with an image dataset; then, a successfully trained generative model produces realistic, but synthetic images for which the features are extremely similar to the original images in the training set. The GAN is one the representative generative models, which uses deep learning architectures and an algorithm with game theory. In recent years, diffusion models have been employed as generative models and demonstrated superior performances [2,23,24]. In Section 2.1, we discuss a variant of the GAN called the controllable GAN, which is the baseline of the proposed model. Additionally, two metrics to assess the produced images by the generative models are presented in Sections 2.2 and 2.3.

### *2.1. Controllable Generative Adversarial Networks*

The conventional GAN model consists of two ANN modules, i.e., the generator and the discriminator. The two modules are trained by playing a game to deceive or detect each other [15,25]. The game to train a GAN can be represented as follows:

$$\hat{\boldsymbol{\theta}}\_{\mathcal{D}} = \operatorname\*{arg\,min}\_{\boldsymbol{\theta}\_{\mathcal{D}}} \{ L\_{\mathcal{D}}(1, \boldsymbol{D}(\boldsymbol{X}; \boldsymbol{\theta}\_{\mathcal{D}})) + L\_{\mathcal{D}}(0, \boldsymbol{D}(\mathcal{G}(\mathcal{Z}; \boldsymbol{\hat{\theta}}\_{\mathcal{G}}); \boldsymbol{\theta}\_{\mathcal{D}})) \},\tag{1}$$

$$\hat{\boldsymbol{\theta}}\_{G} = \operatorname\*{arg\,min}\_{\boldsymbol{\theta}\_{G}} \{ L\_{D}(1, D(G(Z; \boldsymbol{\theta}\_{G}); \boldsymbol{\theta}\_{D})) \}\,. \tag{2}$$

where *G* and *D* denote the generator and the discriminator, respectively, *X* is a training sample, *Z* represents a noise vector, *θ* is a set of weights of an ANN model, and *LD* indicates a loss function for the discriminator.

However, the ordinary GAN can hardly produce the desired samples since each feature in a dataset is randomly mapped into each variable of the input noise vector. Therefore, it is hard to discover which noise variable corresponds to which feature. To overcome this problem, conditional variants of GAN that introduce conditional input variables have been studied [26–28].

Controllable GAN (ControlGAN) [29] is one of the conditional variants of GANs that uses an independent classifier and the data augmentation techniques to train the classifier. While a conventional model, called auxiliary classifier GAN (ACGAN) [28], has an overfitting issue on the classification loss and a trade-off for using the data augmentation technique [29], ControlGAN breaks the trade-off through introducing the independent classifier, as well as the data augmentation technique. The training of ControlGAN is performed as follows:

$$\hat{\boldsymbol{\theta}}\_{D} = \operatorname\*{arg\,min}\_{\boldsymbol{\theta}\_{D}} \{ L\_{D}(1, D(\mathbf{X}; \boldsymbol{\theta}\_{D})) + L\_{D}(0, D(\mathbf{G}(\mathcal{Z}, \mathcal{L}; \hat{\boldsymbol{\theta}}\_{\mathcal{G}}); \boldsymbol{\theta}\_{D})) \},\tag{3}$$

$$\hat{\boldsymbol{\theta}}\_{\mathcal{G}} = \operatorname\*{arg\,min}\_{\boldsymbol{\theta}\_{\mathcal{G}}} \{ \boldsymbol{L}\_{\mathcal{D}}(1, \boldsymbol{D}(\mathcal{G}(\mathcal{Z}; \boldsymbol{\theta}\_{\mathcal{G}}); \boldsymbol{\theta}\_{\mathcal{D}})) + \boldsymbol{\gamma}\_{\mathcal{t}} \cdot \boldsymbol{\mathcal{L}}\_{\mathbb{C}} \left( \mathcal{L}\_{\prime} \mathcal{C} \left( \mathcal{G}(\mathcal{Z}, \mathcal{L}; \boldsymbol{\theta}\_{\mathcal{G}}); \boldsymbol{\hat{\theta}}\_{\mathcal{C}} \right) \right) \}, \tag{4}$$

$$\hat{\theta}\_{\mathbb{C}} = \operatorname\*{arg\,min}\_{\theta\_{\mathbb{C}}} \{ L\_{\mathbb{C}} (\mathcal{L}, \mathbb{C}(X; \theta\_{\mathbb{C}})) \},\tag{5}$$

where *C* represents the independent classifier, L denotes the input labels, and *γ<sup>t</sup>* is a learning parameter that modulates the training of the generator in terms of the classification loss.

### *2.2. The Inception Score*

To assess the quality and diversity of the generated samples by GANs, the Inception score [16] is one of the most conventional evaluation metrics, which has been extensively employed in many studies [8,14,16,21,26,27,29]. For the quantitative evaluation of GANs, the Inception score introduces the Inception network, which was initially used for image classification [18]. The Inception network is pretrained to solve the image classification task over the ImageNet dataset [30], which contains more than one million images of 1000 different classes; then, the network learns the general features of various objects.

Through the pretrained Inception network, the quality and diversity of the generated samples can be obtained from two aspects [16,21]: First, the high quality of an image can be guaranteed if the image is firmly classified into a specific class. Second, a high entropy in the marginal probability of the generated samples indicates a rich diversity of the samples since such a condition signifies that the generated samples are different from each other.

Therefore, the entropies of the intra- and inter-samples are calculated over the generated samples; then, these two entropies compose the Inception score as follows:

$$IS\left(G\left(\cdot;\hat{\boldsymbol{\theta}}\_{G}\right)\right) = \exp\left(\frac{1}{N}\sum KL\left(Pr\left(Y|\hat{X}\right)||Pr(Y)\right)\right),\tag{6}$$

where *X*ˆ denotes a generated sample, *KL* indicates the Kullback–Leibler (KL) divergence, namely the relative entropy, and *N* is the number of samples in a batch. Since a high KL divergence signifies a significant difference between the two probabilities, thus a higher Inception score indicates greater qualities and a wider variety of samples. Generally, ten sets, each of which contains 5000 generated samples, are used to calculate the Inception score [16,21].

### *2.3. The Fréchet Inception Distance*

The FID is another metric to evaluate the generated samples in which the Inception network is employed as well [17]. Instead of the predicted probabilities, the FID introduces the feature distribution of the generated samples that can be represented as the outputs of the penultimate layer of the Inception network.

With the assumption that the feature distribution follows a multivariate normal distribution, the distance between the feature distributions of the real samples and generated samples is calculated as follows:

$$FID\left(\mathbf{X}, \hat{\mathbf{X}}\right) = \left\|\mu\_{\mathbf{X}} - \mu\_{\mathbf{X}}\right\|\_{2}^{2} + Tr\left(\Sigma\_{\mathbf{X}} + \Sigma\_{\mathbf{X}} - 2 \cdot \sqrt{\Sigma\_{\mathbf{X}} \Sigma\_{\mathbf{X}}}\right),\tag{7}$$

where *X* and *X***<sup>ˆ</sup>** are the data matrices of the real samples and generated samples, respectively, and Σ denotes the covariance matrix of a data matrix. In contrast to the Inception score, a lower FID indicates the similarity between the feature distributions since the FID measures a distance.

### **3. Methods**

In this paper, we propose ScoreGAN, which uses an additional target, derived from the evaluation metrics in Section 2.2. The proposed ScoreGAN uses the Inception score as a target of the generator. However, directly targeting the Inception score leads to an overfitting issue; thus, in ScoreGAN, a pretrained MobileNet is used for the training. Then, the trained model is evaluated with the conventional Inception score and FID using the Inception network. This method is elaborated in Section 3.1. The training details of ScoreGAN are described in Section 3.2.

### *3.1. Score-Guided Generative Adversarial Network*

The main idea of ScoreGAN is straightforward: For its training, the generator in ScoreGAN utilizes an additional loss that can be obtained from the evaluation metric for GANs. Since it has been verified that the evaluation metric strongly reflects the quality and diversity of the generated samples [8,16], it is expected that the performance of GAN models can be enhanced by optimizing the metrics.

Therefore, the architecture of ScoreGAN corresponds to ControlGAN with an additional evaluator; the evaluator is used to calculate the score, then gradients are backpropagated to train the generator. The other neural network structures are the same as those of ControlGAN.

However, due to the high complexity of GANs, it is not guaranteed that such an approach can work properly, as described in the previous section. Directly optimizing the Inception score can cause overfitting over the network that is used to compute the metric; then, the overfitted GANs produce noises instead of realistic samples even if the score of the generated noise is high [21].

In this paper, we circumvent this problem through two different approaches, i.e., employing the metric as an auxiliary cost instead of the main target of the generator and adopting another pretrained network as an evaluator module as a replacement of the Inception network.

### 3.1.1. The Auxiliary Costs Using the Evaluation Metrics

ScoreGAN mainly uses the ordinary GAN cost in which the adversarial training process is performed while the evaluation metric is utilized as an auxiliary cost. Therefore, the training of the generator in ScoreGAN is conducted by adding the cost of the evaluation metric to (4). Such a method using an auxiliary cost has been introduced in ACGAN [28]; then, the method has been widely studied in many recent works [27], including Control-GAN [29]. As a result of the recent works, it has been demonstrated that the auxiliary costs serve as a "rough guide" for a generator to be trained with additional information. The proposed technique using the evaluation metrics in this paper corresponds to a variant of such a method, where the metrics are used as rough guides to generate high-quality and a rich variety of samples. In short, the generator in ScoreGAN aims at maximizing a score in addition to the original cost, which can be represented as follows:

$$\hat{\theta}\_{\text{G}} = \operatorname\*{arg\,min}\_{\theta\_{\text{G}}} \{ \mathfrak{L}\_{\text{G}} - \delta \cdot IS(\hat{X}) \},\tag{8}$$

where L*<sup>G</sup>* denotes the regular cost for a generator, such as the optimization target in (4), *δ* is a parameter for the score, and *IS* is the score that can be obtained from the evaluator. Since (6) is differentiable with respect to *G*, *θ<sup>G</sup>* can be optimized by the gradients in such a manner.

### 3.1.2. The Evaluator Module with MobileNet

To obtain the *IS* in (8), originally, the Inception network [18] is required as the evaluator in ScoreGAN since the metrics are calculated through the network. However, as described in the previous sections, directly optimizing the score leads to overfitting the network, thereby making the generator produce noises instead of fine samples. Furthermore, if the Inception network is used for the training, it is challenging to validate whether the generator actually learns features rather than memorizes the network, since the generator trained by the Inception network certainly achieves a high Inception score, regardless of the actual learning.

Therefore, ScoreGAN introduces another network, called MobileNet [22], as the evaluator module, in order to maximize the score. MobileNet [22,31,32] is a comparatively small classifier for mobile devices, which is trained with the ImageNet dataset as well. Due to its compact network size, enabling GANs to be trained, MobileNet is used in this study. The score is calculated over the feature distribution of MobileNet; then, the generator aims to maximize the score, as described in (8). For MobileNet, the pretrained model in the Keras library is used in this study.

Furthermore, to prevent overfitting on MobileNet, ScoreGAN uses a regularized score, which can be represented as follows:

$$RIS\_{mobille}(\hat{\mathbf{X}}) := \min \{ IS\_{mobille}(\mathbf{X}), IS\_{mobille}(\hat{\mathbf{X}}) \}, \tag{9}$$

where *RIS* represents the regularized score and *ISmobile* denotes the score calculated by the same manner as (6) through MobileNet instead of the Inception network. Since a perfect GAN model can achieve a high score that is similar to the score of real data, thus, it is expected that the maximum value of the score that a GAN model can attain is the score of real data. Therefore, such an approach in (9) assists the GAN training by reducing the overfitting of the target network.

The evaluation, however, is performed with the Inception network, as well as the Inception score, instead of MobileNet and *ISmobile*, which can generalize the performance of ScoreGAN. If ScoreGAN is trained to optimize MobileNet, the training ensures maximizing the score obtained with MobileNet, irrespective of the learning of actual features. Therefore, to validate the performance, the model must be evaluated with the original metric, the Inception score.

Furthermore, the model is further evaluated and cross-validated through the FID. Since the score and the FID measure different aspects of the generated samples, the maximization of the score does not guarantee obtaining a low FID. Instead, only if ScoreGAN produces realistic samples that are highly similar to real data in terms of feature distributions, the model can achieve a lower FID than the baseline. Therefore, by using the FID, we can properly cross-validate the model even if the score is used for the target.

### *3.2. Network Structures and Regularization*

Since ScoreGAN employs the ControlGAN structure as the baseline and integrates an evaluator measuring the score with the baseline, ScoreGAN consists of four ANN modules, namely the generator, discriminator, classifier, and evaluator. In short, ScoreGAN additionally uses the evaluator, attached to the original ControlGAN framework. The structure of ScoreGAN is illustrated in Figure 1.

**Figure 1.** The structure of ScoreGAN.The training of each module is represented with arrows. E: evaluator; C: classifier; D: discriminator; G: generator.

As described in Figure 1 and (8), the generator is trained by targeting the three other ANN modules to maximize the score and minimize the losses, simultaneously. Meanwhile, the discriminator tries to distinguish between the real samples and generated samples. The classifier is trained only with the real samples in which the data augmentation is applied; then, the loss for the generator can be obtained with the trained classifier. The evaluator is a pretrained network and fixed during the training of the generator; thereby, the generator learns general features of various objects from the pretrained evaluator by maximizing the score of the evaluator.

Due to the vulnerable nature of the training of GANs, regularization methods for the ANN modules in GANs are essential [33,34]. Accordingly, ScoreGAN also uses the regularization methods that are widely employed in various GAN models for its training. Spectral normalization [35] and the hinge loss [36] that are commonly used in state-of-the-art GAN models are employed in ScoreGAN as well. The gradient penalty with a weight parameter of 10 is used [33]. Furthermore, according to recent studies that show the regularized discriminator requires intense training [8,35], multiple training iterations for the discriminator are applied; the discriminator is trained over five times per one training iteration of the generator. For the generator and the classifier, the conditional batch normalization (cBN) [37] and layer normalization (LN) [38] techniques are used, respectively.

For the neural network structures in ScoreGAN, we followed a typical architecture that is generally introduced in many other studies [27,39]. The detailed structures are shown in Table 1. Two time-scale update rule (TTUR) [17] is employed with learning rates of 4 × <sup>10</sup>−<sup>4</sup> and 2 × <sup>10</sup>−<sup>4</sup> for the discriminator and the generator, respectively. The learning rates halve after 50,000 iterations; then, the models are further trained with the halved learning rates for another 50,000 iterations. The Adam optimization method is used with the parameters of *β*<sup>1</sup> = 0 and *β*<sup>2</sup> = 0.9, which is the same setting as the other recent studies [29,35]. The maximum threshold for the training from the classifier was set to 0.1. The parameter *δ* in (8) that modulates the training from the evaluator was set to 0.5.

**Table 1.** Architecture of neural network modules. The values in the brackets indicate the number of convolutional filters or nodes of the layers. Each ResBlock is composed of two convolutional layers with pre-activation functions.


### **4. Results**

In this section, we discuss the performance of ScoreGAN with respect to the Inception score, the FID, and the quality of the generated images. In the experiments, three images datasets called CIFAR-10, CIFAR-100, and LSUN were used. Three subsections in this section explain the performance results on each dataset. The characteristics of the datasets are described in Table 2.

**Table 2.** Datasets used in the experiments.


### *4.1. Image Generation with CIFAR-10 Dataset*

The proposed ScoreGAN was evaluated over the CIFAR-10 dataset, which is conventionally employed as a standard dataset to assess the image generation performance of GAN models in many studies [26,27,29,35,39–42]. The training set of the CIFAR-10 dataset is composed of 50,000 images that are from 10 different classes. To train the models, we used a minibatch size of 64, and the generator was trained over 100,000 iterations. The other settings and the structure of ScoreGAN that was used to train the CIFAR-10 dataset are described in the previous section. Since the proposed ScoreGAN introduces an additional evaluator compared to ControlGAN, we used ControlGAN as the baseline; thereby, we can properly assess the effect of the additional evaluator.

To evaluate the image generation performance of the models, the Inception score and FID were employed. As described in the previous sections, since the Inception score is the average of the relative entropy between each prediction and the marginal predictions, a higher Inception score signifies better-quality and a rich diversity of the generated samples; conversely, a lower FID indicates that the feature distributions of the generated samples are similar to those of the real samples. Notice that, for ScoreGAN, the Inception score and FID are measured after the training iterations (100,000). It is expected that we can enhance the performance results if the models are repeatably measured during the training, and then, we selected the best model among the iterations, as conducted in several studies [8,39].

Table 3 shows the performance of GAN models in terms of the Inception score and FID. While the neural network architectures of the GAN are the same as ControlGAN, the proposed ScoreGAN demonstrates superior performance compared to ControlGAN, which verifies the effectiveness of the additional evaluator in ScoreGAN. The Inception score increased by 20.5%, from 8.60 to 10.36, which corresponds to state-of-the-art performance among the existing models thus far. The FID also decreased by 21.1% in ScoreGAN compared to ControlGAN in which the FID values of ScoreGAN and ControlGAN are 8.66 and 10.97, respectively. Random examples that are generated by ScoreGAN are shown in Figure 2.

**Figure 2.** Random examples of the generated images by ScoreGAN with the CIFAR-10 dataset. Each column represents each class in the CIFAR-10 dataset. All images have a 32 × 32 resolution.


**Table 3.** Performance of GAN models over the CIFAR-10 dataset. IS indicates the Inception score; FID indicates the Fréchet Inception distance. The best performances are highlighted in bold.

The results of this study appear to validate the effectiveness of both the additional evaluator and auxiliary score present in ScoreGAN. It can be said that the generator in ScoreGAN appears to properly learn general features through the pretrained evaluator and is then enforced to produce a variety of samples by maximizing the score. This is reflected not only in an increase in the Inception scores, but also in a decrease in the FID scores. Since the FID measures the similarity between feature distributions, it is less related to the objective of ScoreGAN. Therefore, this enhancement of the decreased FIDs could be evidence that ScoreGAN does not overfit on the Inception scores, and the proposed evaluator enhances the performance. Furthermore, since ScoreGAN does not use the Inception network as the evaluator and the score, it is difficult to regard the generated samples by ScoreGAN as adversarial examples of the Inception network, as shown in the examples in Figure 2, where the images are far from noises.

The detailed Inception score and FID over iterations are shown in Figure 3. As shown in the figures, the training of ControlGAN becomes slow after 30,000 iterations, while the proposed ScoreGAN continues its training. For example, the Inception score of ControlGAN at 35,000 iterations is 8.48, which is 98.6% of the final Inception score, while, at the same time, the Inception score of ScoreGAN is 9.34, which corresponds to 90.2% of its final score. The FID demonstrates similar results to those of the Inception score. In ControlGAN, the FID decreases by 10.7% from 50,000 to 100,000 iterations; in contrast, it declines by 26.9% in ScoreGAN. Such a result implies that the generator in ScoreGAN can be further trained by the proposed evaluator, although the training of the discriminator is saturated.

**Figure 3.** The performance of ScoreGAN in terms of the Inception score and Fréchet Inception distance over iterations. (**A**) The Inception scores; (**B**) the Fréchet Inception distance (FID). The baseline is ControlGAN with the same neural network architecture, identical to that of ScoreGAN.

### *4.2. Image Generation with CIFAR-100 Dataset*

To generalize the effectiveness of ScoreGAN, the CIFAR-100 dataset was employed for the evaluation of the GAN models. The CIFAR-100 dataset is similar to the CIFAR-10 dataset, where each dataset contains 50,000 images of size 32 × 32 in the training set. The difference between the CIFAR-100 dataset and the CIFAR-10 dataset is that the CIFAR-100 dataset is composed of 100 different classes. Therefore, it is generally regarded that the training of the CIFAR-100 dataset is more challenging than that of the CIFAR-10 dataset. The architectures used in this experiment are shown in Appendix A.

Since existing methods in several recent studies have been evaluated over the CIFAR-100 dataset [43], we compared the performance between ScoreGAN and the existing methods. The performance in terms of the Inception score and FID is demonstrated in Table 4. The results show that ScoreGAN outperforms the other existing models. While the same neural network architectures are used in both methods, the performance of ScoreGAN is significantly superior to that of the baseline. For instance, the FID significantly declines from 18.42 to 13.98, which corresponds to a state-of-the-art result. Random examples of the generated images with ScoreGAN trained with CIFAR-100 are shown in Figure 4.

**Figure 4.** Random examples of the generated images by ScoreGAN with the CIFAR-100 dataset. Each column represents each class in the CIFAR-100 dataset. All images have a 32 × 32 resolution.


**Table 4.** Performance of the GAN models over the CIFAR-100 dataset. IS indicates the Inception score; FID indicates the Fréchet Inception distance. The best performances are highlighted in bold.

While the Inception score of ScoreGAN is slightly lower than that of MHingeGAN [39], such a disparity results from a difference in the assessment of the scores, in which, for MHingeGAN, the Inception score is continuously measured during the training iterations; then, the best score is selected among the training iterations. In contrast, the Inception score of ScoreGAN is computed only once after 100,000 iterations. Furthermore, in terms of the FID, ScoreGAN demonstrates superior results, compared to MHingeGAN. Furthermore, it is reported that the training of MHingeGAN over the CIFAR-100 dataset collapses before 100,000 iterations.

### *4.3. Image Generation with LSUN Dataset*

For an additional experiment, ScoreGAN was applied to another dataset, called LSUN [44]. LSUN is a large-scale image dataset with 10 million images in 10 different scene categories, such as bedroom and kitchen. Furthermore, different from the CIFAR-10 and CIFAR100 datasets, LSUN is composed of high-resolution images; therefore, we evaluated ScoreGAN with LSUN to verify that the proposed framework can be performed with highresolution images. In this experiment, ScoreGAN produces 128 × 128 resolution images.

The training process is the same as the previous experiments with the CIFAR datasets, while different training parameters were used; a learning rate of 5 × <sup>10</sup>−<sup>5</sup> was used for both the generator and discriminator, and the weights of the discriminator were updated two times for each update of the generator. Furthermore, the number of layers of the generator and discriminator was increased due to the resolution of the produced images. Since the resolution of the images is four times that of the CIFAR datasets, two additional residual modules were employed, which correspond to four additional convolutional layers for both the generator and discriminator.

Examples of the generated images by ScoreGAN are shown in Figure 5. The proposed model produced fine images for each category in the LSUN dataset. These results confirm that the proposed model can be applied to higher-resolution images than those in the CIFAR datasets, which demonstrates the generality of the performance of the proposed model. The result of the additional experiments signifies that the proposed model can be trained with various image datasets that have many image categories, such as CIFAR-100, as well as datasets with high-resolution images, such as LSUN.

**Figure 5.** Random examples of the generated images by ScoreGAN with the LSUN dataset. The images are a 128 × 128 resolution. Each column represents each class in the LSUN dataset, i.e., bedroom, bridge, church outdoor, classroom, conference room, dining room, kitchen, living room, restaurant, and tower.

### **5. Conclusions**

In this paper, the proposed ScoreGAN introduces an evaluator module that can be integrated with conventional GAN models. While it is known that the regular use of the Inception score to train a generator corresponds to making noise-like adversarial examples of the Inception network, we circumvented this problem by using the score as an auxiliary target and employing MobileNet instead of the Inception network. The proposed ScoreGAN was evaluated over the CIFAR-10 dataset and CIFAR-100 dataset. As a result, ScoreGAN demonstrated an Inception score of 10.36, which is the best score among the existing models. Furthermore, evaluated over the CIFAR-100 dataset in terms of the FID, ScoreGAN outperformed the other models, where the FID was 13.98.

Although the proposed evaluator is integrated with the ControlGAN architecture and demonstrated fine performance, it needs to be further investigated whether the evaluator module properly performs when it is additionally used for other GAN models. Since the evaluator module can be employed along with various GANs, the performance can be enhanced by adopting other GAN models. Furthermore, in this paper, only the Inception score is introduced to train the generator while the other metric to assess GANs, i.e., the FID, can be used as a score. Such a possibility to use the FID as a score should be further studied as well for future work.

**Author Contributions:** Conceptualization, M.L. and J.S.; methodology, M.L.; software, M.L.; validation, M.L.; formal analysis, M.L.; investigation, M.L. and J.S.; writing—original draft preparation, M.L.; writing—review and editing, M.L. and J.S.; supervision, J.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the National Research Foundation of Korea (NRF) grants funded by the Korean government (MSIT) (Nos. 2022R1A2C2004003 and 2021R1F1A1050977).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Appendix A. Neural Network Architectures of ScoreGAN for the CIFAR-100 Dataset**

**Table A1.** Architecture of the neural network modules for the training of the CIFAR-100 dataset. The values in the brackets indicate the number of convolutional filters or nodes of the layers. Each ResBlock is composed of two convolutional layers. The difference between the architecture for the CIFAR-10 dataset is at the classifier, in which 256 filters are used in the last three ResBlocks.


### **References**


## *Article* **Improved Method for Oriented Waste Detection**

**Weizhi Yang 1,\*, Yi Xie <sup>2</sup> and Peng Gao <sup>3</sup>**


**Abstract:** Waste detection is one of the main problems preventing the realization of automated waste classification, which is a basic function for robotic arms. In addition to object identification in general image analysis, a waste-sorting robotic arm not only needs to identify a target object but also needs to accurately judge its placement angle so that it can determine an appropriate angle for grasping. In order to solve the problem of low-accuracy image detection caused by irregular placement angles, in this work, we propose an improved oriented waste detection method based on YOLOv5. By optimizing the detection head of the YOLOv5 model, this method can generate an oriented detection box for a waste object that is placed at any angle. Based on the proposed scheme, we further improved three aspects of the performance of YOLOv5 in the detection of waste objects: the angular loss function was derived based on dynamic smoothing to enhance the model's angular prediction ability, the backbone network was optimized with enhanced shallow features and attention mechanisms, and the feature aggregation network was improved to enhance the effects of feature multi-scale fusion. The experimental results showed that the detection performance of the proposed method for waste targets was better than other deep learning methods. Its average accuracy and recall were 93.9% and 94.8%, respectively, which were 11.6% and 7.6% higher than those of the original network, respectively.

**Keywords:** waste classification; angle detection box; dynamic smoothing; YOLOv5

**MSC:** 68T20; 68T45; 68U10

### **1. Introduction**

Waste disposal is an important problem worldwide that must be addressed. Classifying waste and implementing differentiated treatments can help to improve resource recycling and promote environmental protection. However, many countries and regions still rely on manual waste classification. The main drawbacks of this are twofold. First, the health of operators can be seriously threatened by the large number of bacteria carried by waste [1]. Second, manual sorting is not only costly but also inefficient. Consequently, automated waste management and classification approaches have received extensive attention [2].

Using a robotic arm is a common method for replacing the manual mode with automated waste sorting [3]. In order to enable the robot arm to correctly classify and grasp the target object, each robot arm needs to have the functions of object recognition and placement angle judgment.

Wu et al. [4] proposed a plastic waste classification method based on FV-DCNN. They extracted classification features from original spectral images of plastic waste and constructed a deep CNN classification model. Their experiments showed that the model could recognize and classify five categories of polymers. Chen et al. [5] proposed a lightweight feature extraction network based on MobileNetv2 and used it to achieve image classification of waste. Their experiments showed that the average accuracy of the classification with their dataset was 94.6%. Liu et al. [6] proposed a lightweight neural network based

**Citation:** Yang, W.; Xie, Y.; Gao, P. Improved Method for Oriented Waste Detection. *Axioms* **2023**, *12*, 18. https://doi.org/10.3390/ axioms12010018

Academic Editor: Oscar Humberto Montiel Ross

Received: 14 November 2022 Revised: 11 December 2022 Accepted: 21 December 2022 Published: 24 December 2022

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

on MobileNet that can reduce the cost of industrial processes. Kang et al. [7] proposed an automated waste classification system based on the ResNet-34 algorithm. The experimental results showed that the classification had high accuracy, and the classification speed of the system was as quick as 0.95 s.

These models can recognize categories of waste intelligently based on convolutional neural networks, but they do not generate location boxes for the waste targets. In addition, when there are multiple categories of waste in an image, such models cannot achieve effective identification. Therefore, they cannot be applied directly to tasks such as automated waste sorting.

Cui et al. [8] used the YOLOv3 algorithm to detect domestic waste, decoration waste and large waste on the street. Xu et al. [9] proposed a five-category waste detection model based on the YOLOv3 algorithm and achieved 90.5% average detection accuracy with a self-made dataset. The dataset included paper waste, plastic products, glass products, metal products and fabrics. Chen et al. [10] proposed a deep learning detection network for the classification of scattered waste regions and achieved good detection results. Majchrowska et al. [11] proposed deep learning-based waste detection method for natural and urban environments. Meng et al. [12] proposed a MobileNet-SSD with FPN for waste detection.

These methods can achieve image-based waste detection, but they do not provide the grasping angle information for the target object. For a target object placed at any angle, these methods only provide a horizontal identification box. Therefore, the robotic arm cannot determine the optimal grasp mode for the shape and placement angle of a waste object, which may easily lead to the object falling or to grabbing failure, especially in cases involving a large aspect ratio, as a small angle deviation can lead to a large deviation in the intersection over union (IoU).

In addition to object identification in general image analysis, a waste-sorting robotic arm not only needs to identify a target object but also needs to accurately judge its placement angle so that the robotic arm can determine the appropriate grasping angle. YOLOv5 has a strong feature extraction structure and feature aggregation network, allowing it to achieve higher detection recall and accuracy. It also provides a series of methods that can be used to achieve data enhancement. YOLOv5 is a good choice for many common identification and classification problems due to its fast detection speed, high detection accuracy and easy deployment, making it popular in many practical engineering applications. Li et al. [13] and Chen et al. [14] proposed improved algorithms for vegetable disease and plant disease detection based on YOLOv5. Their experiments showed that the detection rates reached 93.1% and 70%, respectively, which were better than other methods. Ling et al. [15] and Wang et al. [16] proposed gesture recognition and smoke detection models, respectively, based on YOLOv5. Gao et al. [17] proposed a beehive detection model based on YOLOv5. However, the original YOLOv5 does not provide the grasping angle information required for a target object. For a target object placed at any angle, it only provides a horizontal identification box, as shown in Figure 1a. Therefore, the robotic arm cannot determine the optimal grasp mode for the shape and placement angle of a target object, which may easily lead to the object falling or to grabbing failure, especially in cases involving a large aspect ratio.

**Figure 1.** Grabbing with horizontal and oriented detection box. (**a**) Horizontal detection box; (**b**) oriented detection box.

In this work, we made two modifications to YOLOv5 to improve its suitability for automated waste sorting application scenarios. First, we added an angular prediction network in the detection head to provide grasping angle information for the waste object and developed a dynamic smoothing label for angle loss to enhance the angular prediction ability of the model. Second, we optimized the structure of the feature extraction and aggregation by enhancing the multi-scale feature fusion.

The contributions of this work are threefold:

(1) An optimized waste detection approach was designed based on YOLOv5 that provides higher detection accuracy for both general-sized waste and waste with a large aspect ratio;

(2) An angular prediction method is proposed for YOLOv5 that enables the rotation detection box to obtain the actual position of oriented waste;

(3) New optimization schemes are introduced for YOLOv5, including a loss function, feature extraction and aggregation.

### **2. Detection Method for Oriented Waste**

### *2.1. Detection Scheme*

As shown in Figure 2, the framework of the proposed waste detection scheme consists of five parts: the input layer, feature extraction backbone network, feature aggregation network, detection head and dynamic smoothing module. In this study, the backbone network mainly consisted of the focus module, the convolution module and an optimized HDBottleneckCSP module based on BottleneckCSP. The focus module reduces the number of computations and improves the speed in accordance with the slicing operation. The BottleneckCSP module is a convolution structure that demonstrates good performance in model learning. The backbone was used to extract the features from waste images and generate feature maps with three different sizes. The feature aggregation network converges and fuses multi-scale features generated from the backbone network to improve the representation learning ability for rotating waste angle features. The detection head generates the category, location and rotation angle for waste based on the multi-scale feature maps. Finally, the dynamic smoothing module partially densifies the "one-hot label encoding" of the angle labels for model training.

**Figure 2.** Rotation angle waste detection scheme.

### *2.2. Improvement of Detection Model*

The original YOLOv5 model has the following limitations: (i) It can only generate a target detection box with a horizontal angle and not a rotation angle. (ii) The stack of bottleneck modules in BottleneckCSP is serial, which causes the middle-layer features to be lost. (iii) The feature aggregation network lacks end-to-end connection between the input and output feature maps.

To solve these problems, we optimized three aspects of YOLOv5: (i) We added an angular prediction network and loss function, as well as a dynamic angle smoothing algorithm for angular classification, to improve the angular prediction ability. (ii) We optimized the BottleneckCSP module of the backbone network to enhance the model's ability to extract the features of oriented waste. (iii) We optimized the feature aggregation network to improve the effect of multi-scale feature fusion.

### 2.2.1. Improvement of the Detection Head Network

The original YOLOv5 detector lacks a network structure for angular prediction and cannot provide the grasping angle information for waste objects. Therefore, the robotic arm cannot set the optimal grasp mode according to the placement angle of the waste, which easily leads to the object falling or to grabbing failure. Thus, we optimized the structure of the detection head.

Angular prediction can be realized as regression or classification. The regression mode produces a continuous prediction value for the angle but there is a periodic boundary problem, which leads to a sudden increase in the value of the loss function at the boundary of periodic changes, increasing the difficulty of learning [18]. For example, in the 180◦ long-side definition method, the defined label range is (−90◦, 90◦). When the true angle of waste is 89◦ and the prediction is −90◦, the error learned by the model is 179◦, but the actual error should be 1◦, which affects the learning of the model.

Therefore, we added convolution network branches in the detection head and defined the angle label with 180 categories obtained by rotating the long side of the target box clockwise around the center. The angle convolution network generates the angle prediction using information extracted from the multi-scale features obtained by the feature aggregation network.

In the detection head, the angle convolution network and the original network share the output of the feature aggregation network as the input feature graph. The output of the angle prediction network and the original network are merged as follows:

$$V = (\pounds, \pounds, \pounds, \pounds, \pounds) \tag{1}$$

where *c*ˆ is the predicted category of the waste, *x*ˆ and *y*ˆ are the predicted central coordinates of the object box, ˆ *l* and *s*ˆ are the predicted lengths of the longer side and shorter side of the object box and ˆ *θ* is the predicted angle of oriented waste.

### 2.2.2. Angle Smoothing and Loss Function

The realization of angular prediction as classification can avoid the periodic boundary problem caused by regression, but there are still some limitations. The loss function of traditional category tasks is calculated as cross-entropy loss, and the form of the labels is "one-hot label encoding", as shown in Equations (2) and (3):

$$y\_{ic} = \left\{ \begin{array}{l} 1, c = \theta \\ 0, c \neq \theta \& c \in \{0, 1, \dots, 179\} \end{array} \right. \tag{2}$$

$$L = -\frac{1}{N} \sum\_{i} \sum\_{c=0}^{179} y\_{ic} \log(p\_{ic}) \tag{3}$$

where *yic* is the "one-hot label encoding" for the angle of sample *i*, *θ* is the angle of the oriented waste and *pic* is the prediction of the detection model.

Equation (8) shows that, for different incorrect predictions of the angle, the same loss value is obtained and the distance of the mistake cannot be quantified, which makes it difficult for model training to determine the angle of the oriented waste.

To solve this problem, we propose a dynamic smoothing label algorithm based on the circular smooth label (CSL) algorithm [18] to optimize the "one-hot label encoding" label of the angle.

The circular smooth label algorithm is shown in Equation (4):

$$\text{CSL}(\mathbf{x}) = \begin{cases} \text{g}(\mathbf{x}), \theta - r < \mathbf{x} < \theta + r \text{ & } \mathbf{x} \in \{0, 1, \dots, 179\} \\\ \text{0,others} \end{cases} \tag{4}$$

where *θ* is the rotation angle value, *r* is the range of smoothness and *g*(*x*) is the smoothing function. The angle label vector manifests as a "dense" distribution because *g*(*x*) is within the range of smoothness.

The value of the smoothing function is shown in Equation (5):

$$0 < \mathcal{g}(\theta - \varepsilon) = \mathcal{g}(\theta + \varepsilon) \le 1, |\varepsilon| \le r \tag{5}$$

where, when *ε* = 0, the function has a maximum value of 1, and when *ε* = *r*, it is 0.

The CSL algorithm partially densifies the "one-hot label encoding". When the angular prediction of the model is in the range of smoothness, different loss values for different predicted degrees are obtained; thus, it can quantify the mistake in the angle category prediction. However, the performance of CSL is sensitive to the range of smoothness. If the range of smoothness is too small, the smoothing label will degenerate into "one-hot label encoding" and lose its effect, and it will be difficult to learn the information from the angle. If the range is too large, the deviation in the angle prediction will be large, which will lead to it missing the object, especially for waste with a large aspect ratio.

Therefore, we propose a dynamic smoothing function for the angle label to adjust the smoothing amplitude and range.

The dynamic smoothing function uses the dynamic Gaussian function to smooth the angle labels. It can be seen from Figure 3 that the smoothing amplitude and the range of the Gaussian function are controlled by the root mean square (RMS) value: the larger the RMS, the flatter the curve; the smaller the RMS, the steeper the curve and the smaller the smoothing range. Therefore, the RMS of the Gaussian function is gradually shrunk to achieve dynamic smoothing, as shown in Equation (6).

$$\text{DSM}(\mathbf{x}) = \exp\left(-\frac{d^2(\mathbf{x}, \theta)}{2 \times b^2}\right), \mathbf{x} \in \{0, 1, \dots, 179\} \tag{6}$$

We provide two efficient functions—linear annealing and cosine annealing—to adjust the RMS, as follows:

$$b = c + e \times \cos(\frac{0.5 \times \pi \times epoch}{epochs})$$

$$b = c - e \times \frac{epoch}{epochs}$$

where *θ* is the value of the rotation angle for the waste, which corresponds to the peak position of the function; *x* is the encoding range of the waste angle; *b* is the value of the RMS; and *d*(*x*, *θ*) is the circular distance between the encoding position and the angle values. For example, if *θ* is 179, *d*(*x*, *θ*) is 1 when *x* is 0; *epoch* and *epochs* represent the current number of training rounds and the maximum number of rounds of the model, respectively, and *c* and *e* are hyper-parameters.

**Figure 3.** Gaussian function curves with different RMS.

It can be seen from Equation (6) that the DSM densifies the angle label according to the distance between the encoding position and the angle value dynamically. In the early stage of model training, *b* obtained large values because of the small epoch. At this time, the range of smoothing was large, and the model's learning of angles was reflected in the window area. When the smoothing range was more "loose", the model came closer to the neighborhood area of the optimal point; thus, it reduced the difficulty of angle learning and improved the recall rate in image waste detection. The range of angle smoothing decreased with the increase in the *epoch* value. The objective of the model was changed from the optimal region to the learning of the optimal point so that the deviation in the angular prediction would be smaller. The higher accuracy of the angle prediction improved the recall rate for the oriented waste, especially in cases with a large aspect ratio.

The angular loss of waste was calculated using the cross-entropy loss function based on the dynamic smoothing algorithm:

$$\text{loss}(a) = -\sum\_{i=0}^{s^2} I\_{ij}^{obj} \sum\_{t=0}^{179} \left\{ \hat{p}\_i(t) \log[p\_i(t)] + [1 - \hat{p}\_i(t)] \log[1 - p\_i(t)] \right\} \tag{7}$$

where *p*ˆ*i*(*t*) = DSM(*t*). *pi*(*t*) is the prediction of the angle and *s*<sup>2</sup> is the quantification of the subdomain of the picture, and the model provides the prediction of the target for each subdomain. *I obj ij* is 0 or 1, which indicates whether there is a target. When the prediction is close to the true value, the cross-entropy has a smaller value.

In addition, the GIoU loss function [19] was used to calculate the regression loss of the detection boundary box. In Figure 4, A and B are the real box and the prediction box of the detection target, respectively. C is the smallest rectangle surrounding A and B. The green area is |C| − |A ∪ B|.

The specific calculation is shown in Equations (8)–(10). GIoU not only pays attention to the overlap of the real box and the prediction box but also to the non-overlapping area, which allows it to solve the problem of the gradient not being calculated caused by A and B not intersecting.

**Figure 4.** Illustration of GIoU.

$$\text{IoU}(\text{A}, \text{B}) = \frac{|\text{A} \cap \text{B}|}{|\text{A} \cup \text{B}|} \tag{8}$$

$$\text{IGoU}(\text{A}, \text{B}) = \text{IoU}(\text{A}, \text{B}) - \frac{|\text{C}| - |\text{A} \cup \text{B}|}{|\text{C}|} \tag{9}$$

$$loss(r) = 1 - \text{GIoU}(\text{A}, \text{B}) \tag{10}$$

In the equations, A and B are the real box and the prediction box of the detection target, respectively. C is the smallest rectangle surrounding A and B. The confidence loss function and category loss function are as shown by Equations (11) and (12):

$$\begin{aligned} l\cos(o) &= -\sum\_{i=0}^{s^2} \sum\_{j=0}^{B} I\_{ij}^{obj} \left[ \hat{c}\_i \log(c\_i) + (1 - \hat{c}\_i) \log(1 - c\_i) \right] \\ &- l\_{noobj} \sum\_{i=0}^{s^2} \sum\_{j=0}^{B} I\_{ij}^{noobj} \left[ \hat{c}\_i \log(c\_i) + (1 - \hat{c}\_i) \log(1 - c\_i) \right] \end{aligned} \tag{11}$$

$$\text{loss}(\mathcal{c}) = -\sum\_{i=0}^{s^2} I\_{ij}^{obj} \sum\_{c \in class} \left\{ \beta\_i(c) \log[p\_i(c)] + [1 - \beta\_i(c)] \log[1 - p\_i(c)] \right\} \tag{12}$$

where *I obj ij* and *I noobj ij* indicate whether the prediction box *j* of the grid *i* is the target box, and *λ*no*obj* indicates the weight coefficients.

The overall loss function of the improved model is a weighted combination of the above loss functions, as shown in Equation (13):

$$Loss = loss(r) + loss(o) + loss(c) + loss(a)\tag{13}$$

### 2.2.3. Improvement of Feature Extraction Backbone Network

The feature extraction backbone network was used to extract the features of the waste in the image. Due to the addition of angular prediction in the detection of oriented waste, there is a higher demand on the feature extraction to realize effective recognition, especially in cases involving a large aspect ratio due to a narrow area.

BottleneckCSP is the main module in the backbone of YOLOv5. The BottleneckCSP module is stacked using a bottleneck architecture. As shown in Figure 5a, the stacking of the bottleneck modules is serial. With the deepening of the network, the feature abstraction capability is gradually enhanced, but shallow features are generally lost [20]. Shallow features have lower semantics and can be more detailed due to the fewer convolution operations. Utilizing multi-level features in CNNs through skip connections has been found to be effective for various vision tasks [21–23]. The bypassing paths are presumed to be the key factor for easing the training of deep networks. Concatenating feature maps learned by different layers can increase the variation in the input of subsequent layers and improve efficiency [24,25]. In addition, attention mechanisms, which are methods used to assign different weights to different features according to their importance, have been found to be effective for the recognition of an image [26,27]. The coordinate attention mechanism (CA) [28] is one such mechanism that shows good performance. Therefore, as shown in Equation (14), we concentrated and merged the middle features of BottleneckCSP and added the CA module to enhance the feature extraction capability. The attention mechanism is optional in the module at different levels.

$$\mathbf{Z}^{\rm out} = \mathbf{g}\left(\mathbf{Z}\_{\hbar \times w \times (c \times t)}^{c}\right) \tag{14}$$

where

$$\begin{cases} \mathbf{Z}^1 = f\_1(\mathbf{x}) \\ \mathbf{Z}^t = f\_l(\mathbf{Z}^{t-1}) \\ \mathbf{Z}^c = \left[ \mathbf{Z}^1\_{\hbar \times w \times (c \times t)'} \mathbf{Z}^2\_{\hbar \times w \times (c \times t)'} \dots , \mathbf{Z}^t\_{\hbar \times w \times (c \times t)} \right] \end{cases}$$

**Figure 5.** Comparison of BottleneckCSP before and after improvement. (**a**) BottleneckCSP module. (**b**) HDBottleneckCSP module.

*Z* is the feature map, *<sup>x</sup>* is the input of the BottleneckCSP module, *<sup>f</sup>* is the function mapping of the bottleneck module, and *g* represents the CA attention operation.

Due to the "residual block" connection in the bottleneck architecture, excessive feature merging between bottlenecks leads to feature redundancy, which is not suitable for model training, and the increased number of parameters means that more resources are consumed. Therefore, the characteristic layers were connected using "interlayer merging", as shown in Figure 5b. The optimized module was named HDBottleneckCSP.

The CA module structure in HDBottleneckCSP is shown in Figure 6. The input feature maps are coded along the horizontal and vertical coordinates to obtain the global field and to encode position information, respectively, which helps the network to detect the locations of targets more accurately.

**Figure 6.** CA attention module.

As shown in Equation (15), the CA module generates vertical and horizontal feature maps for the input feature map and then transforms them through a 1 × 1 convolution. The generated *A* <sup>∈</sup> <sup>R</sup>*C*/*r*×(*H*+*W*) is the intermediate feature map for the spatial information in the horizontal and numerical directions, *r* is the down sampling scale and *F*<sup>1</sup> represents the convolution operation.

$$A = \delta\left(F\_{\mathbb{I}}\left(\left[\mathbf{Z}^{h}, \mathbf{Z}^{w}\right]\right)\right) \tag{15}$$

where *A* is divided into *A<sup>h</sup>* <sup>∈</sup> <sup>R</sup>*C*/*r*×*<sup>H</sup>* and *A<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*C*/*r*×*<sup>W</sup>* in the spatial dimension. As shown in Equations (16) and (17), it is transformed into the same number of channels as the input feature map through the convolution operation, while *g<sup>h</sup>* and *g<sup>w</sup>* are used as the attention weight and participate in the feature map operation. The output result of the CA module is shown in Equation (18).

$$\mathbf{g}^h = \delta\left(\mathbb{F}\_{\mathbb{H}}\left(\left[\mathbf{A}^{\mathbb{H}}\right]\right)\right) \tag{16}$$

$$\mathbf{g}^w = \delta(F\_w([\mathbf{A}^w])) \tag{17}$$

$$\mathbf{y}\_c(i,j) = \mathbf{x}\_c(i,j) \times \mathbf{g}\_c^h(i) \times \mathbf{g}\_c^w(j) \tag{18}$$

The optimized feature extraction backbone network structure is shown in Figure 7. It extracts features through the convolution module and the HDBottleneckCSP module and generates feature maps with three sizes by downsampling (1/8, 1/16 and 1/32).

**Figure 7.** Backbone network structure for feature extraction.

### 2.2.4. Improvement of Feature Aggregation Network

The YOLOv5 feature aggregation network consists of feature pyramid networks [29] (FPNs) and path aggregation networks [30] (PANets). The structure of a PANet is shown in Figure 8a. The PANet aggregates features along two paths: top-down and bottom-up. However, the aggregated features are deep features with high semantics, and the shallow features with high resolution are not fused. In order to make use of the input features more effectively, we used P2P-PANet to replace the PANet based on BiFPN [31], as shown in Figure 8b.

**Figure 8.** Network structures of PANet and P2P-PANet. (**a**) PANet network structure. (**b**) P2P-PANet network structure.

Compared to PANet, P2P-PANet adds end-to-end connection for the input-feature and output-feature maps, which establishes a "point-to-point" horizontal connection path from the low level to the high level, and it can realize the fusion of high-resolution and complex semantic features in an image without adding much cost. Through the extraction and induction of semantic information for the high-resolution and low-resolution feature maps, the angular feature information of rotating waste is further enhanced, and the detection ability of the model is improved.

The method for oriented waste detection after all the optimizations was named YOLOv5m-DSM and is shown in Figure 9. When a picture is input into the model, YOLOv5m-DSM extracts features using the backbone and generates downsampling feature maps with three different sizes for the detection of waste. The feature aggregation network undertakes feature aggregation and fusion to enhance the model's ability to learn features. The detection head generates the prediction information for waste targets based on the multi-scale features. In the model's training stage, the label of the training set is smoothed using the dynamic smoothing module, and the loss in the prediction, including class, angle and position, is calculated using the loss calculation module for iterative learning.

**Figure 9.** Method diagram for YOLOv5m-DSM.

### **3. Experimental Results and Analysis**

### *3.1. Datasets*

The dataset for the experiment contained eleven kinds of domestic waste, including a cotton swab, a stick, paper, a plastic bottle, a tube, vegetables, peels, a shower gel bottle, a coat hanger, clothes pegs and an eggshell. The vector of the label contained the category, the center *x* coordinate of the target box, the center *y* coordinate of the target box, the long-side value, the short-side value and the angle value. The angle was the angle between the long side of the target frame and the horizontal axis in the clockwise direction, with a range of (0◦, 180◦).

### *3.2. Evaluation Index*

In order to evaluate the performance of YOLOv5m-DSM, it was compared and analyzed using the recall (*R*), mean average precision (*mAP*) and other indicators. The recall is as follows:

$$R = \frac{TP}{TP + FN} \times 100\% \tag{19}$$

*TP* represents a "true positive" sample, and *FN* represents a "false negative" sample. The mean average precision formula is shown in Equation (20).

$$mAP = \frac{1}{m} \sum\_{i=1}^{m} AP\_i \tag{20}$$

The mean average precision refers to the average precision (*AP*) for each category of samples, which is calculated from the recall rate and precision (*P*) as follows:

$$R = \frac{TP}{TP + FN} \times 100\% \tag{21}$$

$$AP = \int\_0^1 P(\mathbb{R})dR\tag{22}$$

### *3.3. Experimental Results and Analysis*

In order to better show the advantages of the method described in this paper, YOLOv5- DSM was compared with mainstream horizontal box detection methods and rotating box detection methods in the experiments.

Table 1 shows a comparison of the detection effects for YOLOv5m-DSM and horizontal rectangular box detection methods, such as SSD-OBB, YOLOv3-OBB, YOLOv5s-OBB and YOLOv5m-OBB, which are angle classification network structures commonly added in detection heads based on their original models [32,33].

**Table 1.** Comparison between YOLOv5m-DSM and horizontal frame detection methods.


Table 1 shows that, compared with SSD-OBB, YOLOv3-OBB and YOLOv5s-OBB, the recall rate and average precision of YOLOv5m were better. Compared with the original network, YOLOv5m-DSM showed improvements of 7.6% and 11.6% in the recall rate and the average precision, respectively. This proves that the modified waste detection algorithm has obvious improvements. Furthermore, YOLOv5m-DSM showed a good detection effect for oriented waste with a large aspect ratio, demonstrating an obvious improvement over the original model. The good performance of DSM (Cos) and DSM (Linear) proves that the dynamic smoothing label was effective and strong.

The detection effects of YOLOv5m, YOLOv5m-OBB and YOLOv5m-DSM are shown in Figure 10. It can be seen from Figure 10a that the YOLOv5m network only generated a horizontal detection box. It did not provide the grasping angle information for the waste object. Therefore, the robotic arm could not set the optimal grasp mode according to the inherent shape and placement angle of the target object, which could easily lead to the object falling and to grabbing failure, especially in cases involving a large aspect ratio.

**Figure 10.** Comparison of detection effects of the methods. (**a**) Detection using YOLOv5m. (**b**) Detection using YOLOv5m-OBB. (**c**) Detection using YOLOv5m-DSM.

Figure 10b shows that, when the angular classification network was added to the detection head, YOLOv5m-OBB could generate a waste object detection box at any angle, but the angle of the generated detection box was not accurate enough, especially in cases involving a large aspect ratio. Due to the large aspect ratio, a slight deviation in the prediction box resulted in a smaller IoU for the prediction box and the true box, which

resulted in difficulties in the model training. Therefore, a large aspect ratio makes effective learning difficult.

Figure 10c shows the detection results for YOLOv5m-DSM. It can be seen that YOLOv5m-DSM could generate a waste object detection box at any angle and could detect objects involving a large aspect ratio accurately. It can be seen that, with the optimization of the feature extraction backbone network and feature aggregation network, and after the optimization of the loss function through the dynamic smoothing algorithm, YOLOv5m-DSM had better precision and performance in the detection of oriented waste.

Table 2 shows a comparison of the detection effects of YOLOv5m-DSM and the mainstream rotating rectangular box detection methods.


**Table 2.** Comparison of YOLOv5m-DSM and rotating frame detection methods.

It can be seen that, when compared to RoI Trans [34], the average recall rate and average precision of detection increased by 6.2% and 6.6%, respectively. Compared to Gliding-Vertex [35], they increased by 2.4% and 4.3% respectively. Compared to R3Det [36], they increased by 3.4% and 3.8%, respectively. The recall rate and average precision of the YOLOv5m-DSM model were also better than those of S2A-Net [37], and our method had fewer parameters and a detection rate twice as high. In addition, we extended the flops counter tool to calculate the floating point operations (FLOPs) in the methods, and the computation load of YOLOv5m-DSM was lower than the comparison algorithm, making the model more suitable for deployment and application in embedded devices.

### *3.4. Network Model Ablation Experiment*

The network model ablation comparison experiment was used to evaluate the optimization effects of each improvement scheme. The experimental comparison results are shown in Table 3. Optimization 1 involved using the dynamic smoothing algorithm to densify the angle label conversion and calculate the loss function (1a is linear annealing and 1b is the cosine annealing angle). Optimization 2 involved the improvement of the feature extraction backbone network based on the proposed HDBottleneckCSP module. Optimization 3 involved an improvement of the feature aggregation network of the YOLOv5 based P2P-PANet structure.


**Table 3.** Ablation experiment.

It can be seen from Table 3 that, after adding the linear dynamic smoothing algorithm and the corresponding loss function, the recall rate and mean average precision of optimization model 1 increased by 5.3% and 8.2%, respectively. After adding the linear dynamic

smoothing algorithm and HDBottleneckCSP module, these values increased by 6.4% and 9.4% in optimization model 2, respectively. After adding the linear dynamic smoothing algorithm and P2P-PANet module, they increased by 6.2% and 9.2% in optimization model 3, respectively. For the YOLOV5m-DSM (Linear) model, the detection recall rate and average precision of the model increased by 7.6% and 11.6%, respectively, with the above optimization methods.

In order to analyze the effects of replacing the original module structure with the HD-BottleneckCSP structure and P2P-PANet network on the image waste detection algorithm more clearly, as well as the reasons for these effects, the intermediate characteristic graphs of YOLOv5 and YOLOv5m-DSM were extracted for comparison, as shown in Figure 10.

Figure 11a,b show an input image and label image, and Figure 11c–f and Figure 11g–j show the 1/8, 1/16 and 1/32 down sampling feature maps of YOLOv5 and YOLOv5m-DSM in the backbone network. It can be seen from Figure 11c,d,g,h that shallower feature information was extracted from the model after using the HDBottleneckCSP network, and the edge information and feature details of the waste were obtained more clearly. As can be seen from Figure 11e,i, two network structures obtained high-level semantic features through multi-layer convolution operations. Finally, from the comparison of Figure 11f,j, we can see that the YOLOv5m-DSM network generated a clearer edge for the target object, which led to an improvement in the recall and accuracy of the waste detection.

**Figure 11.** Intermediate characteristic diagrams of YOLOv5 and YOLOv5m-DSM. (**a**) Input image. (**b**) Label image. (**c**–**f**) Down sampling feature maps of YOLOv5. (**g**–**j**) Down sampling feature maps of YOLOv5m-DSM.

Table 4 shows a comparison of the detection effects of "interlayer merging" and "layer by layer merging" on the characteristic layer of the HDBottleneckCSP network.

**Table 4.** Effect comparison of "interlayer merging" and "layer by layer merging".


It can be seen from Table 4 that, compared with "layer by layer merging", the "interlayer merging" used for feature map aggregation had fewer training parameters and a better detection effect. This was mainly because the "layer by layer merging" led to excessive duplication of the use of feature maps, which can easily cause feature redundancy and increase the difficulty of learning for the model. In addition, overly dense feature map aggregation increases the number of channels in the feature map, thus increasing the number of parameters and consuming more computing resources.

Table 5 shows the effects of the proposed method with different backbones. It can be seen that, compared with other backbones, such as VGG19, Resnet50, and CSPDarknet, the backbone proposed in this paper achieved a better detection effect.

**Table 5.** Effect comparison with different backbones.


In order to analyze the effects of replacing the original module structure with the HDBottleneckCSP and P2P-PANet network on image waste detection clearly, as well as the reasons for these effects, Figure 11 shows maps of the feature aggregation network and the detection results for YOLOv5 and YOLOv5m-DSM.

Figure 12a–h show the multi-scale feature maps and detection results for the network obtained from the convergence of the original YOLOv5 and YOLOv5m-DSM features. It can be seen from the graph analysis that the YOLOv5 model converged the feature map but, in the generated multi-scale feature map, the contour of the detected object was not clear enough, the feature differentiation from the background map was not obvious and a situation occurred involving mixing with the background feature. The YOLOv5m-DSM algorithm uses the P2P-PANet structure and the smoothing labels of the angle, which makes the model's learning of image features more obvious and the feature contour of the detection object clearer and more differentiated from the background features, thus making the final detection effect more accurate.

**Figure 12.** Feature aggregation network maps of YOLOv5 and YOLOv5m-DSM. (**a**–**d**) Multi-scale feature maps and detection result of YOLOv5m. (**e**–**h**) Multi-scale feature maps and detection result of YOLOv5m-DSM.

Table 6 shows a comparison of the effects of dynamic smoothing and the circular smooth label with different hyper-parameters.


**Table 6.** Comparison of dynamic smoothing and the circular smooth label.

It can be seen from the table that different detection effects were obtained by adjusting the range of the circular smoothing algorithm. However, the performances of the two kinds of dynamic smoothing were better than the best result with the circular smooth label. This proves that the dynamic smoothing was strong. Dynamic smoothing controls angle learning by shrinking the range of smoothness gradually. In the initial stage of model training, a larger smoothing range was set to reduce the difficulty of model learning and improve the recall rate for waste detection. With the iteration of the model learning, the angle smoothing range was gradually reduced through the attenuation function to reduce the angle deviation in target detection, thus improving the detection accuracy. Higher accuracy for angle prediction can improve the recall rate for oriented waste, especially in cases involving a large aspect ratio.

### *3.5. Detection Application Results*

In order to demonstrate the waste detection performance of the improved method proposed in this paper, the method was used for actual testing in different scenarios with different levels of illumination, such as a waste station, garage, corridor, lawn, and so on. The results are shown in Figure 13. It can be seen that the method detected the waste objects effectively in a series of scenarios. It was proven that the method described in this paper is able to carry out the detection of oriented waste effectively.

**Figure 13.** *Cont*.

**Figure 13.** Waste detection with YOLOv5m-DSM in different scenarios.

### **4. Conclusions**

This paper focused on waste detection for a robotic arm based on YOLOv5. In addition to object identification in general image analysis, a waste-sorting robotic arm not only needs to identify a target object but also needs to accurately judge its placement angle, so that the robotic arm can set the appropriate grasping angle. In order to address this need, we added an angular prediction network to the detection head to provide the grasping angle information for the waste object and proposed a dynamic smoothing algorithm for angle loss to enhance the model's angular prediction ability. In addition, we improved the method's feature extraction and aggregation abilities by optimizing the backbone and feature aggregation network of the model. The experimental results showed that the performance of the improved method in oriented waste detection was better than that of comparison methods; the average precision and recall rate were 93.9% and 94.8%, respectively, which were 11.6% and 7.6% higher than those of the original network, respectively.

**Author Contributions:** Conceptualization, W.Y., Y.X. and P.G.; methodology, W.Y. and Y.X.; software: W.Y. and P.G.; validation: W.Y.; writing—original draft preparation, W.Y.; writing—review and editing, W.Y., Y.X. and P.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partially supported by Guangdong Key Discipline Scientific Research Capability Improvement Project: no. 2021ZDJS144; Projects of Young Innovative Talents in Universities of Guangdong Province: no. 2020KQNCX126; Key Subject Projects of Guangzhou Xinhua University: no. 2020XZD02; Scientific Projects of Guangzhou Xinhua University: no. 2020KYQN04; and the Plan of Doctoral Guidance: no. 2020 and no. 2021. Xie's work was supported by the Natural Science Foundation of China (no. 61972431).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
