**Data Analysis and Mining**

Editors

**Stefanos Ougiaroglou Dionisis Margaris**

*Editors* Stefanos Ougiaroglou Department of Information and Electronic Engineering International Hellenic University Sindos, Thessaloniki Greece

Dionisis Margaris Department of Digital Systems University of the Peloponnese Kladas, Sparta Greece

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Applied Sciences* (ISSN 2076-3417) (available at: www.mdpi.com/journal/applsci/special issues/ data analysis mining).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

Lastname, A.A.; Lastname, B.B. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-9503-0 (Hbk) ISBN 978-3-0365-9502-3 (PDF) doi.org/10.3390/books978-3-0365-9502-3**

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license. The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license.

## **Contents**



## **About the Editors**

#### **Stefanos Ougiaroglou**

Stefanos Ougiaroglou is an assistant professor in the Department of Information and Electronic Engineering at the International Hellenic University, where he teaches courses on programming, databases, web application development, and data mining. He received a B.Sc. in Computer Science (2004) from the Department of Informatics at the Alexander TEI in Thessaloniki, Greece; an M.Sc. in Computer Science (2006) from the Department of Computer Science at Aristotle University in Thessaloniki, Greece; and a PhD in Computer Science (2014) from the Department of Applied Informatics at the University of Macedonia, Greece. His research interests include data mining algorithms, data reduction, data streams, data management for mobile computing, databases, algorithms and data structures, educational technology, and web application development. He has published several papers in peer-reviewed international journals and conferences proceedings.

#### **Dionisis Margaris**

Dionisis Margaris is an assistant professor in the Department of Digital Systems, University of the Peloponnese, Greece, where he teaches courses on programming, operating systems, software technology, and information systems. He received his B.Sc., M.Sc., and Ph.D. from the Department of Informatics and Telecommunications, University of Athens, Greece, in 2007, 2010, and 2014, respectively. He has published more than 70 papers in peer-reviewed international journals, books, and conferences proceedings. His research interests include information systems, personalization, recommender systems, business processes, and web services and data mining.

## **Preface**

To date, data analysis and mining is being used in numerous everyday tasks to solve practical problems. This research field has attracted the interest of both academia and industry. The research community has developed algorithms, techniques, and tools for the prediction of future situations, discovery of clusters with similar data, association rules in mining, and pattern recognition, among others, all of which have found applications in many domains, such as medicine, finance, business, biology, marketing, and education. In this reprint, 17 papers are published on different topics of the broad research field of data analysis and mining. Each of the included papers presents new data mining algorithms and techniques, as well as applications of data analysis and mining in real-world domains. These papers have been carefully selected based on a vigorous peer-review process involving several respectful reviewers organized by *Applied Sciences*. It is our sincere hope that these papers will provide new inspiration for the development and application of data analysis and mining. We would like to thank all the authors and reviewers who contributed to this reprint.

> **Stefanos Ougiaroglou and Dionisis Margaris** *Editors*

### *Article* **A Data-Science Approach for Creation of a Comprehensive Model to Assess the Impact of Mobile Technologies on Humans**

**Magdalena Garvanova 1,\*, Ivan Garvanov 1, Vladimir Jotsov 1,2, Abdul Razaque 2,\*, Bandar Alotaibi 3, Munif Alotaibi <sup>4</sup> and Daniela Borissova 1,5**


**Abstract:** Mobile technologies are an essential part of people's everyday lives since they are utilized for a variety of purposes, such as communication, entertainment, commerce, and education. However, when these gadgets are misused, the human body is exposed to continuous radiation from the electromagnetic field created by them. The communication services available are improving as mobile technologies advance; however, the problem is becoming more severe as the frequency range of mobile devices expands. To solve this complex case, it is necessary to propose a comprehensive approach that combines and processes data obtained from different types of research and sources of information, such as thermal imaging, electroencephalograms, computer models, and surveys. In the present article, a complex model for the processing and analysis of heterogeneous data is proposed based on mathematical and statistical methods in order to study the problem of electromagnetic radiation from mobile devices in-depth. Data science selection/preprocessing is one of the most important aspects of data and knowledge processing aiming at successful and effective analysis and data fusion from many sources. Special types of logic-based binding and pointing constraints are considered for data/knowledge selection applications. The proposed logic-based statistical modeling method provides both algorithmic as well as data-driven realizations that can be evolutionary. As a result, non-anticipated and collateral data/features can be processed if their role in the selected/constrained area is significant. In this research, the data-driven part does not use artificial neural networks; however, this combination was successfully applied in the past. It is an independent subsystem maintaining control of both the statistical and machine-learning parts. The proposed modeling applies to a wide range of reasoning/smart systems.

**Keywords:** signal processing; smart device; electromagnetic field; non-ionizing radiation protection; SAR; ANOVA; data science; selection; constraint satisfaction; preprocessing; mobile technology; machine learning; statistics

#### **1. Introduction**

The methods used in data science show ways to find solutions to a specific problem [1]. With the development of technology, the types of data to be analyzed are diverse and heterogeneous [2]. Having many and varied sensors to register an event or phenomenon is a great advantage in data processing and decision making but also a great challenge for data analysts [3]. The processing, aggregation, and analysis of disparate data is a complex process that can be facilitated by the use of intelligent solutions [4].

The study of the issue of the effects of smart devices on humans is an extremely recent scientific task that requires the processing and analysis of diverse data sets obtained from

**Citation:** Garvanova, M.; Garvanov, I.; Jotsov, V.; Razaque, A.; Alotaibi, B.; Alotaibi, M.; Borissova, D. A Data-Science Approach for Creation of a Comprehensive Model to Assess the Impact of Mobile Technologies on Humans. *Appl. Sci.* **2023**, *13*, 3600. https://doi.org/10.3390/ app13063600

Academic Editors: Stefanos Ougiaroglou and Dionisis Margaris

Received: 4 January 2023 Revised: 28 February 2023 Accepted: 2 March 2023 Published: 11 March 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

1

numerous measurements [5]. It is not possible to give an unambiguous answer to this question when conducting the same type of research. Smart technologies are all around us and, in the near future, their number will increase many times over [6]. One of the most popular and currently used smart devices is the smartphone, which is used for both work and entertainment [7].

In recent years, this type of device has been increasingly used, and children and early teens own and use smartphones. These devices are used for communication, work, games, entertainment (watching movies and listening to music), visiting social networks, and more. At the same time, there are studies and analyses of the impact of these technologies on humans, and these effects are both psychological and biophysical in nature [8]. The smartphone is close to its owner, and the amount of time spent with this device is constantly increasing [9]. This process is difficult to interrupt or limit; however, if the consequences of overuse are studied and properly analyzed, the question of how to reduce the harmful effects on humans can be answered [10].

The psychological effects of smart technologies are the result of their long-term use and merging of the real and virtual worlds, which leads to social alienation, psychological loneliness, personal anxiety, low self-esteem, and hence to depressive states. More specific questions are related to: what is internet addiction and what physical and mental symptoms characterize this condition, the types of addictions, the extent to which they spread in society and what are the main areas affected, which is the most at-risk group among the population, what are the consequences, and—last but not least—the mechanisms of therapy and prevention.

Among the most commonly used methods of analysis are statistics from consulting agencies, content analysis of sites and blogs, and data from empirical studies and psycho diagnostic tests. The data analytics processes can be successfully combined with logicbased modeling instruments with the aim to create more medical applicable, versatile, and universal decisions.

Some experts find that dependence on smart technologies, and in particular on the services they offer, is not a separate behavioral disorder but a syndrome of a serious socio-psychological problem. The majority of researchers believe that the combination of addiction to cyberspace, together with electromagnetic radiation from smart devices, is a risk factor with dangerous consequences for the mental and physical health of an individual. Most smart devices communicate with each other using electromagnetic signals, which are a serious threat to human life and pollute the environment with invisible "electrosmog".

This article discusses the data-science approach to creating a comprehensive model for assessing the impact of mobile technologies on humans. The aim is to propose a datascience concept for the preprocessing, processing, and postprocessing of disparate data obtained from various sensors, measuring devices, and computer models for assessing the impact of mobile technologies on humans.

#### *1.1. Paper Organization*

The remainder of this paper is organized as follows. Section 2 presents the salient features of existing works. Section 3 presents the main statistical data processing methods. Section 4 shows the results from measuring the electromagnetic field, which reveal that, under certain conditions, mobile devices emit high-frequency electromagnetic waves and can cause various negative effects on humans. Section 5 discusses the most popular dosimetric values that estimate the levels of absorption of electromagnetic fields by the human body.

Modeling for SAR is used to mimic and illustrate the process of electromagnetic field absorption by the human head in Section 6. The collection and processing of thermal images are shown in Section 7. In Section 8, the experimental results and discussion are presented related to the change of brain activity of a mobile phone user. Section 9 proposes the use of complex data preprocessing, postprocessing, deep modeling, and analysis models by using intelligent methods. Finally, in Section 10, the paper is concluded.

#### *1.2. Research Methodology*

Extensive investigation, familiarity, and evaluation are crucial components in laying the groundwork for our suggested strategy. In order to handle and analyze heterogeneous data, we developed a sophisticated model based on mathematical and statistical techniques, and we then compared it to current state-of-the-art algorithms. In order to obtain these results, libraries were employed with existing algorithms. We reviewed the literature for a variety of study subjects and datasets and published the findings. These findings show that some of the outcomes are comparable to our suggested methodology.

The purpose of our study is to understand how mobile phones affect people in order to forward our efforts, which are described in this paper. In conclusion, because of the nature of the problem and the datasets that the algorithms are intended for, a true comparison is fairly challenging. In other terms, one may do better than the other in some situations, while the opposite results may occur in others. This article's focus does not enable for a thorough analysis and experimental investigation of each. A thorough evaluation of different methods is provided. Furthermore, it can be said that a more complex approach is required to solve the research issue, one that involves performing various experimental measurements, compiling statistical data, and using a computer model to explain some physiological effects brought on by electromagnetic wave exposure to the human head.

#### **2. Related Work**

This section discusses the main contributions of the current works. An assessment of the environmental and human health implications of base station and mobile phone radiation is provided [11]. A key invention that has changed people's lifestyles is the cell phone. With the widespread use of mobile phones in everyday life, the standard of living has significantly improved around the world. There have always been concerns about the effects of radio frequency radiation on humans, plants, and animals. Furthermore, it is alleged that the radiation emitted by mobile phones damages human health and jeopardizes the enjoyment and convenience derived from using the devices. The authors in [12] analyzed the changes that these smart phone technologies can bring to human– nature interactions while focusing on the outdoor behaviors of experienced outdoor users.

GHz [13] presented that the exposure of the human body to electromagnetic fields (EMF) with different frequencies can cause different biophysical effects. Thermal effects are typically minimal with frequencies less than 100 kHz; however, effects appear when increasing the frequencies. Smartphones communicate via high-frequency signals, and extended use of the generated electromagnetic field affects the skull. Additionally, irritability, memory impairment, weariness, anxiety, headaches, and disrupted sleep are primary indicators of changes in the body. It is believed that the changes caused by EMF are able to accumulate in the body under conditions of prolonged exposure.

As a result, pathologies, such as leukemia, brain tumors, and hormonal diseases, can develop. Research has investigated memory loss, Parkinson's and Alzheimer's disease, amyotrophic sclerosis, AIDS, and an increase in suicides in relation to EMF exposure [14]. Another consequence of exposure to EMF in people is the syndrome of premature aging of the body. Despite extensive investigations, there remain a variety of unknown and undiagnosed addictions in people induced by EMF. All of these impacts have been recorded using various research approaches.

These include the processing of thermal pictures to analyze thermal effects and the processing of EEG signals to assess brain activity [15]. To obtain a unified thorough evaluation of the impact of smart technology on humans, an intelligent approach for assessing various data is presented. To that end, this study proposes a framework for combining disparate data sets in order to assess the impact of smart technology on humans. Additionally, new methods for acquiring and evaluating empirical and experimental data are required to overcome the problem. A proposed paradigm for unification and intelligent solutions is effectively evaluated in this research for addressing difficulties.

#### **3. Statistical Data Processing**

It is feasible to collect data on the impacts of mobile devices on the psychological and physical health of the users to measure the impacts of active usage of smart technology. Correlation analysis was used to establish the relationships and the degree of dependence between individual variables. The most commonly used correlation coefficient is the Pearson coefficient (*r*) for linear correlation, which is calculated by the formula [16]:

$$r = \frac{P}{S\_X S\_Y} \tag{1}$$

where *P*—moment of the products; *SX*—standard deviation of the variable *X*; and *SY*—standard deviation of the variable *Y*. The moment of the products (*P*) is calculated as:

$$P = \frac{\sum XY}{n - 1} - \frac{\sum X \sum Y}{n(n - 1)} \tag{2}$$

where ∑ *X*—sum of *X* values; ∑ *Y*—sum of *Y* values; ∑ *XY*—sum of products of *X* and *Y*; and *n*—sample size.

Another statistical criterion that is successfully used to determine changes in the responses and/or conditions of subjects as a result of an experimental intervention is Student's *t*-test for related samples, which involves research design "before-and-after". It has the ability to work with small volumes of data, and there is measurement "before the intervention", measurement "after the intervention", and the recording of statistically significant differences in the values of the tested variables. The empirical value of the t-test is calculated by the formula [16]:

$$t\_E = \frac{|d|}{\sqrt{\frac{\sum d^2 - nd^2}{n^2 - n}}} \tag{3}$$

where *d* = *X*<sup>2</sup> − *X*<sup>1</sup> is the difference between two measured values of each object, *n* is the number of observed objects, and *d f* are the degrees of freedom *d f* = *n* − 1.

Among the most powerful statistical techniques for studying causal relationships is ANOVA (Analysis of Variance). One-way ANOVA provides analysis of the variation of a quantitatively dependent variable—for example, the degree of internet addiction caused by an independent qualitative or quantitative variable—for example, the age group. According to the null hypothesis, ANOVA is used to test the assumption of whether several means are equal, allowing determination of not only the differences between them but also which exact mean values differ from the others.

Table 1 shows the formulas for calculating the one-way ANOVA. The notations are as follows: *SSb*—sum of between-group squares; *SSw*—sum of within-group squares; *SST*—total sum of squares; *K*—degrees of freedom; *nj*—size (number of measurements) for each of the *k* samples (groups); *x*¯*j*—sample mean for the *j*-th group; *x*¯—total mean; *x*¯*ij*—mean of the *i*-th individual from the *j*-th group; and *n*—total sample size. From the presented Table 1, it is clear that the *F*-ratio is obtained by dividing the between-group mean square *MSb* by the within-group mean square *MSw*:

$$F = \frac{MS\_b}{MS\_w} \tag{4}$$

Therefore, the logic of ANOVA is based on the decomposition of the total variance of the variable into two key components: the between-group variance (deviations of the group means from the total arithmetic mean) and within-group variance (individual deviations of the values from the mean within a category (group)).


**Table 1.** One-way ANOVA [16].

The Fisher's *F*-test is checked according to the significance level *α* (usually equal to or less than 0.05) and the degrees of freedom *K*, as follows: for the between-group variance *K*<sup>1</sup> = *k* − 1, where *k* is the number of groups compared (degrees of freedom of the numerator); and for the within-group variance *K*<sup>2</sup> = *n* − *k*, where *n* is the sample size (degrees of freedom of the denominator).

#### **4. Measuring the Electromagnetic Field from a Smartphone**

These measurements show the presence of electromagnetic fields generated by different models of GSM devices in different operating modes. Depending on where the measurements are made—outdoors or indoors, the results differ. The obtained values additionally vary depending on how far the smartphone is from the measuring equipment or from the distance to the base station, as well as what additional radio sources are nearby. For this purpose, experimental measurements were performed in which the GSM device was positioned at distances of 1 and 10 cm from the smartphone. The average measurement results are shown in Table 2. Measurements were obtained with Gigahertz HFE35C.


**Table 2.** EMF values generated by a GSM device.

The data in Table 2 show that, in search and talk mode, the emission levels are many times higher than in passive mode. When ringing, the signal level is high for the first 20 s, and then decreases significantly. Depending on the location of the smartphone, the signal levels are different, and when indoors, the signal level may be above the normal values. The measurements were performed both indoors and in underground rooms, where the signal source from the base station was extremely weak, and, in order to achieve successful communication, the radiation level of the GSM device was at the maximum value to compensate for attenuation and to not let the conversation fail.

Studies have shown that the level of the electromagnetic field in urban conditions is many times higher than in rural areas. This confirms the assumption that the presence of different types of electrical devices and transmitters will lead to a significant increase in the background electromagnetic field. With the development of technology, this problem will deepen and become more relevant. One of the most serious generators of electromagnetic fields is the smartphone. The level of the electromagnetic field generated by these strongly depends on the mode of operation and the environment in which the phone is located. In some cases, the levels of electromagnetic fields exceed the regulated permissible levels. The proximity of these devices to the human head requires in-depth study of the influence of electromagnetic fields on the possible effects on the human body.

#### **5. Specific Absorption Rate**

The specific absorption rate (*SAR*) shows how much radiation is absorbed by human tissues when irradiated by an electromagnetic field. The *SAR* is a measure of the rate at which the radio-frequency energy from a mobile phone is absorbed by the human body [17].

The local *SAR* is calculated as the power loss *dP*<sup>1</sup> absorbed in an infinitesimal mass *dm* in the following way:

$$SAR = \frac{dP\_1}{dm} = \frac{\sigma\_e f f E\_{rms}^2}{\rho} = \frac{f\_{rms}^2}{\rho \sigma\_{eff}}\tag{5}$$

where *Erms* is the root mean square value of the electric field, *Jrms* is the current density. *σeff* is the effective conductivity of human brain tissue, and *ρ* is the tissue density. Therefore, the *SAR* unit of measurement is W/kg. Energy from electromagnetic fields is absorbed into the human tissues and warms them. This leads to another definition of *SAR*, namely:

$$SAR = c\_p \frac{\Delta T}{\Delta t} \tag{6}$$

where *cp* is the specific heat capacity of the tissue and Δ*T* is the change in temperature over a period of time Δ*t*.

A distinction should be made between the instantaneous *SAR* and the permissible SARs, where an average value is measured for a given mass of tissue and a specified period of time. It is best to use a computer model to study the *SAR* and the thermal effects. The benefit of the model is that it visualizes the processes in depth.

#### **6. Modeling for SAR Simulation**

With the use of a computer model, it is feasible to thoroughly analyze *SAR* in the human body. This model may be used to research the effects of mobile phones on the human head since it can be used to visualize the processes of tissue absorption and heat deep within the human body. The characteristics of human tissues can be altered to imitate various age groups. The numerous tissues and bones that make up the human head model each have unique electromagnetic energy-absorption properties.

A mobile phone's characteristics can be changed to emulate various GSM device models. Numerous factors, including the antenna radiation pattern, transmitted signal strength, and signal frequency, are modifiable. The computer model's ability to see the depth of the human head, which can be utilized to analyze the absorption and warming processes in depth, is its most important feature.

For the proper functioning of the model, it is essential to know the biological characteristics of human tissues. The human body is composed of many organs and characterized by specific biological parameters that must be taken into account correctly in the model. The electromagnetic characteristics of the dielectric constant, magnetic permeability, and conductivity [18] should be properly defined for each modeled organ. The created computer model uses parameters that characterize the biological tissues of an adult. The shape of the human head was applied by the IEEE library and was loaded into the COMSOL Multiphysics software. The model was imported from a file named sar\_in\_human\_head.mphbin [19,20].

The source of electromagnetic radiation was a model of a smartphone that was manually added. In the considered model, the location of the device was chosen to be on the left

side of the head in order to facilitate the comparison of the obtained results with thermal images from our previous experimental studies. The mobile device was modeled at a distance of 1 cm from the head as shown in Figure 1.

The electromagnetic parameters of the biological tissues of the human head were modeled through an interpolation function that uses the characteristics of the tissue inside the human head. The output for this function is directly taken from a file named sar\_in\_human\_head\_interp.txt. After simulation of the model, it is possible to estimate the *SAR* on any shape and tissue of the human body. When designing mobile devices, it is important to determine the amount of radiation that can be absorbed by the human body. The use of COMSOL Multiphysics and its radio frequency module allows a faster and moreefficient approach in the design of wireless devices that meet certain safety requirements. The local SAR value in the human head, calculated using the 900 MHz frequency equation, is shown in Figure 2.

**Figure 2.** *SAR* visualization.

When talking to a mobile phone, the human head is very close to the phone, and the power of the emitted electromagnetic field is very high. Penetrating into a person's head, the electromagnetic field releases some of its energy, and the tissues in the head absorb this energy. Electromagnetic energy affects the particles in the tissue due to the electrical and magnetic components of the electromagnetic field. Visualization of the effects of penetration of the electromagnetic wave into the human head can be shown by means of incisions of the head at certain levels (Figure 3).

**Figure 3.** *SAR* visualization at different levels in the human head.

The strongest influence of the electromagnetic waves is in the head area, located in the immediate vicinity of the mobile device [17]. The greatest amount of energy is absorbed in this area, and the penetration into the human head is the greatest. The effects of exposure of the human body to radio frequency radiation mainly depend on the exposure time and the strength of electromagnetic fields. The penetration of the electromagnetic field into the depths of the human head depends on the frequency of the carrier signal. The higher this frequency, the faster it attenuates in space and the less it penetrates the human head. The highest values of absorption are observed on the surface of the human head. The developed model calculated only the local values of the *SAR* parameter. The maximum local *SAR* value is always higher than the maximum mean *SAR* value.

The amount of energy absorbed by the human head affects the temperature to which the tissues of the head are heated. The study of the processes of temperature distribution in the human head and on its surface is possible with the help of the created computer model using the COMSOL Multiphysics software. The frequency of the signal of the mobile device was selected to be 900 MHz. The transformation of the absorbed energy into heat was conducted with a biothermal equation. The change in temperature is a function of the physiological properties of biological tissues and blood circulation in the human body [21]. The thermal effects on and in the human head are shown in Figure 4.

Due to the created computer model, it is possible to study the processes of penetration of the electromagnetic field into the human head and the effects caused by this, thus, thoroughly simulating different situations and different characteristics of the head model and mobile phone characteristics. The visualization capabilities of the COMSOL Multiphysics software are impressive and allow a detailed view of the simulation results.

**Figure 4.** Visualization of the thermal effects in the human head for several horizontal layers.

#### **7. Collection and Processing of Thermal Images**

The impacts of the use of GSM devices on the physical condition of a person can be assessed by the thermal effects caused by the electromagnetic waves emitted by GSM devices. The experimental scenario includes a speaker with a GSM device for 20 min and a FLIR P640 thermal camera to capture their head in profile and full face. During the conversation, the GSM device was located about 1 cm from the head of the participant in the experiment, and the thermal camera was about 2 m away, focused on their head. The average room temperature was around 22 ◦C.

As a result of the 20 min conversation and the irradiation with the electromagnetic waves from the GSM device, the head of the participant in the experiment warms up by about 1–2 ◦C as can be seen from Figure 5. The increase in head temperature depends on the duration of the conversations. When talking for up to 30 s, no change in intracranial temperature is observed; however, when talking for more than 2 min, first, the ear begins to warm up and then the soft tissues around the ear. The increase in temperature is a result of prolonged irradiation of the human body with high-frequency radio signals from the GSM device.

The obtained results show that the temperature on the surface of the head is the highest and decreases in depth. The temperature change near the mobile phone is on the order of 0.6 ◦C and decreases rapidly inside the head. The thermal effects obtained from the model largely coincide with the results of a real experiment conducted with a thermal camera (Figure 5).

Averaging the temperature of the head before and after the experiment resulted in a temperature difference of about 1.3 ◦C. After checking the number of pixels exceeding the temperature of 34 ◦C before and after the experiment, we found that, after the experiment, the area heated above this value was three-times larger than before the experiment. The thermal images were processed using MATLAB. The study found that the temperature of the head on the side of the GSM device heated up much more than the other side. The areas around the ear, forehead, and neck heated up much more than the rest of the head.

The study of the processes of penetration of the electromagnetic field into the human head and the effects caused by this is an extremely important scientific task. The interaction between the human head and the electromagnetic radiation caused by cell phones can cause electric currents and electric fields in the human head, which can lead to negative health effects.

**Figure 5.** (**a**) Visualization of thermal effects in a human head before use of a GSM device; (**b**) visualization of thermal effects in a human head during use of a GSM device; and (**c**) visualization of thermal effects in a human head after use of a GSM device.

#### **8. Experimental Results and Discussion**

Electroencephalographs (EEGs), which measure electrical signals generated by the brain (brain waves), are used to study a person's brain activity. EEG signals are obtained as a result of the work of neurons in the human brain and can be intercepted using electrodes attached to the surface of the scalp [22–24].

A series of experiments were conducted to analyze the possible effects of the electromagnetic fields generated by smartphones on the activity of the human brain. Thirty volunteers (16 men (53.3%) and 14 women (46.7%) with an average age of 45.2 years) participated in the studies; however, we plan to increase these numbers in future studies among adults and children. The participants in the study stated that they were physically and mentally healthy, that they had not taken any medication before the tests, and that they were voluntarily undergoing these tests. The experiments were conducted in two stages.

The first stage involved studying the EEG signals of the subjects without using a mobile phone. The second stage of the experiment was performed while the subjects used a mobile phone (Figure 6). The EEG recordings from the two experiments were processed in the MATLAB environment in the time and frequency domain of the signal. The aim was to make a comparative analysis of the signal spectra from the two experiments.

Measures to reduce any other brain activity have been taken to assess the effect of cell phone electromagnetic radiation on a person's brain activity. The experiments were conducted in a quiet and cozy room, with the test subjects placed in comfortable armchairs with their eyes closed to reduce side stimuli. During the experiments, participants held the phone at a distance of about 1 cm away from the head, listening to a quiet countdown from one to one hundred, which was started by a researcher in another room. The aim of the experiment was to be as close as possible to a real conversation as shown in Figure 6.

A mobile phone with a SAR of 0.36 W/kg was used during the experiment. The experiment lasted about an hour with the first 30 min without a phone and the second 30 min with a phone. The obtained signals were filtered and divided into frequency ranges, respectively: delta *δ* (1–4 Hz), theta *θ* (4–8 Hz), alpha *α* (8–13 Hz), and beta *β* (13–32 Hz). With the help of the Pwelch function of the MATLAB program, the spectra of the signals before and after a call with a mobile device were obtained. The spectra of the two experiments for all electrodes were compared, and differences in the spectra were found at several measurement points (Figure 7).

**Figure 6.** Participant during the experiment.

At the points with the numbers T3, T5, and F7, which are the closest to the mobile phone, a significant change in the spectral activity of the brain was found. The largest change in the spectrum was found in T3, where the changes were in the theta, alpha, and beta frequency ranges. The changes in the spectra at points T5 and F7 were only in the theta and alpha ranges. Interestingly, this dependence was found in all participants in the experiments but to varying degrees.

**Figure 7.** (**a**) Differences in the spectra by range and (**b**) different GSM ranges for delta, theta, alpha, and beta.

To compare the average spectral exposure with and without GSM for the ranges of delta *δ* (1–4 Hz), theta *θ* (4–8 Hz), alpha *α* (8–13 Hz), and beta *β* (13–32 Hz), Student's *t*-test for related samples was used (Howard, 2008). Statistically significant differences, where *p* < 0.05, are visualized in Figure 7.

The change in brain activity in a person's head on the side of a mobile phone has a short-term effect that is shown to be due to the operation of a mobile phone. If the *SAR* is studied and analyzed in more detail using a computer model, we expect that the changes in brain activity will be closely related to the location and amount of absorbed electromagnetic energy. This relationship has not been studied in sufficient depth and requires further research into the body's biophysical responses; thus, this is of interest for future research.

#### *Accuracy*

The degree of similarity between a measurement and its real value is referred to as accuracy. A limited number of EEG channels recorded concurrently can improve the accuracy. Two distinct types of tests were conducted to evaluate brain activity and lasted 30 min. The first experiment was performed without a cell phone (GSM); however, the second experiment included a mobile phone while subtracting brain activity. Interesting discoveries were made, and it was revealed that, while utilizing a cell phone, the accuracy was somewhat reduced.

Figure 8 demonstrates the accuracy and compares the average spectrum exposure with and without GSM for the ranges of delta (1–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), and beta β (13–32 Hz). Figure 8a indicates that the accuracy without a mobile phone was 99.95%, whereas the accuracy with a mobile phone was 99.78% utilizing a delta range of (1–4 Hz). Figure 8b shows that, with the theta (4–8 Hz) frequency range, 99.88% accuracy was obtained without a mobile phone, whereas 99.69% accuracy was obtained with a mobile phone.

**Figure 8.** (**a**) The accuracy of the average spectrum exposure with and without GSM using the frequency range of delta δ (1–4 Hz). (**b**) The accuracy of the average spectrum exposure with and without GSM using the frequency range of theta θ (4–8 Hz). (**c**) The accuracy of the average spectrum exposure with and without GSM using the frequency range of alpha α (8–13 Hz). (**d**) The accuracy of the average spectrum exposure with and without GSM using the frequency range of beta β (13–32 Hz).

Figure 8c depicts the 99.81% accuracy with the alpha (8–13 Hz) range in the absence of a mobile phone. A cell phone, on the other hand, achieved 99.58% accuracy in the same frequency band. Figure 8d exhibits 99.80% accuracy without a mobile phone utilizing a beta (13–32 Hz) frequency range and 99.59% accuracy with a mobile phone. It was shown that the GSM had a negligible impact on the signal accuracy while monitoring brain activity.

#### **9. Processing of Complex Data Analysis Models Using Intelligent Methods**

Intelligent methods are the processes of gathering, modeling, and analyzing data in order to derive insights that may be used to make decisions.

#### *9.1. Deep Knowledge Modeling and Constraint-Based Fast Preprocessing*

The preprocessing of data is effectively utilized in different domains, such as number theory, cryptography, intelligent measurement, education, and bioinformatics. The chances of making an improvement will be quite slim without a comprehensive description of the surroundings. It is clear from working with description logic that logic-based modeling is challenging to control, that it is challenging to merge the logical and statistical phases of the data science cycle, and that their algorithmic complexity is often considerable.

On the other hand, it is possible to reason using the body of knowledge, and this capability enables the development of knowledge- and data-driven open systems. It is simple to combine the suggested study with the mentioned non-classical logical methodologies. More information is available in [25]. When discussing data-driven methodologies, artificial neural networks (ANNs), such as deep learning, are frequently used. In order to enhance the quality of human-like reasoning, this paper investigated a novel data-driven methodology that makes use of modeling and constraint fulfillment characteristics.

The constraint satisfaction methods are frequently used in data-preprocessing issues. A selection of data was applied aiming to complete preprocessing more efficiently, and the considered deep-modeling constraint satisfaction methods significantly improved this process. Currently, the following groups of novel logically-inspired constraints have been investigated in [25–27]. The mema method for control is named Puzzle but it significantly differs from the methods constructed for solving puzzles, such as [26–29].

The latter are ineffective because of the usage of random number strategies. The proposed Puzzle approach is easily combined with these and other methods [30] with the aim to increase their efficiency. Initially the classical constraint satisfaction methods can be applied with the purpose to form a closed focus area where certain data may be logically connected.

Let us focus on the considered two objects, ones of the many enclosed in the area from Figure 9. The closed 'focus/selection area' helps to reveal new knowledge concerning the enclosed objects. The data analysis concerning this case can reveal new knowledge, for example, that *M* implies *N* or that *M* has some relationship to *N*. The second case is the result of the classical link analysis and/or corresponding data mining applications. Some general disadvantages: the classical approach works with a priori given data and is not intended for the elaboration of open systems.

In practice, the constraints should be dynamically changed depending on the current knowledge/data. This becomes possible after the introduction of new types of non-classical constraints. Generally speaking, there should be new types of constraints introduced. Constraint violation is impossible in the classical case but this case should be reconstructed. Every rule/constraint could be defeated depending on the conditions. The new types of constraints make it possible. Somebody who exposed their body to radiation generally does not think about cancer that could occur 20 years later. Many other application problems arise in medical practice where the problems are gradually accumulating.

**Figure 9.** A set of three linear constraints constructs a closed area focusing on objects *M* and *N*.

With new conditions, additional questions arise: **why** the constraints are imposed, **what** and when violates it, **where** it could be defeated, and other use cases.

Binding, pointing, and crossword are the new groups of constraints. There exist many binding situations, and some of them have been researched in this article. The first case is when the maximum binding possibility is concentrated in the center of the area, and the distance from it is a function diminishing its value. The proposed research revealed that the binding may depend on certain conditions, and it may also influence the features of all objects contained in its area: the linearity/range/the region of the usage and/or other properties could be defeated. The general case is depicted in Figure 10 where curves *B* and *D* are the bindings concerning the searched goal *G*1.

**Figure 10.** A set of nonlinear constraints in combination with the proposed three groups of logicallybased constraints.

Association methods are generalized in binding constraint theories [31–36]. Associations are purposed for finding implicative classical form of rules, while the binding constraints explore any form of causal relationships. We investigated only a few binding constraint groups but they were very useful as a data-science modeling tool.

1. Denote *A* is bound to *B* if there is proven evidence of any form of causal relationship between them. The implication is also included in this case. Should the type of the causal form should be described in special forms of metaknowledge attached to the corresponding binding case? For example, some people

are very sensitive to long phone calls. The personal binding constraint 'phone call > 10 min'− >'tired' or 'noise in the head' should be added to the modeling case. The metaknowledge should include the complex of disease history, a nervous state, a history of complaints, and corresponding factors. Semantic reasoners are very helpful in binding information processing.


In Figure 11, an example is shown where the linearity of a classical constraint satisfaction case is fuzzified/defeated in the considered binding constraint area.

**Figure 11.** Type-6 binding constraint and its influence on classical linear constraints.

Agents, especially in health-oriented systems, can not effectively behave using a set of algorithms. In certain conditions, every solution may be modified or changed. One of the frequent forms of change is defeasible reasoning. In practice, the software agent should defeat its goals aiming at better performance. Every rule should be gradually improved and modified and/or suddenly changed depending on the situation in a data-driven way. The defeasible scheme controls the usage of many unified exclusions and other defeasible knowledge forms.

Let a Horn clause describe Rule (7).

$$B \leftarrow \Lambda\_{i \in I} A\_i \tag{7}$$

where the form of the rule is suitable for the backward chaining. This rule is changed when an exclusion *E*(*C*, *Ak*) is attached to Equation (7): if *C* is true, the corresponding *Ak* is defeated, which means that its truth value is reverted Equation (8) or it disappears from the antecedent Equation (9) because its significance for '*B* is true' is defeated to zero. Furthermore, the variant Equation (10) is researched where the defeated value is changed by another formula.

$$\frac{\mathcal{B} \leftarrow \Lambda\_{i=1}^{z} A\_{i\prime} \mathcal{C}\_{\prime} E(\mathbf{c}\_{\prime} A\_{k})\_{\prime} \neg A\_{k} \leftarrow \mathcal{C}}{\mathcal{B} \leftarrow A\_{1} \Lambda A\_{2} \Lambda \cdots \Lambda\_{k-1} \Lambda \neg A\_{k} \Lambda \cdots \neg A\_{z}} \tag{8}$$

$$\frac{\mathbb{C}\angle B \leftarrow \Lambda\_{i=1}^{z} A\_{i\cdot} E(\mathbb{C}\angle A\_{k})}{B \leftarrow A\_{1}\Lambda \cdot \cdots \cdot A\_{k-1}\Lambda A\_{k+1}\Lambda \cdot \cdots \cdot A\_{z}}\tag{9}$$

$$\frac{\mathbb{C}, B \leftarrow \Lambda\_{i=1}^{z} A\_{i}, E(\mathbb{C}, A\_{k})}{B \leftarrow A\_{1}\Lambda \cdots \cdot A\_{k-1}\Lambda (A\_{k} \forall \mathbb{C}) \Lambda A\_{k+1} \cdots \Lambda\_{z}} \tag{10}$$

One of the frequently used health cases is where both antecedent and consequent are changed in the defeated rule. The other frequent case produces a fact from the defeated rule, and this fact contains a non-implicative relation. The defeasible process explores non-classical rule forms, one of them is '*Ak* is defeated if *C* is true in *E*(*C*, *Ak*)'. Detailed information concerning this topic is given in book chapter [25].

The defeasible reasoning is applied to test the strength of the investigated process and of its significant features. The quoted binding and pointing constraints significantly improve the defeasible processes. The pointing (indicating) constraints are applied in order to determine both the distance to the goal and the direction of the research. Furthermore, the history of the research process can influence on the pointing direction.

The group of pointing constraints can be considered as a generalization of the classical systems of goal/target or fitness functions. In contrast to the classical cases, pointing constraints are data-driven by nature and revert to being direction-driven with accumulated data. For example, if there is information that there was a pain, the data on its coordinates are probably no longer valid. In this case, the exact conclusion is in doubt until enough proof is accumulated.

The third researched constraint is named 'crossword'. It is depicted by the triple *A*, *C*, *E* in Figure 10. This type models the process of reasoning on the unknown things based on the accumulated knowledge. In such a way, parts of the searched goal have been found, and by using original evolutionary Puzzle method, a trial to complete the whole solution is attempted. The internal links between the elements of *A*, *C*, *E* are studied using different combinations of the quoted binding and pointing modeling.

Many algorithms and data-driven approaches were investigated with the aim to find any binding or pointing solution enlarging the set of the known part from *G*1. Generally speaking, every pointing constraint was used aiming to diminish the set of selected/processed data and knowledge: the 'selection focus' should be narrowed. Pointing to a certain part of the binding area improves the reasoning process.


On the other hand, in mobile signal processing applications, the pointing constraints show to the center of corresponding binding area. The binding cases frequently concern best signal processing practice, good medical practice, and many analogical examples. In many cases, little but important signal changes may be traced in this way, where neurofuzzy deep learning methods are successfully able to be combined with binding approaches. For example, the pointing constraints have been used for descriptions and maximization of the effectiveness of specific brain–computer interface features and communications to other medical procedures and devices.

Please note that the good practice examples were researched in coordination with possible bad practice situations, where they are described by using a combination of pointing and binding constraints. The considered novel modeling by using constraints does not change, in any direction, the considered biomedical research schemes but improves the effectiveness.

#### *9.2. Analysis on Preprocessing and Postprocessing Features*

The pointing constraints in this article concern the research on interconnection of electromagnetic and thermal influence, the influence of thermal-located radiation on skin/brain aging, an exploration of possibilities of overreaction to mobile radiation in small groups of people, and the influence of emotional state of researched people to their reaction. Noise in the head, forgetting, loss of concentration, bad mood, and slight disorientation symptoms after the call are not the factors of the binding process but they should be further investigated one by one of in a group with other medical data.

The binding constraints aim to model the facts how the size of the overheated area is connected to the damage effect, why the spots T3, T5, and F7 are the most promising places to estimate the aging affects, and the possible influence of radiation and fields on brain tissue processes, some of them presumably unknown. Other binding modeling options concern the research on mobile radiation in smoke, wet, and dirty air environments, and the binding of high-level SAR signals to the spectral power of EEG/electromagnetic field absorbtion.

Human health problems are a result of long-term dynamic process in nonlinear, and dispersive tissues. Potentially important outcomes may be obtained during long-term research on the accumulated effects on humans after 20, 30, and even 50 years of mobile phone usage. This type of work could not be executed manually, Deep Learning (DL) should be applied instead. The proposed deep modeling approaches should help to trace and process tiny changes and gradually re-evaluate their significance.

Significant ANOVA variables should be reinforced by the proposed types of pointing and binding constraints aiming to improve the constant knowledge elicitation, accumulation and processing possibilities. In this way the principles of open systems had been applied to statistical type of research making it more data-driven. The good practice database contains facts that many people are using hands-free,microphone,Bluetooth and other options where the mobile device is far from the head. This does not mean that the problem has an easy solution: the same sort of radiation still remains just near the human body.

Aiming at long-term search the binding constraints are set to search allergic reactions, oncology-like or blood problems, stomach infections, pain picture, influenza history, and Alzheimer symptoms. The history before and after the active start of the mobile era should be compared. The earlier history of PC usage also is included in the research. As a result, human-like reasoning is inspired: the radiation implies slightly higher temperatures in skin and tissue, and how high should be this difference to be physically noticed by the patient in very cold or cold environment.

This is an experimental attempt to bind the mobile radiation to any noticeable influence on the human body. In the positive case the binding/pointing areas may be enlarged by using other types of constraints. If the patient has any specific problems just after the long calls, he should be analyzed in the lab aiming to bind the problem to mobile radiation as a research hypothesis.

The aging and other human body features depend on many personal factors and history. This is frequently used to oppose to many of the medical research data in the field. As a whole it is rather easy to prove or destroy any hypothesis from the scope in a narrow research concentrated only to one group of facts and features. The proposed deep modeling options helped us to escape from this situation.

The complex research involving DL in perspective will be processed by software agents. In such a way, the types of constraints should be changed depending on the situation. As a result, powerful data science will analyze symptoms in each case by using the accumulated medical data and knowledge. The early eradication of irrelevant medical facts and hypotheses should be preserved by the use of software agents.

The use of the proposed data-science techniques opens up the prospect of greatly reducing manual effort and paving the way for intelligent medical research that focuses on combinations made up of minor and ancillary aspects. Sometimes little details might alter the overall course of the investigation.

#### **10. Conclusions**

We developed a novel data-science technique to identify the detrimental impacts of electromagnetic radiation from mobile devices on the human body. The proposed method for analyzing heterogeneous data is based on mathematical and statistical methodologies (thermal imaging and electroencephalograms). The proposed solution combines the ANOVA statistical method with deep modeling and rapid preprocessing approaches, such as binding/pointing/crossword constraints. Several tests were conducted utilizing the Pwelch function of MATLAB software, both with and without a mobile device. Each experiment was 30 min long.

The resulting signals were filtered and classified into four frequency ranges: delta (1–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), and beta (13–32 Hz). The accuracy was determined for each frequency range with and without a mobile device based on the collected signals. The findings demonstrate that the emission of electromagnetic radiation from mobile devices has an effect on the signal frequency range accuracy. The presence of irradiation leads to an increase in the amplitude of brain signals in different frequency ranges. Furthermore, the results show that improved accuracy was reached without the use of a mobile device for each frequency band.

The proposed approach has limitations because it increases the computational complexity due to obtaining heterogeneous data. However, this issue can be resolved using a data-mining approach. In the future, different Quality-of-Service parameters (e.g., energy consumption, time complexity, and reliability) will be examined in the future. Furthermore, the proposed approach will also be compared to state-of-the-art approaches: IoT based mobile monitoring framework for hyper-local PM2 [1], and cognitive emotion pre-occupation method [38].

**Author Contributions:** M.G., conceptualization, writing, idea proposal, methodology, and results; V.J. and I.G., data curation, software development, and preparation; A.R., Writing, results, software development, preparation, submission, review and editing; M.A. and B.A., review, manuscript preparation, and visualization; D.B. review and editing. All authors have read and agreed to this version of the manuscript.

**Funding:** This work is supported by the Bulgarian National Science Fund, project title "Synthesis of a dynamic model for assessing the psychological and physical impacts of excessive use of smart technologies", KP-06-N 32/4/07.12.2019, led by Magdalena Garvanova.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data that supports the findings of this research is publicly available as indicated in the reference.

**Acknowledgments:** This work was supported by the Bulgarian National Science Fund, project title "Synthesis of a dynamic model for assessing the psychological and physical impacts of excessive use of smart technologies", KP-06-N 32/4/07.12.2019, led by Magdalena Garvanova.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

### *Article* **A Flexible Session-Based Recommender System for e-Commerce**

**Michail Salampasis 1,\*, Alkiviadis Katsalis 1, Theodosios Siomos 1, Marina Delianidi 1, Dimitrios Tektonidis 1, Konstantinos Christantonis 1, Pantelis Kaplanoglou 1, Ifigeneia Karaveli 1, Chrysostomos Bourlis <sup>2</sup> and Konstantinos Diamantaras <sup>1</sup>**


**Abstract:** Research into session-based recommendation systems (SBSR) has attracted a lot of attention, but each study focuses on a specific class of methods. This work examines and evaluates a large range of methods, from simpler statistical co-occurrence methods to embeddings and SotA deep learning methods. This paper analyzes theoretical and practical issues in developing and evaluating methods for SBSR in e-commerce applications, where user profiles and purchase data do not exist. The major tasks of SBRS are reviewed and studied, namely: prediction of next-item, next-basket and purchase intent. For physical retail shopping where no information about the current session exists, we treat the previous baskets purchased by the user as previous sessions drawn from a loyalty system. Mobile application scenarios such as push notifications and calling tune recommendations are also presented. Recommender models using graphs, embeddings and deep learning methods are studied and evaluated in all SBRS tasks using different datasets. Our work contributes a number of very interesting findings. Among all tested models, LSTMs consistently outperform other methods of SBRS in all tasks. They can be applied directly because they do not need significant fine-tuning. Additionally, they naturally model the dynamic browsing that happens in e-commerce web applications. On the other hand, another important finding of our work is that graph-based methods can be a good compromise between effectiveness and efficiency. Another important conclusion is that a "temporal locality principle" holds, implying that more recent behavior is better suited for prediction. In order to evaluate these systems further in realistic environments, several session-based recommender methods were integrated into an e-shop and an A/B testing method was applied. The results of this A/B testing are in line with the experimental results, which represents another important contribution of this paper. Finally, important parameters such as efficiency, application of business rules, re-ranking issues, and the utilization of hybrid methods are also considered and tested, providing comprehensive useful insights into SBRS and facilitating the transferability of this research work to other domains and recommendation scenarios.

**Keywords:** next-item and next-basket recommendations; graph-based recommendations; purchase intent; e-commerce; LSTM-RNN

### **1. Introduction**

A pleasant online shopping experience depends on factors such as convenience, comfort, product findability. Some of these constituents of success rely on researchers conducting usage analysis of an e-commerce application in order to improve its design [1], with these factors ranging from good typography, product photography and elegant and clean checkout forms, to personalized website structure [2]. Others, such as managing information overload, finding interesting, and related or alternative products in e-commerce sites, depend on good retrieval and recommendation methods [3]

**Citation:** Salampasis, M.; Katsalis, A.; Siomos, T.; Delianidi, M.; Tektonidis, D.; Christantonis, K.; Kaplanoglou, P.; Karaveli, I.; Bourlis, C.; Diamantaras, K. A Flexible Session-Based Recommender System for e-Commerce. *Appl. Sci.* **2023**, *13*, 3347. https://doi.org/10.3390/ app13053347

Academic Editor: Andrea Prati

Received: 11 January 2023 Revised: 17 February 2023 Accepted: 23 February 2023 Published: 6 March 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

As a response to this last need, recommender systems (RS) have become fundamental tools for conducting effective e-commerce. They provide customers with personalized recommendations in searching for additional products. RS collect and model past user transactions, and potentially other features such as location, demographic profiles and other people's preferences. Several models for RS of that conventional type have been proposed and proved their efficacy. Some examples are content-based [4] and collaborative filtering [5] systems. These research methods make use of the long-term user profiles that are logged every time customers visit an e-commerce site.

However, these conventional RS methods have some important limitations. Firstly, in many e-commerce applications, long-term user models are simply not viable for several reasons: new users visiting for the first-time, or not being required to have user IDs, or choosing not to login for privacy or transaction speediness reasons will disrupt the functioning of these models. However, there are more drawbacks. Focusing on a community's long-term preferences ignores short-term transactional patterns, interest, and temporal shifts. This generally degrades ability to understand the intrinsic nature of a user's behavior in her/his current ongoing session.

To address these issues, session-based recommender systems (SBRS) have emerged. In the context of e-commerce, a session can be seen as a single episode that includes visiting several web pages and viewing items, ending potentially with multiple purchased items in one shopping transaction. The same idea can equally serve other domains such as in linear TV programming [6], next point of interest assessments (POI), or movie and next song recommendations. SBRS solely rely on session-specific information and the user's most recent activities. The most recent interactions the user has had with the web application, or other sorts of information that may be acquired or inferred during a session, should thus be the basis for successful suggestions. These details may include, for example, how a user arrived at the website, how long they stayed on previous pages, short-term community popularity patterns [7], browsing patterns [8] or the ability to predict a customer's intent in real time [9]. A simpler SBRS strategy is to merely utilize the currently available item and community-observed patterns, i.e., suggestions in the form of "people who viewed/bought this item also viewed/bought this item". However, more advanced session-based methods should consider all of the user's previous session activities in addition to the most recent item they have viewed. A thorough study of a current session may be performed by considering the possibility that additional action categories, such as searching, clicking, and cart viewing, were also included in these earlier acts.

The context we described above defines how to recommend in these scenarios. Another important consideration is to inquire what the main tasks are within these scenarios, in other words what can be recommended. The next-item or the next-basket recommendations are candidate tasks, depending on whether the recommended items are for the current running session or for the next one, respectively, if session boundaries are defined. The following events or actions should be recommended if there are not any obvious session boundaries, as in music listening apps (recommend the next movie to be watched).

Another important task in e-commerce is to establish a model of how customers behave, and to predict during a session whether the user has real purchase intent or to determine the cart abandonment probability in real time. If these events could be predicted effectively, then conversion rates would be improved if marketing would be applied and incentives would be offered. Examples of such stimuli are coupons, price discounts that are valid for a short period, and other incentives. In fact, various recommendation tasks can run complementarily to preference assessments.

Flexible recommendation systems (FRES) https://www.fres-project.gr/, accessed on 20 February 2023, refer to a three-year research project that was funded to investigate the effectiveness of several methods and algorithms for SBRS in e-commerce, retail and web services. Several datasets and settings were used to test the effectiveness and the robustness of various SBRS methods in various tasks, namely in establishing next-item, next-basket and purchase intent [10]. Additionally, a testing component was integrated into

an e-commerce site to allow evaluation in a realistic environment. Additionally, several recommenders have been tested in mobile applications. Another aim was also to study the efficiency and other practical parameters, such as the training parameters, processing, and maintenance costs of different SBRS methods.

At the beginning of the project, we studied the concept of modeling anonymous sequential interactions in e-commerce and reviewed relevant prior work. Afterwards, we implemented and tested various forms of SBRS methods. Several types of recurrent neural networks (RNN/LSTM) were created, and their models were evaluated alongside those of other session-based recommenders that utilize various embedding techniques to represent items (Item2Vec, Doc2Vec). Furthermore, we proposed a framework to enable the hybrid application of text and product views sequences. Additionally, the core LSTM model was extended by adding an embedding layer before the LSTM layer. Finally, we used various reranking methods to improve the results of the basic recommenders using item categories.

To investigate reports claiming that recommendation methods using relatively simple statistical co-occurrence analysis are quite effective, we also developed a graph-based model for item recommendations. This method exemplifies a balance between the data processing and management requirements and the effectiveness of the recommendations produced.

Another challenge we examined during the project was the prediction of the shopping intent of e-commerce users using only the short-term browsing pattern of a user. LSTMs have been used recently in the e-commerce domain to improve recommendations; however, they have been barely used to predict a user's buying intention. In that regard, our study contributes to a better understanding of the LSTM approach for predicting the purchase intent. More precisely, we examined whether the e-commerce scenarios in which RNN-LSTM could provide better results in comparison to more conventional ML techniques, which have been considered as the SotA for the purchase intent task.

This paper presents the main results of the FRES project. The methodology of our approach is outlined in Figure 1. The major challenges and problems of SBRS that are addressed by our research work are the following:

**Figure 1.** The methodology of a flexible recommendation system. It involves the use cases of "e-commerce", "web services" and "physical retail store" (yellow boxes). The relevant tasks for each case are shown in blue boxes. They are the prediction of the next item in an online session, the prediction of the intent to purchase, and the prediction of the next basket. The arrows indicate the relationship between the tasks and the corresponding use cases. The methods used to solve these tasks are shown in the green boxes. These include recurrent neural networks and graphbased methods using either statistical co-occurrence analysis or node similarity assessments. The methods are in turn based on the representation of the data. The representation can be used to make additional determinations.

The development of a flexible SBRS system that is based on a common set of principles and methods to address the variety of problems in session-based recommendation systems and physical retail shops. These problems/tasks are next-item recommendation, purchase intent prediction and next-basket recommendation.

The performance of a comparative study of SotA methods from different domains, including neural networks and graph methods.

The identification of a set of efficient and general methods for representing the history of user activity such as session data and basket data history.

The paper is structured as follows: Section 1 discusses earlier work. We go through the tasks and the methods we propose to solve them with in Section 2. The numerous datasets we produced to assess our techniques are shown in Section 3. We provide a description of the experiments, as well as a report and discussion of the findings, in Section 4. In Section 5, we summarize the results, point out challenges and issue a plan for the advancement of SBRS in the future.

#### **2. Methods and Literature Review**

In this section, we present the tasks that are addressed in this paper and the tools and methodologies that were used to tackle problems and challenges related to these tasks. For all these methods, we discuss major relevant literature that shows how each method has been developed and what the current state of the art. In particular, Section 2.1 describes the task of predicting the next item and the last item in a session using the prior user behavior within this session. Section 2.2 discusses graph representations methods for recommendation systems. Section 2.3 discusses the task of purchase intent, which seeks to explicitly determine whether the intent of the user in the current session is to purchase some product or not. Section 2.4 describes the evaluation metrics used in the subsequent experiments. Section 2.5 presents the SBRS methods that we employed in various practical scenarios.

#### *2.1. Next-Item, Last-Item Tasks*

Early recommendation methods used simple pattern mining techniques. These techniques are easy to implement and lead to interpretable models. However, the mining process is usually computationally demanding. Furthermore, several parameters of an algorithm should be fine-tuned, and this might be difficult. Moreover, in some application domains, frequent item sequences do not lead to better recommendations than when using much computationally simpler co-occurrence patterns [11].

After these first experiments, more complex approaches based on context trees [12], [13] reinforcement learning [14], and Markov decision processes [15] were developed. The number of prior interactions (i.e., history window) that should be taken into account while estimating the following interaction was a parameter used in these recommender models.

Word2Vec/Doc2Vec methods were developed for use in linguistic tasks, but they can also be applied in recommender methods for CF [16]. Word2Vec is a two-layer neural network which is trained to represent words as vectors in such a way that words that share common contexts in the training corpus are located in close proximity to one another in the space. These representation vectors are known as embeddings. There are two varieties of Word2Vec called CBOW and skip-gram, with the second one being the most common approach. Skip-gram predicts the context of a word, *w*, given *w* as the input. Doc2Vec is based on Word2Vec with the aim of creating vector representations of documents rather than single words. Doc2Vec creates paragraph vectors by training a neural network to predict the probability distribution of words in a paragraph given a randomly selected word from it.

Word2Vec can be generalized to represent items with vectors based on their context (i.e., other items in the same session or basket) in a very similar fashion to its means of assigning vector representations to words. It can infer item–item relationships even in the absence of user ID information. The item-to-item recommender system (Item2Vec) is initially trained using the item sequences from prior anonymous user sessions. Then, when the system is actually applied, it accepts the currently selected item as input and produces a group of related things based on the input. In fact, when compared to SVD and other sequence-based CF approaches, this method yields results that are competitive [17,18].

Deep neural networks have recently been suggested for recommender systems. In particular, recurrent neural networks (RNN) have been very effective models for session data of user interactions. Recurrent neural networks are extensions of feedforward networks with additional internal memory. They are created by adding a feedback loop from the output back to the input of the network. As a result, the current output depends on both the input and the previous output. The fundamental benefit of RNNs over other approaches for recommendation is that they can naturally and incrementally model series of user interactions. After creating a predictive model, RNNs offer more effective recommendations than other sequence-based conventional techniques [19,20].

#### *2.2. Graph-Based Methods*

The use of graph databases (GDBs) is a new approach for data modeling [21]. A graph database represents data entities as nodes and their relationships as directed connections between nodes. Neo4j is an open source graph database tool that supports semi-structured hierarchically organized data [22,23]. The graph using this method is used to represent sequences of items in a session using node relationships. Thus, it becomes another way to represent item sequences.

Neo4j was used for creating various recommendation systems, making recommendations of friends, movies, and objects, and also in the field of e-commerce. Konno [24] developed a recommendation system based on data-driven rules. They applied a two-layer approach to retail business transaction data for business information query and reasoning. Another graph-based and rule-based recommendation system approach was described by [25]. Delianidi [26] presented another graph-based recommender using Neo4j in which emphasis was given to efficiency. In this work, nodes and relationships between the nodes were defined using session training data. The system finds all pairs of co-occurring items in the current session by running cypher queries. Then, the similarity between the items of the pair can be calculated using these co-occurrence frequencies.

#### *2.3. Purchase Intent Task*

In all likelihood, the first techniques tested to determine whether or not a user session in an e-commerce application is likely to end with a purchase were multilayer perceptron classifiers and simple Bayes classifiers [27]. Suchacka [28] used data from an online bookstore to evaluate SVM using a variety of factors (23), with a similar goal of classifying user sessions as either browsing-only or purchasing-related. The most effective SVM classifier showed high performance. It achieved a likelihood of predicting a purchasing session of about 95% and an overall prediction accuracy of 99%.

Association rules and a k-nearest neighbor (k-NN) classifier were used by [29] to enhance their study and estimate the likelihood of a purchase. To predict purchase likelihood for two client groups—traditional consumers (accuracy 90%) and more diverse–novel customers—they employed basic association rule mining and other behavioral knowledge (accuracy 88%).

The hidden Markov model (HMM) is another method that was tested. In fact, there are many web usage mining research efforts that have considered HMM to predict user behavior in several settings and for different tasks. Examples of its uses include deciding if a web search session was successful or not, establishing recommender systems [13], or making suggestion for the next point of interest in tourism websites. Ding [2] presented a research study that was more related to our work; however, it primarily made use of HMM to understand customer intent in real-time in order to make web page adaption.

Using user activity data, participants in the RecSys2015 competition attempted to estimate the assortment of goods that would be purchased during a session. The stateof-the-art (SotA) solution for this issue remains the two-stage classifier provided by the competition winners [30]. While the second classifier predicts the things that will be purchased, the first classifier predicts whether at least one item will be purchased during the session or not. In this work, the session and click dates, click counts for individual things, and other category features of the sessions and objects were employed.

Recurrent neural networks (RNN) were utilized by [31] to capture event dependencies and connections for user sessions of arbitrary length, both within and across datasets. Results from the RecSys15 challenge indicate that their solution performed admirably well. The key benefit of their approach is that it needs reduced domain- or dataset-specific feature engineering.

Another RNN-LSTM-based system for analyzing online shopping behavior was present-ed by [9]. It had two parallel-operating components. The first predicted consumers' tendency to shop; however, this module employed machine learning classifiers such as random forest, support vector machines, and multilayer perceptron. RNN-LSTM is only used in the second module, which predicts the likelihood that users would leave a website without buying anything. In relation to the SotA, their purchasing intent module performed much worse. The accuracy of the second module, which predicted website desertion after a short window of three-user action, was almost 75%.

A full connected long short-term network (FC-LSTM) for modeling the interactions between customers was tested by [32]. The same network models the nonlinear sequence correlations and cumulative effects between customer's browsing behavior. However, to attain better predictions, they use more features from user profiles, including purchase history and demographics.

#### *2.4. Evaluation Metrics*

One of the most common evaluation metrics used was the mean reciprocal rank (MRR). Its calculation formula is:

$$MRR = \frac{1}{Q} \sum\_{\chi}^{Q} \frac{1}{\text{rank}(\chi)} \tag{1}$$

where *Q* is the number of queries we are considering, and rank is the position of the correct answer *x* among the returned values, with rank(*x*) = 1 if x is the first item in the recommendation list. When there is no correct answer within the recommendation list, then we set rank(*x*) = ∞ and the reciprocal rank 1/rank(*x*) equals zero. In our case, the MRR varies between 0.017 and 0.03 depending on the number of responses we return.

Another evaluation metric used in our experiments is the F1 score @ k, which equals the F1 score of the recommendation list containing k items. The F1 score is the harmonic mean between the precision and the recall, defined as

$$F1 = \frac{precision \cdot recall}{(precision + recall)/2} \tag{2}$$

where *precision* = *tp tp*+*f p* , *recall* <sup>=</sup> *tp tp*+*f n* , and *tp* = true positives, *f p* = false positives and *f n* = false negatives.

#### *2.5. Tested SBRS Methods and Practical Scenarios*

#### 2.5.1. Content-Based SBRS Using Doc2Vec Embeddings

This approach is similar to traditional content-based systems in that it suggests products that are pertinent to previous "liked" things by the user [4]. The similarity is calculated on basis of the text content or other attributes of the items liked by the user. Note that the term "liked" can take several interpretations depending on the domain, application, or other context. In the context of SBRS, the items visited in the current ongoing session provide input content to the recommender.

In our project, we created a vector for each product item using the title, color, and extended description. A fixed dimension vector for every item was produced using the Doc2Vec model. We created an n-dimensional vector for each item after training the Doc2Vec model using the textual descriptions of the products. Following multiple experiments, vector sizes of 500 for the next-item dataset and 100 for the last-basket dataset were chosen. In this method, the similarities to all other things are computed for each viewed item during an ongoing session to suggest the next item. The cosine similarity measure gave slightly better results in all the datasets we tested.

#### 2.5.2. Item-Based SBRS Using Item2Vec Embeddings

This approach is a Word2Vec technique transfer to the SBRS problem. To learn distributed representations of words, the Word2Vec approach was first used in natural language processing.

Item2Vec provides embeddings to things in a manner that is very similar to providing embeddings to words [16]. The underlying idea is that it can be used in the sense of collaborative filtering to provide recommendations, despite user IDs not being available. The Item2Vec method uses item views like Word2Vec to process the sequence of words; in this process, the word sequences match to the users' sessions while they browse an online store (sentences). Our technique utilizes Word2Vec's skip-gram variety of Word2Vec. As such, the underlying assumption is that, given a sequence of previously visited items in an ongoing session, the task is to predict the next item(s).

We tested several embeddings by increasing the sizes of vector dimensions, but the best sizes were 30 for the next-item task/dataset and 100 for the next-basket task/dataset. L2 norm achieves better results when measuring the similarity between items in the first dataset, whereas cosine similarity produced better results in the second dataset.

#### 2.5.3. SBRS Methodology Using Embeddings

When implementing recommendation methods relying solely on the ongoing session, there are several decisions that should be considered on how to execute such operation. For example, how many previous actions of the ongoing session should be accounted for in the prediction? Or, how the previous actions will be summarized of the ongoing session to represent the behavior so far? The following list outlines all these steps and the parameters that are typically involved.


We conducted several experiments in our project to assess how these criteria might affect the results. In order to obtain the vector u, we averaged the embeddings of either the last n items the user viewed during a current session or all of the products they purchased in the previous n baskets. We conducted a number of tests and found that the "temporal locality principle" typically holds. Results are improved when only the most recent items are taken into account; in other words, recent behavior greatly outperforms activity from the session's very beginning in terms of predicting the next action. In general, all SBRS approaches perform better when a smaller memory of the user activities is taken, as we shall demonstrate later.

#### 2.5.4. Hybrid Methods

Systems aggregating both collaborative filtering and content-based methods are called hybrid recommenders [33]. In our research, we explored if the two embedding techniques we created could be used together to operate a hybrid usage of text and item sequences. In particular, our system combines the browsing patterns of all sessions, together with

the model that is developed based on item's text, using the Item2Vec and the Doc2Vec methods, respectively.

The two methods produce their predictions inside their vector space. The vectors generated by the Item2Vec method are located in one vector space, whereas those generated by the Doc2Vec method are found in the other. Both techniques estimate item similarity using the cosine similarity. The range of cosine similarity is [0, 1]. Therefore the closer the value is to 1, the stronger the similarity and prediction will be. Given that cosine similarity has a range of [0, 1], the stronger the similarity and pre-diction is, the closer the result will be to 1. In our hybrid approach, we compute the combined prediction using v = [1-CosineSimilarity]. The values' range is still [0, 1] in this instance, but the prediction is now greater the closer a value is to 0.

For each of the two methods, we multiplied the value of v with the rank of the item in the recommendations list. The product indicates the confidence each method gives the recommendation. Finally, we combined the Item2Vec and Doc2Vec confidences to calculate the final prediction value.

#### 2.5.5. Graph-Based SBRS

In our project we implemented two SBRS methods using Neo4j. The first method, called pair popularity, models in the graph all item co-occurrences found anywhere in the same session. Having recorded the item P viewed by the user at session step t, the method recommends a list of items for step t+1 based on how often the items co-appeared with P in the training sessions. Later, we proposed the hierarchical sequence popularity (HSP) recommendation method. This method uses a hierarchical representation of item sequences in user sessions to further improve the results and produces significantly better results. To recommend an item, we looked at its frequency of appearance as well as the history of sequences of length 1 or 2 in which this item participated during the training sessions. In the absence of history (i.e., in the first step of a session), items were recommended based on their popularity. Both recommendation approaches can be applied to a wide variety of e-shops regardless of the type of items. The item recommendations that the methods offer can be used as an essential component to automate and improve the identification of related items for the online customer.

The main advantage of graph-based recommenders is that they quite are effective, but at the same time are efficient when considering the complete operation cycle of SBRS (data gathering, modeling, processing, analysis, filtering). Graph-based SBRS, due to their underlying architecture, can incrementally collect data from an e-commerce website from all ongoing user sessions. Moreover, these data are immediately available to the recommendation algorithm because no training phase is required. Finally, it is easy to integrate new business rules and constraints on demand, something that is more difficult to inherently implement using ML methods.

Generally, scalability is a critical issue that should be very carefully considered, especially when building SBRS for big data. If a method is very effective, but it requires a substantial amount of training time exceeding the periodic time in which the recommender prediction model should be updated (every day or every week), it is not applicable.

#### 2.5.6. SBRS Methodology Using LSTMs

In a typical architecture of an LSTM recommender, the recently visited item(s) of the ongoing user's session are the input. The output is the next item to be recommended using the system's one-hot encoding. Compared to all the other methods, this one has actually produced strong results. At the conclusion of the model's 20-epoch training process, we obtained the best score by utilizing 200 hidden units for the LSTM layer and a Softmax layer with a size of 1097 units for categorical cross-entropy loss. Figure 2 shows the architecture of this recommender.

**Figure 2.** LSTM design in the next-item task.

For the last-basket job, the user's previous baskets provided the input. The average Doc2Vec vector of the items inside each basket was used to represent each basket. We received a sorted list of all the probabilities for each item as the result. The best F1@2 score for the model's hyperparameters was obtained using 300 LSTM units, 100-dimensional Doc2Vec vectors, and 100 epochs.

An additional embedding layer was added to the LSTM recommenders to improve their performance (Figure 3). In particular, we expanded the approach by using the vectors we trained with Doc2Vec as the initializing weights of the LSTM. As compared to our basic LSTM approach and our hybrid technique (which combines Item2Vec and Doc2Vec), this expansion provided an overall improvement in MRR of up to 10% and up to 100%, respectively.

**Figure 3.** LSTM architecture used in the last-basket task.

2.5.7. SBRS and the Purchase Intent Task

Our method for predicting purchase intent modeled all user actions as a sequence like in our recommenders. However, it differed in that we applied extra features. Specifically, one such feature was the time (in seconds) the user spent in each action. We also used four other features (season, day, working hours, origin). These extra features were important for an e-commerce application as other studies have shown that buying behavior changes over time [30]. The feature Season specifies a high season (autumn/winter for leather apparel) or not. We also know that a purchase is more likely to be made on the weekend, and that visiting an e-commerce application in midday hours usually results more purchases than in the night hours.

Nevertheless, the main difference in the purchase intent task is that we retrieved and modeled all user actions, and not only the View Item actions, as we did in our recommenders. All user actions belong in one out of the twelve action types. These twelve action types include all potential actions that customers do in e-commerce web applications. As

such, hence they may be seen as a "standard" set of action types for other studies in the e-commerce domain. Table 1 shows all user action types modeled and their frequency in the dataset and in each of the two session types we wanted to predict. Most of them are self-explanatory. The action type "Concerned" means that a user has visited a web page reflecting a customer concern about privacy policy, payment security or product shipping and returns. The "Recommend Product" action signals the recommendation of a product to another person by sending an email message.


**Table 1.** User action types.

In this task, each event is represented with a vector and each session is modeled as a sequence of events. A sliding window, starting from the first session action, designates the context of each user movement and it is exactly this context window that it used as a sample for training (Figure 4). The length of the window (N) is a parameter of our method. If the window ends inside the first N-1 events, the previous navigation steps apparently remain fewer than the size of window. In this case, the empty slots are taken as zeroed events.

Similarly, for each input instance, the output is computed using the remaining events. Specifically, for every occurring event, Ei, our method calculates the outcome after considering all the remaining events until the end of the session. The target is modeled as a 2-digit enumeration structure. Each binary digit independently represents one of the two actions of interest (i.e., add cart, make order). Thus, in total there are four outputs possible. Two of them, [1, 0] and [0, 1], signify that at least one add-cart or one purchase event occurred, respectively. The existence of both events is the output [1, 1]. If both these two events do not occur in the remaining segment of a session, the output is coded as [0, 0].

To summarize, our purchase intent system is modeled as a multilabel classification problem that determines, for each session action, the result of the rest session. The following four scenarios outline four different outputs:


2.5.8. Last-Item (Calling Tune Recommendation) Task Using the Node2Vec Method

In this task, we used a different method called Node2Vec, inspired from bio-informatics. The Node2Vec framework learns low-dimensional representations for nodes of a graph using random graph walks starting at a target node. This method requires the creation of a graph where the associations between the various entities—in our application, the main entities were users, songs, artists, genres—are represented as arcs. After creating the graph by applying 2nd-order random walks, numerical representations for each node within the graph can be produced. These representations are finally used as input to the classic word2vec algorithm (skip-gram model with negative sampling) to derive the final embeddings of each node. In essence, the resulting embeddings preserve the structure of the original network in the sense that related nodes have similar representations.

The key characteristic of 2nd-order random walks is that each transition to a neighboring node is accompanied by a probability, defined as a hyper-parameter, different from that of returning to the previous node. This particular methodology requires the user to define a series of hyper-parameters that control the process mainly in terms of complexity. These basic hyper-parameters are the following: number of walks, walk length, return value to the previous node (return hyper parameter—*p*), and transition value to a new node (in-out hyper parameter—*q*). The last two parameters concern the transition between nodes during the random walks phase.

When implementing this method, we had to solve several efficiency and scalability challenges. Initially we used the python library NetworkX to create the graph. NetworkX is the most common open source package for creating and editing graphs in python. Nevertheless, the specific method in its basic implementation, although highly effective, is not scalable enough due to memory (RAM) problems. This is because 2nd-order transitions increase the number of total transition probabilities quadratically with respect to the number of edges within the graph. For this reason, another implementation of this method called PecanPy was used which largely solves memory sufficiency problems. In essence,

PecanPy solves the problem of parallelizing the two otherwise parallel processes, i.e., the preprocessing of the transition probability from each node, as well as the random walk application.

#### **3. Datasets**

In terms of data, the problem of SBRS becomes complex. There are several data items and certain feature selection, and session and feature engineering will be needed. The item (product, movie, tune, etc.) is the central concept, however the presence of the session concept and the potentially different domains, bring extra complexity. For example, selected past sessions may be characterized as irrelevant to the current ongoing session. The item concept in e-commerce will normally be a tangible product. However, item in the music domain will be a song. Additionally, tasks like the purchase intent or the requirement to address the temporal locality principle (i.e., make recommendations that predominantly reflect recent behavior) bring even more complexities.

Furthermore, a dataset in SBRS research should have a clear session structure. The session structure may also present a clear ordering of the events that occur within a session. If a sequence of the events is explicit in the dataset, then this dataset is best fitted for next-item tasks. Sometimes, session boundaries may be clear; however, the sequence of events within the session are not specified. For example, in a shopping cart dataset, each shopping basket has clear session boundaries which distinguish it from all the other baskets, and so it naturally represents a session. The recommendation tasks on such dataset can be next-item(s) or next-basket recommendations. However, it should be noted that the entire basket session should be considered in its entirety. This because we do not know the sequence in which a user has put these items into the basket. For all these reasons, such a dataset is better suited for the next-basket task and is totally unsuitable for the purchase intent task.

Our experimental work in the project was driven in part by the views we discussed above, but mainly by the funding requirements to deploy, apply, and evaluate our developed methods in real e-commerce environments for a long duration.

Many experiments reported in this paper used data that were extracted from the web server logs of an apparel e-commerce website for a relatively long period of time (a few months). The log data were analyzed to identify sessions, session length, user actions in each session, actions' related items, item categories, and time spent on each action. As a result of this preprocess, the first dataset we produced (Dataset A) consisted of 24,111 sessions that altogether counted 312,912 user actions. Twelve different action types exist overall, but all the events were utilized only in the purchase event task. The 728 sessions ending in purchases represent a 3% conversion rate. In 22,008 sessions, users did not have any items in their shopping cart when they exited, meaning that 91.2% sessions were browse-only. The rest of the sessions had items in their shopping cart when they finished, but these never turned into purchases. In the next-item experiments reported in Section 5, we included only the "View Product" user actions and only the sessions holding at least two different item views, finally collecting a number of 12,128 applicable sessions consisting of 67,101 "View Product" user actions.

The above dataset was further processed to create another variation (Dataset B) suitable for the purchase intent task. In this second dataset, we used all event types listed in Table 2, but we kept the sessions that had at least 3 behavior sequences. This preprocessing led to the final dataset containing 21.896 sessions, including 258.101 user actions (the size of each session was 11.7 actions in average and the Median was 8). The average size of Browsing, Cart Abandonment and Purchase sessions are 11, 18.8, and 19.5, respectively. The 689 purchase sessions make a conversion rate of 3.14%. In 90.9% of sessions (19.902), users did not add any items to their shopping cart. In 1305 sessions, users added items to their shopping cart, albeit without completing a purchase. Table 2 shows all user action types and their frequency in the complete dataset.


**Table 2.** Results of all recommender methods for next-item and next-basket task.

\* all (column 5) denotes a session "history" of 35 items max.

The third dataset we created (Dataset C) reproduces the next-basket scenario and it is also built on a real application. It contains purchased items (in baskets) from a petshop store. The dataset has 40,203 transactions (baskets) belonging to 1493 users of which 1408 two or more baskets, i.e., they can be included in the experiment. The dataset contains 6626 items in total. The average length of all baskets is 2.26, therefore we calculate F1@2 in our results. Another parameter that we tested was the number of previous baskets to consider in predicting the last-basket content.

In later phases of the project, we installed a logging component into an e-commerce web application, and we collected data, although this time not from the web server logs but directly by logging specific user actions. For the experiments we report in the next section, this dataset (Dataset D) was processed to obtain only the sessions that contain the "View Item" and "Add to Cart" actions, resulting in 102.024 records. A typical split into train (80%) and test (20%) datasets was performed. Thus, the train sessions are 81.651 and the test sessions 20.373. The total number of unique sessions is 19.236 which corresponds to 15.388 sessions in the train set and 3.848 unique sessions in the test set. The dataset contains 1448 unique items, of which 1429 appear in the train set and 1296 in the test set.

One last dataset we used was the one for the calling tune recommendation task using the Node2Vec method (Dataset E). The data for this task fall into one of these entities: user, song, artist, and genre of music (user\_id, item\_id, artist\_id, genre\_id). Initially, the construction of the graph required in Node2Vec was carried out by connecting all the correlations of the above fields with the central entity, which is that of the song/tune (item\_id). Specifically, the following associations were added to the graph: item\_id->user\_id, item\_id- >artist\_id, item\_id->genre\_id. It is worth noting that a graph created in this way is neither directed nor weighted. The initial dataset contained 662,698 instances. Many of the users appeared more than once and were the customers of interest to our method because users with a single transaction cannot be tested. For this reason, all records having users with just one transaction (tune) were removed and the final size of the dataset was reduced to 570,533 transactions. Then, to configure the train and test set, the following procedure was followed. The train set comprised all user tune records except for the last one, which was put in the test set. For example, if a user has 4 transactions, the first 3 are included in the train set while the fourth ends up in the test set. The size of the two resulting sets was 478,572 for the train set and 91,961 for the test set. The total training time of the model was in the order of minutes (5–10), which gives more value to this method.

#### **4. Results and Discussion**

Before discussing the results of each technique, we need make a few clarifications based on the experimental setting. Section 4 provides a description of the datasets. In the next-item task, we created a list of recommended things that executed all methods for each item in each test session.

If a method needs the sequence of the item(s) viewed hitherto, then this sequence is provided to the method. We estimate the reciprocal rank for each list that each method returns, which is a ranked list of n recommended item(s) (RR). In this manner, the RR for each test session's action is computed.

The mean reciprocal rank (MRR) of the entire session may then be calculated. Finally, the MRR of the approach can be calculated by averaging the MRR of all sessions that have been tested. The reason we choose MRR as the evaluation measure is because it expresses the effectiveness of a method to recommend the next-item as highly as possible in the recommendation list. If a method achieves an MRR of 0.25, then a recommender would require showing 4 recommended items to effectively include an item that could be selected with high probability.

The experiments described below were executed on a computer with an AMD Ryzen-9 CPU @ 4.9 GHz, 32GB memory, and a NVIDIA Titan Xp GPU card. The neural network models were executed on the tensorflow-2 platform, while Neo4j was used for the graphbased algorithms.

#### *4.1. Results of the Next-Item and Next-Basket Tasks*

In these experiments, 90% of dataset A was used for training and the remaining 10% for testing. A random split was repeated five times. The results reported here are the average of the results produced from each split. In the last-basket task, all available baskets were used both for training and testing. Specifically, all baskets, except for the last one, were used to train the last-basket prediction model. Then, the last basket of each user's baskets was predicted and compared to the actual last basket the user purchased.

The post-prediction reranking was inspired by the knowledge that several product categories dominate each customer's purchases. To apply this reranking method, we define the dominant category as the one with the most "hits" from the session's start. Once the prevailing category is determined after each user action, all the recommended items in the dominant category are top-ranked. In the table summarizing the results, this method is indicated as "with reranking".

The results of each method for the next-item (MRR) and the last-basket (F1@2) tasks are presented in Table 2. The column "last" shows the results if only the last item was taken as the context. The column indicated as "all" presents the results when the entire user behavior sequence, from the session start until the current item, is taken into account.

The LSTM methods produced considerably better results. The MRR was 0.265 when we initialized the LSTM recommender using random weights. The embedded layer which we added improved the performance only marginally. The Doc2Vec method was the best from the embeddings that we tested, attaining an MRR of 0.101. The Item2Vec method performed less effectively, having an MRR of 0.087. When the content-based and the item sequence embeddings were combined, they produced better results. Finally, our experiments confirmed the intuitive assumption that category-driven reranking improves the results, except in the LSTM method. We believe this occurs because LSTMs already capture the focus on specific product categories.

Table 3 also illustrates the results of the last-basket prediction task. Column 4 presents the F1@2 results, considering seven previous baskets as the purchase history. Column 5 illustrates the results when all the purchases are considered. In these experiments, the results are quite different from the next-item task. In the reverse of what happens in the next-item task, the Item2Vec method in this task performs much better than the Doc2Vec. This is due to the smaller textual information available in the items in this dataset. Quite surprisingly, Item2Vec outperforms the LSTM method when only seven baskets are considered; however, the performance order is marginally reversed with larger purchase history. Generally, all methods produce better results when a shorter purchase history is used. It seems that a principle of locality consistently holds which states that the next user action is influenced

more intensively by its immediate surroundings. Another interesting result is that categorybased reranking negatively affects all methods. We believe this outcome relates to the organization of the dataset which has very few categories.


**Table 3.** Accuracy Results (Window size = 10).

#### *4.2. Results of the Purchase Intent Task*

Dataset C was used in this task. A 10-fold stratified cross-validation procedure was applied. We found that the optimal settings are a 0.2 dropout rate, Adam optimizer and 500 LSTM units. Table 3 shows the results using window size N = 10. When we combined LSTM with a GRU layer, the performance was better in comparison to result obtained using one LSTM layer. This finding is in accordance with other studies on sequence modeling (Chung et al, 2014), in which combinations of LSTM and GRU variants outperform standard RNNs. At the end of the architecture, we used a dense layer with sigmoid function to deliver the final probabilities for multilabel prediction.

The results produced by our fine-tuned method are better than those from other similar research reported in the literature using RNN-LSTM. Additionally, the results are comparable to the accuracy results that SotA methods have achieved (Section 2.3).

The window size, i.e., the size of previous items considered as "history", is an important parameter for our method. To that end, we performed further experiments to test our model using different window sizes. The cart abandonment sessions are of particular economic interest for e-commerce applications because the user adds item(s) into the cart, but s/he does not complete a purchase in the end. An e-commerce application will benefit remarkably if these sessions are predicted during a session as early as possible. Table 4 presents the accuracy results as a function of multiple window sizes in two different conditions. The first includes all sessions, and the second considers only the "Cart Abandonment" sessions. The prediction of the cart abandonment sessions is less effective. Similar findings have been observed in other studies as a result of datasets containing many sessions that did not conclude with a purchased basket, and very few cart abandonment and purchase sessions.


**Table 4.** Accuracy results as a function of Window size.

#### *4.3. Results of Our Graph-Based SBRS*

Table 5 presents the MRR results of the proposed hierarchical sequence popularity recommendation method in comparison to the "simple" pair popularity graph-based recommendation approach and other recommendation methods using machine learning models (Item2Vec, Doc2vec and LSTM) in Dataset D.


**Table 5.** Results of the next-item task using the graph-based methods for SBRS vs. other ML/DL methods.

Between the two graph-based methods, the new HSP approach outperforms the pair popularity approach in both next-item recommendation scenarios by 3–4%. In terms of comparison with the machine learning methods tested, in the case of the Item2Vec and Doc2vec models, the HSP method prevails by a significant margin.

According to the experimental results, the method that produces the best results is the LSTM machine learning model. The HSP makes recommendations based on limited history and using a quite small memory window extending up to only two previous items. Nevertheless, the performance of the HSP method is very close to the LSTM performance. This is surprising because the LSTM is one of the most powerful models and is equipped with infinite memory capability, enabling maximal exploitation of the history data in a session to produce good recommendations.

#### *4.4. Results for the Calling Tune Recommendation Task using Node2Vec*

In the experiments using Dataset E, after fine-tuning, the final values of the hyperparameters discussed in 3.8 are the following: number of runs = 10, length of runs = 100, *p* value = 0.5, Q-value = 1.

A disadvantage of the way we implemented the Node2Vec method in this problem was that most nodes represented users. As a consequence, recommendations with the most "relevant" nodes always returned a large number of users. For this reason, there was a risk that, if the list of recommendations we requested from the graph was relatively small (of the order of hundreds), we would not receive recommendations concerning calling tunes. This was also the reason why MRR attained very low values. Table 6 shows how the MRR is shaped in relation to the number of recommendations and how many queries end up not returning any music as an answer.


**Table 6.** MRR results for the calling tune recommendation task.

In summary, the Node2Vec recommendation method solves the problem of tune recommendation, but at the same time it is a method that does not require high computational costs. Its evaluation, although it does not seem very impressive, in reality should be considered highly satisfactory for this type of problem. Its qualitative evaluation showed that

there is indeed a great relevance between the suggested tunes and the history of the user, regardless of whether s/he had not selected them up to that point in time.

#### *4.5. Push (Offers) Notifications Task*

The widespread use of mobile applications developed specifically for use on small, wireless computing devices, such as smartphones and tablets, has developed the need for recommender systems that can execute the core task of sending push notifications. A push notification is a message that is "pushed" from a back-end server or application to a mobile application. However, although some aspects are similar, push notifications cannot be managed in the same way that next-item or next-baskets recommendations are created in e-commerce. This is mainly because push notifications have to take into account several other factors that will determine the type of the message or the items that may be recommended inside the delivered message. Such parameters may include the need not to disturb the customer receiving the notification, but also other marketing strategies like promoting specific brands and marketing goals (Figure 5). Additionally, the situation for managing push notification messages is quite different, e.g., push messages may be ignored completely by the user, without even reading them. The final aim in assessing and criteria to evaluate a push notification strategy is the capacity to create push notifications that will be not rejected, will be read by users, but most importantly will activate links or recommended actions. In other scenarios, the aim is to lower customer churn, i.e., the rate at which customers will stop using an app or purchasing items with a company's e-shop over a given period of time.

**Figure 5.** Parameters determining a push notification policy.

In our research project, we implemented a push notification subsystem for a pet-shop (Almapet, refers to Dataset C in Section 4) and for a mobile lottery application. In the Almapet case study, the first objective was the exploratory analysis of the purchasing behavior of the pet-shop customers based on the history of their purchases from a physical store. The second objective was the prediction of the next purchases in order to recommend products that are relevant to a customer and can be the subject of personalized offers using notification messages. The training set included 1408 customers having more than 1 basket, a total of 40,203 baskets and approximately 6400 products. Two methods were used to categorize products. The first method was codenamed cat01 and used product categories (32 exist). The second method was named cat02 and was based on the producer. From the purchase history of each user, a vector representing the user's profile was created. Additionally, clustering revealed 10 customer groups and their preferences according to the animals they own, the items they buy most often and the brands they prefer. The

output of the clustering algorithm was used to produce a personalized prediction of the product categories each user buys (category relevance, see next paragraph) and the producers/brands they prefer (brand relevance, see next paragraph again).

The method overall takes as an input a user id and calculates: category relevance, brand relevance, items composing the next-basket, associated items (items purchased together), recent sales trends and finally the push policy, i.e., items and/or categories that are selected for promotion. Using a weighted scheme to assign a weight to each selection parameter, the management can decide its marketing strategy. For example, the aim to increase weights for category relevance, brand relevance and items composing the nextbasket places the emphasis on a customer's purchase needs. On the other hand, increased weights for the parameters associated items, sales trends and the push policy, increases the diversity, novelty and serendipity of the purchase behavior.

The second scenario regarding this push notification task builds around a mobile lottery application. The basic aim is to send push-promotion messages to increase the use of the lottery application. The strategy we used was to suggest new lottery games to mobile application users that were considered most likely to engage the recipient, with the following objectives:


This notification message component takes as its input the historical sequence of games that a user has played in the past and produces a recommendation about a game that a user has not played before. For this type of recommendation, we used the Word2Vec method. The training sessions, with each sequence of games played on by users being regarded as a session, are used to create and assign embeddings vectors for each game in the lottery application. Games that are played jointly by users (i.e., co-exist in the history of multiple users) are close to the vector space. Thus, by calculating the mean vector of the vectors of a user's games we can say that we are vectorizing the user's preference in the space of lottery games. Then, the Euclidean distance of the user's preference with all the vectors of the games s/he did not play is calculated. The game with the shortest distance from the user's preference vector becomes the recommended game for them that is sent through a push notification message.

#### *4.6. A/B Testing Results*

One of our research objectives and a deliverable aim of our project was to test sessionbased recommendations in the real environment of an online shop. For this purpose, we integrated all the SBRS methods that we analyzed in this paper and created a demonstrator to enable evaluations using different recommenders in a real environment and with real users. One target was to measure the effectiveness of SBRS, but we also wanted to consider other practical issues such as efficiency and adaptation to business rules.

We utilized A/B testing to conduct the study. A/B testing is a quantitative evaluation technique that compares two real versions of a web page or a component of a web application in general to examine which version performs better. These variations, known as A, B, etc., are selected randomly to each new user entering the website. Some of them will be directed to the first version, more rest to the second, and so on. Hence, product recommendations are made to each group of users by a different SBRS method. The division of users into groups is performed with a rotation function. Once users connect to any website, a unique alphanumeric is given which identifies the session id of each user. The session id remains the same for each user until they leave the site. So, the user's session id can be fed

into the rotation function and a different recommendation method is selected for the entire session of a user.

Using A/B testing, we were able to test and verify in a real e-shop which recommender method produces more clicks for recommended products, creates more purchases, and so on. Three SBRS methods were applied, specifically, RNN-based, graph-based, and Item2Vec. Additionally, we used an extra "random" algorithm which made suggestions using only the product category that the visitor was currently viewing.

The recommender component was integrated into an e-shop (leather apparel). When a product is viewed, it shows more recommended products to the visitor, aiming to provide them with a better experience. If this aim is realized, then longer user sessions should be expected, more customer satisfaction, hopefully leading to better conversion rates. Web usage data were recorded to enable calculation of several evaluation metrics (All data and statistics of our A/B testing are accessible in this URL: https://fresanalytics.cntcloud.eu/, accessed on 20 February 2023). All methods have been evaluated for their success based on the number of clicks on recommended products.

Figure 6 shows the number of clicks made by users for an equal number of user sessions, using each algorithm for a period of time. Thus, based on the results, the RNN method produced more recommendations (4768 in number) while the Graph DB-Neo4J algorithm was the second most effective method (4451 clicks). The Item2Vec algorithm was the third best performing method. It is also important to note that the clicks produced by the "random" algorithm were almost half those of the other algorithms. From these results, one can easily infer that the three algorithms were more effective than the recommendations that were made solely based on category. In other words, the success of the recommendation engine, regardless of the algorithm used, is far greater than the most widely used practice, which is to recommend products coming from the same category as the current viewed product.

**Figure 6.** Number of clicks produced by each recommender method.

The most important metric that highlights the effectiveness of a recommendation engine not only scientifically but also commercially, however, is the click-through rate, i.e., the number of clicks that the e-shop users did as a proportion of the total item recommended (Figure 7).

However, the findings, as presented in Figure 7 below, are very encouraging as the click-through rate of the proposals that came from the recommendation engine exceeded 3%. The above percentage would be much higher if the panel of recommended products was higher on the product view web page. However, it was not possible to do this based on an agreement made with the company owning the e-shop. Another interesting finding was the significant improvement in terms of the bounce rate which was recorded in the Google Analytics of the e-shop after we installed the recommendation component, showing the success of the recommendation engine. Of course, several parameters have been identified that can be improved and would potentially lead to better results. One further experiment

we intend to perform will test different layouts of the recommended products in order to increase the utility of the recommendation process and ultimately increase session time and user satisfaction.

**Figure 7.** Click-through rate achieved by our recommender method.

#### **5. Conclusions and Future Directions**

Nowadays, many retail sales come from e-commerce web applications. Furthermore, e-commerce exploded during the pandemic, a business trend which is expected to last. Subsequently, the effectiveness of e-commerce solutions has become an important challenge for successful e-businesses. For large e-businesses that have tens of thousands or millions of users making repetitive visits and purchases, conventional recommender systems that rely on user ID and products views information are viable and very effective. However, these conventional systems cannot sufficiently address other widespread scenarios such as:

Smaller e-businesses, in which it is difficult to collect user ID information for privacy and other reasons.

More dynamic shopping environments that need to understand the user's intent and preferences at a certain time point, without being obscured in long-term historical shopping behaviors.

Capture a short temporal preference shift in user buying behavior that is represented only in the intrinsic nature of each specific session.

The scenarios outlined above show the need for SBRS. In other words, in many ecommerce applications and environments, it is necessary to learn user behavior patterns using sessions as the main transaction unit. Having identified the main requirements of SBRS and the main tasks, our work in the FFRES project was driven by the need to investigate, develop and evaluate several methods for SBRS. After many lab experiments using our SBRS methods, but also testing their operation in real e-commerce application, our main conclusions related to the next-item and next-basket recommender tasks are as follows:


Additionally, LSTM approaches exceeded Word2Vec performance, albeit not on the same level as in the next-item task.


These conclusions we believe make it clear that designing a SBRS is a complex problem that has practical constraints. Recommendations must be effective, i.e., propose relevant items but also deliver next-item recommendations as highly as possible in the list of suggestions. At the same time, the operation of a recommender (i.e., data gathering, data curation and modeling, processing, analysis, filtering) should be executed efficiently and in a way that allows operators to frequently update the recommender models. This is one direction that we plan to explore in our future research.

Besides the work we conducted on next-item and next-basket recommendations, we also worked on the purchase intent problem. This study was mainly motivated by the idea that e-commerce applications should have components for continuously monitoring users during their navigation. We believe that the key features such components can deliver into e-commerce applications are proactive stimuli actions, offering buying incentives to the user.

In conclusion, we believe that by completing the FRES project we have demonstrated the importance of SBRS for e-commerce applications. Furthermore, we have developed and tested many methods to effectively execute all the main tasks and predictions of an SBRS: next-item and next-basket recommendations, as well as purchase intent. The most effective method in all these tasks was RNN-LSTM. This method could become the cornerstone for developing more complex frameworks for e-commerce applications that will aim at higher conversion rates and better profitability.

**Author Contributions:** Conceptualization, M.S. and K.D.; methodology, M.S., A.K., T.S., M.D., P.K. and K.D.; software, A.K., T.S., D.T., P.K., K.C. and M.D.; validation, A.K., T.S., P.K., C.B and M.D.; formal analysis, A.K., T.S., K.C., M.D., D.T, P.K.; investigation, M.D. and I.K.; resources, C.B. and K.D.; data curation, M.S., P.K. and D.T.; writing—original draft preparation, M.S. and I.K.; writing—review and editing, M.S and I.K.; visualization, D.T.; supervision, M.S. and K.D.; project administration, K.D.; funding acquisition, K.D and C.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research work received funding by the European Regional Development Program of the European Union and also by the Greek state through the Program RESEARCH–CREATE– INNOVATE (project code: T1EDK-01776).

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Restrictions apply to the availability of these data. Data was obtained from on leather apparel e-shop and are available from the authors with the permission of the leather apparel e-shop.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Notations and Abbreviations**


#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

### *Article* **Session-Based Recommendations for e-Commerce with Graph-Based Data Modeling**

**Marina Delianidi, Konstantinos Diamantaras \*, Dimitrios Tektonidis and Michail Salampasis**

Department of Information and Electronic Engineering, International Hellenic University, 57400 Thessaloniki, Greece

**\*** Correspondence: kdiamant@ihu.gr

**Abstract:** Conventional recommendation methods such as collaborative filtering cannot be applied when long-term user models are not available. In this paper, we propose two session-based recommendation methods for anonymous browsing in a generic e-commerce framework. We represent the data using a graph where items are connected to sessions and to each other based on the order of appearance or their co-occurrence. In the first approach, called Hierarchical Sequence Probability (HSP), recommendations are produced using the probabilities of items' appearances on certain structures in the graph. Specifically, given a current item during a session, to create a list of recommended next items, we first compute the probabilities of all possible sequential triplets ending in each candidate's next item, then of all candidate item pairs, and finally of the proposed item. In our second method, called Recurrent Item Co-occurrence (RIC), we generate the recommendation list based on a weighted score produced by a linear recurrent mechanism using the co-occurrence probabilities between the current item and all items. We compared our approaches with three state-of-the-art Graph Neural Network (GNN) models using four session-based datasets one of which contains data collected by us from a leather apparel e-shop. In terms of recommendation effectiveness, our methods compete favorably on a number of datasets while the time to generate the graph and produce the recommendations is significantly lower.

**Keywords:** recommender systems; session-based recommendations; e-commerce; data and web mining; item co-occurrence; graph data model

#### **1. Introduction**

Intelligent recommendations and their application in e-business systems are increasingly attracting the interest of researchers and companies. Particularly, the use of the recommendations systems in e-commerce aims at increasing conversion rate, profit and customer engagement and satisfaction. Today, online sales are often made by non-registered users, and therefore there is no historical user data recorded by the e-shop platforms. In these cases, the only data that can be stored is information concerning the duration of a session, the actions performed in each session's step, the items (i.e., products) viewed, and other activities of the online customers. This information will be recorded to later generate recommendations in real time for other users visiting similar or related items. These are session data [1] and can be collected while users navigate in the e-shop platform.

There is a variety of methods used in recommendation systems, such as association rules [2], matrix factorization [3] or machine learning techniques [4], etc. Other methods recently used in recommendation systems are based on graphs [5]. Graphs can efficiently model user–item interactions within sessions enabling the easy generation of new session data in near-real-time. The most frequent items that users visit are easily found using the current complete graph where nodes represent items and edges represent the "next-itemin-session" relationship between the nodes. Additionally, the combination of consecutive item appearances during user navigation in the online store is easily identified through the

**Citation:** Delianidi, M.; Diamantaras, K.; Tektonidis, D.; Salampasis, M. Session-Based Recommendations for e-Commerce with Graph-Based Data Modeling. *Appl. Sci.* **2023**, *13*, 394. https://doi.org/10.3390/ app13010394

Academic Editor: Keun Ho Ryu

Received: 20 November 2022 Revised: 13 December 2022 Accepted: 24 December 2022 Published: 28 December 2022

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

graph structure. Implementing a standard co-occurrence method using graphs, a sessionbased recommendation method, called Pair Popularity, based on item co-appearances anywhere in the same session was presented in [6]. Having recorded the item *x* viewed by the user at session step *t*, the method recommended a list of items for step *t* + 1 based on the number of times the items co-appear with *x* in the training sessions. Session-based recommendation with Graph Neural Networks such as SR-GNN [7], GCE-GNN [8] and IC-GAR [9] have become very popular recently because of their very good performance that often represents the state-of-the-art. However, there are also a number of drawbacks regarding GNN approaches:


The motivation behind this work is to address these problems by introducing improved graph-based recommendation models which are simple, therefore computationally efficient. Moreover, they should be able to exploit the cold-start probabilities of items when there is no available co-occurrence with other items in the session and should be able to offer novel and diverse recommendations. To that end, we propose two new session-based recommendation methods—the Hierarchical Sequence Probability (HSP) and the Recurrent Item Co-occurrence (RIC). The HSP method extends the Pair Popularity graph-based approach to improve the results by using item sequences in user sessions to produce the hierarchical recommendation list, while the Recurrent Item Co-occurrence recommendation approach focuses on the co-occurrence of the products by giving weight to the count of item appearances in the corresponding sessions.

The goal is to produce the optimal item recommendation list at time step *t* + 1 according to the items observed by the user up until the current time step *t*. The contributions of our work are:


The rest of the paper is organized as follows. In Section 2, we present an overview of the existing research related to the session-based recommendation problem. The data model analysis for the proposed recommendation methods and the algorithmic details are presented in Section 3. Section 4 presents the datasets used in our experiments. In Section 5, we describe the experimental procedure and discuss the results of the proposed recommendation methods. Section 6 concludes the paper.

#### **2. Literature Review**

There is a large number of recommendation methods used for different purposes such as recommending friends, destinations, movies, products, etc [10]. These systems use, in addition to previous user transactions, features such as location, demographic profile, and user preferences to identify items that are similar to one another. The role of recommendation systems (RS) has become increasingly crucial, especially in e-commerce, due to the availability of a large group of items from which the user can choose. Users of e-commerce sites are given tailored recommendations for products that they might find interesting. After applying desired business criteria that can be applicable, RSs finally offer a list of the top *n* recommended products for each targeted user action. If they do exist, long-term user profiles are prominently used in RS techniques. Such long-term user models, however, are usually unavailable in many applications for privacy-related reasons [11].

Session-based recommendation approaches (SBR) are recommendation techniques that only consider the user's in-session behavior and other session-specific information as well as the sequential order of items in sessions [12]. They adjust their recommendations to the user's most recent actions, and their main objective is to predict and suggest the next item(s) during every active user session [1].

A general method for developing recommendation systems is matrix factorization [13,14]. A user–item rating matrix must be factorized into two low-rank matrices, each of which reflects the latent factors of users or objects. In [3], the authors propose a matrix factorization approach for session-based recommendations which is based on solving a least squares optimization problem involving item–item similarities and session–item weights. The method achieves results comparable to the state-of-the-art; however, its complexity increases quickly with the size of the itemset. The item-based neighborhood approaches [15], in which item similarities are determined by the co-occurrence within the same session, could be a rational solution by taking into account the sequential order of the objects instead of generating predictions relying on the most recent click. The sequential Markov chain approaches are suggested to be used to predict users' future actions based on their past actions [16,17]. The weakness of Markov-chain-based models is that they independently recombine the previous components. Such a significant assumption of independence affects the prediction's accuracy.

Recommendation systems using graphs have also been quite actively studied recently. In fact, graph databases (GDBs) are one of the latest approaches in data modeling [5]. In a graph model, the data entities are represented as nodes and their relationships as directed or undirected connections between the nodes; thus, any data relationship can be represented on a corresponding graph [18]. The Neo4j [19] is a popular graph database tool used for creating various recommendation systems for friends, movies and items, as well as in e-commerce and loyalty-based retail businesses [20,21]. It uses the Cypher declarative graph query language, which is similar to SQL allowing efficient creation, reading, updating and querying of the graph data [22].

A session-based recommendation solution developed using the Neo4j graph database is presented in [6]. In this paper, the authors demonstrate an efficient method for sessionbased next-item recommendations. This recommendation system has been developed for an e-commerce retail store. With the appropriate data modeling, by defining nodes and relationships between the nodes and executing cypher queries, the system identifies the cooccurring paired items anywhere in the same session. The frequency of co-occurring item pairs determines the degree of similarity between these items. In practice, the next-item recommendation method uses these similarities for building the model.

Deep Learning (DL) models based on Recurrent Neural Networks (RNN) have been recently proposed for session-based recommendation solutions. The work in [23] proposes the Recurrent Neural Network approach for session-based recommendations, called GRU4REC, which employs multiple layers of the GRU model and uses only item sequences. In [24], the authors propose a hierarchical Recurrent Neural Network based again on the GRU model for session-based recommendations using user information. The work presented in [25] extends the GRU4REC method by introducing data augmentation, and [26] proposes NARM which is an integration of a stacked GRU encoder attention mechanism to capture more representative item transition information of SBR. In [27], the authors mix the sequential patterns and co-occurrence signals by combining together the recurrent method and the neighborhood-based method to enhance the performance of the GRU4REC recurrent model. One more DL recommendation method is based on mixture-channel purpose routing networks (MCPRNs) [28]. To handle multi-purpose sessions, the authors suggest a mixture-channel model. To model the dependencies between items within each channel for a specified purpose, they create a purpose-specific recurrent network (PSRN), a variation of the GRU RNN model. The authors of [29] introduce an RNN model named Hierarchical Attentive Transaction Embedding (HATE), which exploits the attention mechanism to predict the next item by modeling dependencies in transactional data. The HATE model consists of two parts, the Inter-transaction Context Embedding part for the item representation, and the Intra-transaction Context Embedding part for the representation of multiple chosen items in the current transaction, integrating these embeddings using Intra-transaction attention.

Graph Neural Network (GNN) models implement recommendation systems of various scenarios such as Social Recommendation, Sequential Recommendation, Session-based Recommendation, Bundle Recommendation, Cross-Domain Recommendation or Multibehavior Recommendation [30]. Additionally, GNN models adopt machine learning and deep learning techniques, such as Convolutional Networks, Attention Mechanism or Embeddings representation to create recommendation systems in different domains [5]. Several next-item Graph Neural recommendation models have been proposed recently for the case of e-commerce scenarios using session-based datasets. One of Graph Neural Network's next-item recommendation approaches, named the Heterogeneous Mixed Graph Learning (HMGL) framework [31], was constructed to learn the complex local and global dependencies for next-item recommendations. HMGL encodes both session information and item attribute information into one unified graph modeling both local and global dependencies to better prepare for the next-item recommendations. In SR-GNN (https:// github.com/CRIPAC-DIG/SR-GNN, accessed on 15 March 2022) [7], the session sequences are modeled as graph-structured data. Each session is represented as the composition of the global preference and the current interest of the session. An attention network is used to learn item embeddings on the session graph, and then obtain a representative session embedding which is calculated according to the relevance of each item to the last one. The GCE-GNN (https://github.com/CCIIPLab/GCE-GNN, accessed on 10 May 2022) [8] extends the previous approach by employing a session-aware attention mechanism to recursively incorporate the neighbors' embeddings of each node on the global graph. First, the session sequences are converted into session graphs to construct a global graph. The GCE-GNN learns two levels of item embeddings from the session graph by modeling pairwise item-transitions within the current session and the global graph which is to learn the global-level item embedding by modeling pairwise item-transitions over all sessions. Another recent GNN model, called IC-GAR (https://github.com/Taj-Gwadabe/IC-GAR, accessed on 10 October 2022) [9], models current session representations with session co-occurrence patterns, using a modified variant of Graph Convolutional Network (GCN). The Prediction Module of the IC-GAR separates global preference, local preference, and session co-occurrence in order to estimate the probability scores of candidate items. The global and local preferences model user interest in the current session, whereas the session co-occurrence representation aggregates the higher-order transition patterns of all the items in the training sessions. IC-GAR generates a single undirected graph for every training session. The SR-GNN, CGE-GNN and IC-GAR are the most recent state-of-the-art GNN RS models for SBR where one enhances the other with additional modules in order to more accurately predict the next item. The summary of the reviewed recommendation methods is presented in Table 1.

In the present work, we focus on graph-based recommendation systems in the ecommerce domain by proposing recommendation models that compete with recent stateof-the-art GNN based recommendation models. We propose two different methods called Hierarchical Sequence Probability (HSP) and Recurrent Item Co-occurrence (RIC) which create the recommendation list using the item–item relationships: "next" and "in-samesession", respectively. Related to these two methods, the aim of this paper is to answer the following research questions:


**Table 1.** Summary of reviewed methods.


#### **3. Recommendation Methods**

In this section, we describe two session-based recommendation methods called Hierarchical Sequence Probability (HSP) and Recurrent Item Co-occurrence (RIC), respectively. For both methods, we use graphs to represent data. These graph methods can be applied in e-shop platforms that allow anonymous access from non-registered users. The difference between the two recommendation methods is that HSP exclusively uses the sequence of items appearing in the session, while in the case of RIC, the recommendation list is based primarily on the items' co-occurrences extracted from session data. In detail, the two methods are described below.

#### *3.1. The Graph Models*

In both proposed methods, the graphs consist of two types of nodes which represent sessions and items. In the HSP graph, the connections between the nodes represented relationships as follows:


Thus, the graph data provide sequence information about items in sessions. Figure 1a shows part of the data graph including the *ItemInSession* and the *Next* relationships between the items and the sessions.

In the RIC method, similar to the HSP, the graph has two types of nodes, for the items and sessions representations and also two types of following connections:


In this case, the graph data model does not provide the sequence information about items in sessions. Figure 1b shows part of the data graph including the *ItemInSession* and the *InSameSession* relationships between the items and the sessions.

**Figure 1.** Representation of items (light blue nodes), sessions (pink nodes) and the relationships *ItemInSession* (blue edges), *Next* (red edges), *InSameSession* (green edges). (**a**) The HSP Graph Model, (**b**) The RIC Graph Model.

#### *3.2. Hierarchical Sequence Probability Method—HSP*

The Hierarchical Sequence Probability approach is an item–item collaborative filtering recommendation method where the list of recommended items arises from the items' sequential appearances during session navigation. We consider *t* the current time instance and *itemt* the item that appears to a user in time *t* during the session *s*. To recommend the next item at time instance *t* during *s*, we introduce the concept of *"item sequence probability"*. This term derives from the visiting frequency of the item by users during the sessions on the e-shop platform. All the items have the *single item probability*, *pair sequence probability* and *triplet sequence probability* as follows:

• *P*0(*A*)—*single item probability*, or *cold-start probability* is the number of appearances of item *A* in all sessions divided by the total number of appearances of all items:

$$P\_0(A) = \frac{\text{number of appearance of } A}{\text{number of appearance of all items}} \tag{1}$$

The *single item probability* is derived from the relationship *ItemInSession* of the graph;

• *P*1(*A*, *B*)—*pair sequence probability*, the number of appearances of the item pair (*A,B*) in successive instances in all sessions, i.e., *itemt*−<sup>1</sup> *= A*, *itemt = B* divided by the number of appearances of item *A*

$$P\_1(A,B) = \frac{\text{number of appears of consecutive pair } A, B}{\text{number of appears of item } A}$$

$$= P(item\_l = B | item\_{l-1} = A) \tag{2}$$

This function is created for all pairs of items using the *Next* relationship;

• *P*2(*A*, *B*, *C*)—*triplet sequence probability*, the number of appearances of the item triplet (*A,B,C*) in successive steps in all sessions, i.e., *itemt*−<sup>2</sup> *= A*, *itemt*−<sup>1</sup> *= B*, *itemt = C* divided by the number of consecutive pairs *A*, *B*

$$P\_2(A, B, \mathbb{C}) = \frac{\text{number of appearance of consecutive triplet } A, B, \mathbb{C}}{\text{number of appearance of consecutive pair } A, B}$$

$$= P(item\_l = \mathbb{C} | item\_{l-2} = A, item\_{l-1} = B) \tag{3}$$

This function is created for all triplets of items connected by the *Next* relationship in a chain *A* → *B* → *C*.

The probabilities *P*<sup>1</sup> and *P*<sup>2</sup> are closely related to the confidences of the association rules (*A* ⇒ *B*) and (*A*, *B* ⇒ *C*) under the additional constraint that *A*, *B*, and *C* must be consecutive items [32].

#### The Algorithm

The Hierarchical Sequence Probability (HSP) recommendation method is based on item sequences observed through the users' actions in the sessions. To recommend an item, we look at its probability of appearance as well as the history of sequences of length 1 or 2 in which this item has participated during the training sessions. In the absence of a history (i.e., in the first step of a session), items are recommended based on their probability. Thus, the recommendation of the next item is based on the following cases:


Figure 2 summarizes the flowchart of the proposed algorithm. In general, the recommended items appear only once in the final list following the hierarchy *"triplet sequence probability"* (2-length history), followed by *"pair sequence probability"* (1-length history), and followed by *"single item probability"* (0-length history).

**Figure 2.** The flowchart of the proposed Hierarchical Sequence Probability (HSP) algorithm. Whenever appending an item in the recommendation list, we make sure it is not already included to avoid duplicate recommendations.

#### *3.3. Recurrent Item Co-Occurrence Algorithm*

Whereas HSP is based primarily on the *Next* relationship to determine the list of recommended items, our second proposed method is based primarily on the *InSameSession* relationship, which neglects the relative position of the items in the session. Let I = {*o*1, ... , *oN*} be the set of all the items. Given a current item *x* we define the confidence weight *γ*(*x*|*oi*) of any other item *oi* as the ratio of all the *InSameSession* relations involving both *x* and *oi* divided by the number of the *ItemInSession* relations involving *x* and any session *s*:

$$\gamma(o\_l|\mathbf{x}) = \frac{\text{count(InSameSesis(x, o\_l))}}{\text{count(ItemInSension(x, s))}}.\tag{4}$$

This is equivalent to the confidence value conf(*x* ⇒ *oi*) of the association rule *x* ⇒ *oi* [32]:

$$\begin{split} \text{conf}(\mathbf{x} \Rightarrow o\_l) &= P(\mathbf{s} \ni o\_l | \mathbf{s} \ni \mathbf{x}) = \frac{P(\{\mathbf{s} \ni \mathbf{x}\} \cap \{\mathbf{s} \ni o\_l\})}{P(\mathbf{s} \ni \mathbf{x})} \\ &= \frac{\text{number of sensors containing both } \mathbf{x} \text{ and } o\_l}{\text{number of sensors containing } \mathbf{x}}. \end{split} \tag{5}$$

A naive approach would be to build the recommendation list by simply sorting items by decreasing confidence *γ*(*oi*|*x*). This approach, however, has two major drawbacks:


To alleviate these problems, we define a new confidence value *ci* for item *oi* which is equal to *γ*(*oi*|*x*) if *x* and *oi* co-occur in at least one session; otherwise, *ci* is equal to the cold-start probability *P*0(*oi*) defined in Equation (1):

$$c\_i = \begin{cases} \begin{array}{l} \gamma(o\_i|\mathbf{x}) & \text{if } \gamma(o\_i|\mathbf{x}) > 0 \\ P\_0(o\_i) & \text{otherwise} \end{array} \end{cases} \tag{6}$$

With this approach, items with no history of co-occurrence with *x* are placed in decreasing cold-start probability.

In order to introduce memory to the system, we further propose a simple, first order recurrent model that generates the item weights which will be used to build the recommendation list. Let *x* = *itemt* be the item viewed at step *t* in some session *s* and *ci*(*t*) be

the confidence value of any item *oi* based on *itemt* as described in Equation (6). Then, the weight *wi*(*t*) of this item at time *t* will be defined by the recurrent model:

$$w\_i(t) = \alpha w\_i(t-1) + (1-\alpha)c\_i(t). \tag{7}$$

The initial condition is again the cold-start probability *wi*(0) = *P*0(*oi*). The recommendation list at any time step *t* is built using the top *n* items with the largest weights *wi*(*t*). Since the weights are computed independently for different items, the process can be easily parallelized using the vectors **w***<sup>t</sup>* = [*w*1(*t*), ... , *wN*(*t*)], **c***<sup>t</sup>* = [*c*1(*t*), ... , *cN*(*t*)], where now: **w***<sup>t</sup>* = *α***w***t*−<sup>1</sup> + (1 − *α*)**c***t*.

The parameter 1 − *α* (0 ≤ *α* ≤ 1) is the "forgetting factor," which determines the memory length of the system. For 1 − *α* = 0 the system has infinite memory, the confidence *ci*(*t*) based on *itemt* is ignored and *wi*(*t*) maintains a constant initial value through-out the session. If 1 − *α* = 1 the model becomes memoryless and *wi*(*t*) = *ci*(*t*). The parameter *α* is set by the user and determines the effect that the previous items *itemt*−1, *itemt*−2, ... have on the current decision. Figure 3 depicts the schematic diagram of the proposed recurrent system.

**Figure 3.** In the RIC method, the item weight vector **w***t* at any time step *t* is generated by a first order linear recurrent model. The input to the model is the current confidence vector **c***t*.

Depending on how the sessions are recorded, it is possible to have repeated consecutive entries of the same item, for example, the item sequence could be *A*, *B*, *B*, *C*. In some datasets this is a frequent situation, whereas in other datasets this case never appears. We offer two-flavors of the RIC algorithm:


#### **4. The Datasets**

We applied the experimentation on four session-based datasets: Leather (https:// github.com/delmarin35/Graph-Probability-Rec-Sys/tree/main/data, accessed on 19 November 2022), Yoochoose1/64 (http://2015.recsyschallenge.com/challege.html, accessed on 3 April 2022), Diginetica (http://cikm2016.cs.iupui.edu/cikm-cup, accessed on 3 April 2022), eElectronics (https://www.kaggle.com/datasets/mkechinov/ecommerceevents-history-in-electronics-store, accessed on 18 June 2022).

**Leather:** The data of the Leather dataset obtained from the processing of web server log records of an e-shop with leather apparel, jackets, furs and accessories. This is real data that emerged from the log files that were recorded implicitly during the users' navigation actions in the e-shop platform for the time period of six months from March to August of 2021. The log data were preprocessed by identifying sessions, session length, user actions in each session, and the items targeted by the actions. We consider as a session step every user action during the session, for example, viewing an item or adding an item to the basket. The dataset was processed to obtain only the sessions that contain at least two behavior sequences, *"view item"* and *"add to cart"*, resulting in 102,024 records. The dataset was split into train (80%) and test (20%) sets. Thus, the number of records of the train and test sets are 81,651 and 20,373 respectively. The total number of unique sessions is 19,236 which corresponds to 15,388 unique sessions of the train set and 3848 unique sessions of the test set. In addition, the dataset contains 1448 unique items, of which 1429 appear in the train set and 1296 in the test set. The same item never appears in two consecutive steps in any session.

**Yoochoose1/64:** It is a public benchmark session-based dataset that has been commonly used to evaluate recommendation system performance. This dataset has 17,740 unique items, of which 17,371 appear in the train set and 6745 in the test set. In addition, 369 items of the test set do not appear in the train set, while 10,995 train set items do not exist in the test set. Moreover, 16.183% of the item pairs in the train sessions and 14.938% of the item pairs in the test sessions, respectively, contain the same item twice.

**Diginetica:** similarly to the Yoochoose1/64 dataset, the Diginetica dataset is often used as a benchmark for testing recommendation systems' performances. This dataset contains 43,097 items. All the items appear in the train set, but only 21,129 appear in the test set. The percentage of consecutive item pairs with a repeated item is less than in Yoochoose1/64, being approximately 9% in both train and test sets.

**eElectronics:** this dataset contains user behavior data recorded for a period of 5 months (October 2019–February 2020) from a large electronics online store. After removing the sessions with only one item, and splitting the total number of 68,973 sessions into train (80%) and test (20%) sets, 55,089 sessions were used for training and 13,884 sessions were used for testing. Additionally, there are 33,130 unique items, 29,917 of which appear in the train test and 14,482 appear in the test set. Furthermore, 3213 test set items are not in the train set and 1848 train test items do not appear in the test set.

The description of statistical information of the datasets is presented in Table 2. All the datasets have items that exist in the test set and do not exist in the train set or vice versa. No item was removed from either the train or the test sets.


**Table 2.** Datasets statistics.

#### **5. The Experimentation Procedure and Results**

In this section, we first describe the evaluation metric for performance evaluation. We then intend to answer the research questions posed in Section 2.

#### *5.1. Evaluation Metrics*

The metrics that we used to evaluate the methods were the Mean Reciprocal Rank (MRR)@*K* and the Recall@*K*. The MRR is an appropriate metric for measuring the performance of recommendation algorithms on a session-based dataset as well as a good measure of the effectiveness of next-item recommendation [33]. It evaluates the accuracy of the recommended top-*k* list and is defined as:

$$\text{MRR}\oplus k = \frac{1}{N} \sum\_{\mathfrak{x}} \frac{1}{\text{rank}(\mathfrak{x})} \gamma$$

where *x* is the next item to be predicted and rank(*x*) is the position of *x* in the recommendation list, starting from position 1 for the first item. If *x* is not in the recommendation list, we set rank(*x*) = ∞. The value of MRR is between 0 and 1 and the higher the value, the more effective the quality of the recommendations. Assuming, as is often the case, that at

least five recommended items appear on the user's screen, an MRR ≥ 0.2 indicates that the method is quite successful, since the next item chosen by the user is—on average—among the top five recommended.

The Recall@*k* is defined as the percentage of the target items that were actually included in the top-*k* recommendation list. Specifically, given a sequence of *N* top = *k* recommendation lists *Li* with corresponding target items *xi*, the Recall@*k* is defined as [29]:

$$\text{Rec}\circledast k = \frac{1}{N} \sum\_{i=1}^{N} |L\_i \cap \{\boldsymbol{x}\_i\}|\_{\boldsymbol{\lambda}}$$

where *Li* ∩ {*xi*} denotes the cardinality of the intersection set between *Li* and {*xi*} which, in this case, can either take the value 0, if the intersection is empty (i.e. *xi* ∈ *Li*), or 1, if *xi* ∈ *Li*.

In all experiments, we used the train sets to construct the data representation graphs or to train the neural models. The evaluation of the algorithms was performed on the test sets.

#### *5.2. The Experiments*

The same session-based datasets were used for all the experimental implementations and recorded the results for MRR@*k* and Rec@*k* for the top items *k* = {10, 20, 30}. More specifically,



**Table 3.** Optimal values of the *α* parameter in the RIC method.

#### *5.3. Results and Discussion*

The experimentation results show that the effectiveness of a model is affected by the dataset. Table 4 shows that the RIC-CiL recommendation method has a better MRR@*k* performance for any *k* in the case of the Leather and Electronics datasets. In these datasets, there is no session with repeated items in consecutive steps. The RIC-CiL variant achieves these results by transferring the current item to the end of the recommendation list, thus essentially excluding it from the top-*k* recommendations. This technique does not bring

desired results in the Yoochoose1/64 and Diginetica datasets due to the existence of repeated consecutive items in the train and test sets. In these cases, the plain RIC variant works better and, especially in the Diginetica case, outperforms SG-NN and IC-GAR with respect to the MRR@*k* metric. HSP also has a very good MRR performance, outperforming the GNN models on the Leather dataset (falling only behind RIC CiL).

On the other hand, the GCE-GNN and SR-GNN models achieve a better MRR performance in the Diginetica dataset while all three GNN models have better MRR performance on the Yoochoose1/64 dataset. In both of these datasets, repeated consecutive items appear in many sessions. As shown in Table 2 the Yoochoose dataset has a very large percentage of repeated item pairs (15–16%), even higher than the Diginetica dataset (∼9%). This indicates that the GNN models have difficulty predicting the next item in a high ranking position in the recommendation list unless the next item is identical to the current one. In other words, they are not as efficient in identifying novelty. We stipulate that this phenomenon is due to overfitting, considering that these models have a lot of parameters that allow them to achieve very good fit on the training data but may not generalize as efficiently on the test data. Here, it is worth noting that online store users may not appreciate getting recommendations including the same item they are currently visiting. It seems more natural to exclude the current item from the recommendation list. However, in our experiments, we still use the Diginetica and Yoochoose1/64 datasets because they are common benchmarks studied in many papers in the field.

**Table 4.** Comparison results of our methods against three state-of-the-art Graph Neural Network recommendation models. The MRR@{10,20,30} and Recall@{10,20,30} are used as evaluation metrics. The best performance for each dataset and corresponding metric is marked by bold-face numbers.


Additionally, our experiments show that the GCE-GNN model achieves the best Recall@*k* for any *k* in all the datasets. In combination with the previous observations, we conclude that the GCE-GNN model is the most efficient one in finding the next item somewhere in the top-*k* list. However, it has still some difficulty in placing the next item in a high ranking position unless it is the same item as the current one. This is more obvious when studying the top 10 recommendations in the Leather and eElectronics datasets. In this case, we note that, although Rec@10 is almost identical for GCE-GNN and RIC-CiL, GCE-GNN has a significantly lower MRR@10 (between 2.5–3%)

Comparing HSP and RIC with each other, we find that HSP is inferior to RIC-CiL in the case of the non-repeating datasets (Leather and eElectronics) and inferior or very close to the performance of RIC-plain in the item-repeating datasets (Diginetica and Yoochoose1/64). Especially in the case of Yoochoose1/64, the HSP and RIC-plain are almost equivalent in terms of MRR performance although HSP is inferior in terms of the Rec@*k* metric. In the case of the Diginetica dataset, the performance of HSP is inferior to both versions of RIC. The difference between our two proposed methods is that, in HSP, we are basing it on the *Next* relationship, taking into account the sequence of items, while in RIC, we are basing it on the *InSameSession* taking into account the items' co-occurrence. The performance superiority of RIC against HSP indicates that focusing on item co-occurrence is more beneficial than looking strictly at the recent item sequence.

Based on these findings, we can claim that the structure of a dataset affects the performance of the recommendation models regardless of the way the recommendation model is constructed, i.e., with or without the use of neural networks. Assuming that we do not recommend the current item to the online user, the RIC-CiL variation achieves the best MRR performance compared against state-of-the-art GNN models.

Regarding the execution time, the proposed HSP and RIC methods differ significantly in the production of the recommendations list from the initial stage. In addition, simple CPU execution is sufficient for the studied datasets to quickly implement the training process or the calculations of the possible recommended items. The time it took for the entire experimental process per recommendation method and dataset, from the beginning to the appearance of the results, is shown in Table 5 and schematically illustrated in Figure 4. Despite the fact that, for the training of the state-of-the-art GNN methods, a GPU is necessary to complete the experiments in the time indicated in Table 5, for our HSP and RIC methods, significantly less time was consumed without the use of a GPU.


**Table 5.** The approximate experimentation execution time in minutes.

During the training process on the eElectronics dataset, we reduced the batch size to eight to avoid the out-ofmemory problem. For the other datasets, we kept the batch size to 100 as defined in the methods' GitHub.

Based on the above findings, the proposed HSP and RIC methods are sufficiently competent against more complex, state-of-the-art methods, and can be applied in real ecommerce environments without requiring special equipment for their productive operation.

**Figure 4.** The execution time in minutes for all methods and each dataset. Our proposed methods are between 10× and 180× times faster than GNN models.

#### **6. Conclusions**

We have presented two graph-based methods for session-based recommendations in a generic e-commerce environment without employing user history, i.e., suitable for anonymous browsing. The methods are called Hierarchical Sequence Probability (HSP) and Recurrent Item Co-occurrence (RIC). HSP is based on the statistics of the *Next* relationship between items computing the probabilities of triplets, pairs and single items, which are then used—in that order—to determine the position of each item in the recommendation list. The RIC method is primarily based on the *InSameSession* relationship, which determines the co-occurrence of pairs of items in the same session. We introduce memory to RIC by incorporating a simple recurrent formula to determine the weight of each item which is subsequently used to place the item in its proper position in the recommendation list. Setting the value of the forgetting factor of this recurrent formula allows us to balance the effect on our current decision of previously visited items in the session.

Both proposed methods have been compared to state-of-the-art Graph Neural Network models. Our experiments, which involve four diverse datasets, show that RIC can outperform the GNN models in two cases and achieve a performance quite close to that of the winner model in the other two. The HSP approach is typically inferior to RIC, indicating that the *Next* relationship is not so important compared to the *InSameSession* relationship when building the recommendation list.

Additionally, both HSP and RIC methods are very fast compared to the GNN models. This happens despite the fact that the execution time of the GNN models is reduced thanks to the presence of a GPU accelerator, whereas the times recorded for our models are measured on a simple CPU-based machine.

In future work, we plan to investigate the improvement of the RIC method by automatically determining the optimal parameter *α* and also to determine whether the current item should be first or last in the recommendation list. Another important aspect of the algorithm which is worth investigating is the graph update as new data are collected in such a way that the computational cost of updating the new confidence vectors, cold-start probabilities and weight vectors is minimized.

**Author Contributions:** Conceptualization, M.D. and K.D.; methodology, M.D. and K.D.; software, K.D.; validation, M.D. and K.D.; formal analysis, M.D. and K.D.; investigation, M.D. and K.D.; resources, D.T. and M.S.; data curation, D.T., M.D., K.D. and M.S.; writing—original draft preparation, M.D. and K.D.; writing—review and editing, K.D., M.D. and M.S.; visualization, K.D.; supervision, K.D.; project administration, K.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by RESEARCH–CREATE–INNOVATE call (project code: T1EDK-01776).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data for this study are available on request from the corresponding authors.

**Acknowledgments:** This research has been co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH–CREATE–INNOVATE (project code:T1EDK-01776).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

**Evangelos Tsagalidis <sup>1</sup> and Georgios Evangelidis 2,\***


**Abstract:** We deal with the problem of class imbalance in data mining and machine learning classification algorithms. This is the case where some of the class labels are represented by a small number of examples in the training dataset compared to the rest of the class labels. Usually, those minority class labels are the most important ones, implying that classifiers should primarily perform well on predicting those labels. This is a well-studied problem and various strategies that use sampling methods are used to balance the representation of the labels in the training dataset and improve classifier performance. We explore whether expert knowledge in the field of Meteorology can enhance the quality of the training dataset when treated by pre-processing sampling strategies. We propose four new sampling strategies based on our expertise on the data domain and we compare their effectiveness against the established sampling strategies used in the literature. It turns out that our sampling strategies, which take advantage of expert knowledge from the data domain, achieve class balancing that improves the performance of most classifiers.

**Keywords:** meteorological data mining and machine learning; class imbalance; classification; randomized undersampling; SMOTE oversampling; undersampling using temporal distances

#### **1. Introduction**

Imbalanced or skewed training datasets make predictive modeling challenging since most of the classifiers are designed assuming a uniform distribution of class labels among the examples. There are classification problems that must deal with various degrees of imbalance. The goal is to improve the quality of the training dataset, i.e., make it more balanced, in order for the classifiers to achieve better predictive performance, specifically for the minority class. Usually, the minority class is more important and, hence, the classifier should be more sensitive to classification errors for the minority class than the majority class [1]. A typical approach in the literature is the application of techniques for transforming the training dataset to balance the class distribution including data oversampling for the minority examples, data undersampling for the majority examples and combinations of these techniques [1,2].

We attempt to enhance existing pre-processing sampling strategies by exploiting expert knowledge from the domain of Meteorology. We use the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis 40-years dataset (See https:// www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-40-years, accessed on 29 November 2022, for details) (also known as ERA-40) and a dataset with the historical observations of the meteorological station of Micra, Thessaloniki, Greece and we attempt to predict the occurrence of precipitation on the ground at the meteorological station. We use various data pre-processing strategies (based on oversampling and undersampling) for the selection of the appropriate training dataset, and, we test their effectiveness on various classifiers.

**Citation:** Tsagalidis, E.; Evangelidis, G. Exploiting Domain Knowledge to Address Class Imbalance in Meteorological Data Mining. *Appl. Sci.* **2022**, *12*, 12402. https://doi.org/ 10.3390/app122312402

Academic Editor: Yosoon Choi

Received: 27 October 2022 Accepted: 2 December 2022 Published: 4 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

The input dataset consists of imbalanced data regarding the precipitation class variable, where the minority class is only the 16.1% of the cases. It is known that such situations degrade the performance of data mining or machine learning classifiers. In [3], we determined the minimum training dataset size that can ensure effective application of data mining techniques specifically on meteorological data. The performance of various classifiers did not increase significantly for training dataset sizes of more than 9 years. Also, the results were not affected by the way we chose the training dataset examples, i.e., randomly isolated examples totalling nine years versus nine entire yearly sets of examples randomly selected. In this paper, we take advantage of the above finding by choosing appropriately large training datasets for the tested classifiers.

The contribution of this study is the proposal of effective sampling strategies on meteorological training datasets that are based on our expertise on the data domain. In our experimental study, we compare common sampling strategies from the literature and the proposed new strategies and show that the newly proposed strategies improve the performance of most classifiers.

The remainder of the paper is organized as follows. Section 2 discusses the problem of class imbalance, reviews recent works that address it using domain knowledge, and, describes the sampling strategies used in the literature as well as the novel sampling strategies we propose. Section 3 describes the datasets we used for applying the sampling strategies on the training dataset and the classifiers that we compared. Section 4 discusses the methodology used in the experiments. In Section 5, we present the analysis and the results, and, finally, we conclude in Section 6.

#### **2. The Problem of Class Imbalance**

A very good introduction to the problem of class imbalance and the related research efforts is given in [4,5]. Ref. [4] provides a comprehensive review of the subject and discusses the initial solutions that were proposed to deal with the problem of class imbalance. Ref. [5] discusses the role that rare classes and rare cases play in data mining, the problems that they can cause and the methods that have been proposed to address these problems.

Over the years the problem of class imbalanced has been studied extensively. There exist numerous papers that use standard data agnostic oversampling and undersampling techniques to create balanced training datasets. Regarding meteorological data, ref. [6] first applies oversampling to increase thunderstorm examples in the training dataset and then uses deep neural networks to predict thunderstorms. Similarly, ref. [7] applies standard oversampling techniques on radar image data to improve rainfall prediction, while [8] presents a framework for predicting floods, in which it embeds re-sampling to address class imbalance. Finally, ref. [9] does not apply any sampling strategies but experiments with various classifiers and concludes that Self-Growing Neural Networks perform better when predicting fog events using data with class imbalance.

Various research works attempt to exploit domain knowledge to address the class imbalance problem, but not in the meteorological domain. Ref. [10] addresses the problem of noisy and borderline examples when using oversampling methods, while [11] deals simultaneously with the problems of class imbalance and class overlap. Ref. [12] uses domain specific knowledge to address the problem of class imbalance in text sentiment classification. Finally, ref. [13] exploits domain knowledge to address multi-class imbalance in classification tasks for manufacturing data.

In our study, we use the most common sampling strategies found in the literature to address the class imbalance problem, namely, the randomized undersampling and the SMOTE oversampling methods and their combination. SMOTE stands for Synthetic Minority Oversampling Technique [14]. Besides the natural distribution, we employ the commonly used 30% and 50% (or balanced) distributions regarding the minority class [1]. We also examine the within-class distribution in addition to the between-class distribution [15], using a combination of the randomized undersampling and the SMOTE oversampling methods in both minority and majority examples.

In an effort to take into account the peculiarities of the data domain when sampling the training datasets and to examine how these could affect the performance of the classifiers, we applied two novel strategies when constructing balanced datasets, i.e., datasets where the number of majority and minority examples is equal. In the first strategy, we applied the k-Means clustering algorithm using "classes to clusters" evaluation to select only the most homogeneous majority examples. In the second strategy, we rejected the majority examples that were closer to the minority examples with respect to their temporal distance in days using three different values for the distance. Then, we further reduced the number of majority examples to achieve a balanced distribution using the randomized undersampling method. We are not aware of any other attempt that uses large meteorological databases and at the same time domain specific sampling techniques to address the class imbalance problem.

We used five different classifiers to build models for predicting our class variable. The training/test set method was used to evaluate the models and to reveal the best sampling strategy for meteorological data. As an evaluation metric, we used the Area Under the ROC (Receiver Operating Characteristics) Curve (AUC) [5,16].

#### **3. Datasets**

#### *3.1. ERA-40 Dataset*

The European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis 40 years dataset (ERA-40) is a global atmospheric analysis of many conventional observations and satellite data streams for the period of September 1957 to August 2002. Reanalysis products are used increasingly in many fields that require an observational record of the state of either the atmosphere or its underlying land and ocean surfaces. There are numerous data products that are separated into dataset series based on resolution, vertical coordinate reference, and likely research applications. In this study, we used the ERA-40 2.5 degree latitude-longitude gridded upper air analysis on pressure surfaces. This dataset contains 11 variables on 23 pressure surfaces on an equally spaced global 2.5 degree latitudelongitude grid. All variables are reported four times a day at 00, 06, 12 and 18UTC for the entire period [17].

We created our initial dataset choosing the values of 10 variables on 7 pressure surfaces on one node. We used only the data from the node with geographical coordinates 40◦ N latitude and 22.5◦ E longitude, which is the closest node to the Meteorological Station of Micra, Thessaloniki, Greece located at 40.52◦ N, 22.97◦ E and altitude of 4m. We omitted the 11th variable of the Ozone mass mixing ratio. The 1000 hPa, 925 hPa, 850 hPa, 700 hPa, 500 hPa, 300 hPa and 200 hPa are the 7 pressure surfaces we chose, because these are the ones that are mainly used by the meteorology forecasters operationally. In addition, the values of the barometric pressure on mean sea level in Pa supplement the initial dataset that consists of 71 variables.

Furthermore, the initial values of most of the variables for each pressure surface and the pressure on mean sea level were transformed to make them easier to understand or to express them in the same metric units as used operationally by the meteorologists. More specifically, specific humidity initially expressed in kg·kg−<sup>1</sup> was converted to g·kg−<sup>1</sup> and vertical velocity in Pa·s−<sup>1</sup> to hPa·h<sup>−</sup>1. The relatively small values of both vorticity (relative) in s−<sup>1</sup> and divergence also in s−<sup>1</sup> were multiplied by 106, and the value of potential vorticity in K·m2·kg−1·s−<sup>1</sup> by 108. Regarding the wind, wind direction in azimuth degrees and wind speed in knots were calculated using the U and V velocities in m−1. Also, the azimuth degrees for the wind direction were assigned into the eight discrete values of north (N), northeast (NE), etc., used in meteorology. The geopotential in m2·s−<sup>2</sup> was divided by the World Meteorological Organization (WMO) defined gravity constant of 9.80665 m·s−2, thus, it was transformed to geopotential height in gpm. Finally, the values of barometric pressure on mean sea level were expressed in hPa, and only the values of temperature in K and relative humidity as percentage (%) on pressure surfaces remained unchanged.

#### *3.2. Class Variable*

The 6-hourly main synoptic surface observation data of the Meteorological Station of Micra, Thessaloniki, Greece completed our initial dataset. More specifically, we collected the recorded precipitation data of the period 1 January 1960 00UTC–31 December 2001 18UTC. We assigned the value 'yes' to the 6-hourly records of rain, drizzle, sleet, snow, shower at the station or the records of thunderstorm at the station or around it, and the value 'no' to the rest of the records, thus, creating the class variable of our study. Our purpose is to use the ERA-40 atmospheric analysis data at node 40◦ N, 22.5◦ E to predict the precipitation at the station. We mention that the determination of the recorded precipitation is taking into account both the present and past weather of the synoptic observation, and that snow or thunder have priority over rain. Tables 1 and 2 depict the distribution of the precipitation types that had been recorded in the Meteorological Station according to the defined sub-clusters.

**Table 1.** Natural distribution of values within the minority class variable (precipitation 'yes').


**Table 2.** Natural distribution of values within the majority class variable (precipitation 'no').


#### *3.3. Predictor Variables*

In the pre-processing phase we applied data reduction using the Principal Component Analysis (PCA) extraction method to remove highly correlated variables from the ERA-40 dataset. We used the SPSS statistical software package to process the entire ERA-40 dataset and to produce a new one that consisted of a reduced number of uncorrelated variables.

After applying PCA and examining the component matrix of loadings and the variable communalities, we deleted a total of 36 variables from our initial dataset that consisted of 71 variables. The component model was re-specified six times with a final outcome of 35 variables and 9 components with eigenvalues greater than 1. This is exactly the same methodology we used in a previous work of ours [18]. The analysis revealed the findings of Table 3.

Table 4 displays the variance explained by the rotated components and additionally the corresponding nine most highly correlated variables. The Total column gives the eigenvalue, or amount of variance in the original variables accounted for by each component. The % of Variance column gives the ratio of the variance accounted for by each component to the total variance in all of the variables (expressed as a percentage). The % Cumulative column gives the percentage of variance accounted for by the first 9 components.

The first nine rotated components explain nearly 85.2% of the variability in the original variables and it is possible to considerably reduce the complexity of the data set by using these components, with a 14.8% loss of information. As a result, we can reduce the size of the ERA-40 dataset by selecting the 9 most highly correlated variables with the 9 principal components [18,19]. These meteorological parameters could express the state of the troposphere where precipitation is created and reaches the ground. The reduced ERA-40 dataset with the 9 chosen variables, as predictors, and the precipitation, as class variable, comprised our experimental dataset with 61,364 examples. The size of the dataset is explained by the fact that we have four daily examples (one every 6 h) for a period of 42 years (42 × 365 × 4 = 61,320 examples plus 11 × 4 = 44 examples for the 11 extra leap year days of that period).


**Table 3.** Most highly correlated variables to the rotated components.

**Table 4.** Variance explained by rotated components and the representative variables.


#### **4. Methodology**

Since the focus of our study was to address the class imbalance problem, we used a number of sampling strategies in order to balance the training datasets used in the classification task.

Besides the training dataset with the natural distribution of the precipitation values that are shown in Tables 1 and 2 (Strategy 1), we created nine more balanced training datasets following different strategies (Strategies 2–10) (Table 5). Two of them followed the 30% distribution regarding the minority class variable and the other seven the balanced distribution (50%). In the following we describe Strategies 2 through 10.

In the second and third strategies, we used the randomized undersampling method to remove examples producing two datasets with a 30% (U30) and a 50% (U50) distribution of the minority class, respectively [5].

Likewise, in the fourth and fifth strategies, we used a combination of the SMOTE oversampling method to create new examples of the minority class and the randomized undersampling method to remove examples from the majority class, achieving a 30% (SU30) and a 50% (SU50) distribution of the minority class, respectively. We ran the SMOTE oversampling method in the WEKA environment, using 3 nearest neighbors [14,20,21].

In the sixth strategy (BW), we formed balanced datasets not only between-classes but also within-classes [15]. Thus, we employed the randomized undersampling method to reduce the number of the examples for the large clusters of 'Rain/Drizzle' and 'Fair/Cloudy' and the SMOTE oversampling method to increase the number of the examples for the

small clusters of 'Snow/Sleet', 'Thunder' and 'Fog'. Thus, the sum of the 'Rain/Drizzle', 'Snow/Sleet' and 'Thunder' examples that belong to the minority class became equal to the sum of the 'Fair/Cloudy' and 'Fog' examples of the majority class achieving the betweenclass balance. Moreover, the number of the 'Rain/Drizzle', 'Snow/Sleet' and 'Thunder' examples became equal to each other, and, similarly, the number of 'Fair/Cloudy' and 'Fog' examples became equal to each other achieving the within-class balance.

**Table 5.** Description of used sampling strategies.


Strategies 7 through 10 are newly proposed sampling strategies that take into consideration the nature of the data at hand. More specifically, in the seventh strategy (CU), we applied the k-means clustering algorithm to the entire dataset using WEKA. We set the number of clusters equal to five and chose the "classes to clusters" evaluation in WEKA to evaluate each cluster according to the five classes of precipitation (Tables 1 and 2). In the first step, we selected only the majority examples of the 'Fair/Cloudy' and 'Fog' labeled clusters. In this manner, we rejected all the majority examples that clustered in the three clusters that corresponded to the three minority classes. The idea is that these examples are not good majority representatives since they cluster with minority examples and the classifiers would suffer to distinguish between them. Then, we employed the randomized undersampling method to further reduce the number of majority examples in order to achieve a balanced distribution.

Finally, we introduced three more strategies to reduce the excessive number of majority examples that comprise the majority class. For each majority example, we added a new attribute that expressed its temporal distance to the closest minority example. Then, we selected only the majority examples that had a temporal distance greater than one day (D1U), or two days (D2U), or four days (D4U). And finally, similarly to strategy CU, we employed in the D1U and D2U strategies the randomized undersampling method to further reduce the number of majority examples and achieve a balanced distribution. In the case of the D4U strategy, the number of the majority examples after the reduction was very close to the number of the minority examples. The idea of the temporal distance arose from the fact that during the precipitation episodes there may be some intervals without precipitation on the ground, while the meteorological factor for the precipitation still exists. It is possible that the classifiers can not distinguish these cases of majority class from a minority one leading to a degradation of their performance.

In Section 5, we provide the corresponding number of examples for each strategy and the details regarding the sub-clusters of the precipitation class variable. The training datasets were the input to five classifiers, namely, the Decision tree C4.5, the k-Nearest Neighbor, the Multi-layer Perceptron with back-propagation, the Naïve Bayesian and the RIPPER (Repeated Incremental Pruning to Produce Error Reduction) [21].

We evaluated the resulting models on separate test datasets that followed the natural distribution regarding the clusters of precipitation (Tables 1 and 2). The Area Under the ROC Curve, or simply AUC, was the evaluation metric we used. The AUC measures the performance of the classifiers as a single scalar. ROC graphs are two-dimensional graphs in which the True Positive Rate (the percentage of minority cases correctly classified as belonging to the minority class) is plotted on the Y axis and the False Positive Rate (the percentage of majority cases misclassified as belonging to the minority class) is plotted on the X axis. An ROC graph depicts relative trade-offs between benefits (true positives) and costs (false positives). The AUC is a reliable measure especially for imbalanced datasets to get a score for the general performance of a classifier and to compare it to that of another classifier [5,16].

#### **5. Experiments and Results**

#### *5.1. Training/Test Datasets*

The training/test set method was used to build and evaluate the data mining models. The initial dataset of 61,364 examples was divided into 10 non-overlapping folds. By taking each one of the 10 folds as a test set and the remaining 9 as a pool of examples for choosing the training datasets, we formed 10 groups with 55,228 training examples and 6136 test examples. Every fold was chosen randomly, but it followed the natural distribution according to the clusters within the precipitation class variable, as shown in Tables 1 and 2. Thus, we produced 10 test datasets with 6136 examples following the natural distribution that covered the entire initial dataset. In our experiments we always used the above test datasets without introducing any synthetic examples.

We created 100 training datasets by randomly taking 10 samples with replacement consisting of 17,788 examples from the training examples of each one of the 10 groups. Furthermore, we joined the same test dataset 10 times to the corresponding 10 training datasets of each group and formed 100 training/test datasets with 23,924 examples (17,788 training and 6136 test examples, 74.35–25.65%). It is noted that in the strategy of D4U, where we used the four days restriction and reduced the number of majority examples close to the number of minority examples, we formed only a total of 10 training datasets, one for each group.

The different methodologies used to generate a training dataset, characterize the different strategies that we followed to address the class imbalance problem. We employed nine new training datasets according to the strategies that we described in Section 4.

Table 6 shows the number of examples of each of the five different types of precipitation for: (a) the initial file (Initial), (b) the 10 groups (Groups), (c) the 10 folds or test sets (Folds), and, (d) the sampled training datasets produced by the nine strategies. Notice that for all strategies, we generated 10 samples per Group for a total of 100 samples of 17,788 examples. The exception was D4U, where the generated testing datasets had an almost balanced distribution of the majority and minority classes, hence, we generated a single sample per Group for a total of 10 samples of 17,625 examples.

In Table 6, we observe that the total number of minority examples in the original training datasets (Groups of 9 folds) was 8894. Hence, in order to produce a 50% balanced training dataset, one needs to choose the same number of majority examples out of the 46,334 available ones. This is the reason we chose 17,788 as the size of the sampled training dataset. These examples correspond to about 12 years of data that is an acceptable amount of data for classification purposes according to our previous research [3], as we explained in Section 1.

#### *5.2. Algorithm Runs*

To recap, we tested each one of the first nine strategies with 100 training/test datasets (UN, U30, U50, SU30, SU50, BW, CU, D1U and D2U) and the tenth strategy with 10 training/test datasets (D4U), for a total of 910 training/test datasets.

These datasets comprised the input to the five classifiers that were run and evaluated using WEKA. The classifiers were the decision tree C4.5 without pruning and Laplace estimate (DT), the k-Nearest Neighbors with k = 5 and Euclidean distance (kNN), the RIP-PER (RIP), the Naïve Bayesian (NB), and the Multilayer Perceptron neural network with back-propagation (MP).


**Table 6.** The natural distribution and the number of examples within the precipitation class variable of the training datasets generated by the various sampling strategies.

The last three classifiers were run using the default settings of WEKA. Thus, we performed 4550 runs in the WEKA environment and we present the results in Table 7 and in Figures 1 and 2. Table 7 shows the mean value and the standard deviation of AUC of the 100 or 10 (for D4U) runs for each strategy and classifier.

**Figure 1.** Box-plots of AUC values for strategies UN, U30, U50, SU30, SU50, BW and all classifiers.

Since it is impossible to plot all the box plots for all strategies and classifiers in a single figure, we decided to use two figures. In the first figure, we compare the strategies commonly used in the literature (2 through 6) against UN (strategy 1 that simply uses the initial unbalanced dataset). In the second figure, we compare the newly proposed strategies (7 through 10) against UN and the best strategy of the first figure.

Thus, Figure 1 depicts the box-plots of the corresponding AUC values for the first six strategies. The white box-plots correspond to the UN strategy, the light gray box-plots to the U30 strategy, the light gray box-plots with a pattern of black dots to the U50 strategy, the dark gray box-plots to the SU30 strategy, the dark gray box-plots with a pattern of

black dots to the SU50 strategy and the white box-plots with a pattern of black dots to the BW strategy.

**Figure 2.** Box-plots of AUC values for strategies UN, U50, CU, D1U, D2U, D4U and all classifiers.

**Table 7.** Mean value and standard deviation (SD) of AUC. The top three strategies per classifier are shown in red text.


We notice that the best strategy for each classifier, with the exception of Naïve Bayesian, is the Randomized Undersampling with the balanced distribution (U50). Also, the classifier with the highest AUC value is the Multilayer Perceptron with back-propagation Neural Network. Regarding the Naïve Bayesian classifier, all strategies perform about equally and it seems that only the combination of the SMOTE Oversampling and Randomized Undersampling strategies (SU30, SU50) slightly improve the AUC metric. For the k-Nearest Neighbor and RIPPER classifiers, the U30, U50, SU30 and SU50 strategies significantly improve the performance on AUC, and, especially, the U50 strategy. For the Decision Tree C4.5, only the U50 strategy performs slightly better than the Natural one (UN), and, for the Multilayer Perceptron, the U50 strategy performs better than the Natural one (UN) and the U30 strategy slightly better. The balanced distribution in both the between and within-classes (BW) strategy gave the worst results on AUC with the exception of the RIPPER classifier.

Likewise, Figure 2, depicts the box-plots of the corresponding AUC values for the proposed four strategies (CU, D1U, D2U, D4U), and, additionally, the UN and U50 strategies for comparison. The U50 strategy was chosen because of its performance shown in Table 7 and Figure 1. The white box-plots correspond to the UN strategy, the light gray box-plots to the U50 strategy, the dark gray box-plots to the CU strategy, the white box-plots with a pattern of black dots to the D1U strategy, the light gray box-plots with a pattern of black dots to the D2U strategy, and the dark gray box-plots with a pattern of black dots to the D4U strategy.

In both Figure 2 and Table 7 that highlights the top three performing strategies per classifier, we notice that the strategies with the temporal distance restriction of each minority example from the closer majority one (D1U, D2U and D4U) perform better than the UN strategy on all classifiers with the exception of the Naïve Bayesian classifier. In addition, they perform better than the U50 strategy in the case of the Decision Tree C4.5 and the k-Nearest Neighbor classifiers. Regarding the Multi-layer Perceptron, Naïve Bayesian and RIPPER classifiers, the D1U strategy performs about equally to or slightly better than the U50 strategy, while it performs better than the D4U strategy. Finally, the CU strategy gave very poor results on AUC and only in the RIPPER classifier it outperformed the UN strategy.

#### **6. Conclusions**

We applied Principal Component Analysis to reduce the 71 initial chosen variables of the ERA-40 dataset to 9 variables that were uncorrelated to each other, which explain nearly 85.2% of the variability in the original variables. The reduced ERA-40 dataset and the historical precipitation records of the Meteorological Station of Micra, Thessaloniki, Greece were then input into five data mining and machine learning classifiers we used to build models that predict the occurrence of precipitation at the station.

The Multilayer Perceptron with back-propagation neural network classifier outperforms all other classifiers on AUC, revealing the most effective classifier in this meteorological domain.

Moreover, the proposed new strategy D1U with the balanced distribution resulting from the combination of the one day restriction and the Randomized Undersampling method is the recommended strategy to address the class imbalance problem for the Multilayer Perceptron with back-propagation neural network, Decision Tree C4.5, k-Nearest Neighbor and RIPPER classifiers. Alternatively, the Randomized Under-sampling with the balanced distribution strategy U50 could also be used for the Multilayer Perceptron with back-propagation neural network and RIPPER classifiers. Finally, regarding the Naïve Bayesian classifier, the proposed sampling strategies did not improve its performance when compared to the natural distribution. We observe that in the class imbalance problem, the application of sampling strategies based on the expertise on the data domain can improve the effectiveness of some classifiers.

**Author Contributions:** Conceptualization, E.T. and G.E.; Methodology, E.T. and G.E.; Software, E.T.; Validation, E.T. and G.E.; Formal analysis, E.T. and G.E.; Investigation, E.T.; Resources, E.T.; Data curation, E.T.; Writing—original draft, E.T.; Writing—review & editing, E.T. and G.E.; Visualization, E.T. and G.E.; Supervision, G.E. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We wish to thank the European Centre for Medium-Range Weather Forecasts and the Greek National Meteorological Service for providing us with the meteorological data. We would also like to thank our colleagues Demetrios Papanastasiou and Leonidas Karamitopoulos for their valuable suggestions and comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Similarity Calculation via Passage-Level Event Connection Graph**

**Ming Liu 1,2,3,4, Lei Chen 4,\* and Zihao Zheng <sup>2</sup>**


**Abstract:** Recently, many information processing applications appear on the web on the demand of user requirement. Since text is one of the most popular data formats across the web, how to measure text similarity becomes the key challenge to many web applications. Web text is often used to record events, especially for news. One text often mentions multiple events, while only the core event decides its main topic. This core event should take the important position when measuring text similarity. For this reason, this paper constructs a passage-level event connection graph to model the relations among events mentioned in one text. This graph is composed of many subgraphs formed by triggers and arguments extracted sentence by sentence. The subgraphs are connected via the overlapping arguments. In term of centrality measurement, the core event can be revealed from the graph and utilized to measure text similarity. Moreover, two improvements based on vector tunning are provided to better model the relations among events. One is to find the triggers which are semantically similar. By linking them in the event connection graph, the graph can cover the relations among events more comprehensively. The other is to apply graph embedding to integrate the global information carried by the entire event connection graph into the core event to let text similarity be partially guided by the full-text content. As shown by experimental results, after measuring text similarity from a passage-level event representation perspective, our calculation acquires superior results than unsupervised methods and even comparable results with some supervised neuron-based methods. In addition, our calculation is unsupervised and can be applied in many domains free from the preparation of training data.

**Keywords:** text similarity calculation; passage-level event connection graph; vector tuning; graph embedding

### **1. Introduction**

The fast advance of web technology causes an explosive increase of web data. Text is one of the most prevailing data formats across the web, which enables lots of text-based analysis tools to be provided to help users ease the way to process texts. Text similarity calculation is one of the fundamental text processing tasks, which is also the bottleneck of many web applications, such as news recommendation, Q&A system, etc. Traditional text similarity calculations can be roughly divided into two classes. One is supervised based, which maps two texts into a high-dimensional space and finally makes two similar texts close in the form of vector representation. The other one is unsupervised based, which often treats text as a sequence of word pieces and scores the similarity between two texts in terms of word concurrence plus the order of the sequence or ignoring it. Except word concurrence, some other statistical features are also utilized such as TF/IDF or Mutual Information. Between the two kinds of calculations, supervised ones always own high performance, since they can accurately draw the boundary between similar texts and

**Citation:** Liu, M.; Chen, L.; Zheng, Z. Similarity Calculation via Passage-Level Event Connection Graph. *Appl. Sci.* **2022**, *12*, 9887. https://doi.org/10.3390/ app12199887

Academic Editors: Dionisis Margaris and Stefanos Ougiaroglou

Received: 18 May 2022 Accepted: 26 September 2022 Published: 1 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

dissimilar texts with the help of training data. The emergence of neural-based algorithm enhances the performances of supervised methods to a higher level, which gain a huge advantage beyond unsupervised ones. Accordingly, their high-quality results are over dependent on training data. When the domain changes, the performances of supervised ones degrade sharply. Unsupervised ones do not suffer from this limitation, since they do not refer to any transcendental knowledge and are free from training data. Thus, they do not fear domain transferring. The types of texts across the web are countless. We cannot collect all types of texts as training data to let supervised methods go through at advance. Therefore, it is reasonable to design an unsupervised similarity calculation, which can be applied in any domain.

Text is often used to record events. Reading text, we know what is happening and what is the end. An event just indicates something happens in some place at some time. Traditional event extraction tasks, like ACE [1] and FrameNet [2], treat events occurring in sentence-level (which means event is fine-grained). The events stated by different sentences are independent. Events extracted at sentence level cannot be directly utilized to deal with passage-level task. Text similarity calculation is a typical passage-level task. The similarity between two texts depends on their overall content similarity. We should take a high-level view over all the events mentioned in one text. Such as one text has one main topic, from passage-level, though there are many events stated by the sentences in one text, one text only has a core event. The other events serve the core event, as explaining the core event or completing the details of the core event. That indicates the core event mostly decides the similarity between two texts, and the other events play an auxiliary role. For example, the similarity between two following articles (Two articles are, respectively, https://www.reuters.com/article/us-asia-storm/super-typhoon-slams-into-chinaafter-pummeling-philippines-idUSKCN1LW00F, and https://www.wunderground.com/ cat6/Typhoon-Mangkhut-Causes-Heavy-Damage-Hong-Kong-China-and-Macau, accessed on 17 May 2022. These two articles can be accessed till the pages are deleted) is high, since both take the event of the damage of "Mangosteen" typhoon as the core event, though one details the degree of the damage and the other does not.

As text similarity calculation is a passage-level task and concerns whether two texts stress the same core event or not, we should take the entire text into consideration to locate the core event. Anyway, most of events cannot be fully stated by only one sentence. They may cross several sentences, even the nonadjacent sentences. Like financial events, e.g., investment and debt, the arguments of those events spread all over the entire text. For this reason, traditional sentence-level event extraction methods are not appropriate to extract the core event from a passage level. This paper just constructs a graph, namely event connection graph, to cover the multiple relations among the events mentioned in one text. This graph is composed of a set of polygons. Each polygon is formed by one trigger as its center and some arguments surrounding this center. The trigger and the arguments are extracted by a sentence-level event extraction method. To value the nodes in the graph, PageRank is adopted. The nodes of the largest values are treated as the core event, and text similarity is calculated according to the correlation between two core events, respectively, extracted from two texts. Moreover, two improvements based on vector tuning are proposed to better model the event connection graph. One is to detect the semantically similar triggers and link them to fully cover the relations among events. The other is to embed the global content carried by the entire event connection graph into the core event to let text similarity be partially guided by the full-text content.

To sum up, the contributions of this paper can be summarized as follows:

1. This paper proposes a novel event connection graph to model the events and their mutual relations mentioned in one text. This graph is composed of some polygons, and each polygon represents a sentence-level event. Via PageRank, the core event can be extracted to represent the main content of the text, and further utilized to calculate text similarity.

2. Two improvements are provided to enhance the completeness and effectiveness of the constructed event connection graph. One is to tune the vector representation of the trigger to find and link more related events, which enables the generation of a more comprehensive event connection graph. The other is to embed the information carried by the entire event connection graph into the core event to make similarity result more rational.

3. As shown by experimental results, our similarity calculation obtains superior results than unsupervised methods by a large margin, and even comparable results with supervised neuron-based methods. Typically, our calculation is unsupervised. It can be applied in any domain without the dilemma of domain transferring.

Though our similarity calculation can combine the merits from supervised and unsupervised similarity calculations. Our calculation has time issues needed to be further solved. In particular, our calculation needs to form a passage-level event representation. This kind of operation needs extra time. Thus, though our calculation has higher accuracy, it is not fit to online applications, especially some time-insensitive applications.

Our paper has six sections. Section 1 is introduction, which briefly introduces the motivation of our work and summarizes its contributions. Section 2 shows some related research. Section 3 gives a brief overview of our work at first, and then details the process used to construct the event connection graph and the approach used to value the nodes in the graph. Section 4 tells two improvements on our event connection graph based on vector tuning. Section 5 designs some experiments to illustrate the high quality of our similarity calculation. Section 6 presents the conclusions and gives some future works.

#### **2. Related Work**

The rapid advance of internet technology brings the explosive increase of web data. Facing the massive amount of data, internet users require automatic data analysis and processing tools. Text is one of the most prevailing data formats on the web. Thus, many web applications are designed aiming at processing textual data. Almost all the text related applications treat text similarity calculation as their fundamental module. Such as text clustering [3], machine dialogue [4], product recommendation [5], Q&A [6], those applications take text similarity calculation as the key component. In general, the methods for text similarity calculation can be partitioned into two categories. One is supervised based which is guided by annotated training samples. The other one is unsupervised based free from annotations.

Supervised type often treats texts as points mapping to the high-dimensional space. A classification function is trained to separate points into similar and dissimilar two groups. Some other methods turn classification to a rank problem, which learn score functions to discriminate similar points from dissimilar ones. The advantage of supervised type is brought from the guidance of training data. Due to training process, supervised type often acquires high performance. Text is encoded as a vector for calculating convenience. Before the appearance of deep neuron network, one-hot vector is widely used. Only one entry has non-zero value. This kind of encoder generates high-dimensional and sparse vectors, which degrades the quality of many text-oriented applications [7]. The proposal of word embedding changes this dilemma. Word embedding compresses one-hot vector into a densely distributional vector with low dimension. Skip-gram [8], CBOW [9], GloVe [10], ELMo [11] are typical exemplars. The neuron-based models, such as CNN [12], GRU [13], LSTM [14], or the pre-trained language models such as Transformer [15], GPT [16], BERT [17], XL-NET [18], Roberta [19] can produce more reasonable text representation on the basis of word embedding. The overlapping degree decides the similarity between texts, whereas, only depending on word concurrence or word alignment cannot fully express the semantic similarity between texts. To better model the interaction between texts, attention mechanism is taken, which considers the relevance of non-aligned parts across the input sequences. The widely applied attentions are multiple layer attention in [20] and co-attention in [21]. Basically, supervised text similarity calculations own high performance, especially after the application of neuron-based models. However, they are easily distorted by training data. They have to make a hypothesis about the distribution of input data in terms of the transcendental knowledge implicitly provided by training data. There is no way to collect

enough training data to let supervised calculations go through in advance, especially for the neural-based methods, since their explosive parameters require massive data for fully training. For this reason, supervised calculations are appropriate to deal with domain data and can hardly be transferred. In our paper, we hope to design a text similarity calculation, which can fit to the texts in any type and from any domain. Therefore, we try to design an unsupervised text similarity calculation.

Unsupervised similarity calculations free from training data. They model input data all by their natural distribution. Some untrained score functions are taken to measure text similarity based on distribution similarity. Euclidean distance [22], KL divergence [23], and Entropy [24] are some widely used score functions. Joint functions are also proposed to integrate previous scores [25,26]. Due to missing training data, the features used by score functions are some statistical values provided by raw texts after word segmentation and stemming, such as TF/IDF [27], TextRank [28], and LDA [29]. Some recent works try to turn unsupervised similarity calculation into a supervised task. An iterative process is adopted to take the output cases as training data in turn [30]. This kind of calculations suffers from cold-starting issue, which needs to set initial similarity values beforehand, and the final results drop a lot on the inappropriate initialization.

Table 1 just summarizes the difference between supervised text similarity calculation and unsupervised text similarity calculation.


**Table 1.** Comparison of supervised and unsupervised similarity calculation.

As indicated by the pervious table, it can be observed that these two kinds of similarity calculations both have corresponding merits and drawbacks. Supervised methods have higher performances due to its importing of training data. However, using training data is hard to alter domain. This situation causes that the performances of supervised methods drop sharply when domain changes. On the contrary, unsupervised methods have lower performances, while their performances do not drop along with domain alteration. In this situation, we try to propose an unsupervised similarity calculation to combine both merits of supervised and unsupervised methods.

Features taken by previous calculations are words or word spans which contribute mostly to score functions (applied in supervised ones) or own some prominent distribution compared with other features (applied in unsupervised ones). Though among supervised calculations, some algorithms may learn a semantic embedding on word level or text level in terms of training data to help model the semantics in input text [31–33]. They all ignore a fact that most of web texts are used to record events. One text should tell one core event. The other mentioned events either help explain the core event or provide some details of the core event (such as time, place, or related events). In fact, the core event mostly decides the similarity between two texts. In other words, if one event is stressed by two texts meanwhile, these two texts are similar at a high possibility. Thus, the task of calculating text similarity can be fulfilled by comparing the discrepancy of the core events, respectively, extracted from two texts. The core event represents the main content of one text. It should be extracted from a passage-level viewpoint.

Event extraction and representation have been studied during a long time. As the most famous event extraction tasks, MUC (Message Understanding Conference) [34] and ACE (Automatic Content Extraction) [35] have been held for about 30 years. The definition of event in MUC and ACE is sentence-level with trigger as key element and arguments as supplementary details. Traditional event extraction tasks assume that an event can be fully expressed by a single sentence. It can be extracted without taking other sentences into consideration. Since an event can be formatted as trigger and arguments, traditional sentence-level event extraction methods can be separated into two successive steps. The former step is called event detection (or trigger extraction), which aims to detect events and classify event type. The latter step is called argument extraction, which aims to acquire the arguments related to the trigger, such as time, location, subject, and object, etc. The algorithms designed for sentence-level event extraction are not appropriate to extract passage-level events, since they aim to learn a better representation for single sentences [36,37] and not to model the relations among events across sentences.

As told before, traditional event extraction methods treat sentences independently and extract events from a single sentence. Though it has been proposed something called cross-sentence event extraction methods. While their object is still to extract events from a single sentence, their highlight is to take the adjacent sentences in a sliding window into consideration during extracting process [38,39]. Obviously, the cross-sentence event extraction methods are not suitable to extract core events, since they also miss the operation of modeling the relations among events from a passage angle. Therefore, this paper designs an event connection graph to cover the relations among all the events mentioned in one text. Via graph centrality measurements, the core event can be extracted and used to calculate text similarity.

#### **3. Model Details**

#### *3.1. Task Overview*

The objective of text similarity calculation is easy to be defined. As given two texts *s*<sup>1</sup> and *s*2, we hope to obtain the similarity value between *s*<sup>1</sup> and *s*2. Different from traditional calculations, this paper aims to calculate text similarity in terms of measuring whether *s*<sup>1</sup> and *s*<sup>2</sup> mention the same core event or not. A graph, noted as *G*(*V*, *E*), is constructed to model the relations among events, where *V* is node set and *E* is arc set. It is called event connection graph. The nodes in *V* are just triggers and arguments. Those triggers and arguments are extracted sentence by sentence to represent a serial of sentence-level events [40]. While the arcs in *G* represent the relations among events. Since the core event represents the main content of one text, it should be surrounded by the other events. As turning events to nodes and relations among events to arcs to form a graph, the nodes, which represent the core event, should locate at the graph's center. Via some graph centrality measurement, such as PageRank, we can easily locate the core event. It is worth noting that comparing the nodes (extracted by sentence-level event extraction methods), the arcs play an important role to decide the quality of similarity results. As shown in the experiments, we adopt several popular sentence-level event extraction methods, but it can hardly see the difference in accuracy. Therefore, we provide two improvements based on vector tuning to complete the constructed event connection graph to involve more arcs.

#### *3.2. Graph Construction*

This section details the approach used to construct the event connection graph. Here we borrow the method shown in [40] to extract fine-grained sentence-level events. Each sentence-level event is formed as a polygon with trigger as its center and arguments as its surrounding nodes. The arguments are listed in the order how they appear in the sentence. The trigger and the arguments are connected by arcs. Figure 1 is an example of one polygon formed from the sentence "The President of USA communicates with Chinese Leader on the phone about North Korea issue". It is straightforward that trigger is the key element, since event type and argument template are both decided by trigger. Therefore, we put trigger at the center of the polygon to represent its pivotal position and put the arguments surrounding the trigger. As shown in Figure 1, the trigger "communicate" (after stemming) is put at the center of the polygon and the arguments related to this trigger are put surrounding the center in the order how they appear in the sentence.

**Figure 1.** The polygon formed from the example sentence.

To model the relations among events, we just simply connect the polygons via the overlapping arguments to form an event connection graph. Figure 2 shows an example event connection graph formed from the following four sentences. In Section 4, we further propose a way to find semantically similar triggers to reveal deeper relations among events.

**Figure 2.** The event connection graph formed from given sentences.

"South Korea and North Korea have a military dispute on the border." "The President of USA visits South Korea with his wife."

"The President of USA communicates with Chinese Leader on the phone about North Korea issue."

"Chinese television and US BBC reported the meeting between the US president and Chinese leader respectively."

#### *3.3. Node Evaluation*

Typically, "If one author emphases a topic (or a clue), everything in his article is related to this topic (or clue)" [41,42]. It is straightforward to make an assumption that the core event in one text should be supported by the other events. If we project all the events mentioned in one text to a plane, the core event will be the point surrounded by the other event. In our paper, we just project events to a plane, while treating the event as a polygon which includes several nodes (i.e., trigger and arguments). Trigger is the key element in the event and decides event type. Thus, we put trigger at the center of the polygon. These two situations ensure that the center of the graph should be the core event. The remaining job becomes how to locate the center of the graph. PageRank, a popular centrality measurement, is chosen to fulfil this task. PageRank is proposed by Google and used to rank web pages in searching engine. The principle behind PageRank is random walk [43]. When one surfer randomly surfs on a graph, the node visited most frequently should be the central node (owning the largest PageRank value).

We just follow the traditional PageRank measurement. The only difference is to use the transition matrix (noted as *A*) formed from our event connection graph. The size of *A* is *v* ∗ *v*. *v* denotes the number of nodes in the graph. Each entry in *A* is the transition probability from one node in the row to another node in the column. For example, given a surfer who travels on the event connection graph, if (*i*,*j*) is an arc, *Aij* denotes the transition probability that this surfer visits *j* by jumping from *i*, and can be set as the reciprocal of the out degree of node *i*. On the opposite, if (*i*,*j*) is not an arc, this probability is 0.

Via PageRank, each node in the event connection graph has a value. This value indicates the importance of the node in the graph, which can be utilized to locate the core event. There are two kinds of nodes in the graph, i.e., trigger and argument. If the node of the largest PageRank value is a trigger, we then extract the trigger and the arguments belonging to this trigger as the core event. This way just treats the nodes in the polygon which takes the trigger of the largest value as its center as the core event. Otherwise, if the node of the largest value is an argument, we then output the nodes in all the polygons which take this argument as their intersection node. Figure 3 gives the PageRank values of the nodes in Figure 2. In this figure, the node "President of USA" owns the largest value (marked in red color). Since "President of USA" is an argument, we output the nodes in the polygons which take "President of USA" as their intersection node. The chosen nodes are marked in a yellow color in Figure 3. These nodes just indicate the core event expressed by the previous paragraph with four sentences. However, the chosen event is not accurate, since the main meaning of this paragraph is about the meeting of two leaders in USA and China.

Let *Si* and *Sj*, respectively, denote the two sets which include the chosen nodes in the event connection graphs formed from the given texts, *texti* and *textj*. To calculate the similarity between *texti* and *textj*, we can form a matrix, noted as *TSij*. Each element in this matrix denotes the similarity between two chosen nodes in *Si* and *Sj*, respectively. Since the node in the graph is either trigger word or argument word, it can be represented as vector via GloVe [10]. Their similarity can be measured by vector similarity via Cosine. Some triggers or arguments may be phrase (composing of several words). We then average the vectors of the words in that phrase as its vector representation. We take the mean of all the elements in *TSij* as the similarity between *texti* and *textj*. The formula is shown as follows:

$$\text{sim}(\text{text}\_{i\prime}, \text{text}\_{j\prime}) = \sum\_{k=1}^{n} T\mathbb{S}\_{i\dot{j}}(k)/n \tag{1}$$

where *n* denotes the number of all the elements in *TSij*, and *TSij*(*k*) denotes one element in *TSij*.

**Figure 3.** PageRank values of the nodes in the event connection graph.

The previously constructed event connection graph has two flaws. One is that it only uses the overlapping arguments shared by polygons to model the relations among events. This kind of relation is too vague and not sufficient, since the relations among events are mainly caused by triggers. We should provide a way to detect deeper relations among events. The other is that only the node of the largest value and its adjacent nodes are chosen as text representation. These nodes can cover the information expressed by the core event and some other events highly related to the core event. The rest of the events mentioned in the text can also add some supplementary details. These details also need to be considered in similarity calculation. For this reason, two improvements are made. One is to fine-tune the trigger vector to detect and connect more related events. The other is to tune the vectors of the nodes in the core event to let them integrate the information carried by the entire graph.

#### **4. Two Improvements Made on Our Event Connection Graph**

#### *4.1. Tuning Trigger Words*

The relations among events are mainly caused by trigger words. We should provide a way to find semantically similar triggers and link them in the event connection graph to involve more relations among events. It has been counted that about ninety percent of trigger words are nouns and verbs (or noun and verb phrases) [44,45]. The pre-trained word embedding injects semantics in vector representation and obtains this representation counting on whether two words own the similar contexts or not. However, in event related tasks, we cannot merely depend on pre-trained word embedding to reveal semantic similarity between triggers. Trigger and its arguments have some commonly used collocations, e.g., "kick football" and "play basketball". That causes two trigger words which have high semantic similarity may have different contexts.

To find semantically similar trigger words, we can borrow the help from some synonym dictionaries, like VerbNet and WordNet, two manually formed synonym dictionaries. These two dictionaries put semantically similar words in one synset. The synsets are organized in hierarchy. Unfortunately, dictionary cannot cover all the possible semantically similar trigger pairs. Anyway, it is time-consuming and laborious to manually construct such kind of dictionary. Therefore, we should provide a way to find semantically similar triggers independent of manual dictionary. In this paper, we try to fine-tune the vector representations of the trigger words to let semantically similar triggers own close vector representations. Two triggers whose cosine similarity is beyond the threshold (0.8) are connected through an arc to involve more rational relations among events. Figure 4 shows the event connection graph after connecting similar triggers, i.e., visit and communicate, and report and communicate. Regarding the threshold (0.8), it is set based on sufficient experimental results shown in the experimental section. GloVe [10] trained on wiki data is used as the basic trigger embedding.

**Figure 4.** The event connection graph by linking semantically similar triggers.

As shown in Figure 4, with the inserted arc, dotted line with green color, the node of the largest value changes to "communicate". That indicates, after involving novel relation, the core event can be revealed more correctly.

Let *Bc* denote the set including the synonymous trigger pairs sampled from Verb-Net and WordNet. We tune the vector representations of the triggers according to the following formulas:

$$\min \mathcal{O}(B\_{\mathcal{c}}) = \min \left( \mathcal{O}\_{\mathcal{c}}(B\_{\mathcal{c}}) + R(B\_{\mathcal{c}}) \right) \tag{2}$$

$$O\_{\mathbb{C}}(B\_{\mathbb{C}}) = \sum\_{(\mathbf{x}\_{l}\mathbf{x}\_{r}) \in B\_{\mathbb{C}}} \left[ \pi(att + \mathbf{x}\_{l}t\_{l} - \mathbf{x}\_{l}\mathbf{x}\_{r}) + \pi(att + \mathbf{x}\_{l}t\_{r} - \mathbf{x}\_{l}\mathbf{x}\_{r}) \right] \tag{3}$$

$$R(B\_{\mathcal{C}}) = \sum\_{\mathbf{x}\_{i} \in B\_{\mathcal{C}}} \lambda \parallel \mathbf{x}\_{i}(int) - \mathbf{x}\_{i} \parallel \mathbf{z} \tag{4}$$

where (*xl*, *xr*) denotes a synonymous word pair in *Bc*. *tl* is one word, randomly sampled from the synset which *xl* and *xr* are not in. So is to *tr*. *att* denotes the predefined deviation between the semantically similar word pair and the dissimilar one. It is set to 0.6. *τ* denotes max margin loss, noted as *max τ*(0, *x*). *xi*(*int*) denotes the pre-trained GloVe vector. *λ* is a predefined regression parameter and is set to 0.0006. The predefined parameters are set according to [45].

As shown in Equation (2), the tuning formula has two parts. The former part (noted as *Oc*(*Bc*)) refers to Equation (3), which makes similar triggers own similar vector representations. The latter part (noted as *R*(*Bc*)) refers to Equation (4), which keeps the tuned vectors not far away from their pre-trained results. Since the pre-trained vectors are acquired from a large-scale corpus, we certainly do not want the tuned vectors to deviate from the pre-trained ones. If one trigger in the event connection graph is tuned, we then replace its original vector representation by the tuned one. In our tuning method, we only tune the vectors of the words included by VerbNet and WordNet, and do not extend the range outside the dictionaries. The reason is that the pre-trained vector representations are acquired from a large-scale corpus. Thus, they are credible until we have enough evidence to support that the pre-trained vector representations cannot calculate word similarity accurately. If one trigger is a phrase, we simply take the mean of the vectors through all the words in that phrase as its representation.

In English, the synonymous word pairs in VerbNet and WordNet can be used as training data. For the other languages, it is hard to find such kind of dictionary. We then use these two dictionaries as bridge to construct training data. We take VerbNet and WordNet as pivot dictionaries and utilize Google translation to translate the words in them into any language. However, one word in English can be translated to many words in the other language. Taking the synonymous word pair "undo" and "untie" in VerbNet for example, "undo" can be translated to five words in Chinese like "打开 (open)", "解开 (untie)", "拆开 (open)", "消除 (remove)", "取消 (cancel)", while "untie" can be translated to "解开 (untie)" and "松开 (loosen)". We finally obtain the possible translated word pairs in the number of 9 (except the duplicated one "解开 (untie)" and "解开 (untie)"). Among them, only the word pair "解开 (untie)" and "松开 (loosen)" is a rational synonym. To avoid incorrect translation, we introduce back translation, extensively used in unsupervised translation task to avoid semantic drift [46]. Following the idea of back translation, we only remain the translated word pairs which can be back translated to the exact same words in English. Also taking the word pair (undo, untie) for example, when we translate them in Chinese, we only remain the word pair (解开, 松开), since these two words can be translated back to undo and untie in English, respectively.

#### *4.2. Node Representation via Graph Embedding*

In Section 3, we only choose the nodes which can represent the core event as text representation to calculate text similarity. On one side, except the core event, the information carried by the other events (we call them supplementary events) also provide some useful information. We should not simply abandon them. On the other side, the information carried by the supplementary events is trivial compared with the core event. Thus, there is no need to choose nodes from the event connection graph to represent them. For this reason, we apply graph embedding to integrate the information carried by the supplementary events into the chosen nodes.

Graph embedding is conducted to embed the graph structure into node representation [47], which can make one node in the graph integrate the information carried by the entire graph. Graph embedding often has a clear target to achieve and the vector representations of the nodes are formed via a bunch of training data, while in our setting, we do not have a clear target to set objective function (to integrate information into the chosen nodes is not a clear target to set objective function) and certainly do not have any training data either. In this situation, we follow self-training approach used in word2vec, as shown in [48]. We take random walk to generate a set of paths and take these paths as contexts to adjust the vector representations of the nodes. The graph embedding process used to acquire node representation is shown as follows:

1. Taking *Gi*(*V*, *E*) for example, the event connection graph formed from *texti*, we treat one node in *V* as the starting point and choose the successive node via randomly jumping to one of the adjacent nodes. Repeat this jump for *l* times. A path of length *l* can be obtained.


Let *v*1, *v*2, ... , *vi*−1, *vi*, *vi*+1, ... , *vl*−1, *vl* denote one path. The values of *m* and *l* are set according to [48]. Following self-training setting, we learn a vector representation for *vi* to predicate the context of *vi*. The loss function for it is:

$$\max P(v\_{i-2}, v\_{i-1}, v\_{i+1}, v\_{i+2} | v\_i) = \max \prod\_{i=1}^{2} P(v\_{i \pm j} | v\_i) \tag{5}$$

where *vi*−2, *vi*−1, *vi*+1, *vi*+<sup>2</sup> denote the context of *vi*. Two fully-connected layers are used to train the node representation. Softmax is adopted as the output layer. During the training process, for the first iteration, the node *vi* integrates the representations of its adjacent nodes, i.e., *vi*−2, *vi*−1, *vi*+1, *vi*+2. For the second iteration, *vi* integrates the representations of the nodes which can be linked to *vi* through the path whose length is less than 4. As the training process continues, the information carried by the entire graph can be integrated into *vi*. After graph embedding, we replace the original vector representations of the nodes in the core event by the tuned ones to let the core event integrate the information carried by the entire text and recalculate the similarity between two texts using Equation (1).

#### **5. Experiments and Analyses**

#### *5.1. Experimental Setting*

Our similarity calculation aims to obtain the similarity value between two texts from a viewpoint of passage-level event representation. One text may mention several related events. An event connection graph is then constructed to model the relations among those events. In addition, two improvements based on vector tuning are provided to help better construct the event connection graph. Finally, the nodes, indicating the core event mentioned in one text, are chosen to represent the graph. It is worth noting that our calculation is unsupervised. It is not limited on any particular language and any particular domain. To test its compatibility in different languages and different tasks, we build testing corpora in three languages, i.e., English, Chinese, and Spanish. For English, there are many open tasks about text similarity measurement, such as paraphrase and query match in NLU (natural language understanding) [49,50]. We just choose these two tasks to test the performance of our similarity calculation. Ten thousand text pairs are sampled from the corpora for these two tasks. One half includes similar text pairs, and the other half includes dissimilar text pairs. The corpora for these two tasks only contain short sentences. Most of short sentences only mention one event. Our similarity calculation is designed on a passage-level representation perspective and chooses the core event to help accurately measure the similarity between two texts. It is more suitable to handle long text which mentions several events. The former two tasks cannot fully demonstrate the advantage of our calculation on dealing with long text. For this reason, we manually annotate a testing corpus including two thousand long text pairs from Daily news published in the latest one month. For Chinese, we build two testing corpora, one for short text and one for long text. The one for short text is provided by Alibaba company for query match task. The one for long text is manually annotated including two thousand long text pairs from Tencent news also published in the latest one month. For Spanish, there is no suitable open corpus for testing. We have to manually annotate one corpus including two thousand long text pairs from kaggle contest. Among all the manually annotated corpora, we set one half including similar text pairs and the other half including dissimilar text pairs.

The criterion used for evaluation is *F*1. The formulas are shown as follows:

$$P = \frac{r(n)}{t(n)}\tag{6}$$

$$R = \frac{r(n)}{a(n)}\tag{7}$$

$$F1 = 2 \ast \frac{P \ast R}{P + R} \tag{8}$$

where *P* is precision, which is measured by the correctly calculated similar text pairs (noted as *r*(*n*)) compared with the totally similar text pairs (noted as *t*(*n*)). *R* is recall, which is measured by the correctly calculated similar text pairs compared with the totally noted similar text pairs (noted as *a*(*n*)). *F*1 combines precision and recall together.

There are two kinds of corpora in the experiments. One kind is collected from open tasks, such as paraphrase and query match, with a large number. Sufficient annotated texts enable us to compare our calculation with some supervised similarity calculations. The other kind includes the manually collected long texts, which can be used to demonstrate the advantage of our calculation particularly on dealing with long texts. Regarding the large-scale corpora, we separate them into 8:1:1 for training set, development set, and test set. Three neuron-based algorithms are adopted as baselines in the following experiments. They are textCNN (one convolutional layer, one max-pooling layer, and one softmax output layer), Bi-LSTM (taking Bi-LSTM to encode text and softmax to output similarity value), Bi-LSTM+Bidirectional attention (taking Bi-LSTM to encode text and adding a Bidirectional attention layer to model the interaction between two input texts). In detail, we encode text via textCNN, Bi-LSTM, and Bi-LSTM+Bidirectional attention, respectively. Softmax layer is utilized to output a value to indicate the similarity between two texts. The pre-trained model, i.e., BERT base, is also taken as baseline (like machine reading, one of the fine-tuning tasks in BERT, we input two texts into BERT with a segmentation tag [SEP] and add a softmax layer on [CLS] to output similarity value).

Our calculation is unsupervised. Therefore, we also bring in some unsupervised baselines. We represent input text as vector via the following methods and apply Cosine as similarity function to calculate vector similarity as text similarity.

The applied unsupervised vector representations are listed as follows:


All the word vectors are set via GloVe.

We also bring in two novel and high-performance text similarity calculating methods, Con-SIM [51] and RCMD [52]. They both based on powerful pertained language model. In addition, multi-head attention and cross-attention are both adopted to model deep interaction between two texts. These two algorithms have proved their high accuracy across some text related tasks. Between them, the first one takes context into consideration to model the training gap among different calculations, and the second one models the distance between sentences as the weighted sum of contextualized token distances.

#### *5.2. Experimental Results*

The following experiments are conducted in six aspects. Section 5.2.1 shows the experiments to test the rationality of the threshold setting in our calculation. Section 5.2.2 demonstrates the performances of our calculation on the condition that the event connection graph is constructed by different event extraction methods. Section 5.2.3 shows some sampled examples to explicitly demonstrate the ability of our calculation. Section 5.2.4 shows the results of our calculation comparing with the supervised and unsupervised baseline algorithms on the testing corpora. Section 5.2.5 shows the ablation results to see the enhancement brought from two improvements provided in Section 4. In Section 5.2.6, we add an experiment to prove that our calculation is not limited on any particular language and any particular domain.

#### 5.2.1. Testing on Threshold

In Section 4, we provide a vector tunning-based method to find semantically similar triggers and link them to form a comprehensive event connection graph. Two triggers whose semantic similarity value is beyond the threshold (set as 0.8) can be connected in the graph. The following figure just demonstrates and explains the rationality of the threshold setting. It shows *F*1 values when the threshold changes from 0.1 to 1.0. This experiment is designed to see the rationality of the chosen of the threshold setting.

As shown in Figure 5, the calculating results change along with the variety of the value of the threshold. All the curves have the similar trend across different corpora. They all reach the perk at the value of 0.8 (or close to it). The reason can be explained based on the principle behind word embedding. For most of word pairs, if two words are semantically similar, their pre-trained vector representations are close. The vector representation of trigger is initialized via GloVe, one kind of word embedding; thus, we can take the value between two triggers measured by vector distance to decide whether two triggers are semantically similar or not. Trigger decides event type. Two similar triggers just indicate two related or similar events. However, as indicated in Section 4.1, due to the situation that there are some commonly used collocations in linguistics, the previous conclusion (similar triggers have close vectors) is not always true. Thus, in Section 4.1, we tune the pretrained vector representations of the triggers via the training data sampled from synonym dictionaries to make semantically similar triggers own close vectors. Based on the tuning operation, finding a threshold to decide whether two triggers are similar becomes feasible. As shown in Figure 5, 0.8 is a reasonable choice, where the performance curves reach the peak through all the testing corpora. When the threshold exceeds 0.8, the performance curves even drop. This is because when the threshold enlarges, some similar triggers are missed to be connected. The relations among events, especially the relations among the semantically similar events, cannot be fully covered by the event connection graph. It finally causes that the extracted event may not be the core event. Furthermore, missing the connections between similar events, the extracted event cannot integrate the information carried by the entire event connection graph. These two situations lead to the drop of the performance curves.

**Figure 5.** *F*1 values when we change the threshold from 0.1 to 1.0. The threshold is utilized to decide whether to connect two triggers in the event connection graph or not.

#### 5.2.2. Comparison of Different Event Extraction Methods

Our similarity calculation needs to construct an event connection graph to reveal the core event to calculate text similarity. In our paper, this graph is constructed by the sentence-level event extraction method shown in [40]. We note this method as OneIE as it is called in [40]. There raises a doubt that whether different event extraction methods affect the final calculating results or not. Therefore, we design an experiment to see the similarity calculation results with event connection graphs constructed by different event extraction methods across all the testing corpora. The following table just illustrates the performances of our calculation on the condition that the event connection graph is constructed via several popular sentence-level event extraction methods. The event connection graph includes both trigger and argument; thus, the chosen event extraction methods should jointly extract trigger and argument meanwhile. We choose BeemSearch [53], JointTransition [54], and ODEE-FER [55] as baselines. BeemSearch is one of the classic joint event extraction methods, which encodes text via one-hot feature and applies local and global features to label trigger and argument meanwhile. JointTransition and ODEE-FER are both based on neuron model. The significant difference between them is that JointTransition applies transition model to characterize the relation between trigger and argument, while ODEE-FER integrates latent variable into neuron model to extract open-domain event free from event schema predefinition. Multi-task learning framework is utilized in ODEE-FER to identify trigger and argument concurrently. The corpora for paraphrase, query match, and manually annotated are abbreviated as Para, Q&Q, and MA, which are also used in the following tables.

As shown in Table 2, it can be found that different event extraction methods do not affect the performances of our calculation much. This situation is due to the following two reasons. First of all, the construction of event connection graph is only the first step in our calculation. The extracted triggers and arguments are subsequently measured to indicate their importance via centrality measurement. During the measuring process, the incorrectly extracted triggers and arguments can be eliminated. In Table 2, we also add an extreme case, noted as Extreme listed in the last row, where we treat a verb in the sentence as trigger and noun as argument. If there is more than one trigger in the sentence, we construct polygon for each trigger following the approach shown in Section 3. It can be observed that the result obtained from Extreme is a little different from the ones obtained from the other methods. That indicates we do not need the precise event extraction results, as long as the extracted results contain enough triggers and arguments. Furthermore, two improvements proposed in Section 4 can also help remove the adverse effect brought from the incorrectly extracted triggers and arguments. In detail, one improvement tunes the vector representation of the trigger, and links the semantically similar triggers in the event connection graph. Regarding the triggers incorrectly extracted, they are little related to the core event mentioned in the text. Thus, these triggers do not locate at the center of the event connection graph. After the measuring process, these triggers will be valued with little weights. They will not be chosen as the core event to measure text similarity. The other improvement is to integrate the information carried by the entire text into the chosen nodes via graph embedding. In some cases, even if the incorrectly extracted triggers and arguments are chosen as the core event, after graph embedding, the information carried by the entire text can be integrated into the incorrectly extracted triggers and arguments. This way can also alleviate the adverse effect brought from the incorrect extraction.

**Table 2.** The results of our calculation obtained on the condition that the event connection graph is constructed via several popular event extraction methods (highest values in bold).


To clearly see the effects brought from different event extraction methods, we also draw a histogram figure to illustrate the calculating results obtained by different sentencelevel event extraction methods. In addition, to compare their results in two tasks of text similarity calculation and event extraction, we show the results obtained by different event extraction methods in both tasks in two colors. In this figure, each algorithm corresponds to two columns. The column with blue color indicates the text similarity calculating results, which is averaged across all the corpora including open tasks and manual annotation. The column with orange color indicates the event extracting results, which is averaged across the corpora from ACE and FrameNet, two popular event extraction tasks.

As shown in Figure 6, it is hard to see the performance gap among all the event extraction methods in text similarity calculation task. However, the event extraction methods adopted in the experiment perform differently in the event extraction task. As shown in this figure, ODEE-FER performs much better than the other event extraction methods. Anyway, in this experiment, we also design a simple method called Extreme. This method just simply treats a verb in the sentence as trigger and a noun as argument. It is easy to assume that Extreme should have a poor performance in the event extraction task. The result also proves this assumption. Surprisingly, Extreme obtains almost the same result with the other event extraction methods in the text similarity calculation task. That indicates we do not need the precise event extraction results, as long as the extracted results contain enough triggers and arguments.

**Figure 6.** The histogram to see the results obtained by different event extraction methods (blue color indicates text similarity calculation task and orange color indicates event extraction task).

#### 5.2.3. Case Study

In our similarity calculation, we need to extract some nodes from the event connection graph to represent the core event mentioned in the text. In the following table, we just show the extracted nodes (i.e., triggers and arguments) from some sampled texts. We sample 6 texts from our English testing corpora, three for long texts (noted as A1, A2, A3) and three for short texts (noted as S1, S2, S3). The short texts are news caption. The contents of the chosen texts can be found in (A1: https://www.bbc.com/news/technology-52391759; A2: http://news.bbc.co.uk/sport2/hi/football/europe/8591081.stm; A3: https://www.bbc. com/news/business-52467965; S1: https://www.bbc.com/news/uk-51259479; S2: https: //www.bbc.com/news/business-44789823; S3: https://www.bbc.com/news/business-52466864 (accessed on 17 May 2022). These texts can be accessed till the pages are deleted). Since keywords can also be treated as the content representation of the text, we also show the keyword extraction results (the top five keywords) obtained by LDA and TextRank, two popular unsupervised keyword extraction algorithms.

As shown in Table 3, among the chosen words, some are the same across different extraction methods while some are distinct. Taking the contents of the sampled texts into

consideration, the words extracted by our calculation can exactly cover the core event mentioned in the sampled texts, though the number of extracted words is often less the number from the other two keyword extraction methods. On the contrary, the words extracted by TextRank and LDA are not always related to the core event. This situation is obvious for long texts. For example, for A1, its core event is "Apple iPhone has a software leak on email app". The keywords extracted from this text via TextRank and LDA both include "ZecOps". This word repeats many times in A1. Thus, it is chosen as keyword, whereas this word indicates the source where the news is published. It is not the part of the core event. The reason to this situation is that traditional keyword extraction methods often take the shallow statistics, such as frequency or distribution, to measure word importance. Such an approach causes "ZecOps" to be incorrectly chosen. In our calculation, when measuring the importance of one word, we consider the effect of the event which includes this word. In detail, only if the event is emphasized by one text, the word included by this event can be treated as the representation of this text. In the text of long length, there may mention several events. The frequently occurring words may not be included by the core event, such as the words "ZecOps" and "Rooney" included by A1 and A2. Thus, they may be extracted incorrectly. Regarding the texts of short length, they only include one or two sentences. A few events are mentioned. The frequently occurring words are mostly included by the core event. Thus, among the short texts, the words extracted by the three methods are almost same.

**Methods TextRank LDA Ours** Long Texts A1 ZecOps, Apple, mail, leak, hacker Apple, mobile, ZecOps, bug, hacker Apple, mail, software, leak, \* A2 Rooney, soccer, Bayern Munich, injury, champion soccer, Bayern Munich, beat, Man Utd, Rooney Bayern Munich, beats, Man Utd, \* A3 Barclay, bank, economic, coronavirus, profit Barclay, coronavirus, bank, pandemic, work coronavirus, pandemic, costs, £2.1bn, \* Short Texts S1 Carmaker, Tesla, build, factory, Shanghai Carmaker, Tesla, build, factory, Shanghai Tesla, build, factory, Shanghai, \* S2 Kobe Bryant, death, BBC, TV news, mistake BBC, apologize, footage, mistake, \* BBC, apologize, footage, mistake, \* S3 Coronavirus, economy, sink, pandemic, shutdown Coronavirus, economy, sink, shutdown, \* Economy, sink, pandemic, shutdown, \*

**Table 3.** The extracted words from the sampled texts (\* marks that less than five words can be extracted).

#### 5.2.4. Comparison of Different Algorithms

The following table shows the results of comparing our similarity calculation with the supervised and unsupervised baseline algorithms. The supervised baseline algorithms include textCNN, LSTM, LSTM+Bidirectional attention (abbreviated as LSTM+BIA), and BERT-base. They are conducted only on the large-scale testing corpora including short sentences, since these corpora have enough data to form training set. The unsupervised baseline algorithms include Average (abbreviated as AVE), TextRank+Average (abbreviated as TR+AVE), and TextRank+Concatenation (abbreviated as TR+CON). The details of the baseline algorithms are already told at the beginning of Section 5.

For unsupervised similarity calculations, to test their similarity calculating results, we first represent input text as vector. This vector is averaged from all the word vectors through input text, or some chosen word vectors by TextRank algorithm. Then, Cosine similarity is used to decide whether two texts are similar or not. For supervised similarity calculations, we also represent input text as vector, while this vector is formed by different encoder models like LSTM, Bi-LSTM, or pretrained models (BERT or RoBERTa). Then, two

vectors obtained by different encoders are sent to MLP layer to output a probability in terms of softmax to indicate similarity results. All the testing algorithms output a value to measure the similarity between two texts. We record the similarity values of all the text pairs in the testing corpora via the given algorithms and take the mean of all the values as the threshold to decide whether two texts are similar or not. To make the obtained results more persuasive, we add significant test. We separate each testing corpus into ten parts, and record calculating results in each part. Two-tail paired t-test is applied to determine whether the results obtained by different algorithms over the ten times' calculations are significantly different or not. We set three significant levels as 0.01, 0.05, and 0.1 (labelled as \*\*\*, \*\*, and \*).

As shown in Table 4, we list the results obtained in different languages and in different tasks. It can be found that supervised algorithms overwhelm unsupervised algorithms by a large margin in all the testing corpora. The reason is straightforward, since supervised algorithms can utilize training data to obtain a reasonable hyperplane to separate similar text pairs from dissimilar ones. Correspondingly, unsupervised algorithms cannot acquire any transcendental guidance to help model the discrimination between similar and dissimilar text pairs. They only depend on data's natural distribution, thus, obtain lower performance. Anyway, unsupervised algorithms only choose some words with prominent distribution or aggregate all the words in the text to generate text representation, whereas long text has many words which are little related to the main content. This situation causes unsupervised algorithms obtain extremely lower performances on the manually annotated corpora which include only long texts. It can be found that our calculation obtains comparable results with supervised baseline algorithms and performs much better than unsupervised baseline algorithms especially on the corpora including long texts. The reason is totally due to our event connection graph. Based on this graph, the nodes (or words), which can represent the core event mentioned in one text, can be finally extracted. The unrelated noisy words are ignored when calculating text similarity. For this reason, we can acquire accurate similarity results on both long and short texts. Besides, to further improve performance, graph embedding is used to encode the information carried by the entire graph into the chosen nodes to make the chosen nodes carry the global information expressed by entire text. The significant test results also prove the reliability of the high performance of our calculation.


**Table 4.** The comparison of our calculation with the baseline algorithms (highest values in bold. \*\*\*, \*\*, and \* indicate three significant levels as 0.01, 0.05, and 0.1.).

In the experiments, there are also two SOTA text similarity calculations, Con-SIM and RCMD. They both obtain extraordinary results. Moreover, it can be observed that RCMD even obtains over 90% F1 value across all the testing corpora. These two methods both have two layers. The lower layer is text encoder, where Con-SIM takes BERT as the encoder and RCMD takes a more powerful model (RoBERTa) as encoder. The difference is that RoBERTa has more parameters; thus, RCMD obtains higher performance. The upper level is the interaction layer, where Con-SIM takes hierarchical interactive attention and

RCMD takes two matrixes to model local interaction (inner sentence) and global interaction (outer sentence) to obtain sentence matching results. As indicated in the experiments, our proposed method obtains lower performance than those SOTA methods. However, it is easy to be explained. Those SOTA methods all take pretrained models to encode input texts with massive parameters. Anyway, those methods need to consume training data to adjust model to deal with domain-specific data. As domain changed, these methods are easy to be distorted as shown in the experiments. Compared with those methods, our proposed method is unsupervised based. Thus, it keeps its performance across domains. Anyway, with the help of passage-level document representation, our proposed method can obtain high performance. Though the performance is lower than the ones with pretrained models, its performance is much higher than the baseline unsupervised methods.

#### 5.2.5. Ablation Results

In Section 4, we provide two improvements on our calculation. One is to detect and link similar triggers to involve the relations among the similar or related events into the event connection graph. The other is node representation via graph embedding, which lets the representations of the chosen nodes integrate the information carried by the entire event connection graph. In the following table, we record the results obtained by our calculation in the following two settings. One is with or without linking similar triggers. The other is with or without node representation.

As shown in Table 5, it can be found that two improvements both enhance the performance of our calculation. Between them, node representation brings more boost. The advance brought from these two improvements is easy to be explained. Regarding the advance brought from the linkage of similar triggers, since we find similar triggers via tuning their vector representations and further link them, our event connection graph can cover more relations among events. Via this comprehensive graph, we can locate the core event more accurately. Regarding the advance brought from node representation, the pre-trained vector representations of the chosen nodes can only express their inherent information, i.e., only representing the local information carried by the chosen nodes. After we tune node representation via graph embedding, the vector representations of the chosen nodes can integrate the information carried by the entire event connection graph. Via these nodes, text similarity can be calculated more accurately. Anyway, the representations of the chosen nodes after graph embedding can cover both the inherent information themselves and the information carried by the global graph. It can bring more boost on the performance than the linkage of similar triggers.


**Table 5.** The ablation results of our calculation (highest values in bold).

#### 5.2.6. Task Transferring

Text similarity calculation is a fundamental component of many artificial intelligence applications. We cannot predefine the domain and the task where these applications are applied. We then add a test to compare the capacities of different algorithms in the transferring scenario across different tasks and languages. The testing corpora are collected from different domains and different languages. Regarding supervised algorithms, we train them on one corpus and test them on another corpus. There are three kinds of corpora in English, two in Chinese, and one in Spanish. For English, we combine two corpora as training set and test the algorithms on the remaining corpus. For Chinese, we train the algorithms on one corpus and test them on the other corpus. For Spanish, since we only have one corpus, we do not test the algorithms in this language.

To conduct the experiments on the task transferring scenario, we take supervised algorithms and unsupervised algorithms in two ways. Regarding unsupervised algorithms, since they do not have training stage, we run them directly on each corpus and record the results. Anyway, since both languages (English and Chinese) have the corpus about query match task, we test all the algorithms on this task while training on one language and testing on the other language (noted as E-C and C-E indicating English to Chinese and Chinese to English). In this test, all the algorithms are given cross-lingual word embeddings trained on the corpus formed via sentence alignment [56]. We also add significant test to see the credibility of the obtained results.

As shown in Table 6, it can be found that supervised algorithms degrade much compared with the results shown in Table 4. In Table 4, the results are obtained in the situation that training and testing are performed on the same corpus. This phenomenon indicates that task transferring (or corpus changing) deeply affects the performance of supervised algorithms. The reason is obvious. Since supervised algorithms count on the transcendental knowledge (this knowledge indicates data distribution assumption) derived from training corpus to deal with novel data, they are easy to be distorted by the other corpus which owns diverse distribution. On the contrary, there is no training corpus for unsupervised algorithms. Thus, they do not make any assumption about data distribution, which causes they are not affected by task transferring (or corpus changing). Our calculation is one kind of unsupervised algorithms. It keeps high quality across all the corpora. The significant test results prove the reliability of the high quality of our calculation.


**Table 6.** The results of all the algorithms in the transferring scenario (highest values in bold. \*\*\*, \*\*, and \* indicate three significant levels as 0.01, 0.05, and 0.1.).

#### **6. Conclusions and Future Work**

Text similarity calculation is a fundamental task for many high-level artificial intelligence applications, such as text clustering, text summarization, and Q&A. Traditional similarity calculations are conducted in terms of either making two similar texts close in a high-dimensional space (supervised methods) or measuring the number of concurrent words shared by two texts (unsupervised methods). They ignore a fact that, in many scenarios, text is used to record events. Text similarity is mostly decided by whether two texts mention the same core event or not. This paper just proposes a novel text similarity calculation via constructing an event connection graph to disclose the core event mentioned in one text. To better model the relations among events, we tune the vectors of the triggers to detect related events and link them in the event connection graph. This approach can locate the core event more accurately. The nodes which can represent the core event are chosen and utilized to measure text similarity. Moreover, we adopt graph embedding to

tune the vectors of the chosen nodes to integrate the global information carried by the entire text into the chosen nodes. This way can further boost the performance of our calculation. Experimental results prove the high performance of our similarity calculation.

Though our paper can combine the merits from supervised and unsupervised similarity calculations and can be applied in many text-related downstream applications which need text similarity as their component. Our calculation has time issue needed to be further solved. In particular, our calculation needs to form a passage-level event representation. This kind of operation needs extra time. Thus, though our calculation has higher accuracy, it is not fit to online applications, especially some time-insensitive applications.

One issue needed to be mentioned is that, to link semantically similar triggers to let our event connection graph cover more relations among events, we need to predefine a threshold to decide whether two triggers are similar or not. As shown in the experiments, this parameter setting is not optimal for some corpus. It is chosen via balancing the results across all the testing corpora. Therefore, in the future work, we hope to set it dynamically. The other work we hope to carry out is to improve efficiency. The process of graph construction is time-consuming. We hope to construct some template graphs at advance. During the calculating stage, we choose the corresponding template graph via some matching score and complete the matched template graph using some specific words chosen from input text.

**Author Contributions:** M.L. and Z.Z. conceived and designed the study, interpreted the data, and drafted the manuscript. L.C. guided the study and revised the manuscript. All authors critically revised the manuscript, agree to be fully accountable for ensuring the integrity and accuracy of the work, and read and approved the final manuscript before submission. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research in this article is supported by the National Key Research and Development Project (2021YFF0901600), the Project of State Key Laboratory of Communication Content Cognition (A02101), the National Science Foundation of China (61976073, 62276083), and Shenzhen Foundational Research Funding (JCYJ20200109113441941).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** No new data were created or analyzed in this study. Data sharing is not applicable to this article.

**Acknowledgments:** I would like to express my gratitude to all those who have helped me during the writing of this thesis, including all the co-authors. I also appreciate the help from the reviewers' directions.

**Conflicts of Interest:** This submission is an extension version of a conference paper written by the same authors in NLPCC, however, I confirm that over 60% part including methods and experiments are added. Besides, the results are improved by the novel proposed method. All authors are aware of the submission and agree to its review. There is no conflict of interest with Associate Editors. The coverage of related work is appropriate and up-to-date.

#### **References**


### *Article* **Neural Networks for Early Diagnosis of Postpartum PTSD in Women after Cesarean Section**

**Christos Orovas 1,\*, Eirini Orovou 2,3, Maria Dagla 3, Alexandros Daponte 4, Nikolaos Rigas 3, Stefanos Ougiaroglou 5, Georgios Iatrakis <sup>3</sup> and Evangelia Antoniou <sup>3</sup>**


### **Featured Application: Early diagnosis and warning mechanisms are essential in every health condition. The research described in this paper can provide the means for the development of medical assistance applications.**

**Abstract:** The correlation between the kind of cesarean section and post-traumatic stress disorder (PTSD) in Greek women after a traumatic birth experience has been recognized in previous studies along with other risk factors, such as perinatal conditions and traumatic life events. Data from early studies have suggested some possible links between some vulnerable factors and the potential development of postpartum PTSD. The classification of each case in three possible states (PTSD, profile PTSD, and free of symptoms) is typically performed using the guidelines and the metrics of the version V of the Diagnostic and Statistical Manual of Mental Disorders (DSM-V) which requires the completion of several questionnaires during the postpartum period. The motivation in the present work is the need for a model that can detect possible PTSD cases using a minimum amount of information and produce an early diagnosis. The early PTSD diagnosis is critical since it allows the medical personnel to take the proper measures as soon as possible. Our sample consists of 469 women who underwent emergent or elective cesarean delivery in a university hospital in Greece. The methodology which is followed is the application of random decision forests (RDF) to detect the most suitable and easily accessible information which is then used by an artificial neural network (ANN) for the classification. As is demonstrated from the results, the derived decision model can reach high levels of accuracy even when only partial and quickly available information is provided.

**Keywords:** artificial neural networks; random decision forests; posttraumatic stress disorder; DSM-V; emergency cesarean section; elective cesarean section; postpartum period

### **1. Introduction**

Post-traumatic stress disorder (PTSD) is a mental health problem that can develop after a person goes through a life-threatening event. The disorder can develop even when the person is witnessing an event, exposed through information, or extreme repeated exposure to the workplace [1]. The disorder, regardless of the type of exposure to trauma, causes symptoms of re-experiencing, avoidance, negative cognitions in the mood, and arousal. The duration of symptoms lasts more than a month, not due to the action of any substance

**Citation:** Orovas, C.; Orovou, E.; Dagla, M.; Daponte, A.; Rigas, N.; Ougiaroglou, S.; Iatrakis, G.; Antoniou, E. Neural Networks for Early Diagnosis of Postpartum PTSD in Women after Cesarean Section. *Appl. Sci.* **2022**, *12*, 7492. https:// doi.org/10.3390/app12157492

Academic Editor: Anton Civit

Received: 19 May 2022 Accepted: 22 July 2022 Published: 26 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

or physical condition and causes a significant reduction in the individual's social life [2]. Anyone can develop PTSD at any age. Women, however, are twice as likely to develop PTSD as men, showing how they are most affected by traumatic childbirth experiences, hormonal disorders, stressful life events, and domestic violence [3].

On the other hand, PTSD profile, or partial PTSD, originally used in relation to Vietnam veterans has recently been extended to trauma victims. The PTSD profile includes the most important symptoms of PTSD, but people exposed to trauma do not meet all the diagnostic criteria of the disorder. A correlation has also been found between PTSD profiles with increased rates of suicidal ideation, alcoholism, overuse of health services, and several absences from the work environment as well as a negative reduction of a person's social life [4,5].

For several years, scientists viewed the childbirth experience as a positive experience, regardless of the presence of traumatic events. In recent years, however, birth trauma has increased researchers' interest, as it has been shown that it can develop into PTSD or PTSD profile. Actually, more than 1/3 of mothers experienced their delivery as a traumatic event, while 1/4 of them will experience postpartum PTSD [6]. Some factors can increase the chance that a postpartum mother will have PTSD, such as pathology of gestation, complicated vaginal delivery, personal history of mental disorders, tokophobia, low social support, past PTSD, and cesarean section (CS) [7–10]. Postpartum PTSD symptoms are debilitating and affect the social, professional, psychological, and communication function of the mother–infant bond and her family, as well [10]. However, there are many previous and current surveys that highlight the effect of CS on maternal mental health, especially emergency cesarean section (EMCS) which show a strong correlation with postpartum PTSD compared to other types of births [11–16].

Due to the nature of the current diagnosis procedure, which is in accordance with the (DSM-V), in order to reach a conclusion, it is necessary to wait for a period of six weeks to fill up the necessary questionnaires regarding any symptoms. However, the early detection of the possibility of developing PTSD could offer medical personnel significant information to take increased precautionary measures and alleviate any symptoms in advance.

This observation is behind the motivation of the present work. More specifically, our motivation is to examine if machine learning and especially the artificial neural network (ANNs) models can be applied to predict possible PTSD cases. Our contribution is the development of an ANN model that can detect PTSD cases using a minimum amount of information and produce an early PTSD diagnosis as soon as possible.

The rest of the paper is organized as follows: Section 2 presents the related work. In Section 3, the dataset and the proposed methodology for early diagnosis of PTSD cases are described in detail. Section 4 presents the experimental study which is based on a dataset with 469 cases. Section 5 discusses the results while Section 6 concludes the paper and gives directions for future work.

#### **2. Related Work**

An early investigation of the application of ANNs as a clinical diagnostic and a modeling tool, especially for psychiatric disorders has been presented in [17]. Although many successful cases of diagnosis in general medicine, contemporary at the time of that review, have been presented, the lack of evaluation of the impact of the nature of psychiatric data, where most variables derive from dimensional rating scales, is also mentioned. A more detailed consideration of the application of ANN models to clinical decisionmaking exists in [18] where some issues of psychological assessment using ANNs are discussed as well. The use of ANNs in psychology-related applications, such as personality traits analysis, has also been reviewed in [19]. In general, machine learning can provide a powerful diagnostic toolset as it is demonstrated in [20].

In a similar manner to the work presented in this paper, the use of ANNs in identifying the symptom severity in obsessive–compulsive disorder (OCD) for classification and prediction has been successfully employed in [21]. The importance of timely treatment of

OCD before leading to a chronic disability is also stressed and several significant factors related to this disorder are pointed out with confirmatory factor analysis (CFA).

The potentiality of machine learning approaches with multidimensional data sets in pathologically redefining mental illnesses and also improving the therapeutic outcomes in relation to the Diagnostic and Statistical Manual of Mental Disorders (DSM) and the International Classification of Diseases (ICD) is examined in [22]. An extended related review also exists in [23,24] where open issues for AI in psychiatry are discussed as well.

#### **3. Materials and Methods**

This study took place from July to November 2019 to August 2020, at the Midwifery Department of the General University Hospital of Larisa in Greece. It was approved by the University Hospital of Larisa Ethics Commission. Approval: 18838/08-05-2019. To answer the research question, the study was designed as a prospective study between 2 groups of postpartum women (EMCS and Elective Cesarean Section (ELCS)).

#### *3.1. Participants*

The participants were all postpartum women who gave birth by the 2 types of CS and gave their written consent for their participation. A total of 469 postpartum women were examined in this research. For each case, several demographics, prenatal health, and mental health variables were collected through questionnaires that were filled through interviews during their hospitalization in the departments and 6 weeks later. The exclusion criteria of the research were difficulties at a cognitive level, other languages than Greek, and underage mothers.

#### *3.2. Data and Measures*

The data were collected in 2 stages: the first stage was the 2nd day after CS, and the second stage was the 6th week after CS. During the first stage, from 469 women, we collected medical and demographic data from the socio-demographic questionnaire and past traumatic life events from the Life Events Checklist-5 (LEC-5) of DSM-V and Criterion A from the adapted first Criterion of PTSD. At the second stage, the PTSD symptoms from the Post-Traumatic Stress Checklist (PCL-5) of DSM-V are collected (The dataset that was used can be found in: https://users.uowm.gr/chorovas/appsci/nn\_ptsd.html (accessed on 20 June 2022)).

The life events checklist (LEC) is the only measure that individuals can determine different levels of exposure to a traumatic event in their lives [25]. For a PTSD diagnosis, 8 criteria must be met. For the first criterion (Criterion A), the individual must have been exposed to death, threatened death, serious injury, or sexual violence in one of the following ways: (a) direct exposure, (b) witness to the event, (c) information of the event, and (d) exposure in the working space [26]. For this study, Criterion A was adjusted accordingly. The post-traumatic stress checklist (PCL-5) is a self-report scale, which was developed to measure and evaluate PTSD and PTSD Profile symptoms [1,27]. In the present study, the postpartum women replied via telephone to 20 questions during the 6th postpartum week, corresponding to 20 symptoms of the criteria B (re-experiencing), C (avoidance), D (negative thoughts and feelings), and E (arousal and reactivity). All replies are scored on 5-point scales (range zero to four). A score of one or more in the categories of criteria B and C and two or more in categories D and E are considered PTSD symptoms. Depending on the symptoms, the postpartum women were diagnosed with (a) provisional diagnosis of PTSD and (b) PTSD profile [27,28].

The demographics, prenatal health, and mental health variables that were collected are presented in Tables 1–3 (statistical tests with IBM SPSS Statistics v.20).



\* *p*-values refer to Pearson chi-square.

#### **Table 2.** Prenatal health variables. Counts and percentages in corresponding diagnosis.


**Table 2.** *Cont.*


\* *p*-values refer to Pearson chi-square.


**Table 3.** Mental health variables. Counts and percentages in corresponding diagnosis.

\* *p*-values refer to Pearson chi-square.

In total, for each case there were 70 data fields available as it is shown in Table 4.

As mentioned in Section 1, the development of a diagnostic model that could indicate early a possible PTSD case using a minimum amount of information could be very useful to prepare the health personnel for such a scenario so that appropriate measures could be taken in advance. Having this in mind we initially trained an artificial neural network (ANN) [18,23] with all the available information so that we could check whether the traditionally confirmed diagnosis could be replicated. Since that was easily achieved by a two-layered feed-forward ANN (Table 5), the focus was moved to the proper subset of data that could be used to achieve high classification accuracy. Random forest classification [29] was performed with the initial set of 70 data fields (variables). The goal was to derive Gini importance values [30] which could assist with the selection of the proper subset of variables. The criteria for the selection of these variables were the level of their direct availability with the smaller number of questions asked. This procedure resulted in having the sets of data that we used to train the ANNs models. A schematic diagram of the above processing is depicted in Figure 1.

**Table 4.** The total of 70 available data fields.


**Table 4.** *Cont.*


**Table 5.** The averaged confusion matrix of the initial classification results for the training phase using the complete set of the 70 variables. The accuracy is 99.6%.


The corresponding results and additional details from the above methodology are presented to the following section.

**Figure 1.** A schematic diagram of the methodology used.

#### **4. Results**

#### *4.1. Initial Classification Using the ANN*

As mentioned above, the complete set of the data were used initially to examine the feasibility of the reproduction of the original classification according to the DSM-V.

From the 469 cases of the collected data, 379 (80.81%) were manually diagnosed as free of symptoms, 34 (7.24%) had traces and were characterized as profile and 56 (11.94%) were diagnosed as PTSD cases. For the training and testing phases, a stratified ten-fold cross-validation scheme was employed.

The ANN was created using the PyTorch (v1.9.0 + cu11) library in Python and had a structure of seventy input units (in the case of the complete data fields as shown in Table 4), six hidden units, and three output units using three bits for the output where only one of them was set to "1" indicating the diagnosis (one hot coding). The connections were feed-forward from one layer to the next, the Sigmoid function (with α = 1.0) was used for activation and the mean squared error (MSE) was employed from the stochastic gradient descent (SGD) optimization algorithm for training. The learning rate was set to 1.0 and the momentum to 0.9. The tuning of the hyperparameters that were used was performed on a trial-and-error base after several initial experimentations.

Initially, we estimated precision, recall, specificity, and accuracy for the complete set of the 70 variables by considering the confusion matrices and these are presented in Tables 5 and 6. Precision estimates how many positive predictions were correct. Recall estimates how many positives are correctly predicted while specificity estimates how many negatives are correctly predicted. Precision is calculated as the fraction TP/(TP + FP), the recall (sensitivity) as TP/(TP + FN), the specificity TN/(TN + FP), and the total accuracy (TP1 + TP2 + TP3)/(P1 + P2 + P3) where TP, FP, TN, and FN are the true and false positives and true and false negatives, respectively.

**Table 6.** The averaged confusion matrix of the initial classification results for the testing phase using the complete set of the 70 variables. The accuracy is 92,9%.


The results for both phases are averaged over ten sessions of the experiments, each one with a different initialization of the weights of the ANN. The averaged learning curve for the training process is depicted in Figure 1.

From Tables 5 and 6 and Figure 2, we can see that the ANN manages to easily learn the classification procedure of the DSM-V. However, we need to perform the same classification with as few variables as possible. Therefore, we employ the RDF importance values.

**Figure 2.** The convergence graph for the training with all the initial data (70 fields).

#### *4.2. Importance Values Using Random Decision Forests*

All the data from the initial set (469 × 70) were used with the random decision forests classification which was performed using the function *randomForest* from the library *randomForest* version 4.6-14 in RStudio (v1.3.1093). The number of trees was 500 and the number of variables tried at each split (*mtry*) was 20. These parameters were also selected on a trial-and-error basis. As RDF classification has a stochastic feature in its operation, ten sessions were run, and the average estimated error rate was 1,13%. The average confusion matrix is shown in Table 7.

**Table 7.** The averaged confusion matrix and the classification errors from the RDF.


A powerful feature of RDF classification is that an importance vector is also returned which has the Gini importance values (mean decrease in impurity, MDI) [30] of the variables used. This is very useful for having an idea of what variables contribute more to the classification process as the higher the Gini values the higher the importance of the variables. This is profound in our research as our aim was to reach a competitive level of classification using as less and more directly acquired, variables as possible.

The Gini values for the 70 variables sorted from highest to lowest can be seen in Figure 3 and in Table 8 for more precision.

**Figure 3.** The averaged Gini importance values of the 70 variables in descending order.

**Table 8.** The averaged Gini importance values of the 70 variables in descending order 1. Bolded variables are only available after six weeks of birth.



#### **Table 8.** *Cont.*

<sup>1</sup> The variable coding scheme is mentioned in Table 4.

#### *4.3. Classification Using a Subset of the Available Data*

The values in Table 8 show an expected high level of importance to the variables that are used directly for the typical diagnosis procedure in DSM-V (indicated by bold variable labels). As these are only available after six weeks, our effort is to avoid them and concentrate on what is quickly and easily acquired with as less questions as possible. This gives us the list of candidate variables listed in Table 9.



All the twenty-four variables that are presented in Table 9 were used to construct eight data sets (called D1–D8) in steps of three. The variables in each dataset and the corresponding sum of the Gini values of these variables can be seen in Table 10.


**Table 10.** The eight datasets that were created from the variables in Table 9 and the corresponding sum of their Gini values. "1" means the variable is included in the dataset.

The results concerning the precision, recall (sensitivity), specificity, and accuracy during the training and testing phases in a stratified ten-fold cross-validation scheme can be seen in Tables 11 and 12 and Figures 4 and 5.

**Table 11.** The results during the training phase for the eight partial datasets (D1–D8) and for the complete set of the 70 variables (bolded values). Stratified ten-fold cross validation is applied.


**Table 12.** The results during the testing phase for the eight partial datasets (D1–D8) and for the complete set of the 70 variables (bolded values). Stratified ten-fold cross validation is applied.


**Figure 4.** The precision, recall (sensitivity) and specificity for each class and dataset and the accuracy for the training phase.

**Figure 5.** The precision, recall (sensitivity), and specificity for each class and dataset and the accuracy for the testing phase.

In order to have an idea about the best level of classification that could be achieved with RDF using only those variables of the complete set which are not related to DSM-V, (i.e., v41–v60 and v36–v39), ten sessions were run using the complete dataset for training. Comparing the classification errors in Table 13 (which is one recall) with the best values for recall in Table 12 we can observe a slightly better performance from the ANN using datasets D6 and D7 with only 18 and 21 variables, respectively. This is an indication of the validity of the variable selection method that was performed based on Table 8.

**Table 13.** The averaged confusion matrix and the classification errors from the RDF using the 46 variables remaining after removing the (20 + 4) ones directly related to the DSM-V. The complete dataset is used for the training.


#### **5. Discussion**

The subject of the present study was to present a model that can produce an early diagnosis to detect and alarm a possible case so that proper measures can be taken as soon as possible. According to our findings, emergency cesarean section, pathology of gestation, preterm birth, the inclusion of neonate in NICU, absence of breastfeeding, psychiatric history, expectations from childbirth, and support from the partner are included in the set of important decision factors.

Additionally, as it can be seen from the results (graphs in Figures 4 and 5, Tables 11 and 12), the ability of the ANN model to arrive at a correct conclusion is demonstrated at a very satisfactory level (around 97% in training and 94% in testing) for the cases which are free of symptoms. For the cases that are PTSD diagnosed, the recognition level reaches 83% in training and 66% in testing. The area in between the above two categories has a low percentage of recognition and it collects the PTSD profile cases. As it can be observed from the results, the PTSD profile cases are the only ones that really need the late questionnaires data (after 6 weeks). According to the above, a policy that could be followed to arrive at a conclusion as soon as possible is to characterize a case that is not classified as free of symptoms as a possible PTSD case. If the case is indeed classified as PTSD, then such a scenario would probably denote an increased potentiality for the appearance of PTSD symptoms after six weeks when the second part of the data is collected. More focused treatment in such a case could be applied and this can start six weeks in advance, providing a beneficial period of medical care.

The use of random decision forests for associating an importance value for each data field is very useful as well. The ordering of the early accessible variables according to their Gini values in Table 9 is the result of that process and it can be noted that this ordering is indeed profound. Criterion A, which constitutes a basic decision factor also in the typical DSM diagnosis, is ranked first and its related parts (A1 and A2) are just after that. Although there is one more datum field related to Criterion A, (v34, number of similar stressful experiences) we decided not to use this as it requires extra effort from the side of the woman in order to be defined. The rest of the data fields that are used for the datasets are all important and this can be shown by the gradual increase in PTSD sensitivity which is noticed in the training phase (Figure 4). This is expected and it denotes the usefulness of the extra information which is added to every dataset. This information increase is also depicted as the sums of the Gini values of the datasets in Figure 6.

**Figure 6.** The graph of the sums of Gini importance values in the eight datasets (D1–D8) of Table 10.

#### **6. Conclusions**

Our aim for this research was to examine whether the use of ANN modeling for describing the classification process of postpartum PTSD could be useful to provide a diagnostic model for the early detection of possible cases. The high accuracy that is obtained using as little and as readily available information as possible demonstrates that this is possible, and this marks a successful scenario for the application of ANNs in psychological data modeling. Future research could incorporate additional machine learning tools for the classification to obtain even more precise classification percentages. The development of mobile device applications to make the process faster would be also desirable. The benefit for the persons that would finally be diagnosed positively is important as well, since the extra period gained could be used in favor of their preliminary treatment.

**Author Contributions:** Conceptualization, C.O.; methodology, C.O. and S.O.; software, C.O.; validation, M.D., A.D., G.I. and E.A.; formal analysis, C.O.; investigation, E.O., N.R. and E.A.; resources, E.O., N.R. and E.A.; data curation, C.O.; writing—original draft preparation, C.O.; writing—review and editing, M.D., A.D., S.O., G.I. and E.A.; visualization, C.O. and E.O.; supervision, E.A.; project administration, C.O. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** This study was approved by the University Hospital of Larisa Ethics Commission. Approval: 18838/08-05-2019.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Chicken Swarm-Based Feature Subset Selection with Optimal Machine Learning Enabled Data Mining Approach**

**Monia Hamdi 1, Inès Hilali-Jaghdam 2, Manal M. Khayyat 3, Bushra M. E. Elnaim 4, Sayed Abdel-Khalek 5,6 and Romany F. Mansour 7,\***


**Abstract:** Data mining (DM) involves the process of identifying patterns, correlation, and anomalies existing in massive datasets. The applicability of DM includes several areas such as education, healthcare, business, and finance. Educational Data Mining (EDM) is an interdisciplinary domain which focuses on the applicability of DM, machine learning (ML), and statistical approaches for pattern recognition in massive quantities of educational data. This type of data suffers from the curse of dimensionality problems. Thus, feature selection (FS) approaches become essential. This study designs a Feature Subset Selection with an optimal machine learning model for Educational Data Mining (FSSML-EDM). The proposed method involves three major processes. At the initial stage, the presented FSSML-EDM model uses the Chicken Swarm Optimization-based Feature Selection (CSO-FS) technique for electing feature subsets. Next, an extreme learning machine (ELM) classifier is employed for the classification of educational data. Finally, the Artificial Hummingbird (AHB) algorithm is utilized for adjusting the parameters involved in the ELM model. The performance study revealed that FSSML-EDM model achieves better results compared with other models under several dimensions.

**Keywords:** feature subset selection; data mining; educational data mining; artificial intelligence; machine learning; metaheuristics

#### **1. Introduction**

Data mining (DM) is the procedure of understanding data through cleaning raw data, discovering patterns, producing models, and testing the models. It comprises of several fields such as statistics, machine learning (ML), and database systems. Education DM (EDM) is an emergent field with an arising strategy to investigate the various types of data which are obtained from an education background [1]. It is an interdisciplinary area which inspects data mining (DM), man-fabricated consciousness, and measurable demonstrating with the data produced using an academic organization [2]. EDM uses a calculation method for taking care of elucidating academic data considering a definitive point of examining academic enquiries. To make a nation stand out among different nations across the globe, education frameworks should encounter an essential advancement by re-planning their design. The concealed data and examples from various data sources can be extricated

**Citation:** Hamdi, M.; Hilali-Jaghdam, I.; Khayyat, M.M.; Elnaim, B.M.E.; Abdel-Khalek, S.; Mansour, R.F. Chicken Swarm-Based Feature Subset Selection with Optimal Machine Learning Enabled Data Mining Approach. *Appl. Sci.* **2022**, *12*, 6787. https://doi.org/10.3390/ app12136787

Academic Editors: Stefanos Ougiaroglou and Dionisis Margaris

Received: 9 April 2022 Accepted: 29 June 2022 Published: 4 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

by adjusting the strategies for DM. For summing up the outcomes of students with their qualifications, they investigate the abuse of DM in the academic fields. Crude data can be altogether moved through DM models. The data achieved from an education organization go through examination of various DM strategies [3]. The strategy identifies the conditions wherein students can strive to have a positive impact [4].

Student performance prediction (SPP) has different definitions according to troublesome perspectives; however, the measured assessment assumes a significant part in current education establishments. SPP seems effective in aiding all partners in the education interaction. For students, SPP can assist them with picking reasonable courses or activities and make their arrangements for academic durations [5]. For educators, SPP can assist with changing learning material and presenting programs compatible with the students' capacity, and help identify struggling students. For education chiefs, SPP can assist with checking the education program and enhancing the course framework. Generally, partners in the education advancement have intentions to further develop the education outcome. Moreover, the data-driven SPP study provides a goal of reference for the education framework. Weka, a compelling DM technique was utilized to produce the outcome [6].

The increment of educational information from distinct sources has resulted in desperation for the EDM research [7]. This can help to further objectives and characterize specific goals of education and highlight subset determinations by disposing of the component that is repetitive/is not important. The set of components chosen should follow the Occam Razor rule to provide the best outcome in light of the goal [8]. The data size to be dealt with has expanded in the past five years; hence, the choice is turning into a necessity before any sort of arrangement happens. It is not quite the same as the element extraction strategy because the determination method does not change the first portrayal of the data [9]. The least complex method includes choice, where how much quality in an examination is diminished by choosing just the main view of the circumstances such as a more elevated level of exercises [10].

Since the high dimensionality raises computational costs, it is essential to define a way to reduce the number of considered features. Feature selection (FS) allows reducing a high dimensionality problem and selecting a suitable number of features. This study designed a Feature Subset Selection with optimal machine learning for the Educational Data Mining (FSSML-EDM) model. The proposed FSSML-EDM model involves the Chicken Swarm Optimization-based Feature Selection (CSO-FS) technique for electing feature subsets. Next, the extreme learning machine (ELM) classifier was employed for the classification of educational data. Finally, the Artificial Hummingbird (AHB) algorithm was utilized for adjusting the parameters involved in the ELM model. The performance study revealed the effectual outcomes of the FSSML-EDM model over the compared models under several dimensions. Our contributions are summarized as follows:


#### **2. Literature Review**

Injadat et al. [11] explored and analyzed two distinct datasets at two distinct phases of course delivery (20% and 50%) utilizing several graphical, statistical, and quantitative approaches. The feature analysis offers understanding as to the nature of distinct features regarded and utilizes in the selection of ML techniques and their parameters. Moreover, this work presents a systematic model dependent upon the Gini index and *p*-value for selecting an appropriate ensemble learner in a group of six potential ML techniques. Ashraf et al. [12]

progressed an accurate prediction pedagogical method, considering the pronounced nature and novelty of presented approach in Educational Data Mining. The base classifications containing RT, j48, kNN, and naïve Bayes (NB) were estimated on a 10-fold cross-validation model. In addition, the filter procedure as over-sampling (SMOTE) and under-sampling (Spread subsampling) were exploited to examine some important alterations in outcomes amongst meta and base classifications.

Dabhade et al. [13] forecasted student academic performance in a technical institution in India. The data pre-processed and factor analysis were executed on the attained dataset for removing the anomaly from the data, decreasing the dimensionality of the data, and attaining the most correlated feature. Nahar et al. [14] generated two datasets concentrating on two distinct angles. In the primary dataset classification and forecast, the type of students (bad, medium, and good) on a particular course was dependent upon its prerequisite course efficiency. This can be executed during the artificial intelligence (AI) course. The secondary dataset also classified and forecasted the last grade (A, B, C) of an arbitrary subject; our data can be established in such a way that the data are only concentrated on the efficiency of the midterm exam.

Despite all the studies performed on FS process, to the best of our knowledge, only few works have carried out an FS-based classification model for EDM. Earlier works have used ML models for EDM without contributing much significance to the FS process. At the same time, the parameters involved in the ML models (i.e., ELM) considerably affect the overall classification performance. Since the trial-and-error method for parameter tuning is a tedious and erroneous process, metaheuristic algorithms can be applied. Therefore, metaheuristic optimization algorithms can be designed to optimally tune the parameters related to the ML models to improve the overall classification performance.

#### **3. The Proposed Model**

In this study, a new FSSML-EDM technique was developed for mining educational data. The proposed FSSML-EDM model involves data preprocessing at the initial stage to transform the input data into a compatible format. Then, the preprocessed data are passed into the CSO-FS technique for electing feature subsets. Next, the ELM classifier can be employed for the effective identification and classification of educational data. Finally, the AHB algorithm is utilized for effectively adjusting the parameters involved in the ELM model. The outcome of the ELM model is the classification output. Figure 1 depicts the block diagram of FSSML-EDM technique.

**Figure 1.** Block diagram of FSSML-EDM technique.

#### *3.1. Process Involved in CSO-FS Technique*

At the initial stage, the presented FSSML-EDM model incorporates the design of the CSO-FS technique for electing feature subsets. The CSO algorithm is chosen over other optimization algorithms due to its simplicity and high parallelism. The CSO simulates the chicken movement and the performance of the chicken swarm; the CSO is explained as follows: CSO has several groups, and all the groups have a dominant rooster, some hens, and chicks [15]. The rooster, hen, and chick from the group are found dependent upon their fitness value. The rooster (group head) is the chicken which is an optimum fitness value. However, the chick is the chicken which has the worse fitness value. The majority of chickens are hens and it can be selected arbitrarily to stay in that group. The dominance connection and mother–child connection from the group remain unaltered and upgrade during (G) time steps. The movement of chickens are expressed under the equation which utilizes to the rooster place upgrade provided by Equation (1):

$$X\_{ij}^{r+1} = X\_{i,j}^t \times \left(1 + randn\left(0, \sigma^2\right)\right) \tag{1}$$

whereas:

$$
\sigma^2 = \begin{cases} 1 & \text{if } f\_i \le f\_k \\ \exp\left(\frac{f\_k - f\_i}{\left|f\_i + \varepsilon\right|}\right) & \text{Otherwise} \end{cases}
$$

In which *k* ∈ [1, *Nr*], *k* = *i*, and *Nr* refers the amount of chosen roosters. *Xi*,*<sup>j</sup>* signifies the place of rooster number *i* in *j*th dimensional under *t* and *t* + 1 iteration, *randn O*, *σ*<sup>2</sup> is utilized for generating Gaussian arbitrary numbers with mean 0 and variance *σ*2; *ε* refers to the constant with minimum value; and *fi* is the fitness value to the equivalent rooster *i*. The equation which utilizes the hen place upgrade is provided by Equations (2)–(4):

$$X\_{i,j}^{t+1} = X\_{i,j}^t + S\_1 \\\\randn\left(X\_{r1,j}^t - X\_{i,j}^t\right) + S\_2 \\\\randn\left(X\_{r2,j}^t - X\_{i,j}^t\right) \\\tag{2}$$

In which:

$$S\_1 = \exp\left(\frac{f\_i - f\_{r1}}{|f\_i| + \varepsilon}\right) \tag{3}$$

and:

$$S\_1 = \exp\left(f\_{r2} - f\_i\right) \tag{4}$$

where *r*1, *r* ∈ [1, . . . , *N*], *r*1 = *rr* refers to the index of the rooster, but *r*2 implies the chicken in the swarm which is a rooster or hen and a uniform arbitrary number is created by *randn*. Finally, the equation that utilizes the chick place upgrade is provided by Equation (5):

$$X\_{i,j}^{t+1} = X\_{i,j}^t + FL\left(X\_{m,j}^t - X\_{i,j}^t\right) \; \; FL \in \left[0, 2\right] \tag{5}$$

where *X<sup>t</sup> <sup>m</sup>*,*<sup>j</sup>* signifies the place of *i*th chick mother.

#### *3.2. ELM Based Classification*

At this stage, the ELM classifier can be employed for the effective identification and classification of educational data. The ELM model has *n* input layers, *l* hidden layers, and *m* output layers. Initially, considering the training instance {*X*, *Y*} = {*xi*, *yi*}(*i* = 1, 2, . . . , *Q*), and it is comprised of the input feature *X* = *xi*1*xi*<sup>2</sup> ... *xiQ* and matrix *Y* = *yj*1*yj*<sup>2</sup> ... *yjQ* with training instance, where the matrix *X* and *Y* are expressed by [16]:

$$X = \begin{bmatrix} \mathfrak{x}\_{11} & \mathfrak{x}\_{12} & \cdots & \mathfrak{x}\_{1Q} \\ \mathfrak{x}\_{21} & \mathfrak{x}\_{22} & \cdots & \mathfrak{x}\_{2Q} \\ \vdots & \vdots & \ddots & \vdots \\ \mathfrak{x}\_{n1} & \mathfrak{x}\_{n2} & \cdots & \mathfrak{x}\_{nQ} \end{bmatrix} \tag{6}$$

$$\mathcal{Y} = \begin{bmatrix} \mathcal{Y}\_{11} & \mathcal{Y}\_{12} & \cdots & \mathcal{Y}\_{mQ} \\ \mathcal{Y}\_{21} & \mathcal{Y}\_{22} & \cdots & \mathcal{Y}\_{mQ} \\ \vdots & \vdots & \ddots & \vdots \\ \mathcal{Y}\_{m1} & \mathcal{Y}\_{m2} & \cdots & \mathcal{Y}\_{mQ} \end{bmatrix}^{\prime}$$

where *n* and *m* parameters denote the dimension of input and output matrix. Next, ELM set the weights amongst input and hidden layers randomly:

$$w = \begin{bmatrix} w\_{11} & w\_{12} & \cdots & w\_{1n} \\ w\_{21} & w\_{22} & \cdots & w\_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w\_{l1} & w\_{l2} & \cdots & w\_{ln} \end{bmatrix} \tag{7}$$

where *wij* denotes the weight from *i*th hidden and *j*th input layers. Figure 2 showcases the framework of ELM. Then, ELM considers the weight from output and hidden layers that can be shown below:

$$
\boldsymbol{\beta} = \begin{bmatrix}
\beta\_{11} & \beta\_{12} & \cdots & \beta\_{1m} \\
\beta\_{21} & \beta\_{22} & \cdots & \beta\_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
\beta\_{l1} & \beta\_{l2} & \cdots & \beta\_{lm}
\end{bmatrix} \prime \tag{8}
$$

where *βjk* indicates the weight from *jth* hidden and ith output layers. Next, ELM set a bias of hidden layers randomly:

$$B = \begin{bmatrix} b\_1 b\_2 \ \cdots \ \ b\_n \end{bmatrix}^T. \tag{9}$$

After that, ELM chooses the network activation function. According to, output matrix *T* is characterized by:

$$T = [t\_1 \ \ t\_2 \ \dots \ \ \ \ \ \ t\_Q]\_{m \times Q}. \tag{10}$$

**Figure 2.** Framework of ELM.

The column vector of output matrix *T* is shown in the following:

$$t\_j = \begin{bmatrix} t\_{1j} \\ t\_{2j} \\ \vdots \\ t\_{mj} \end{bmatrix} = \begin{bmatrix} \sum\_{i=1}^{l} \beta\_{i1} \text{g} \left( w\_i \mathbf{x}\_j + b\_i \right) \\ \sum\_{i=1}^{l} \beta\_{i2} \text{g} \left( w\_i \mathbf{x}\_j + b\_i \right) \\ \vdots \\ \vdots \\ \sum\_{i=1}^{l} \beta\_{im} \text{g} \left( w\_i \mathbf{x}\_j + b\_i \right) \end{bmatrix} (j = 1, 2, 3, \dots, Q). \tag{11}$$

Moreover, considering Equations (10) and (11), it is modelled in the following:

$$H\beta = T',\tag{12}$$

where *T* indicates transpose of *T* and *H* represents simulation outcomes of hidden neurons. To achieve better solution with less error, the least square model is employed for determining the weight matrix measure of *β* [17,18].

$$
\beta = H^{+}T'.\tag{13}
$$

To enhance the normalization ability of the system and provide stable outcomes, the regularization parameter of *β* is used. The amount of hidden layers is minimal in contrast with the amount of training samples, *β* is shown in the following [19]:

$$\mathcal{B} = \left(\frac{\mathcal{I}}{\lambda} + H^T H\right)^{-1} H^T T'. \tag{14}$$

When the amount of hidden layers is maximal than the amount of training samples, *β* is denoted as follows [20]:

$$\mathcal{B} = H^T \left(\frac{I}{\lambda} + HH^T\right)^{-1} T'.\tag{15}$$

#### *3.3. AHB Based Parameter Optimization*

Finally, the AHB algorithm is utilized for effectively adjusting the parameters involved in the ELM model with the goal of attaining maximum classification performance [21]. The AHB algorithm is an optimization approach stimulated from the foraging and flight of hummingbirds. The three major models are provided as follows: in a guided foraging model, three flight behaviors are utilized in foraging (axial, diagonal, and omnidirectional flight). It can be defined as follows:

$$v\_i(t+1) = \mathbf{x}\_{i,ta}(t) + h \cdot b \cdot (\mathbf{x}\_i(t) - \mathbf{X}\_{i,ta}(t))h \sim N(0, 1) \tag{16}$$

where *xi*,*ta*(*t*) characterizes the location of the targeted food source, *h* signifies the guiding factor, and *x*(*t*) represents the location of *i*th food source at time *t*. The location updating of the *i*th food source is provided by:

$$\mathbf{x}\_{Ai}(t) = \begin{cases} \mathbf{x}\_i(t) & f(\mathbf{x}\_i(t)) \le f(v\_i(t+1)) \\ v\_i(t+1) & f(\mathbf{x}\_{i(t)}) > f(v\_i(t+1)) \end{cases} \tag{17}$$

where *f*(*xi*(*t*)) and *f*(*vi*(*t* + 1)) denote the value of function fitness for *x*(*t*) and *vi*(*t* + 1). the local search of hummingbirds in the territorial foraging strategy is provided in the following:

$$
\pi\_i v\_i(t+1) = \mathfrak{x}\_{i(t)} + \mathfrak{g} \cdot \mathfrak{b} \cdot \left(\chi\_{i(t)}\right) \mathfrak{g} \sim N(0, 1) \tag{18}
$$

where *g* represents the territorial factor. The arithmetical formula for the migration foraging of hummingbirds is provided by:

$$
\propto\_{uor} (t+1) = lb + r \cdot (\mu b - lb) \tag{19}
$$

where *xwor* indicates the source of food with worst population rate of nectar refilling, *r* represents a random factor, and *ub* and *lb* denote the upper and lower limits, respectively.

#### **4. Experimental Validation**

The proposed FSSML-EDM model was simulated using a benchmark dataset from UCI repository, which comprises of 649 samples with 32 features and 2 class labels as illustrated in Table 1. The parameter settings are provided as follows: learning rate, 0.01; dropout, 0.5; batch size, 5; and number of epochs, 50. For experimental validation, the dataset is split into 70% training (TR) data and 30% testing (TS) data.



Figure 3 highlights a set of confusion matrices produced by the FSSML-EDM model on the test data. The figure indicates that the FSSML-EDM model resulted in effectual outcomes. On entire dataset, the FSSML-EDM model identified 545 samples into pass and 86 samples into fail. In addition, on 70% of the training dataset, the FSSML-EDM model identified 390 samples into pass and 53 samples into fail. Moreover, on 30% of testing dataset, the FSSML-EDM model identified 155 samples into pass and 33 samples into fail.

Table 2 offers a comprehensive EDM outcome of the FSSML-EDM model on test dataset. The experimental values indicated that the FSSML-EDM model accomplished maximum outcomes on all datasets. Figure 4 provides brief classification results of the FSSML-EDM model on entire dataset. It can be inferred from the figure that the FSSML-EDM model classified pass instances for *accuy*, *precn*, *recal*, *Fscore*, *MCC*, and kappa of 97.23%, 97.50%, 99.27%, 98.38%, and 89.08% respectively. Moreover, the figure shows that the FSSML-EDM model classified fail instances for *accuy*, *precn*, *recal*, *Fscore*, *MCC*, and kappa of 97.23%, 95.56%, 86%, 90.53%, and 89.08%, respectively.

**Table 2.** Result analysis of FSSML-EDM technique with distinct measures and datasets.


Figure 5 provides detailed classification results of the FSSML-EDM model on 70% of the training dataset. The figure reveals that the FSSML-EDM model classified pass instances for *accuy*, *precn*, *recal*, *Fscore*, *MCC*, and kappa of 97.58%, 97.50%, 99.74%, 98.61%, and 89.57%, respectively. In addition, the figure shows that the FSSML-EDM model classified fail instances for *accuy*, *precn*, *recal*, *Fscore*, *MCC*, and kappa of 97.58%, 98.15%, 84.13%, 90.60%, and 89.57% respectively.

**Figure 3.** Confusion matrix of FSSML-EDM technique on test data. (**a**) Entire dataset, (**b**) 70% of training dataset, and (**c**) 30% of testing dataset.

**Figure 4.** Result analysis of FSSML-EDM technique on entire dataset.

**Figure 5.** Result analysis of FSSML-EDM technique on 70% of training dataset.

Figure 6 offers brief classification results of the FSSML-EDM approach on 30% of testing dataset. The figure exposes that the FSSML-EDM algorithm classified pass instances for *accuy*, *precn*, *recal*, *Fscore*, *MCC*, and kappa of 96.41%, 97.48%, 98.1%, 97.798%, and 88.22%, respectively. Moreover, the figure shows that the FSSML-EDM approach classified fail instances for *accuy*, *precn*, *recal*, *Fscore*, *MCC*, and kappa of 96.41%, 91.67%, 89.19%, 90.41%, and 88.22%, respectively.

Figure 7 illustrates the training and validation accuracy inspection of the FSSML-EDM technique on the applied dataset. The figure of the FSSML-EDM approach offers maximum training/validation accuracy on the classification process.

**Figure 6.** Result analysis of FSSML-EDM technique on 30% of testing dataset.

**Figure 7.** Accuracy graph analysis of FSSML-EDM technique.

Next, Figure 8 reveals the training and validation loss inspection of the FSSML-EDM approach on the applied dataset. The figure shows that the FSSML-EDM algorithm offers reduced training/accuracy loss on the classification process of the test data.

A brief precision-recall examination of the FSSML-EDM model on the test dataset is portrayed in Figure 9. By observing the figure, it is noticed that the DLBTDC-MRI model accomplished maximum precision-recall performance under all classes.

**Figure 9.** Precision-recall curve analysis of FSSML-EDM technique.

A detailed ROC investigation of the FSSML-EDM approach on the distinct datasets is portrayed in Figure 10. The results indicate that the FSSML-EDM technique exhibited its ability in categorizing two different classes such as pass and fail on the test datasets.

Table 3 reveals an extensive comparative study of the FSSML-EDM model with existing models such as improved evolutionary algorithm-based feature subsets election with neuro-fuzzy classification (IEAFSSNFC) [22], neuro-fuzzy classification (NFC) [21], neural network (NN), support vector machines (SVM), decision tree (DT), and random forest (RF). Figure 11 inspects the comparative *precn*, *recal*, and *accuy* investigation of the FSSML-EDM model with recent methods. The figure reveals that the NN and SVM models showed poor performance with lower values of *precn*, *recal*, and *accuy*. The NFC, DT, and RF models have showed slightly improved values of *precn*, *recal*, and *accuy*. Moreover, the IEAFSS-NFC model resulted in reasonable *precn*, *recal*, and *accuy* of 93.81%, 92.39%, and 90.33%, respectively. Furthermore, the FSSML-EDM model accomplished effectual outcomes with maximum *precn*, *recal*, and *accuy* of 94.58%, 93.65%, and 96.41%, respectively.

**Table 3.** Comparative analysis of FSSML-EDM approach with existing methods [21].


Figure 12 inspects the comparative *Fscore*, *MCC*, and *kappa* analysis of the FSSML-EDM method with existing algorithms. The figure reveals that the NN and SVM methods showed poor performance with lower values of *Fscore*, *MCC*, and *kappa*. Similarly, the NFC, DT, and RF approaches showed slightly improved values of *Fscore*, *MCC*, and *kappa*.

**Figure 11.** *Precn*, *recal*, and *accy* analysis of FSSML-EDM technique.

**Figure 12.** *Fscore*, *MCC*, and *kappa* analysis of FSSML-EDM technique.

Moreover, the IEAFSSNFC model resulted a in reasonable *Fscore*, *MCC*, and *kappa* of 93.01%, 73.78%, and 73.37%, respectively. Additionally, the FSSML-EDM methodology accomplished effectual outcomes with maximum *Fscore*, *MCC*, and *kappa* of 94.10%, 82.22%, and 88.20%, respectively. Therefore, the FSSML-EDM model has the capability of assessing student performance in real time.

#### **5. Conclusions**

In this study, a new FSSML-EDM technique was developed for mining educational data. The proposed FSSML-EDM model involves three major processes. At the initial stage, the presented FSSML-EDM model incorporates the design of CSO-FS technique for electing feature subsets. Next, ELM classifier can be employed for the effective identification and classification of educational data. Finally, the AHB algorithm is utilized for effectively adjusting the parameters involved in the ELM model. The performance study revealed the effectual outcomes of the FSSML-EDM model over the compared models under several dimensions. Therefore, the FSSML-EDM model can be used as an effectual tool for EDM. In the future, feature reduction and outlier removal models can be employed to improve performance. In addition, the proposed model is presently tested on small-scale dataset, which needs to be explored. As a part of the future scope, the performance of the proposed model will be evaluated on a large-scale real-time dataset.

**Author Contributions:** Conceptualization, M.H.; Data curation, I.H.-J.; Formal analysis, I.H.-J.; Investigation, M.M.K.; Methodology, M.H. and R.F.M.; Project administration, M.M.K.; Resources, B.M.E.E.; Software, B.M.E.E.; Supervision, S.A.-K.; Validation, S.A.-K.; Visualization, S.A.-K.; Writing original draft, M.H.; Writing—review & editing, R.F.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R125), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia; Also, the authors would like to thank the Deanship of Scientific Research at Umm Al-Qura University for supporting this work by Grant Code: (22UQU4400271DSR07).

**Data Availability Statement:** Data sharing not applicable to this article as no datasets were generated during the current study.

**Conflicts of Interest:** The authors declare that they have no conflict of interest. The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

#### **References**


**Alicia Huidobro 1, Raúl Monroy 1,\* and Bárbara Cervantes 2,†**


**Abstract:** Knowing how visitors navigate a website can lead to different applications. For example, providing a personalized navigation experience or identifying website failures. In this paper, we present a method for representing the navigation behavior of an entire class of website visitors in a moderately small graph, aiming to ease the task of web analysis, especially in marketing areas. Current solutions are mainly oriented to a detailed page-by-page analysis. Thus, obtaining a highlevel abstraction of an entire class of visitors may involve the analysis of large amounts of data and become an overwhelming task. Our approach extracts the navigation behavior that is common among a certain class of visitors to create a graph that summarizes class navigation behavior and enables a contrast of classes. The method works by representing website sessions as the sequence of visited pages. Sub-sequences of visited pages of common occurrence are identified as "rules". Then, we replace those rules with a symbol that is given a representative name and use it to obtain a shrinked representation of a session. Finally, this shrinked representation is used to create a graph of the navigation behavior of a visitor class (group of visitors relevant to the desired analysis). Our results show that a few rules are enough to capture a visitor class. Since each class is associated with a conversion, a marketing expert can easily find out what makes classes different.

**Keywords:** web analytics; web log mining; clickstream analysis; sequence mining; sequitur; graph techniques

#### **1. Introduction**

The more knowledge a company has about visitors, the more effective its marketing strategies will be [1–3]. Therefore, it is valuable to know how visitors navigate the website [4–6]. This knowledge has to be obtained from the huge amount of data that are stored on a website [6–8]. Web analytics solutions (WAS) are widely used and provide useful metrics [9–17]. However, they have some limitations for describing the navigation behavior of visitors. Current web analytics software provides a page-by-page report [11,12]. This level of detail produces huge graphs that are difficult to analyze and compare. Numerous literature approaches analyze the sequence of visited pages [18–22] but they do not provide a high-level description of the navigation behavior. They are limited to cluster visitors based on different criteria; for example, the longest common subsequence of visited pages [20,23–27].

The objective of our research is to find out the navigation behavior that is common in a whole class of visitors. Each class of visitors should provide valuable knowledge in terms of business goals, specifically for marketing experts. Therefore, the segmentation of visitors is important. We used conversions as classes of visitors, as proposed by A. Huidobro et al. [28]. This approach eases the interpretation of results because conversions are specific visitor actions that contribute to business objectives [3,5,29], and it is a concept with which marketing experts

**Citation:** Huidobro, A.; Monroy, R.; Cervantes, B. A High-Level Representation of the Navigation Behavior of Website Visitors. *Appl. Sci.* **2022**, *12*, 6711. https:// doi.org/10.3390/app12136711

Academic Editors: Dionisis Margaris and Stefanos Ougiaroglou

Received: 8 June 2022 Accepted: 30 June 2022 Published: 2 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

are familiar. Examples of website conversions are: to pay for a product or a service, to fill in a form with contact details, or to post a positive product review.

To describe the navigation behavior of a whole class of visitors, we started by representing sessions as a sequence of visited pages. From the sequences representing sessions of a given class, we extracted the most frequent subsequences of pages. We called those sequences "rules". Each rule is formed by different pages and represents visitors actions. For example, making a payment or searching for the availability of a product. We named the rules and used them to obtain a reduced representation of each session; session reduction is a result of replacing sub-sequences (of length greater than or equal to two) for the name of the corresponding rule. The representation of sessions with rules drastically reduces the length of sessions (for example, from forty pages to two rules). Various types of analyses can be performed with the reduced representation of sequences. For example, comparing the frequency of a given rule in two different classes of visitors. We show the result of our analysis in a graph to facilitate the explanation oriented to marketing experts. Using the described method, we reduce hundreds of nodes and edges into a simplified graph that captures the navigation behavior of a whole class of visitors. The graph also assists in the comparison of the navigation behavior of different classes of visitors. It prevents the marketing expert from analyzing huge graphs to understand the navigation behavior of visitors. Our four-step methodology is shown in Figure 1.

**Figure 1.** The four-step methodology for characterizing website visitors based on their navigation behavior.

#### *1.1. Related Work*

In this subsection, we explain the limitations of both popular commercial software and literature approaches for describing the navigation behavior of a whole class of visitors. Concerning commercial software, we focus on Google Analytics and Matomo, which have a similar functionality. Google Analytics is the most popular web analytics software [30–32] and Matomo is an alternative to overcome some limitations of Google Analytics [12].

#### 1.1.1. Commercial Software

Google Analytics and Matomo provide a similar functionality for tracking the navigation behavior of visitors. In Google Analytics, it is called a "Behavior flow" report. In Matomo, it is the section "Goal conversion tracking", but it is only available in the premium version. Both consist of showing the sequence of the most visited pages in a period. It aims to measure the engagement page to page. Therefore, it is useful for finding pages where the traffic is lost, but it is difficult to follow a path with numerous pages. It is also difficult to visualize the path of 100% of visitors if they are numerous and behave differently. It is possible to track events instead of pages. Nevertheless, those events have to be previously configured. Therefore, events do not represent the natural navigation behavior of visitors. In Figure 2, we show an example of the behavior flow chart in Google Analytics. The pageby-page detail does not provide a high-level description of the navigation behavior [11,12]. Tens of pages would have to be reviewed to understand the navigation behavior of a whole class of visitors.

**Figure 2.** Example of the "Behavior flow" report in Google Analytics. It shows the sequence of most visited pages, from left to right. The thick red lines indicate traffic drop-offs. Each column corresponds to a web page or event. Therefore, this type of visualization does not show all the navigation in a single view but continues to the right, according to the number of pages or events.

#### 1.1.2. Other Non-Commercial Approaches

There are diverse web log mining approaches in the literature. However, they are centered around identifying clusters of visitors, and not on obtaining a high-level description of their navigation behavior. The analysis of visited pages, called sequence mining, is commonly applied for discovering patterns with a frequency support measure [33]. Clickstream analysis is the most popular sequence mining approach used for clustering visitors [21]. Clickstream is the sequence of pages visited by a user in a given website and period [27]. In our approach, a rule is a sequence of pages frequently visited by visitors of the same class. Therefore, we reviewed clickstream approaches; below, we describe some of them.

S. Tiwari et al. [20] use previously visited pages to forecast online navigational patterns (finding the next page expectation). They apply agglomerative clustering to group visitors according to the previous web data accessed. They obtain the set of frequently visited pages in each group of visitors. This information is used to put in the cache pages with higher frequency in order to reduce the search time.

A. Banerjee et al. [27] propose finding the longest common sub-sequence of clickstreams using a dynamic programming algorithm. Then, they identify similar users by computing a similarity value that considers the time spent on each page. With the similarity values, they construct a weighted similarity graph. Finally, they find clusters on that graph. They found that, in some cases, there are no exact matches. As a solution, they propose to first group data into categories.

There are other clickstream pattern mining approaches, but they are focused on improving the runtime or memory consumption for clustering visitors [21,22]. Visualization tools have also been proposed to analyze the navigation behavior of visitors [34–39]. However, they provide a detailed analysis of web pages; for example, to find the percentage of visitors on each web page.

Our contribution is a high-level description of the navigation behavior of visitors. A distinguishing characteristic of our approach is that we extracted the natural navigation behavior of visitors instead of finding if visitors perform previously known actions. Another distinctive aspect is that we represented business functions (conversions) in a single node (rule); this data reduction is relevant because representing all of the sessions of an ecommerce website usually involves thousands of visitors and hundreds of pages. In Figure 3, we show an example of sessions represented with rules. Considering that each rule groups N web pages, the proposed representation reduces the information a business expert needs to analyze while keeping interpretability. The representation obtained by commercial software would involve much more nodes (see Figure 2). This would be

difficult, for example, in the identification of loops that are worth analyzing and comparing different entry points to the website. We describe this in more detail in Sections 4 and 5.

**Figure 3.** An example of sessions represented with rules. Each node represents a rule (business functions) that groups N web pages. We can see, for example, that visitors who arrive at the website by the login page have a greater chance of paying than those who enter through the control panel. We also identify loops. The loop in the "Consult availability" rule may be expected because visitors usually review the availability of N products. However, the loop in the "Make invoice" is worth investigating since it could be a cause of dropout.

#### *1.2. Methodology*

Web log mining is the use of data mining techniques to obtain information about the navigation behavior of visitors [40]. The main difficulties in web log mining are the huge amount of traffic on websites and the wide variety of paths that visitors could follow [41]. Understanding the navigation behavior of numerous visitors can be overwhelming. Therefore, we aim to describe the navigation behavior of visitors with a simplified representation. With a sequence mining approach, we captured the milestones of different classes of visitors and presented them as a graph. That reduced the amount of data that a marketing expert would have to analyze in order to understand the navigation behavior of visitors and contrast different classes of visitors. To achieve this goal, we represented the navigation behavior of the website visitors as the sequence of visited pages in each session. That representation allows us to find out the representative navigation milestones in each class of visitors. To this aim, we used a compression algorithm that allowed us to identify sequences of pages that are common among visitors from the same class. We called those common sequences "rules". The set of rules obtained in each class of visitors describes the navigation behavior of most of the visitors in that class. Having identified the rules for each class of visitors, it is possible to replace the session pages with rules. This results in a reduced representation of sessions that allows us to carry out different kinds of analyses. For example, sessions could be represented exclusively with rules or, conversely, the behavior that is not represented by rules could be analyzed further. Statistics of the sessions represented as rules would help a marketing expert to establish questions of interest. We analyzed those statistics and, due to our simplification purpose, we found it relevant to represent the navigation behavior of visitors only with the most frequent rules. For our target audience, who are marketing experts, a graph visualization of results would be more friendly. Therefore, we summarized all of the sessions of a given class in a graph. The representation of sessions with rules significantly reduced the amount of data that a marketing expert would have to analyze. The graph depiction assists in the understanding of the navigation behavior of visitors, even when our work was not focused on visualization techniques.

In Section 2, we explain the sequence mining process for identifying rules in each class of visitors. Then, in Section 3, we describe how we used rules for representing sessions. In Section 4, we present the graph that describes the navigation behavior of visitors and insights obtained from it. In Section 5, we summarize our contributions and compare them

with previous work. Finally, in Section 6, we mention the advantages and limitations of the proposed methodology.

#### **2. Identification of Rules in Each Class of Visitors**

To find out the most common sub-sequences of visited pages for different classes of visitors, in the first place, one needs to represent the navigation behavior of website visitors. With that representation, it is then possible to find sequences of web pages that are commonly visited among visitors from a given class. Below we describe: (1) how we represented sessions with a sequence of symbols that contains the navigation behavior of interest, (2) the compression algorithm that we used to find common sub-sequences of pages, and (3) how we used that compression algorithm to find out the most common sub-sequences of visited pages, which are the milestones for each class of visitors.

#### *2.1. Representation of Each Class of Visitors as a Sequence of Symbols*

The input data consist of 50,820 sessions represented as the list of visited pages. There were thousands of different pages, but only a small subset was relevant for the proposed analysis. Once we identified relevant pages on sessions, we represented them as symbols to ease the sequence mining process.

#### 2.1.1. Identification of Relevant Pages

We are interested in the navigation behavior that could be meaningful for marketing experts. Therefore, we identified relevant pages, which are described as follows:


The process to keep only relevant pages entailed some information loss. That information could be useful for some traffic analytics; for example, to measure the number of pages sent to the visitor, the amount of data transmitted, or the frequency of clicks. Nevertheless, that loss does not affect the objective of describing the navigation behavior of visitors. We only needed web pages intentionally visited.

#### 2.1.2. Representation of Sessions as a Sequence of Symbols

To ease the sequence mining process, we represented each session as a sequence of symbols. Due to the fact that there are 298 pages of interest, we assigned a two-letter identifier to each of them. Then, in each session, we replaced the name of the page with its identifier; for example, the session Home → Login→ Control panel → Logout became *AaAzBkBb*.

#### 2.1.3. Segmentation of Data in Different Classes of Visitors

We are interested in describing the navigation behavior of different classes of visitors and contrasting them. Therefore, it is necessary to segment data. The input dataset was already labelled. We classified 100% of sessions into four disjoint classes:


We will refer to the previous classes of visitors as "Made payment", "Started payment", "Other conversions", and "No conversion". We obtained a dataset for each class of visitors.

With sessions represented as a sequence of symbols and segmented into different classes of visitors, it is possible to identify the representative navigation milestones in each class of visitors. To this aim, we used a compression algorithm, which is described next.

#### *2.2. Selection and Implementation of the Compression Algorithm*

An objective of our research was to reduce the amount of data that have to be analyzed in order to understanding the navigation behavior of visitors. Our strategy was to find recurrent sub-sequences of visited pages. Therefore, we used a sequence mining approach. In this subsection, we explain how we selected the sequence mining algorithm, how it operates, and the implementation that we used.

#### 2.2.1. Selection of the Sequence Mining Algorithm

We discarded algorithms that find the longest common sub-sequence, such as MAXLEN [27,42], because we are interested in all of the sub-sequences that are repeated, no matter if they are long. We evaluated compression algorithms such as Sequitur [42–44], Repair [45,46], and Bisection [42,47]. We selected the Sequitur algorithm because it is the most efficient. It runs in linear time. Below, we explain this algorithm.

#### 2.2.2. Sequitur algorithm

Sequitur finds repetitive sub-sequences in a sequence by identifying rules. It creates a grammar based on repeated sub-sequences. Then, each repeated sub-sequence becomes a rule in the grammar. To produce a concise representation of the sequence, two properties must be met [48,49]:


To clarify the operation of Sequitur, we will use the following definitions:


We will use the sequence aghdfghmadfgh as an example to describe the operation of the Sequitur algorithm. For each symbol in the sequence, Sequitur verifies the properties of digram uniqueness and rule utility. In Table 1, in each row, we show the resulting grammar and the expanded rules, as each new symbol is reviewed. In the column "Resulting grammar", 1 to n are the found rules, and 0 is the result of using those rules in the original string. Grammar 0 is not expanded in the last column because it is not a rule. However, if we expand Grammar 0, we obtain the original string. We can see that Sequitur does not find any rule from rows 1 to 7. That is because there are no pairs of symbols that appear twice or more in the string. In row 8, the pair of symbols "gh" appears twice, so it is added to the grammar as rule 1. In row 11, the pair of symbols "aa" appears twice and it is added to the grammar as rule. In row 13, the pair of symbols "df" appears twice and it is added to the grammar as rule 3. In row 15, the rule "df" becomes a nested rule because "dfgh" is found twice, but the pair "gh" is already rule 2, so rule 3 changes from "df" to "df 2". All of the rules added to the grammar met properties *p*1 and *p*2.

#### 2.2.3. Implementation of the Sequitur Algorithm

We used a publicly available implementation of Sequitur [50]. We adapted this implementation in order to use it with the two-letter identifier of each web page. That was necessary because the original implementation identifies each symbol as a different element in the sequence.


**Table 1.** Operation of the Sequitur algorithm. NRF = No rules found.

Next, we explain how we used this implementation of the Sequitur algorithm for finding recurrent sub-sequences of visited pages.

#### *2.3. Rule Extraction*

We used the Sequitur algorithm to find recurrent sub-sequences of visited pages (rules) in each class of visitors. In this subsection, we explain how we extracted, analyzed, and selected those rules.

#### 2.3.1. Rule Finding

Sequitur identifies the sub-sequences that appear twice or more in a string as rules. Nevertheless, for our analysis, it was necessary to find all sub-sequences that are common among different sessions. Some of those sub-sequences may appear only once in each session. To this aim, we concatenated sessions of each class of visitors. Below, we explain this methodology.


In Table 2, we show the percentage of sessions and the number of rules found in each class of visitors. We can see that the classes of visitors "Made payment", "Started payment", and "Other conversions" have a much higher number of rules than the "No conversion" class of visitors, even when "No conversion" has the highest percentage of visitors. This could indicate a more homogeneous behavior in visitors from the first three classes.

**Table 2.** Rules obtained in each class of visitors. Columns 2 to 5 indicate the class of visitor. Seven percent of the sessions are visitors from the class Made payment, where we found 764 rules. Conversely, fifty-two percent of the sessions are visitors from the class No conversion, where we found only 92 rules.


Rules should allow us to contrast classes. Therefore, we made an inter-class analysis to find out if the set of rules is different in each class of visitors.

#### 2.3.2. Inter-Class Analysis

The objective of the inter-class analysis is to find out (1) if the rules are different for each class of visitors, and (2) if those rules are relevant. To this aim, we computed two metrics:


As an example of the inter-class analysis, in Table 3, we show the metrics of the rules found in visitors from the class "Made payment". Below, we summarize the interpretation of this table.



**Table 3.** Rules selected for each class of visitors: the nested rules with inverse frequency ≥5%.

We made the inter-class analysis for the other three classes of visitors. Both metrics are higher when the set of rules and the sessions belong to the same class of visitors. Nevertheless, it was remarkable that the highest inverse frequency of No conversion rules was only 14% in the sessions of the same class. This indicates that the behavior of the visitors that belong to the class No conversion is less homogeneous.

The inter-class analysis confirmed that there are relevant and specific rules for each class of visitors. The next step was to select the best rules for describing the navigation behavior of visitors.

#### 2.3.3. Rule Selection

A selection of rules is necessary because the rules obtained until now include base rules and nested rules. This is redundant because base rules are contained in nested rules. There could also be rules with too low an inverse frequency (e.g., rules that are found in just one session). These rules are not representative. Therefore, we applied two criteria for selecting rules:


In Table 4, we show the number of rules obtained after applying these selection criteria. The inverse frequency of all of the nested rules extracted from the class of visitors No conversion was <5%. We concluded that the navigation behavior of this class of visitors is non-homogeneous. Thus, it could not be simplified using a small set of rules. In the next steps of the process, we only used the classes of visitors Made Payment, Started payment, and Other conversions. From now on, when we use the term "rule(s)", we refer to the set of rules presented in Table 4.

**Table 4.** Rules selected for each class of visitors: the nested rules with inverse frequency ≥5%.


We assigned a name to each rule. In Table 5, we list that name and the number of pages that form each rule. We also mention the class of visitors in which the rule was found. The rules listed in Table 5 are the navigation milestones for each class of visitors. Now, those rules can be used to simplify the representation of sessions. That process is explained next in Section 3.


**Table 5.** Name of the rules found in each class of visitors. The rule length indicates the number of pages that form each rule. An "X" indicates that the rule was found in that class of visitors.

#### **3. Representation of Sessions with Rules**

At this point, we have already identified the rules for each class of visitors. These rules can be used for representing sessions. This reduced representation allows us to carry out different kinds of analyses. In Section 3.1, we explain how we select rules to create a reduced graph and we provide statistics of the sessions represented with rules. These statistics provide information that would help marketing or information technology experts to establish questions of interest.Then, in Section 3.2, we describe how we select the data to be shown based on the questions of interest. For our particular case study, it was relevant to represent the navigation behavior of visitors only with the most frequent rules.

#### *3.1. Selection of Rules to Visualize*

We used the rules identified in Section 2 to represent sessions. In each session or group of sessions, we only used the rules that belong to the same class of visitors, according to Table 5. For example, a session that belongs to the class of visitors "Made payment" is represented only with the nine rules found in the sessions of the same class of visitors. In Figure 4, we show an example of three representations of a session: (1) the original session, (2) the session we obtained by replacing in the original session frequently occurring sub-sequences with rules (we called it a shrinked session), and (3) the session we obtained by stripping off any symbol but a rule in a shrinked session (we called it a stripped session).

The rules represent the behavior that is common among visitors from the same class. Conversely, pages that do not form rules represent uncommon behavior. To determine which analysis is worth conducting, we obtained statistics about sessions represented with rules. These statistics would help a marketing or information technology expert to determine questions of interest. We computed statistics on the three representations exemplified in Figure 4: original session, shrinked session, and stripped session. We computed the length of each representation and the reduction rate with respect to the length of the original session. Using the example in Figure 4, the length of the original session is 12, the length of the shrinked session is 5, and the length of the stripped session is 3.


For each class of visitors, we computed the length of the three representations and the reduction rate. As an example, in Table 6, we show the results for the class of visitors "Made payment". We can see that the average reduction rate is 0.54 in shrinked sessions. For stripped sessions, the average reduction rate is 0.95. To better understand how the reduction rate behaves, we obtained a histogram of the reduction rate. In Figure 5, we show the histogram for shrinked and stripped sessions.

**Figure 4.** Example of a session represented with rules. (**A**) shows the original session. The green (respectively, red) arrow indicates the entry (respectively, exit) page. Circles with a blue border are pages that are part of a rule. This session belongs to visitors from the class "Other conversions". Therefore, we only used the rules identified in that class of visitors. We obtained (**B**) by representing the session with rules. (**C**) is the result of removing all pages that do not form a rule.

**Table 6.** Statistics of the length of sessions represented as rules. Metrics in rows 1 to 3 refer to the length of sessions in each representation. Metrics in rows 4 and 5 refer to the reduction rate with respect to the original session. In the last row, we indicate the percentage of sessions that we used in calculations. A total of 30% of sessions do not include any rule. Thus, in the last column, the percentage is reduced to 70%.


**Figure 5.** Histogram of the reduction rate of visitors from the class "Made payment". The reduction rate is calculated with respect to the length of the original session. The blue histogram corresponds to the shrinked sessions; their reduction rate varies from 0 to almost 1. The green histogram corresponds to the stripped sessions; their reduction rate varies from 0.8 to almost 1.

#### *3.2. Selection of the Session Representation to Visualize*

Previous statistics would help marketing or information technology experts to determine questions of interest; for example: what is the common navigation behavior in each class of visitors? what is different in the navigation behavior of each class of visitors? what navigation behavior is common among all classes of visitors? what are the relevant entry and exit milestones (rules) in each class of visitors?, etc. Different session representation is useful for each question.

In our case, the objective is to capture the milestones of different classes of visitors in a reduced representation of their navigation behavior. Therefore, for further analysis, we used the shrinked sessions. These allowed us to reduce the amount of data to analyze and contrast different classes of visitors. The elimination of pages that do not form rules does not affect our objective. On the contrary, including them would introduce information about the individual navigation behavior of visitors. Nevertheless, the analysis of pages that do not form rules could be relevant for other purposes; for example, to find out what distinguishes visitors from the same class.

Using shrinked sessions, we created a graph visualization that allows us to summarize the navigation behavior of each class of visitors. That visualization is presented next in Section 4.

#### **4. Results**

Shrinked sessions contain the milestones of the navigation behavior of visitors. In this section, we present those shrinked sessions in a graph visualization aimed at our target audience, marketing experts. Our work was not focused on visualization techniques, but the use of a graph is user friendly. It also assists in analyzing and comparing different rules or different classes of visitors. In Section 4.1, we explain how we built the graph. In Section 4.2, we describe the visualization of a whole class of visitors. Then, in Section 4.3, we exemplify the analysis of a single rule. Finally, in Section 4.4, we contrast different classes of visitors.

#### *4.1. Graph Creation*

In this subsection, we describe the concepts and calculations that we used to build a graph that describes the navigation behavior of visitors.

#### 4.1.1. Definitions

Consider *i* and *j* rules in the class of visitors *A*:


#### 4.1.2. Calculation Example

Consider the rules *a* and *b* found in the class of visitors "A". This class has 10 sessions. In addition, consider the following information:


In order to construct the graph, it is necessary to calculate the entry rates, the exit rates, and the weights.


**Table 7.** Example of weight calculation. The out-degree frequency *Oi* is the sum of frequencies *fij* of edges with the same source rule. Thus, *Oa* = 5 + 3 + 2 = 10 and *Ob* = 6 + 8 = 14.


#### 4.1.3. Graph Example

In Figure 6, we show the graph obtained in our example.

**Figure 6.** (**a**,**b**) Graph example. Yellow nodes are the rules where conversion occurs. The arrow thickness corresponds to the edge weight. The green (respectively, red) arrows indicate the entry (respectively, exit) rate. The values in the middle of the arrows indicate the weight of the edge (*wij*). If there were edges with a weight <0.05, they would be in a lighter grey, and their weight would not be shown.

#### *4.2. Visualization of a Whole Class of Visitors*

We created the graph for each class of visitors as described in Section 4.1. In Figure 7, we show the graph of visitors that belong to the class "Made payment". The marketing expert could determine if the observed behavior is expected or if there is suspicious or interesting behavior that is worth investigating. The interpretation of the graph depends on the business questions in which the marketing expert is interested. Below, we present our interpretation.

**Figure 7.** Graph of shrinked sessions for visitors from the class "Made payment". Yellow nodes are the rules where the payment is confirmed.

#### 4.2.1. Relevant Entry and Exit Rules

In Figure 7, we can see that the rule "Go to control panel" has the highest entry rate (0.32). A total of 32% of the sessions have this rule as the entry point. The rule "Pay via control panel" has the highest exit rate (0.17). A total of 17% of the sessions have this rule as the exit point.

Based on the weight of the edges, there are three rules with a weight >0.90 on their red arrow. Those rules are "Pay and modify product", "Pay for a service", and "Modify product and pay". This means that almost all visitors who follow those rules leave the website after that. Contrarily, only 25% of visitors who follow the rule "Consult availability and pay" leave the website.

#### 4.2.2. Most Frequent Path

If we follow the path of the highest entry rate and highest weights, we can see that 32% of visitors enter by the rule "Go to control panel". From there, 41% of visitors follow the rule "Start payment". Then, 52% of visitors follow the rule "Pay via control panel". In this rule, the payment is confirmed. After that, 83% of visitors leave the website. The marketing expert could evaluate if this path was expected. For example, is that sequence of 15 pages adequate? Could it be shorter? Was it expected that visitors leave the website immediately after the payment?

#### 4.2.3. Rules in Which Conversion Occurs

From all of the rules in which the payment occurs, most visitors leave the website. The marketing expert could determine feasible strategies to retain visitors after a purchase; for example, a flash discount on the purchase of additional service. There is an exception in the rule "Consult availability and pay". From this rule, 73% of visitors continue with the rule "Pay for a service". Those visitors made two payments because, in both rules, the payment is confirmed.

Besides observing the "big picture" in the graph, it is also possible to analyze specific rules in more detail. Next, we exemplify it.

#### *4.3. Analysis of Specific Rules*

From the rules in which the payment does not occur, the rule "Make invoice" has the highest exit rate. Therefore, we will analyze this rule further. Visitors who follow this rule mainly come from the rules "Start payment" and "Go to control payment". In those rules, the payment has not been confirmed. Additional analysis from marketing expert is needed to determine the reasons for the described behavior. For example, is the making of the invoice clear? Is it a long process? Does it have an annoying bug? Does it require redundant information? Is it used by clients or competing companies to inquire about the prices of products or services?

After knowing the reasons for losing visitors in the rule "Make invoice", marketing experts could design strategies for retaining them; for example, proactive online help. If the marketing team does not have enough information to determine why visitors are leaving after this rule, different actions could be implemented; for example, a pop-up window to rate the process to make the invoice. Even users who confirm the payment may provide useful information about this process.

Besides reviewing rules in detail, the graph representation also allows us to compare different classes of visitors. This is exemplified next.

#### *4.4. Contrasting Different Classes of Visitors*

The comparison of different classes of visitors depends on the behavior in which the marketing expert is interested. Below, we present how we contrasted the classes of visitors "Made payment" (shown in Figure 7) and "Started payment" (shown in Figure 8).

**Figure 8.** Graph of shrinked sessions for visitors from the class "Started payment". Yellow nodes are the rules where the payment is started.

#### 4.4.1. Contrasting the Exit Rule

Most visitors from the class "Made payment" leave the website after following the rule "Pay via control panel". This is a rule in which the payment is confirmed. Most visitors from the class "Started payment" leave the website after following the rule "Consult payment details". This is a rule in which visitors start the payment. This indicates that most visitors who start a payment but do not confirm it leave the website immediately after consulting the payment details instead of navigating further or requesting online help.

In both classes, the highest exit rate is in the rules in which a conversion is performed, even though the conversion is different in each class of visitors.

#### 4.4.2. Contrasting a Common Rule

In both classes of visitors, the rule "Go to control panel" has the highest entry rate. Nevertheless, the exit rate is double in visitors from the class "Started payment". After following the rule "Go to control panel", most visitors from the class "Made payment" start the payment process, whereas most visitors from the class "Started payment" modify the product information. This could be useful for encouraging the purchase in the pages where the product information is modified.

#### 4.4.3. Contrasting the Most Frequent Path

The path with the highest entry rate and weights, in visitors from the class "Made payment", is "Go to control panel" (32%) → "Start payment" (41%) → "Pay via control panel" (52%) → Exit (83%). In visitors from the class "Started payment", the path with the highest entry rate and weights is "Go to control panel" (44%) → "Modify product information" (41%) → Exit (49%). This confirms the relevance of the rule "Modify product information" as an exit point.

#### **5. Discussion**

We presented the graph of shrinked sessions for different classes of visitors. This visualization reduces the amount of data that marketing experts would have to analyze for understanding the navigation behavior of visitors. It also assists in contrasting different classes of visitors. Our main contributions were: (1) to obtain results that are easy to interpret and could be meaningful for marketing experts. We achieved this by using classes of visitors associated with conversions of the sales funnel; and (2) to ease the analysis of results with a simplified description of the navigation behavior of visitors. This was created by using rules, which are the sequences of pages of common occurrence.

In Section 1.1 "Previous work", we describe previous commercial and non-commercial approaches related to our work. In Table 8, we summarize their differences with our method.

**Table 8.** Differences between our method and previous approaches.


Web analytics software is essential for measuring website traffic and follow-up marketing campaigns. Nevertheless, its standard configuration and reporting options make it hard to extract high-level knowledge about the navigation behavior of different classes of visitors, especially due to the high amount of website traffic and the diversity of paths that visitors could follow. Most non-commercial approaches focus on finding clusters of visitors or improving the runtime performance, and proposed visualizations also provide a page-by-page detail that may lead to enormous graphs that are difficult to analyze. To highlight the advantages of the method that we propose, we compared the resultant visualization (Figure 8) with the visualization of Google Analytics (Figure 2) and a non-commercial approach (Figure 9). We will refer to the two last as CSW (commercial software) and NCA (non-commercial approach), respectively. Next, we list the most relevant differences:


Our method helps to identify points of interest whose interpretation is enriched by the opinion of a business expert. This approach assists in answering business questions in the context of the navigation behavior of all visitors in a given class, which is opposite to existing solutions that mainly aim to analyze the web page performance in detail. Next, we mention some examples of findings that would be difficult to obtain in a graph with page-by-page detail:


have a loop. A call to action in the web pages of rules 1 and 2 could decrease their dropout rate.


**Figure 9.** An example of the visualization proposed by B. Cervantes et al. [35]. Blue nodes are web pages, starts are objective pages, and nodes with country flags are visits of users from that country.

#### **6. Conclusions**

To describe the navigation behavior of visitors, we proposed a clickstream analysis. It is based on identifying actions that are repeated by users of the same class, considering an action as a sequence of visited pages. To assist in the interpretation of results to marketing experts, we created a graph representation. Our proposal is a starting point to further simplify the analysis of the navigation behavior of visitors or the extraction of knowledge for a marketing audience. Next, we summarize the contributions of our method, its limitations, and future work.

#### *6.1. Contributions*

There are three main advantages of our methodology over other existing solutions. The first advantage is that it reduces the amount of data to analyze for understanding the navigation behavior of visitors. The second advantage is that it extracts the natural navigation behavior of visitors. The third advantage is the use of web logs as entry data. Below, we explain them in more detail.

The increasing amount of data generated on websites makes it difficult to find relevant knowledge. With our method, we replace tens of pages with a single graph. Besides summarizing the navigation behavior of a whole class of visitors, our approach also allows us to compare different classes of visitors. This knowledge could be used to improve the effectiveness of marketing campaigns or website design. For example, an action (which is composed of a sequence of pages) could be especially successful to attract new visitors, but unsuccessful to make clients purchase. Marketing experts could design strategies for visitors to leap from the interest stage to the purchase stage (e.g., add proactive online help, provide more information about the benefits of the product, or a retargeting campaign for the visitors who performed visits of that sequence of pages).

With our methodology, we extracted the natural navigation behavior of visitors. This distinguishes our work from other solutions. Previous approaches group the web pages in tasks identified by the business expert. Therefore, they reflect the expected behavior, not the natural paths followed by visitors. Our approach, on the contrary, obtains the common sub-sequences of visited pages that visitors follow. This approach allowed us to find useful information; for example, that the making of the invoice is a relevant exit point. It also enabled us to contrast relevant entry and exit rules for different classes of visitors.

The use of web logs as entry data allows for the performance of a retrospective analysis. This is not possible in commercial software, where the configuration of conversions and market segments usually applies only for future traffic of the website. The use of web logs also allows us to prepare data according to different objectives. For example, we could compare different periods of a given class of visitors or different website versions.

#### *6.2. Limitations*

Our method focuses on a high-level understanding of navigation behavior. However, some business questions necessarily require a detailed web-page-level review. For those cases, commercial software is already effective; for example, if we want to know where traffic flows to after the visit of a specific web page.

Our approach can extract facts; for example: out of N possible ways to make a payment, which one is the shortest or the most frequent. However, in other cases, the findings are only the beginning of the discussion and require the intervention of a business expert. For example, a loop is not necessarily bad, but the interpretation of a human expert is needed to identify which ones are worth analyzing and correcting.

The effectiveness of our method relies on the existence of sequences of pages that are common among visitors to a website. However, there is a possibility that the navigation is too sparse for a particular class of visitors. In those cases, there will be no rules with a relevant frequency. While this would in itself be a finding, it would not be possible to construct a graph for further analysis by a business expert.

#### *6.3. Future Work*

There is a latent need for creative ways to help business experts evaluate the performance of a website. Next, we mention a few examples of how our method could be improved or extended:


behavior of visitors. On the contrary, the company may find it useful to associate each page of interest with a conversion.


Our approach responds to the need for a high-level description of the navigation behavior of website visitors. It does not replace the functionality of existing web analytics software; on the contrary, it can complement it. Our method can also be used with existing classification techniques. This work is a starting point for business questions that require understanding the navigating behavior of website visitors in a wide context. We believe that this is a very promising area of research.

**Author Contributions:** Conceptualization, R.M. and B.C.; methodology, A.H.; software, A.H.; validation, A.H.; formal analysis, A.H.; investigation, A.H.; resources, R.M. and B.C.; data curation, A.H.; writing—original draft preparation, A.H.; writing—review and editing, R.M. and B.C. visualization, A.H.; supervision, R.M. and B.C.; project administration, A.H. and R.M.; funding acquisition, R.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research reported here was supported by Consejo Nacional de Ciencia y Tecnología (CONACYT) studentship 957562 to the first author.

**Institutional Review Board Statement:** Not applicable.

**Data Availability Statement:** We did not use publicly available datasets. We thank NIC Mexic for providing the data used in this research.

**Acknowledgments:** The authors acknowledge the technical support of Tecnologico de Monterrey, Mexico. We also thank NIC Mexico for providing the data used in this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


**Amira Abdelwahab 1,2,\* and Mohamed Mostafa <sup>2</sup>**


**Abstract:** The social network is considered a part of most user's lives as it contains more than a billion users, which makes it a source for spammers to spread their harmful activities. Most of the recent research focuses on detecting spammers using statistical features. However, such statistical features are changed over time, and spammers can defeat all detection systems by changing their behavior and using text paraphrasing. Therefore, we propose a novel technique for spam detection using deep neural network. We combine the tweet level detection with statistical feature detection and group their results over meta-classifier to build a robust technique. Moreover, we embed our technique with initial text paraphrasing for each detected tweet spam. We train our model using different datasets: random, continuous, balanced, and imbalanced. The obtained experimental results showed that our model has promising results in terms of accuracy, precision, and time, which make it applicable to be used in social networks.

**Keywords:** spam detection; deep learning; semantic similarity; social network security

**1. Introduction**

Currently, many internet users can impart information and work together inside online social networks (OSNs). However, Twitter is viewed as the most well-known informal community which offers free blogging services for clients to publish their news and thoughts inside 280 characters. Clients can follow others through various platforms [1]. Consistently, a huge number of Twitter clients share their status and news about their disclosures [2]. Moreover, the Twitter platform additionally attracts criminal records (spammers) that can tweet spam substances, which may incorporate destructive URLs. This could divert clients to malevolent or phishing sites for bringing in cash misguidedly [3,4] by assaulting the client's profile. As Twitter set caps for the length of the characters of tweets, this makes spammer swindle clients by putting cheat content or malicious URL to divert them for the outside site [5]. In an investigation studying the correlation between both email and social spam, the click-through rate of Twitter spam was found to reach 0.13%, in spite of the fact that email spam arrives at 0.0003–0.0006% [6]. Moreover, social spam is viewed as increasingly perilous and cheats a lot of clients [7].

To tackle this problem, many researchers are focusing on detecting spammers by discovering the statistical features of spammers on both messaging and account levels. These messaging detection approaches focus on checking tweet content to find keyword patterns, hashtags, and URLs. These approaches are shown to be effective, but real-time detection is needed to solve the huge number of messages which are posted per hour. The account level approaches focus on extracting statistics and info about the behavior of each account to classify whether they are spam accounts or legitimate users. However, an experimental study was conducted to examine whether the statistical features changed over time. The experimental results proved that the statistical features are changed over time.

**Citation:** Abdelwahab, A.; Mostafa, M. A Deep Neural Network Technique for Detecting Real-Time Drifted Twitter Spam. *Appl. Sci.* **2022**, *12*, 6407. https://doi.org/10.3390/ app12136407

Academic Editors: Dionisis Margaris and Stefanos Ougiaroglou

Received: 15 May 2022 Accepted: 20 June 2022 Published: 23 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Most of the researchers are focusing on collecting these features and trying to experience spammers priorities, ignoring that these features drift over time. However, spammers will try to tackle all these features. In this paper, an effective technique has been proposed to tackle the aforementioned limitations. Our proposed technique focuses on the content of each tweet in addition to the statistical features. Moreover, it has an auto-learning capability to find the features which make it able to classify each tweet as spam or not with high accuracy in a reasonable time.

Accordingly, these challenges inspire us to investigate this problem to contribute to spam detection approaches. To cope with this problem, we maintain a framework that contains three stages to detect spammers:


The rest of this manuscript is organized as follows. Section 2 briefly discusses the literature review on Twitter spam detection. Section 3 clarifies the problem statement of spam drift in detail. Section 4 explains our proposed detection framework. Section 5 discusses our experiments and results. Finally, conclusions are represented in Section 6.

#### **2. Literature Review**

Many studies have been performed to improve spam detection challenges. These studies can be organized into three categories [8], syntax analysis, feature analysis, and blacklist techniques, as shown in Figure 1.

Most of the research applied blacklisting techniques based on URLs in the tweets using any third-party tools, such as Trend Micro or Google safe browsing. However, S. Savage [9] creates a lightweight technique for spam detection, while [10] filtered tweets based on checking URLs in tweets, username patterns, and hashtags.

**Figure 1.** Twitter spam detection taxonomy.

Consequently, a lot of researchers have applied machine learning (ML) techniques in their works [11–14] and extracted some features of users, such as number of followings, username pattern, and account creation in addition to features of content, such as length of tweets, number of hashtags, and hashtags pattern. Authors in [11] employed honey pots

to collect spammers' profiles to extract statistical features using different ML algorithms, such as Decorate and Random some space. However, Benevento et al. attempted to detect spammers by using a support vector machine (SVM) algorithm [12]. These features can be easily fabricated as spammers can buy fake spammers' followers and followings. Thus, some studies [15] depend on a social graph to tackle the problem of fabrication by calculating distance and connectivity of each tweet between sender and receiver to examine whether it is spam. Yang et al. [16] built a more robust feature using a bidirectional link ratio between centrality and local cluster coefficient with performance 99% true positive, while [17] provides a new solution that can detect most campaigns and classify each of them into spam or not spam using deep learning techniques and semantic similarity methods.

Most of the described methods focus on detecting spam tweets based on some statistical features. Some studies employ syntax analysis, while a spam dataset based on hashtags was created by [18], in which authors collected 14 million tweets and classified them using five different techniques. Sedhi and son [19] utilized a package of four lightweight techniques to detect spam at tweet level using part of speech tag, content-based, sentiment, and user-based features, using a word vector as the universal feature of their task. Le and Mikulov [20] have deployed a deep learning method by constructing a tweet vector by combining the word vector with the document vector to classify the neural network.

In [21], the authors employ the horse herd optimization algorithm (HOA), inspired by nature optimization algorithms. This algorithm emulates the social exhibitions of horses at various ages. The idea behind this study has a great performance result on complex problems, specifically with high dimensions, solving many dimension problems with low cost based on time, performance, and complexity (up to 10,000 dimensions). The researcher attempts to find the best solution by employing the multiobjective opposition-based binary which gave good results compared with similar approaches. However, it still depends on statistical functions which can deviate over time as explained.

The study by Abayomi-Alli [22] used the ensemble approach to detect SMS spam. This approach depends on two pipeline the BI-LSTM (Bidirectional Long-Short Term Memory) network which produce accurate results in text classification tasks and the classical machine learning methods. However, this approach does not employ any attention mechanism in the BI-LSTM network, which causes this approach to suffer in long sentences of more than 8 words.

Many different extraction methods have been used for representing tweets, such as [23]. In this reference, authors analyzed people's sentiments collected through tweets. They employed three different feature extraction methods, domain-agnostic, fastTextbased, and domain-specific, for tweet representation. Then, an ensemble approach was proposed for sentiment analysis by employing three CNN models and traditional ML models, such as random forest (RF), and SVM using the Nepali Twitter sentiment dataset, called NepCOV19Tweets. Their models achieve 72.1% accuracy by employing a smaller feature size (300-D). However, these models have two limitations. First, they are complex and need high computational resources for implementation. Second, their methods are based on only semantic features.

In addition, authors in [24] analyzed people's sentiments using three feature extractions, term frequency-inverse document f(TF-IDF), fastText, and a combination of these two methods as hybrid features for representing COVID-19 tweets. Then, they validated their methods against different ML techniques. Their SVM model obtained the highest accuracy on both TF-IDF (65.1%) and hybrid features (72.1%). The major limitation of this model is its high computational complexity.

TF-IDF [25] may be used to vectorize text into a format that is more suitable for machine learning and natural language processing approaches. It is a statistical measure that we can apply to terms in a text and then use to generate a vector, whereas other methods, such as word2vec [26], will provide a vector for a term and then extra effort may be required to transform that group of vectors into a single vector or other format. Another approach is Bidirectional Encoder Representations from Transformers (BERT), which converts phrases, words, and other objects into vectors using a transformer-based ML model [27]. However, BERT's design also includes deep neural networks, which means it can be significantly more computationally expensive than TF-IDF.

Because our proposed framework will be used with highly intensive data applications, we had to choose a high-performance and quick feature extraction method. TF-IDF produces high accuracy relative to our framework, so we decided to build our model with it.

Most of the mentioned studies focus on extracting the features that can help them find the spammers, but they ignore a very important problem, which is "spam drift", meaning that these features are changed over time. Egele et al. [28] build a historical-based model, which does not suffer from this problem. Authors in [29] have built a model using a fuzzy model that attempts to adapt the features over time, but the accuracy is decreased. So, we will focus on this problem and then try to build a robust framework to cope with most of the challenges to detect Twitter spam.

#### **3. Problem Statement**

The problem revealed in this paper is detecting and classifying each tweet whether it is spam or not. So, we have the problem of "spam drift", which happened because most of the researchers focus on determining the spam tweets based on the statistical features. Most of them focus on selection of features as shown in Table 1. In the real world, these features are changing in an unpredictable way over time. Therefore, we attempted to build a framework that is robust against these changes.


**Table 1.** Comparative study of ten consequence days between spam and non-spam using KL-Divergence.

At the beginning, we will try to prove this problem as in [29]. So, we have crawled data of tweets from Twitter Stream API for 10 consecutive days. We have to check a lot of tweets to determine which are spam. In this stage, we found that most of the spam tweets contain a URL, which most spammers use to spread their malicious content by sending the victim to mine or farm sites. Therefore, we use Trend Micro's Web Reputation Technology (WRT) to detect the tweet as spam or not based on the URL [22]. This WRT system helps users to identify the malicious sites in real-time with high reliability with an accuracy rate of 100% as reported in [30]. Moreover, we have made hundreds of manual inspections to ensure the reliability of this system.

As described previously, we found that the statistical features are changing from day to day with impressive effect as shown in Table 1. For example, we found that the average number of account followings changes from the 1st day (500–900) to the 9th day (950–1350). This means that the spammers try to collect the followings, but the average number of followings is confused whether this account is spam or not.

Therefore, to justify the problem of changing the statistical features, the distribution of the data should be modeled. There are two types of data: parametric and non-parametric. The parametric approaches are always used when the distribution of data is known as normal distribution, but the statistical features of Twitter are unknown [31,32]. So, we used

the non-parametric approaches. One of the most common non-parametric approaches is the statistical test. The calculation of the statistical test is based on computing the distance between the two distributions to calculate the change between them. Distance is calculated using Kullback-Leibler (KL) divergence [31], which is also known as relative entropy, shown in Equation (1):

$$D\_{kl}(P|Q) = \sum\_{i} P(i) \log \frac{P(i)}{Q(i)} \tag{1}$$

This formula is used to measure the two probability distributions as reported in [33]. Let *s* = {*x*1, *x*2,..., *xn*} be a multi-set from a finite set *F* containing numerical feature values, and *N*(*x*|*s*) is the number of appearances of *x* ∈ *s*. Thus, the relative proportion of each *x* is shown Equation (2)

$$P\_s(\mathbf{x}) = \frac{N(\mathbf{x}|s)}{n} \tag{2}$$

The ratio of the two variables *P*/*Q* is undefined, if we assume *Q*(*i*) = 0. Therefore, the estimation of *P*s(*x*) is changed to Equation (3)

$$P\_{\mathbf{s}}(\mathbf{x}) = \frac{N(\mathbf{x}|\mathbf{s}) + 0.5}{n + |\mathbf{F}|/2} \tag{3}$$

when variable |*F*| is defined as the number of elements in the finite set *F*. The distance between two day's tweets, *D*1 and *D*2, is defined as shown in Equation (4)

$$D(D1|D2) = \sum\_{\mathbf{x} \in F} P\_{D1}(\mathbf{x}) \log \frac{P\_{D1}(\mathbf{x})}{P\_{D2}(\mathbf{x})} \tag{4}$$

We calculate the KL Divergence of spam and legitimate tweets of each feature in two adjacent days as shown in Table 1. The larger the distance, the more dissimilarity between the two distributions. So, according to the results in Table 1, the distance is large in most features in case of spam data. However, in non-spam data, the distance is very small in most of the features. According to this study, by examining the Number\_of\_tweet (f-6) feature from Table 1, we notice that the KL Divergence metric of spam tweets for Day 1 and Day 2 is 0.99. However, in non-spam tweets, it is 0.36, which means that the distribution of this feature is changed from Day 1 to Day 2 compared to non-spam tweets. As shown in Table 1, most features are changing unpredictably from one day to another, although the training data is fixed and is not affected by any changes. Therefore, the performance of the classifiers will become inaccurate if the decision boundary is not updated.

#### **4. The Proposed Model**

The process of classifying tweets as spam or not has three challenges. First, the tweet classification process can not only depend on statistical features because it drifts over time as described. So, our classifier considers the tweet content. Second, our proposed framework must struggle over the spammers because they try to change the tweet content, which helps them to evade from any monitor system [34]. Therefore, new spam tweets must be rephrased from the detected spam. Third, a robust framework must be built that is able to detect spam tweets in less execution time to cope with Twitter big data challenges. These three challenges motivated us to build the proposed framework. This framework consists of three layers as shown in Figure 2.

#### *4.1. Learning from Detected Spam Tweets*

This layer is used to filter Twitter as an initial step for fast detection of spam tweets. As described in Figure 2, our proposed framework is interested in spam tweets to regenerate a new semantic meaning of the same tweet by the next layer. Therefore, new information or words can be obtained that the spammer can use to paraphrase the tweet content and spread their spam again. In this step, the SVM classifier is utilized. First, this classifier is

trained with a bi-gram (set of two words for each tweet) and transforms the tweets with TF-IDF. Then, the new unlabeled tweets are entered into this classifier to classify them (Spam, notSpam). This method focuses on the non-spam tweets, which will be the input for the next layer.

**Figure 2.** The proposed framework.

#### *4.2. Generate New Tweets*

In the real world, researchers try to build robust systems. However, smarter spammers are trying to tackle these solutions. Therefore, a system for tweet paraphrasing should be built using a method that generates text by preserving the same meaning and semantic, not only focusing on the correct grammar. Therefore, we used the encoder-decoder framework [35], which is embedded with an attention model network. The spam tweet paraphrasing model is shown in Figure 3.

**Figure 3.** An overview of spam tweet paraphrasing model.

Given the source spam messages as input from the classifier layer, the encoder packs the source into dense representation vectors called context vector *ct* , which captures the context information for this message. Then, the decoder tries to generate the paraphrased messages from the hot encoded vectors according to Equations (5) and (6).

$$c\_{l} = \sum\_{i=1}^{N} \alpha\_{li} h\_{i} \tag{5}$$

$$\alpha\_{l\dot{i}} = \frac{e^{\mathbf{g}(s\_l, h\_{\dot{i}})}}{\sum\_{j=1}^{N} \mathbf{e}^{\mathbf{g}(s\_l, h\_{\dot{j}})}} \tag{6}$$

where *g*(*st*, *hi*) is an attractive score between the encoder state *hi* and the decoder state *st*. Then, the dense representations are fed into an attention layer. For predicting words, the decoder utilizes the combination of source and target context vector as query *qt* shown in Equation (7) to get the word embeddings

$$q\_t = \tanh(\mathcal{W}\_\mathbf{c}[\mathbf{s}\_t; \mathbf{c}\_t]) \tag{7}$$

The candidate words *Wi* and its corresponding embedding vector *ei* are stored as key-value pairs {*ei*, *Wi*}. Therefore, our model uses *qt* to query these key-value pairs by evaluating all the applicant words between the query *qt* and the word vector *Wi* as shown in Equation (8)

$$f(q\_t, e\_i) = \begin{cases} \begin{array}{c} q\_t^T e\_i \\ q\_t^T W\_d e\_i \\ v^T \tanh\left(W\_q q\_t + W\_e e\_i\right) \end{array} \end{cases} \tag{8}$$

where *Wq* and *We* are two trainable parameter matrices, and *v<sup>T</sup>* is a trainable parameter vector. Then return the word which has the highest matching. The chosen word is emitted as the generated token, and its embedding is then utilized as the contribution of the long short-term memory (LSTM) at the next step. The word embedding is affected by three sources: the input of the encoder, the input of the decoder, and the query of the output layer. In the training stage, we used the Adam optimizer method with these hyper-parameters <sup>β</sup>1 = 0.9, <sup>β</sup>2 = 0.999, <sup>α</sup> = 0.001, and ∈ = <sup>1</sup> × <sup>10</sup>−<sup>8</sup>

#### *4.3. Ensemble Method*

In this layer, we proposed a novel technique to classify the tweets as spam or non-spam as shown in Figure 4. We have combined three deep neural network classifier techniques together for content based on one classifier for user-based features, which contains two different architectures. First, we will explain the methodology for each component and then explain the whole technique as an ensemble classifier.

**Figure 4.** Ensemble neural network architecture.

#### 4.3.1. Convolution Neural Network

In this section, the convolution neural network (CNN) will be discussed. Recently, this network was designed to be used in computer vision problems. However, it has been shown that it can be used in natural language processing (NLP) tasks as [36] proposed neural architecture used in many NLP tasks, such as part of speech tagging, chunk, and named entity recognition. Our model is inspired by [36] in that the layers of this architecture are divided into five parts, input layer, embedding layer, convolution layer, pooling layer, and output layer, as shown in Figure 5. The input layer receives tweet messages as words or embedding words using word2vec or glove [37]. Each tweet is split into words with max\_length value 50 because the length of max tweet message is 280 characters, which is difficult to exceed this number of words. If the length is small, it should be padded with value 0. Thereafter, these words are split into features by performing kernel multiplication and then are fed into the next layer, the convolution layer. ReLU, sigmoid, and tanh activation functions are used to obtain the convolution feature map. Then, max pooling is used to select the maximum activation value. Max pooling is used with NLP tasks where min and mean pool is used with computer vision tasks. The fully connected hidden dense layer with sigmoid activation function is applied to classify the tweets. Twelve regularization is used to avoid overfitting. To build this architecture, we used loss function: binary cross entropy and optimizer parameters.

**Figure 5.** Neural network architecture with four conv. layers.

#### 4.3.2. Recurrent Neural Networks

A recurrent neural network (RNN) is a network of directed connection between each node. The main feature of this network is the hidden state (memory) that can capture the sequential dependence in data. So, we utilized LSTM networks in our work rather than gated recurrent unit (GRU) [38], which has a problem with remembering long sequences. As shown in Figure 6, we used the same architecture as CNN, but we replaced the convolution layer with the LSTM layer which contains three main gates as follows: Forget gate is responsible for controlling what information should throw away from memory, Input gate is responsible for controlling what new information should be added to hidden state from the current input, Output gate decides what information to output from the memory. Then the output of this layer is entered to fully connected dense layer to produce the output.

#### 4.3.3. Feature-Based Model

Statistical features in spam classifiers detection give good results [8]. Apart from using word embedding as described in the previous two sections, we also consider user-based features in our classifier.

A dataset with 6 million tweets is used to extract these features especially for userbased features [29]. We have presented the extracted features that can differentiate between spam or legitimate users as shown in Table 2. To represent the behavior of spam and legitimate accounts, a comparative study has been built between each extracted feature to

represent the difference between them using the empirical cumulative distribution function (ECDF) as shown in Figure 7.

**Figure 6.** Recurrent neural network architecture.

**Table 2.** Extracted Features with the Corresponding Description.


The experimental study found that more than 53% of spam users have less than 500-day account age. However, 38% of non-spammers have less than 500 days. This means that they always try to create new accounts to spread their attacks, but they get blocked by spam detection techniques. Also, regarding the number of user mentions, most of the spammers must put more than one user mention to spread their data. Regarding number of capital words, most of the spammers use capital words to attract the users, and more than 70% of spammers use capital words in their tweets compared to only 30% of non-spammer users. In addition, we have also identified a new attribute called reputation of users, which is calculated as shown in Equation (9):

$$\text{Reputation} = \frac{\text{number of followers}}{\text{number of followers} + \text{number of following}} \tag{9}$$

However, we found that the ratio of spammers is always small. They always have number of followings more than number of the followers because they try to make fake followers or following to show that this is a real account.

**Figure 7.** *Cont*.

**Figure 7.** (ECDF) User-based features comparison: (**a**) account age; (**b**) number of followers; (**c**) number of digits; (**d**) reputation; (**e**) number of URLs; (**f**) number of user favorites; (**g**) number of retweets; (**h**) number of tweets; (**i**) number of characters; (**j**) number of followings; (**k**) number of user mentions; (**l**) number of lists.

4.3.4. Proposed Ensemble Approach

As shown in Figure 4, this architecture contains three different neural networks gathered with one classifier for a user-based feature and is described as follows:


Furthermore, a neural network meta classifier is utilized and trained from the newly created data which consists of three-layers. It contains four input nodes and eight hidden nodes with a bias that is supported with the ReLU activation function. The output has only one node supported with the sigmoid activation function to generate value from 0 to 1.

#### **5. Experiments and Results**

In this section, we will present our experiments for each approach with different datasets for detecting the spam tweets in the Twitter platform. Firstly, we will give a brief description of our datasets and the evaluation metrics used in this study, then we will discuss our results of each approach.

#### *5.1. Dataset*

A ground truth dataset, which is called Hspam, is applied [18]. It contains 14 million tweets collected over two months and classified using many methods, such as manual annotation, KNN-based annotation, user-based annotation, domain-based annotation, and reliable ham tweet detection. For the privacy of the Twitter platform, we must grab the tweets using tweet\_id, but there are some tweets that are deleted or missed. So, we focus only on the returned tweets. To evaluate our approaches over many datasets, we split our dataset into 4 samples as shown in Table 3. We made two balanced samples with random selection and another with continuous selection. Then, we selected another two samples and divided the ratio of spam to not spam to 20 times as it describes that, in real life, 5% only of tweets are spam [6]. So, we made two samples to simulate the real-life data. For testing our approaches, we selected a random sample of 0.5 million tweets to make a fair comparison between all dataset samples and all approaches.



#### *5.2. Evaluation Metrics*

To evaluate our approach, we used the metrics of recall, precision, and F1-score which are shown in Equations (10)–(12), respectively. We supposed that spam tweets are positive while non-spam tweets are negative. Then, we constructed the confusion matrix accordingly as shown in Table 4, where TP (true-positive) refers to all spam tweets that are predicted correctly as spam tweets, FN (false-negative) denoted as all spam tweets which are predicted wrongly as non-spam tweets, TN (true-negative) denoted as all non-spam tweets which are predicted correctly as non-spam tweets, and FP (false-positive), which refers to all non-spam tweets predicted wrongly as spam tweets.

$$\text{Recall} = \frac{\text{TP}}{(\text{TP} + \text{FN})} \tag{10}$$

$$\text{Precision} = \frac{\text{TP}}{(\text{TP} + \text{FP})} \tag{11}$$

$$F1-\text{score} = \frac{2 \ast \text{Precision} \ast \text{Recall}}{\text{Precision} + \text{Recall}} \tag{12}$$

#### **Table 4.** Confusion Matrix.


#### *5.3. Experiments Settings*

We have run our experiments in Linux ubuntu 18 LTS, with Inter(R) core (TM) I7 CPU of 16 GB. For each run over each dataset with every model, we divide the dataset into 80% as a training set and 20% for testing. All basic parameters we use in each model are put in each figure in the last section, embedding layer, dropout, number of filters, and dense network.

#### **6. Results and Discussion**

In this subsection, we will discuss the results of each model in our proposed framework and compare it with the latest frameworks.

#### *6.1. Primary Twitter Filter*

In this section, maxentropy, random forest, and SVM are implemented. As shown in Table 5 and Figure 8, SVM achieved the best results in terms of recall, precision, and F1-score for most datasets. So, it is selected to be applied in our framework with parameters c = 0.1, kernel = linear, and penalty = 12.

**Figure 8.** Roc curve for comparative study for SVM, MaxEntropy and Random Forest algorithms foreach dataset as the first module for filtering the tweets.


**Table 5.** Evaluation Results for Dataset 1.

#### *6.2. User-Based Features*

As discussed earlier, the statistical features are changed over time, but this cannot prevent their abilities to detect spammers' actions with high accuracy and precision. Therefore, we attempt to find new user-based features. SVM and random forest are compared to get the best algorithm to be part of our detection framework. As shown in Tables 5 and 6, random forest achieves the best results in terms of precision and recall where trained with 6 million-tweet dataset [40] to get the user-based statistical features.


**Table 6.** Evaluation Results for Dataset 4.

#### *6.3. Ensemble Method*

This is the main module that consists of three main algorithms as discussed previously. They are trained with the Twitter Glove word embedding [37] dataset for all dimensions 25, 50, 100, 200. The results of each dimension are compared to our four datasets for each model as shown in Figures 9–11. We found that the results for the 200 dimensions are better in the three models, CNN, LSTM, and CNN with SVM.

**Figure 9.** Roc curve for CNN model results for each dataset as a first component in our ensemble method.

**Figure 10.** ROC curve for LSTM model results for each dataset as a third component in our ensembl method.

**Figure 11.** ROC curve for CNN features with SVM model results for each dataset as a second component in our ensemble method.

The CNN model is very good at finding the patterns. Each convolution will fire when a learned pattern is detected, but it suffers from long patterns or long tweets, which make the results of precision and F1-measure less. So, we embedded that LSTM model that is built using RNN, which is the strongest one with long sequences compared to CNN. Some studies [43] conducted an alternative for the last softmax function by SVM model. It aims to decide the optimal hyperplane for isolating the two classes in the dataset, and a multinomial case is apparently disregarded. With the utilization of SVM in a multinomial classification, the case turns into a one-versus-all, in which the positive class has the highest score, while the rest has the negative class.

#### *6.4. Meta-Classifier*

To achieve the results of our proposed framework, we build a sequential neural network that assembles the results of the utilized methods: LSTM, CNN, CNN feats with SVM, user-based features as presented in Figure 12. As shown in Tables 5 and 6, the proposed model achieved the best results in terms of accuracy, precision, and recall compared to the latest research in this field. Although [41] has the lowest execution time that it take 0.002 for each tweet, this execution time is very small compared to our proposed method as it takes longer, approximately 2 ms for each tweet. That is because of the number of features used to detect the spam and the combination of models that the tweet must pass to get the final result. However, this time can be optimized using clusters of nodes to decrease the time.

We also found that the results of the meta classifier are not boosted very much as they are too close to the ensemble model, but it is able to preserve the performance by a significant margin for this dataset. So, we can offer robust framework, that can be selftrained with the new words and hashtags, which the spammer can use as Twitter always has new subjects and interests of their users.

#### *6.5. Performance of Learned Model*

Twitter is considered a real-time platform. Therefore, it is extremely important to block spam tweets before it spread for preserving the safety of its users and preventing any potential damage. So, the proposed framework is designed to observe the execution time of the detection process. The processing time is calculated for the whole framework for each tweet. We found that each tweet takes 1:2 ms to detect whether it is spam or not. This value is very acceptable in real-time applications, although it can be decreased by using clusters of these models that help with the parallelization of the execution of the process of detecting spamming activities. However, most of the spammers are always thinking out of the box.

They try to deceive all detection strategies by changing the keywords and content and trying new features that can pass from detection methods and attract the users. On the other hand, there are legitimate users who are posting in new trending topics and new events happened immediately. So, we need to retrain the detection framework periodically to preserve the same accuracy and performance which we added in designing our framework, while all systems that depend only on the statistical features will be useless at later time. Our framework combined the statistical features with the deep learning features. So, it is very difficult for a spammer to fool our detection system. Furthermore, we have conducted four experiments with different datasets to test our framework. We concluded that our framework gives good results in both balanced and imbalanced datasets where the imbalanced dataset 4 has 1 million tweets and the balanced dataset 1 has 0.4 million tweets. They gave the same results in precision and F1-measure, which show the robustness of our detection framework as shown in Tables 5 and 6.

**Figure 12.** *Cont*.

**Figure 12.** ROC curve for the results of our proposed framework for different datasets 1, 2, 3, and 4.

#### **7. Conclusions**

In this paper, we have proposed an ensemble learning framework based on deep learning technique that tries to detect spam tweets based on two methods: firstly, working at the tweet level by building three robust models; secondly, work with a user-based feature to gather information between the user information and the words in each tweet. We also tried to get ahead of step by generating new spam tweets to train our models to predict any spam paraphrasing those spammers can try to deceive our users. The proposed model has been trained using four datasets for more than 7 million tweets to build a robust framework. The experiments show that our proposed model gives excellent results compared to other methods in an acceptable time.

In future work, we will try to conduct more experiments in other online social networks rather than Twitter. Also, we will consider other data formats, such as images and videos that can affect OSN platforms. In addition, we need to try our model in new real data to study if our framework can be affected by the changing of data.

**Author Contributions:** Conceptualization, A.A. and M.M.; methodology, A.A.; software, M.M.; validation, A.A. and M.M.; formal analysis, M.M.; investigation, A.A.; resources, A.A.; data curation, M.M.; writing—original draft preparation, M.M.; writing—review and editing, A.A.; visualization, M.M.; supervision, A.A.; project administration, A.A.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported through the Annual Funding track by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Project No. AN000417].

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The dataset is available on http://nsclab.org/nsclab/resources/ ?fbclid=IwAR2SkJQ9hN-0LCTb54UYdBCm7CS10zZqgywrh4lOtJo7M4JxjCr2D184QYk, (accessed on 20 April 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Natural Time Series Parameters Forecasting: Validation of the Pattern-Sequence-Based Forecasting (PSF) Algorithm; A New Python Package**

**Mayur Kishor Shende 1, Sinan Q. Salih 2,3, Neeraj Dhanraj Bokde 4,\*, Miklas Scholz 5,6,7, Atheer Y. Oudah 8,9 and Zaher Mundher Yaseen 10,11,12**


**Abstract:** Climate change has contributed substantially to the weather and land characteristic phenomena. Accurate time series forecasting for climate and land parameters is highly essential in the modern era for climatologists. This paper provides a brief introduction to the algorithm and its implementation in Python. The pattern-sequence-based forecasting (PSF) algorithm aims to forecast future values of a univariate time series. The algorithm is divided into two major processes: the clustering of data and prediction. The clustering part includes the selection of an optimum value for the number of clusters and labeling the time series data. The prediction part consists of the selection of a window size and the prediction of future values with reference to past patterns. The package aims to ease the use and implementation of PSF for python users. It provides results similar to the PSF package available in R. Finally, the results of the proposed Python package are compared with results of the PSF and ARIMA methods in R. One of the issues with PSF is that the performance of forecasting result degrades if the time series has positive or negative trends. To overcome this problem difference pattern-sequence-based forecasting (DPSF) was proposed. The Python package also implements the DPSF method. In this method, the time series data are first differenced. Then, the PSF algorithm is applied to this differenced time series. Finally, the original and predicted values are restored by applying the reverse method of the differencing process. The proposed methodology is tested on several complex climate and land processes and its potential is evidenced.

**Keywords:** forecasting; univariate; time series; Python; PSF

**Citation:** Shande, M.S.; Salih, S.Q.; Bokde, N.D.; Scholz, M.; Oudah, A.Y.; Yaseen Z.M. Natural Time Series Parameters Forecasting: Validation of the Pattern-Sequence-Based Forecasting (PSF) Algorithm; A New Python Package. *Appl. Sci.* **2022**, *12*, 6194. https://doi.org/10.3390/ app12126194

Academic Editor: Stefanos Ougiaroglou

Received: 20 April 2022 Accepted: 15 June 2022 Published: 17 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Time series forecasting is a field of interest in many research and society fields such as energy [1–3], economics [4,5], health [6,7], agriculture [8,9], education [10,11], infrastructure [12,13], defense [14], technology [15], hydrology [16,17], and many others. Time series are generally addressed in terms of stochastic processes in which values are placed at consecutive points in time [18]. Time series forecasting is the process of predicting values of a historical data sequence [19]. In the digitized development, with the increase in extensive historical data, more powerful and cross-platform-compatible forecasting methods are highly desirable [20,21].

Pattern-sequence-based forecasting (PSF) is a univariate time series forecasting method which was proposed in 2011 [22]. It was developed to predict a discrete time series and proposed to use clustering methods to transform a time series into a sequence of labels. To date, several researchers have proposed modifications for its improvement [23–26] and recently, its implementation in the form of an R package was also proposed [27,28]. PSF has been successfully used in various domains including wind speed [29], solar power [26], water demand [13], electricity prices [30], CO2 emissions [31], and cognitive radio [32].

The PSF algorithm consists of various processes. These processes are broadly categorized into two steps, clustering of data and, based on this clustered data, performing forecasting. The predicted values are appended at the end of the original data and these new data are used to forecast future values. This makes PSF a closed-loop algorithm, which allows PSF to predict values for a longer duration. PSF has the ability to forecast more than one values at the same time, i.e., it deals with arbitrary lengths for the prediction horizon. It must be noted that this algorithm was particularly developed to forecast data which contain some patterns. Figure 1 shows the steps involved in the PSF method.

**Figure 1.** Block diagram of the PSF method (Source: [33]).

The goal of the clustering step is to discover clusters and label them accordingly in the data. It consists of the normalization of the data, the selection of the optimal number of clusters, and applying k-means clustering using the optimum number of clusters. Normalization is an important part of any data processing technique. The formula used to normalize the data is:

$$X\_j' = \frac{X\_j}{\frac{1}{N} \sum\_{i=1}^{N} X\_i} \tag{1}$$

where *Xj* is an input time series and *X <sup>j</sup>* denotes the normalized value for *Xj* and *i* = 1, ... , *N*. The k-means clustering technique is used to cluster and label the data. However, k-means requires the number of clusters (*k*) to be provided as an input. To calculate the optimum value of *k*, the silhouette index was used. The clustering step outputs the time series as a series of labels which are used for forecasting.

Then, the last "*w*" labels are selected from the series of labels outputted by the clustering step. This sequence of *w* labels is searched for in the series of labels. If the sequence is not found, then the search is repeated with the last (*w* − 1) labels. The selection of the optimum value of *w* is crucial in order to get accurate prediction results. Formally, the size of the window for which the error in forecasting is minimum during the training process is called the optimum window size. The error function used is shown in (2).

$$\sum\_{t \in TS} \left\| \left| \hat{X}(t) - X(t) \right| \right\| \tag{2}$$

where *X*ˆ(*t*) is a predicted value at time *t*, *X*(*t*) is the measured data at same time instance, and *TS* represents the time series under study.

After the selection of the optimum window size (*w*), the last *w* values are searched for in a series of labels and labels next to the discovered sequence are stored in a new vector called *ES*. The data corresponding to these labels from the original time series are retrieved. The future time series value is predicted by averaging the retrieved data from the time series with the expression (3).

$$
\hat{X}(t) = \frac{1}{size(ES)} \times \sum\_{j=1}^{size(ES)} ES(j) \tag{3}
$$

This predicted value is appended to the original time series and the process is repeated for predicting the next value as shown in Figure 2. This allows PSF to make long-term predictions.

**Figure 2.** Prediction with PSF algorithm (Source: [27]).

In the current research, the main intention of the current investigation was to develop a new Python package for modeling univariate time series data that are characterized by natural stochasticity. This can contribute remarkably to the best knowledge of monitoring, assessment, and advisable support for decision makers that are interested with such timeseries-related problems. Among several time series engineering problems, hydrological time series forecasting is one of the highly attractive topics recently discovered [34–36]. Hydrological time series processes are very complex and stochastic problems that require robust technologies to tackle their complicated mechanisms. Hence, in this research, several hydrological time series examples were tested to validate the proposed methodology.

#### **2. Difference Pattern-Sequence-Based Forecasting (DPSF) Method**

The PSF algorithm was particularly developed to forecast data for a time series which contains pattern or is seasonal, thus the prediction error is very small for such time series. However, if the time series follows some trends or is not seasonal, then the error increases. This can be observed in the illustrative examples provided in the later sections. The "nottem" dataset is very seasonal, thus the predictions of PSF are observed to be better than that of ARIMA. However, in the "CO2" dataset, the result of PSF is not as good as that of ARIMA. This is because the "CO2" dataset follows an upward trend. The forecasting results with the PSF method are degraded with positive or negative trends. To tackle this problem the DPSF model was proposed [3].

The DPSF method is a modification of the PSF algorithm. The time series is differenced once. These differenced data are then used for prediction using the PSF algorithm. The predicted values are then appended to the differenced time series, which was used for prediction using PSF. Finally, the original time series is attempted to be regenerated using the reverse method of the first-order differencing process.

The DPSF method gives better results for data where positive or negative trends can be observed in the data. However, the PSF method does not work well with such datasets and prefers seasonal datasets. This can also be observed in examples shown in Sections 4.1 and 4.2. An example in Section 4.1 uses a seasonal dataset (*nottem*), where the PSF results are better than the DPSF results. In Section 4.2, the *CO*<sup>2</sup> dataset is used, which shows a positive trend. Here the results of DPSF are significantly better than those of PSF.

#### **3. Description of the Python Package for PSF (PSF\_Py)**

The proposed Python package for PSF (PSF\_Py) is available at the Python repository, describing license, version, and required package imports [37]. The package can be installed using command in Listing 1.

**Listing 1.** Command to install PSF\_Py package.


The package makes use of "pandas", "numpy", "matplotlib", "sklearn" packages. The various tasks of the processes are accomplished by using various functions, such as psf(), predict(), psf\_predict(), optimum\_k(), optimum\_w(), cluster\_labels(), neighbour(), psf\_model(), and psf\_plot(). The code for all the functions was made available on GitHub [38]. All these functions were made private and are not directly accessible by the user. The user needs to create an object of the class Psf, which takes as inputs the time series, cycle, values for the window size (*w*), and the number of clusters (*k*) to be formed. The values of *k* and *w* are optional; if not specified by the user, then they are internally calculated using the optimum\_k() and optimum\_w() functions. Once the PSF model has been created, the predictions can be made using the predict() method. The predict() takes as its input the number of predictions to make (n\_ahead). For the DPSF model, the user makes use of the class Dpsf. The remaining process is the same as that of Psf.

After the predictions are made using the predict() method of class Psf, the model can be viewed using the model\_print() method. The original time series and predicted values are plotted using the psf\_plot() or dpsf\_plot() methods. Alternatively, the user can use "matplotlib" functions to plot the time series.

#### *3.1.* optimum\_k()

The optimum\_k function is used to calculate the optimum number of clusters for forecasting. The PSF uses the k-means algorithm to cluster the data, but the algorithm requires the number of clusters as an input. The function takes as inputs the time series and a tuple consisting of the desired values for *k*. The function performs k-means clustering using KMeans() from the sklearn package, calculates its silhouette score using the "the silhouette\_score()" function and returns the value of *k* for which the score was maximum.

#### *3.2.* optimum\_w()

The optimum\_w function is used to calculate the optimum window size. The window size is a critical parameter for getting accurate predictions. A cross-validation is performed to find the optimum value for the window size. The time series is divided into a training and test set. The test set consists of the last cycle values of the time series and the training set consists of the remaining time series values. PSF is performed on the training set and cycle values are predicted. Then, the error is calculated for the predicted values and on the test set. The error is calculated using the mean absolute error (MAE). The function returns the value of *w* for which the error is minimum.

The functions for calculating the optimum window size and clustering the data may yield a different result in R and Python. Therefore, the predictions done in R and Python can vary in some cases. Furthermore, the default window values in optimum\_w() range from 5 to 20 in Python. In R, they range from 1 to 10. In some cases, it was observed that the optimum number of clusters was calculated more accurately in Python. Overall, the predicted values were very similar to R.

#### *3.3.* get\_ts()

In the Python package for PSF, some time series are included, namely, "nottem", "AirPassengers", "Nile", "morley", "penguin", "sunspots", and "wineind". It should be

noted that the package does not provide the entire data frames (datasets). It only provides a 1D array that consists of the data for the time series. These can be accessed using the get\_ts() function, which takes as an input the name of the time series.

#### *3.4.* predict()

The predict() method is used to perform the forecasting. This method returns a numpy array of values predicted according to the PSF algorithm (or DPSF algorithm, if the DPSF model is used). The actual calculations take place in the psf\_predict() function, which was made private and not intended to be directly used by the user. The predict() method also calculates the optimum values of *k* and *w*, in case no values are given by the user, using the optimum\_w() and optimum\_k() functions described above. If a tuple of values is passed instead of an integer, then the optimum *k* and *w* are calculated from those values. Furthermore, the normalization of the data is done in this method.

Some other functions are available to the users. The model\_print() function prints the actual time series, predicted values, values of *k* and *w* used for predictions, and value of cycle for the time series. This function does not return anything; it only prints the data and parameters. The functions psf\_plot() and dpsf\_plot() take the PSF model and predicted values as inputs and plot them. The functions make use of the "matplotlib" package.

#### **4. Demonstration**

Following several established research works from the literature, proposing a new softcomputing methodology must be validated with real time series datasets [39–42]. The proposed package in the current research was examined on six different time series dataset. The performance of forecasting methods were compared with the root-mean-square error (*RMSE*), mean absolute error (*MAE*), mean absolute percentage error (*MAPE*), and Nash–Sutcliffe efficiency (*NSE*) [43,44]. These error metrics are defined in Equations (4)–(7), respectively.

$$RMSE = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} \left| X\_i - \hat{X}\_i \right|^2} \tag{4}$$

$$MAE = \frac{1}{N} \sum\_{i=1}^{N} |X\_i - \hat{X}\_i| \tag{5}$$

$$MAPE = \frac{1}{N} \sum\_{i=1}^{N} \frac{|X\_i - \hat{X}\_i|}{X\_i} \times 100\% \tag{6}$$

$$NSE = 1 - \frac{\sum\_{i=1}^{N} (X\_i - \hat{X}\_i)^2}{\sum\_{i=1}^{N} (X\_i - X\_{mean})^2} \tag{7}$$

where *Xi* and *X*ˆ*<sup>i</sup>* are the measured and predicted data at time *t*. *Xmean* is the mean of the measured data and *N* is the number of predicted values.

For each of the examples demonstrated, the original dataset was divided into training and test data. The number of observation values used for the test data is mentioned in each example. Once the forecasted values were calculated, they were compared against the test data.

#### *4.1. Example 1: Nottem Dataset*

In the below example, the "nottem" time series was used for the model training, forecasting, and plotting. It contains the average air temperatures at Nottingham Castle in degrees Fahrenheit over 20 years [45]. The procedure is the same for other univariate time series. Table 1 reveals the statistical characteristics of the time series.


**Table 1.** Statistical characteristics of the "nottem" dataset.

The package contains the get\_ts() function, which can be used to access some univariate time series included in the package using command in Listing 2.

**Listing 2.** get\_ts() function to access univariate time series included in the package.

```