Plot the model and predic ted values
>>> dp s f _pl o t ( a , b )
```
**Table 4.** Comparison of forecast methods with different error metrics for the "CO2" dataset.


**Figure 6.** Plot showing the test data and values forecasted using various methods for the "CO2" dataset.

#### *4.3. Example 3: Water Demand Dataset*

The PSF and DPSF algorithms were applied to forecast water demand on the given dataset. Investigating such high complex time series data is highly important for water management [47]. Table 5 contains the statistical characteristics of the time series.

**Table 5.** Statistical characteristics of the "Water Demand" dataset.


Different error metrics comparison is listed in Table 6. The error for the Psf() function in python was found to be better than that of the other algorithms adopted in this study. The dataset and predictions using Psf() and Dpsf() are plotted in Figures 7 and 8. Further, the comparison of forecasted values using various methods are shown in Figure 9.

**Figure 7.** Result of psf\_plot() for the "Water Demand" dataset.

**Figure 8.** Result of dpsf\_plot() for the "Water Demand" dataset.

**Table 6.** Comparison of forecast methods with different error metrics for the "Water Demand" dataset.


**Figure 9.** Plot showing the test data and values forecasted using various methods for the "Water Demand" dataset.

#### *4.4. Example 4: Total Solar Radiation Dataset*

Among several climatological time series, total solar radiation is one of the essential climatological processes. Providing a robust soft-computing methodology for solar radiation can contribute remarkably to clean and friendly sources of energy [48]. The dataset consisted of daily solar radiation readings for year 2010 to year 2018 at Baker station in North Dakota. Before applying the algorithms, the dataset was reduced by taking the mean of the values for each month. Table 7 presents the statistical characteristics of the time series. Performance of various methods are tabulated in Table 8. Results of psf\_plot() and dpsf\_plot() for the

"Total Solar Radiation" dataset are shown in Figures 10 and 11. Further, the comparison of forecasted values with different methods are shown in Figure 12.

**Table 7.** Statistical characteristics of the "Total Solar Radiation" dataset.


The error for Psf() was significantly less the that for auto.arima() and Dpsf(). The errors are listed in Table 8.

**Table 8.** Comparison of forecast methods with different error metrics for "Total Solar Radiation" dataset.


**Figure 10.** Result of psf\_plot() for the "Total Solar Radiation" dataset.

**Figure 11.** Result of dpsf\_plot() for the "Total Solar Radiation" dataset.

**Figure 12.** Plot showing the test data and values forecasted using various methods for the "Total Solar Radiation" dataset.

#### *4.5. Example 5: Average Bare Soil Temperature*

Soil temperature is an important process that is related to geoscience engineering [49]. Based on the factual mechanism, soil temperature has highly nonstationary features due to the influence of the soil morphology, climate, and hydrology information [50,51]. Hence, taking the soil temperature as a time series forecasting is highly useful for multiple geoscience engineering applications [52]. The data were obtained from the same region as in Example 4 ("Baker station") and using the same data span, "2010–2018". A similar modeling procedure was implemented as in Section 4.4. Table 9 reports the statistical characteristics of the soil temperature time series, and the performance of various methods are tabulated in Table 10. Results of psf\_plot() and dpsf\_plot() for the "Average Bare Soil Temperature" dataset are shown in Figures 13 and 14. Further, the comparison of forecasted values with different methods are shown in Figure 15.

**Table 9.** Statistical characteristics of the "Average Bare Soil Temperature".

**Figure 13.** Result of psf\_plot() for the "Average Bare Soil Temperature" dataset.

**Figure 14.** Result of dpsf\_plot() for the "Average Bare Soil Temperature" dataset.

**Table 10.** Comparison of forecast methods with different error metrics for the "Average Bare Soil Temperature" dataset.


**Figure 15.** Plot showing the test data and values forecasted using various methods for the "Average Bare Soil Temperature" dataset.

#### *4.6. Example 6: Average Temperature*

The final example reported in this research is modeling the air temperature. Having a reliable and robust technique for air temperature is very essential for diverse water resources and hydrological processes [53,54], for instance, in agriculture, water body evaporation, crops production, etc. [55]. Similar to Examples 4 and 5, the air temperature data were from the same station, region, and data span. For these data, the procedure followed was the same as in Examples 4 and 5. Table 11 indicates the statistical characteristics of the time series. The performance of various methods are tabulated in Table 12. Result of

psf\_plot() for the "Average Temperature" dataset are shown in Figure 16. Further, the comparison of forecasted values with different methods are shown in Figure 17.

**Table 11.** Statistical characteristics of the "Average Temperature" dataset.


**Table 12.** Comparison of forecast methods with different error metrics for the "Average Temperature" dataset.


**Figure 16.** Plot showing result of psf\_plot() for the "Average Temperature" dataset.

**Figure 17.** Plot showing the test data and values forecasted using various methods for the "Average Temperature" dataset.

#### **5. Discussion and Conclusions**

This paper described the PSF\_Py package in detail and demonstrated its use for implementing the PSF and DPSF algorithms for diverse applications on real time series forecasting datasets. The package makes it very easy to make predictions using the PSF algorithm. The syntax is similar to that in R and is very easy to understand. The examples shown above suggested that the results from the Python package were comparable to those in R. The values of the window size and the number of clusters may differ in both packages. The algorithm worked exceptionally well for the time series containing periodic patterns. The forecasting error of the DPSF method, implemented in the proposed package, was much smaller and better than the benchmark ARIMA model. The complexity of a model is another critical aspect besides its accuracy. We compared the time and space complexities of the models in case studies with the GuessCompx tool. The GuessCompx tool [56,57] empirically estimates the computational complexity of a function in terms of Big-O notations. It computes multiple samples of increasing sizes from the given dataset and estimates the best-fit complexity according to the "leave-one-out mean squared error (LOO-MSE)" approach. The "nottem" dataset was used to calculate the complexities. The results of the tool are summarized in Table 13, and it shows that both PSF and DPSF models are computationally efficient and consumes an optimum amount of memory to achieve a better accuracy.

**Table 13.** Estimated complexities according to the GuessCompx tool [56].


It is worth to mention that this research proposed a reliable and robust computational technique that can be implemented for online and offline forecasting for diverse hydrological and climatological applications [58,59]. In future work, other hybrid models [33,60] of the PSF method can be incorporated into the proposed Python package, with which further improved accuracy in forecasting can be targeted. In addition, the application of the proposed package could be extended to several new data-driven research domains. Further, other hydrological processes dataset or other engineering time series data could be investigated for the possibility to generalize the proposed package.

**Author Contributions:** Conceptualization, M.K.S., S.Q.S., M.S. and N.D.B.; methodology, M.K.S., S.Q.S., M.S. and N.D.B.; software, M.K.S., A.Y.O., Z.M.Y. and N.D.B.; validation, Z.M.Y., M.S., A.Y.O. and N.D.B.; formal analysis, M.K.S., S.Q.S., Z.M.Y., M.S., A.Y.O. and N.D.B.; investigation, M.K.S., S.Q.S, Z.M.Y., M.S., A.Y.O. and N.D.B.; resources, S.Q.S., Z.M.Y., M.S., A.Y.O. and N.D.B.; data curation, M.K.S., S.Q.S., Z.M.Y., A.Y.O. and N.D.B.; writing—original draft preparation, M.K.S., S.Q.S., Z.M.Y., M.S., A.Y.O. and N.D.B.; writing—review and editing, M.K.S., S.Q.S., Z.M.Y., M.S., A.Y.O. and N.D.B.; visualization, M.K.S. and N.D.B.; supervision, Z.M.Y., M.S., A.Y.O. and N.D.B.; project administration, Z.M.Y., M.S., A.Y.O. and N.D.B.; funding acquisition, Z.M.Y., N.D.B. and M.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was financially predominantly supported by WATERAGRI (European Union Horizon 2020 research and innovation program under Grant Agreement Number 858375).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


### *Article* **Identification of Mobility Patterns of Clusters of City Visitors: An Application of Artificial Intelligence Techniques to Social Media Data**

**Jonathan Ayebakuro Orama 1,\*, Assumpció Huertas 2, Joan Borràs 1, Antonio Moreno <sup>3</sup> and Salvador Anton Clavé 1,4**


**Abstract:** In order to enhance tourists' experiences, Destination Management Organizations need to know who their tourists are, their travel preferences, and their flows around the destination. The study develops a methodology that, through the application of Artificial Intelligence techniques to social media data, creates clusters of tourists according to their mobility and visiting preferences at the destination. The applied method improves the knowledge about the different mobility patterns of tourists (the most visited points and the main flows between them within a destination) depending on who they are and what their preferences are. Clustering tourists by their travel mobility permits uncovering much more information about them and their preferences than previous studies. This knowledge will allow DMOs and tourism service providers to offer personalized services and information, to attract specific types of tourists to certain points of interest, to create new routes, or to enhance public transport services.

**Keywords:** mobility patterns; social media data; artificial intelligence; tourist clusters; tourist flows

#### **1. Introduction**

Technological evolution has brought changes in tourists and their behavior [1–3], and it has catalyzed new information challenges for destinations. In this evolution, some destinations have started to become smart by integrating technological infrastructures and end-user devices [4–6]. However, the fulfillment of two of the main objectives of smart destinations, namely the enhancement of tourists' experiences and the improvement of their management, is still far from complete, and results from the destinations' efforts remain largely unreported [7]. Progress in this area requires Destination Management Organizations (DMOs) to know who their tourists are, what needs and travel preferences they have, what they visit the most, and which are their mobility patterns and flows around the destination [8].

Technological evolution has allowed some Destination Management Organizations to start to maintain personalized and real-time exchange of information with tourists [9] and, at the same time, to collect vast amounts of information from them (big data). The analysis of these data permits offering them even more personalized services [10,11] at the moment and place they need it [12,13]. This two-way communication between tourists and

**Citation:** Orama, J.A.; Huertas, A.; Borràs, J.; Moreno, A.; Anton Clavé, S. Identification of Mobility Patterns of Clusters of City Visitors: An Application of Artificial Intelligence Techniques to Social Media Data. *Appl. Sci.* **2022**, *12*, 5834. https:// doi.org/10.3390/app12125834

Academic Editors: Dionisis Margaris and Stefanos Ougiaroglou

Received: 6 May 2022 Accepted: 6 June 2022 Published: 8 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

DMOs helps to generate more satisfactory tourist experiences [5,14,15] and to improve the destination's management [16].

Many studies have focused on knowing who are the visitors of a destination [17]. These studies have classified tourists in clusters or profiles depending on their travel preferences and behaviors [18,19]. Some studies have analyzed the acceptance of technology by tourists [20], while others have focused on their degree of connectivity during the trip [21,22]. To find out their preferences and behaviors, some authors have analyzed their information searches during the trip [23–25] and others their movements at the destination (flow analysis) through GPS or other smart technologies [26,27]. Knowing the mobility patterns of tourists is one of the most important issues for the development of tourism planning and destination management [28,29].

Social media has brought a great transformation in the related tourism research [8,30,31]. Social media analytics use natural language processing and machine learning techniques to analyze social media content [32]. Geotagged Social Media Data (GSMD) offer information about tourist behavior, travel route choices, emotions, and satisfaction level [33]. Nowadays, technology and analytical tools are evolving and social media analytics allow us to know the movements or flows of tourists [34]. Twitter is a platform with global coverage, very useful for tourist mobility analysis [35] that even allows us to determine the most visited POIs (Points of Interest) at destinations [36].

Existing research in spatial data or mobility/flow analysis through GSMD still offers little information about mobility patterns of different sub-groups or clusters of tourists [37]. Nevertheless, the more DMOs know about the mobility, preferences and behavior of tourists, the more they will be able to offer personalized services, packages and information [11,38]. Additionally, the more they can automatize processes to obtain information from open social media data, the more efficient they will be when providing personalized information throughout travel recommender systems [39].

Therefore, the aim of the study is to develop a methodology that, through the application of Artificial Intelligence techniques to geotagged social media data, creates clusters of tourists according to their mobility patterns and visiting preferences at the destination. The analytical and managerial goal is to know the most visited points of interest and the main flows between them within a destination for each tourist cluster. This will allow the DMOs to uncover their tourists' profiles, their preferences, and their mobility patterns at the destination. This will also let them enhance the visitors experience through the personalization of tourist packages, services, public transport and also the information offered on each attraction or spot. Finally, this will also enhance the usefulness and efficiency of DMOs to mitigate the array of problems caused by the mobility of visitors towards and around the main points of interest through a better experience design management and interaction with the environment [40].

#### **2. Theoretical Framework**

#### *2.1. Tourist Flow Analysis and Social Media Analytics*

Prior to the technological developments, studies on tourist mobility were based on tourists' surveys [35], and they were rather limited. Tourist flow analysis grew enormously with the development of tracking mechanisms like GPS, cell-tower identification or Wi-Fi positioning [41], which have made it possible to obtain big data from the movement of tourists at destinations. Several studies have used GPS [27,42,43] to find out which were the most visited places in a particular destination and when they were visited. In destinations, the proliferation of sensor networks and portable devices like smartphones has also made it possible to obtain big data from tourists and to know their movements or flows [26]. In general, the increasing effectiveness and reliability of GPS data and the mobile positioning data have increased the possibilities of analyzing spatial-temporal behaviors, widening the research objectives beyond the initial aim to know where and when visitors went. In this vein, they have been used to identify seasonal demand patterns by Ahas et al. [44] or to improve the management of destination marketing by Kuusik et al. [45]. Other authors

have considered data obtained from mobility services, such as the subway smart card [46] or bike sharing systems [47].

Social media also has very useful platforms for knowing the movements of tourists. Social media analytics (SMA) is a research field that has advanced heavily since 2014, but it is still in an early stage of development [32]. The potential of social media as sources for big data research in the field of tourism has increased in the last decade [8,30], and studies on big data, social media, UGC (User-Generated Content), and online reviews have proliferated in hospitality and tourism [30,32,34,48,49]. Moreover, the innumerable footprints that millions of tourists leave online using the technological platforms constitute an interesting source for knowing the tourists' movements and flows [49,50], although big data-based theoretical studies still remain limited [34].

Text analytics and data mining studies try to find out tourists' interests and to predict their decisions and behaviors [51]. Trend analysis studies also try to predict tourists' behaviors [52] or future trends in tourist behavior at destinations [53]. Nevertheless, one of the best ways to know the behavior of tourists during the trip is through spatial data analysis.

Spatial data analysis is a stream of studies within SMA that, through the analysis of GSMD [8], aim to ascertain the spatial distribution of tourists on a place or a destination [54] and even to know tourists' movements or flows [26,41,55]. Flow is the collective movement of people [26], and flow analysis shows the movement of tourists in a location [38]. GSMD are key sources of information to analyze tourists flows [56] in order to uncover their travel preferences and behaviors [26] or the tourists' experiences through a spatial analysis [33]. Provenzano et al. [50] compared GSMD results with the UNWTO record-based network demonstrating the usefulness of GSMD to discover tourist flows. However, there are still few publications that analyze Geotagged Big Data to examine the spatial distribution and flows of different types of tourists in the destinations [35].

GSMD analysis allows for knowing the dispersion of tourists and also the routes and activities they carry out in the destination [57–62], their density of movements [63], their flows [26,38,64,65], and the most popular resorts, attractions, or points of interest in the destinations [36,62,66,67].

Most studies that analyze GSMD have focused on Twitter because it is a platform that has global coverage and its data are available for free on the Internet from the moment the tweets are published [35]. Twitter is, along with Tripadvisor, one of the two most used platforms by SMA. It also allows the analysis of multi-modal data such as User Generated Content (including text, images, and even videos), and geotagged information [32,68]. However, studies based on geotagged photos and other social media like Flickr [38,69], Foursquare [60], or Instagram [70] have also proliferated showing tourists' flows, movements, and behaviors in destinations.

It has also been shown that analyzing different social media sources or platforms is useful because they provide complementary information and enrich the knowledge of different tourists' movements [35]. In this line, Dietz et al. [71] analyzed tourists' movements at destinations through three social media (Twitter, Foursquare, and Flickr) and identified different types of trips according to the origin of the tourists. A study by Sugimoto et al. [72] combined tracking technologies and surveys to study the relationship between visitor mobility and urban spatial structures. Salas-Olmedo et al. [35] analyzed the digital footprint of urban tourists through photos, check-ins, and tweets from three social media (Panoramio, Foursquare, and Twitter). In addition, they used a clustering methodology to identify certain areas of the destinations according to the tourist activities that visitors carried out in them. However, they did not cluster or segment tourists to know their different preferences, movements, and behaviors.

Many studies on GSMD have focused on tourists' mobility patterns [59,73–75]. However, the difference in mobility patterns between sub-groups or clusters of tourists has not been fully researched [37].

#### *2.2. Uncovering the Mobility Patterns of Clusters of Tourists*

Previous studies have shown that different types, sub-groups, or clusters of tourists may present different travel behaviors [76–79]. Domènech et al. [80], for instance, identified that cruise passengers with different expenditure levels have different mobility patterns in port destination cities. However, this kind of studies usually apply an ad-hoc combination of analytic techniques that is not easy to generalize.

From a complementary perspective, many researchers have followed the digital footprint of tourists [35] to know their mobility in destinations, but the current research shows that it is difficult to analyze all these data by segmenting tourists. In fact, many studies have analyzed tourists' mobility patterns [73,74] without taking into account the diversity of tourists [37] because of the difficulty of obtaining their socio-demographic data. This aspect can be considered only in those cases in which user information is available, such as the one described by Massimo and Ricci [81]. In that case, they use information produced by a recommender system to define patterns of visitors depending on the similarities in their observed visit trajectories.

Previous GSMD-based studies have focused on the clustering of tourists according to diverse factors. Following Liu et al. [37], studies that segment visitors according to their mobility patterns can be based on non-spatial factors (socio-economic status, gender, age, income, education, race) or on spatial factors. For instance, Manca et al. [82] focused on spatial data and Jin et al. [41] on spatial and temporal data. Nevertheless, very few studies have focused on the analysis of mobility patterns according to the socio-demographic data in order to segment visitors in a destination because, with the currently applied methods of analysis, very little socio-demographic data can be obtained from users. In this vein, several SMA studies based on GSMD analysis have claimed to obtain demographic data from tourists to better understand who they are and to be able to classify them [26,69,83]. However, the available information is still very limited [60] and, in some cases, it is even reduced to the country of origin [56].

To name some of those studies, Chua et al. [26] focused on spatial, temporal, and also demographic data from Twitter to discover the tourists flows in a destination, creating tourist profiles and segmenting them by country of origin. Similarly, Vu et al. [77] analyzed the different mobility patterns, popular locations, and routes in Hong Kong of Western and Asian tourists. In the same line, Paldino et al. [84] analyzed geo-tagged picture data from Flickr, segmenting domestic and foreign tourists, and Ma et al. [18] also analyzed the mobility of tourists in destinations and their most visited attractions by classifying tourists into foreign tourists and domestic tourists. Van der Zee and Bertocchi [85] analyzed the spatial behavior of visitors at a destination through a relational approach and Trip Advisor data, classifying visitors as local, national, European, and non-European. Vu et al. [86] analyzed the activities carried out in a destination by different groups of tourists also segmenting them by their country of origin. Xu et al. [87] analyzed mobility patterns of tourists in a country, South Korea, and its diverse destinations, according to their nationality or country of origin. Liu et al. [37] analyzed mobility via GSMD from Twitter by considering homogeneous segments of users (state visitors, national visitors, and international visitors), created according to their past visits.

However, despite the difficulty of obtaining other socio-demographic data than the origin of users from the GSMD, some mobility studies have tried to take a step further in segmenting visitors. Huang and Wong [88], for example, analyzed their mobility segmenting them by their socio-economic status through Twitter's GSMD. They identified this status from the home and work location of the users. In addition, they showed that socio-economic status and urban spatial structure are the factors that have a stronger influence on the mobility of visitors. On the other hand, Han et al. [89] analyzed the mobility patterns of visitors from the analysis of social media check-in. They used a deep learning method to try to classify tourists by the purpose of their travel.

Studies of mobility patterns that employ Artificial Intelligence techniques are still emerging, and very few of them try to identify meaningful clusters of tourists. Liao [90] obtained trajectory data from different location-based services and tourism applications, and then they applied cluster analysis to identify the most popular tourist attractions. DBSCAN clustering was used to identify spatial clusters of trajectories at the points of greatest interest, but not to identify clusters of tourists. Xu et al. [87] analyzed a mobile positioning data set in order to know the nationality and movement patterns of foreign tourists in South Korea. They used network analysis to identify the structure of tourism destinations based on patterns of travel flow, and clustering analysis to identify similar patterns. They identified areas of destinations with different visit patterns of tourists according to their nationality, but not clusters of tourists. Instead, Giglio et al. [91] used cluster analysis to identify automatically clusters of tourists around points of interest at destinations. They studied the relationship between human mobility and tourist attractions through geo-located images of Italian destinations provided by Flickr users. The results showed that social media data are a valuable source to understand the behavior of tourists in a destination. However, the study did not define or specify the different clusters of tourists and their different mobility flows between the most popular attractions.

Considering that, despite the difficulties and limitations, GSMD data can be analyzed in order to make segmentations more precise than the ones based on the country of origin, and also considering that mobility patterns are a key issue for DMOs; this study aims to make a contribution to the current challenges of analyzing tourism flows in destinations. Hence, its goal is to provide and test a methodology to create tourists' clusters (groups of visitors with similar characteristics, preferences, and patterns of travel) taking into account not only their personal and cultural interests, leisure activities, and the context, but also their mobility patterns on the destination. The authors' perception is that, from an academic perspective, this is a fundamental issue in order to better understand tourists spatial and temporal behavior. Additionally, from the point of view of the developers of tourists' experiences, this methodology is a new tool that can help to offer services to tourists in a more personalized way, enhancing the communication to the appropriate targets or the attraction of new ones [37]. Finally, it can help to improve the management of critical flows in certain points of interest of the tourism destinations, helping to minimize social stresses, built environment management difficulties, transportation issues, and frictions between tourists and local population in situations of congestion and overcrowding [40].

#### **3. Methodology**

#### *3.1. Data Collection and Processing*

Tourist mobility patterns can be analyzed by building user profiles, which group individuals with common visits to some points of interest or with similar travel behavior. These profiles can be discovered via a clustering process, which uncovers the common interests and habits of the visitors. The data required to cluster individuals based on their visited POIs for tourist mobility analysis can be obtained from various sources, including social media and mobile carriers (GPS). Social media data have become popular for this purpose in recent years.

The data used in this research were collected from the popular micro-blogging platform Twitter, using their application interface (API). The API allows developers to stream live tweets published at a specific location, which is provided as a JavaScript Object Notation (JSON) file that includes timestamps, texts, tweet IDs, user details, tweet language, coordinates, and place details. We streamed geo-located tweets published in the city of Barcelona in the year 2019, which amounted to over 1.5 million tweets from more than 100,000 users.

After the data collection, a cleaning process was applied. Users that had sent less than three tweets were discarded, along with their tweets. It was necessary to distinguish residents from visitors, as we were interested only in detecting visitor profiles. A tweet was considered to be from a tourist if Barcelona is not among his/her home locations. These home locations are specified explicitly in the individual's Twitter profile, or they are inferred from his/her tweets. Concretely, a place is considered to be a user's home location

if he/she has posted daily tweets in a 20-day period (i.e., it is assumed that someone residing in a location for more than 20 consecutive days is not a tourist). Afterwards, tweets from residents were discarded.

In order to build the user profiles, it was crucial to identify the points of interest experienced by the tourists from their published tweets. The geographical coordinates of each tweet were sent to Overpass, a query engine for requesting specific features on the Open Street Map (OSM) server, in order to obtain points of interest in their proximity. The POIs returned from Overpass were then assigned to the tweets. OSM classifies physical features (buildings, parks, attractions, etc.) under certain tags that describe their type (e.g., nature: beach). We used these tags to build an activity tree, which has a hierarchical structure that categorizes activities. The activity tree contained 175 OSM tags, classified under nine main categories (Routes, Sports, Gastronomy, Leisure, Accommodation, Transportation, Nature, Events, and Culture) and 32 subcategories.

The activity tree allows us to categorize tweets based on an assigned POI visited in the tweet. This POI is chosen from the POIs returned from Overpass, by giving priority to POIs explicitly mentioned in the tweet and to POIs with OSM tags that belong to categories that are of higher interest to a tourist. For instance, we consider a museum to be more interesting to a tourist than a coffee shop, so the museum is prioritized over the coffee shop. The tweets that could not be assigned to any POI and activity were removed from the analysis. After the filtering process, the dataset had 37,302 tweets from 6066 individuals. The summary of the dataset is shown in Table 1.

**Table 1.** Dataset summary.


Source: Authors, from a previous work [39].

#### *3.2. Feature Engineering and Data Clustering*

In order to cluster the users, each of them was represented by a numerical vector of features, which represents their interests and travel habits. Four types of features were considered:


Thus, each of the 6066 users was represented by a vector of 37 numbers, which codifies his/her leisure interests, moving ability, interest in popular places, and the preferred period of the day for visiting points of interest. All features were standardized using the *Z-score* scaler, and the k-means clustering algorithm was used to cluster this dataset into 25 clusters (this number was empirically decided after some experiments). Five of the 25 clusters had less than 50 users, so they were not considered to be very relevant, and they were dismissed from the posterior mobility analysis. Thus, at the end of the clustering process, there were 20 clusters with a minimum of 50 visitors. By averaging the values of the 37 features for the users in each cluster, we obtained a general description of the preferences of the users in that group.

In a previous work [39], you can find a more detailed technical explanation of the data collection, filtering, activity identification, and clustering steps.

#### **4. Results**

#### *4.1. Characterization of the Visitors in Each Cluster*

The analysis identified different clusters with a wide range of visit preferences. Figure 1 highlights the main characteristics shared by tourists in each of the 20 clusters, including their origin, the kind of leisure activities they visit, their preferred time of the day, and their interest in different kinds of activities. All clusters are associated with one or more types of activities, which match the activity features used in the clustering process.

As it can be seen in Figure 1, the historic and religious activities (columns A and B) are the ones associated with most clusters (0, 2, 5, 8, 11, 16, 17, and 19). In most of these clusters, users enjoy visiting the most popular POIs, revealing that the historic and religious attractions are among the most popular in the city of Barcelona. This also proves a direct correlation between tourist interests and their popularity, as it could be expected.

It can also be seen that the individuals in clusters 2 and 17 visit a larger number of popular POIs and have a wider spread of different kinds of activities. These clusters contain mostly non-Spanish tourists, who visit and experience many varieties of POIs that the city offers. On the other hand, clusters characterized mainly by Spanish nationals (clusters 1, 3, 9, 10, 15, 21, 23, and 24) focus on a small set of features and on unpopular POIs, indicating a higher specificity in their favorite POIs and their visits. People in cluster 1, for example, prefer to do scenic routes within the city. However, those in cluster 3 prefer to go to the beach or places near the beach, and some historic sites. Tourists on cluster 9 visit Barcelona mainly for health and care, although they also visit religious sites and cafes, whereas those in cluster 15 prefer food and enotourism, those in cluster 23 are mainly interested in visiting Camp Nou, the football stadium, and people in cluster 24 mainly visit museums.

**Figure 1.** Cluster characteristics highlighting tourists' origin, preferred tour time, orientation towards popular POIs, and interests in different activities. \* Note: A—Religious, B—Historic, C—Routes, D— Nature, E—Art Gallery, F— Recreational Facilities, G—Beach, H—Cultural Amenities, I—Shopping, J—Nightlife, K—Statues, L—Health & Care, M—Viewpoints, N—Sculptures, O—Accommodation, P— Food, Q—Enotourism, R—Amusement Parks, S—Museums, T—Architecture, U—Other Artworks, V—Art Centers.

The mobility analysis based on visitor clusters that was carried out also allows for knowing the specific POIs visited by the tourists of each cluster. Figure 2 shows the percentage of tourists from each cluster who visit the top 20 most visited POIs of the destination, highlighting in each row the values above the average. For example, Camp Nou, the Barcelona football stadium, is visited mostly by tourists from cluster 23, the religious architecture (as the Sagrada Família) and historic monuments (as the Palau de la Generalitat or the Casa Batlló) are mainly visited by tourists of clusters 0, 2, or 17, and beaches (like Barceloneta) or places near them by tourists of cluster 3. This figure also clearly identifies those clusters that are focused on the most popular POIs, and to what degree. Clusters 0 and 2 are heavily focused on popular POIs as they have an above average representation of 55% and 80% (respectively) in the popular POIs. In contrast, clusters 7, 12, 13, 15, 21, and 23 have very little representation in the popular POIs. This shows that the 20 clusters do not favor only popular or unpopular POIs.


**Figure 2.** Top 20 most visited attractions and the percentage of their tweets per cluster, highlighting values above average in green.

In summary, the clusters present unique groups of tourists with different characteristics and preferences derived from the clustering features and allows knowing which are the visitors of the most popular tourist POIs. This information can be very useful for the marketing managers of destinations and tourist attractions of the place because it allows them to know the visiting preferences of their visitors. Moreover, these data can also be exploited for mobility analysis, as described in the next section.

#### *4.2. Tourist Mobility/Flow Analysis*

To analyze the mobility of tourists within each cluster, bigrams (*A*, *B*) were extracted from their sequences of visits. These bigrams are *n*-grams of size two that represent the movement of a tourist from A to B (i.e., the user sent a tweet from A and the next one from B). Given a sequence of items, *n*-grams are unique sets of *n* directly adjacent items. For example, a sequence *S* = {*a*, *b*, *c*, *d*} has the following 2-grams (popularly known as bigrams) 2*grams* = {(*a*, *b*),(*b*, *c*),(*c*, *d*)}. *N*-grams must maintain the order in the original sequence and must be unique.

Bigrams were extracted from the sequences of visited places of each tourist in a certain cluster, and we counted their frequency of occurrence (i.e., the number of times a bigram appeared in the cluster). The clusters 0, 2, 3, 7, and 16 were selected to illustrate this analysis because of their diversity. The top 20 bigrams with a higher frequency in each cluster were taken as the most relevant. Figure 3 shows heat map plots of the tourist mobility within clusters, where the color intensity represents the movement between POIs measured by the frequency of occurrence.

As it can be seen from Figure 3, *Basílica de la Sagrada Família* acts as a hub in cluster 0, as all other POIs are directly connected to it. In most cases, tourists are visiting the other POIs after visiting *Basílica de la Sagrada Família* because its outflows exceed its inflows from other locations, except *Casa Batlló*, *Catedral de la Santa Creu i Santa Eulàlia*, and *Al actor Iscle Soler*, which might be a result of route preference. The strongest connection can be seen between *Basílica de la Sagrada Família* and *Park Güell* with almost equivalent inflow and outflow between them. In the case of cluster 2, *Basílica de la Sagrada Família* is once again the most interconnected attraction but with more inflows than in cluster 0, and the strongest connection is between *Basílica de la Sagrada Família* and *Camp Nou*. The other selected clusters (3, 7, and 16) show connections between various attractions with no clear hub.

Figures 4 and 5 help to further understand tourist mobility with network graphs plotted on the Barcelona city map. Nodes represent POIs, and edges are the bigrams that connect them (the wider is the edge, the more tourists travel between those two locations). It can be seen that proximity plays a role in why *Basílica de la Sagrada Família* acts as a hub in clusters 0 and 2. In cluster 3, the focus is the Mediterranean Sea as most POIs are near the beach, except in the case of *Basílica de la Sagrada Família*, *Park Güell*, and *Camp Nou*, as tourists are willing to travel out of the way to see these POIs. Clusters 7 and 16 show two different kinds of tourists. The former is focused on bars near the city center, whereas the latter visits many different kinds of places all around the city, including amusement parks, shop malls, and the beach, but also the most popular venues.

In summary, the analyzed clusters showed inter-connected POIs which are of interest to certain groups of tourists. In some cases, they focus on the popular attractions, but in others they also visit places off the beaten track. The graphic representation in the map of the mobility of different clusters of tourists in a destination provides new knowledge to DMOs, who could take them into account to define new tourist routes, to create targeted marketing campaigns or to optimize transport routes between heavily connected POIs for different types of tourists.

**(e)** Cluster 7

**Figure 3.** Mobility heat-maps for clusters 0, 2, 3, 16, and 7 representing bigrams with movement from left to bottom.

**Figure 4.** Tourist mobility pattern between attractions in Barcelona for clusters 0, 2, 3, and 7.

**Figure 5.** Tourist mobility pattern between attractions in Barcelona for cluster 16.

#### **5. Discussion, Conclusions, and Implications**

The main contribution of the study is the introduction of a method for analyzing social media data that create visitor profiles according to their travel preferences and mobility. This study corroborates that social media are very useful platforms as sources for big data research in the field of tourism [8,30], and they can also be used to study tourist mobility [26,35,41,55]. It also improves the knowledge about the different mobility patterns of tourists depending on who they are and what their preferences are [56]. Therefore, the study has shown that clustering visitors by their travel mobility permits uncovering much more information about visitors and their preferences than previous studies. It can also provide complementary information to DMOs, attraction operators, and developers of contextual and next-POI recommender systems [81].

Additionally, the study also reveals the most popular or the most visited points of interest at destinations. Previous studies had also found out the most popular tourist spots or the most visited routes by analyzing geo-tagged social media data [62,66], but their analysis did not allow for knowing which were the most visited POIs by the different tourists. Thus, interestingly, the applied method allows for knowing the percentages of tourists in each cluster who visit each attraction the most. This is crucial for the managers of the different tourist attractions that want to know who their majority visitors are as well as their interests; in that way, they will be able to adapt their service and information in an almost personalized way.

Another contribution of the study is to show the mobility of each cluster of tourists between POIs and to see graphically the movement that they make in the map of the destination. Previous studies have shown mobility with place maps and most visited points [26]. Many studies on GSMD only focused on tourists' mobility patterns [48,59,73]. However, the difference in mobility patterns between sub-groups or clusters of tourists has not been fully researched [37]. Therefore, GSMD analysis allows for knowing the dispersion and tourists' movements, the routes they follow, the activities they carry out in a territory [57–62], and their density of movements [63] and flows [38,64,65]. Hence, the resulting information is particularly valuable for city management, since it provides a

better knowledge of the connections between points of interest related to different clusters of tourists according to their preferences and behaviors. This information can help DMOs to define new tourist routes, to create targeted marketing campaigns, to optimize transport routes between heavily connected POIs for different types of tourists, or to improve the management of congestion or overcrowding situations.

To sum up, through this application of Artificial Intelligence techniques to social media data creating clusters of tourists, it has been possible to know how to segment them according to their visitor behavior and visit preferences. This information is key to the DMOs and the different service providers of the destinations. The interest of DMOs in analyzing big data and knowing the maximum information about their tourists is based on being able to anticipate their interests and preferences [38]. This is precisely the information that this study provides. Therefore, the study can have a major impact on the marketing and flows management of tourist destinations. The exploration of the relationship between tourists' profiles, points of interest, and tourist mobility allows for gaining further insight into really concerning debates on tourism pressure in specific locations, and destination carrying capacity. Accordingly, it can be used by local and regional authorities, as well as by planners and urban designers, to deal with urban complexity, especially in successful tourist cities with contradictions and conflicts generated by overtourism [92].

From this perspective, the managerial implications of the study are diverse. Using these kinds of analytical tools, DMOs and tourism service providers could be able to offer the most personalized services and information, to attract specific types of tourists to certain points of interest [37], to propose the visit of new under-visited attractions to certain market segments, to create new routes or to optimize the existing ones, to enhance public transport services, to develop new POIs or tourist services for the busiest routes [93] or to create tourism development plans for the least visited areas [60].

In addition, it will allow destinations to encourage smart development, overcoming some of the existing gaps in the level of achievement of their objectives [7]. This includes the improvement of smart tourism developments such as the creation of differentiated attractive travel packages [86], the adaptation of the marketing and communication tactics to the preferences of visitors [10,11], the improvement of the satisfaction of tourists [5,14,15], and the co-creation of a more positive tourist destination image [94]. It will even have a major impact on travel recommendation systems, as shown in a previous study [39]. Finally, from the place management perspective, the detection of visitor flow patterns would help to regulate the carrying capacity of the visitors' points of interest avoiding overcrowding, improving allocation of visitor services and reducing tensions produced by the different tourist and residential uses of the city areas, infrastructures, and services around the points of interest.

**Supplementary Materials:** The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app12125834/s1.

**Author Contributions:** Conceptualization, A.H. and S.A.C.; methodology, J.A.O., J.B. and A.M.; software, J.A.O.; validation, J.B. and A.M.; formal analysis, J.A.O.; investigation, J.A.O.; resources, J.B. and A.M.; data curation, J.A.O. and J.B.; writing—original draft preparation, J.A.O. and A.H.; writing—review and editing, A.H., A.M. and S.A.C.; visualization, J.A.O.; supervision, J.B. and A.M.; project administration, J.B. and A.M.; funding acquisition, J.B. and A.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not Applicable.

**Informed Consent Statement:** Not Applicable.

**Data Availability Statement:** 3rd Party Data Restrictions apply to the availability of these data. Data were obtained from Twitter and are available at https://twitter.com (accessed on 1 January 2019) with the permission of Twitter. Research is allowed to provide Tweet IDs to other researchers to download using Twitter's API. The Tweet IDs of data presented in this paper are included in the supplementary material.

**Acknowledgments:** Jonathan Ayebakuro Orama is a fellow of Eurecat's "Vicente López" PhD grant program.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


### *Article* **Multivariate Time Series Deep Spatiotemporal Forecasting with Graph Neural Network**

**Zichao He, Chunna Zhao \* and Yaqun Huang**

School of Information Science and Engineering, Yunnan University, Kunming 650504, China; hzc@mail.ynu.edu.cn (Z.H.); huangyq@ynu.edu.cn (Y.H.)

**\*** Correspondence: zhaochunna@ynu.edu.cn

**Abstract:** Multivariate time series forecasting has long been a subject of great concern. For example, there are many valuable applications in forecasting electricity consumption, solar power generation, traffic congestion, finance, and so on. Accurately forecasting periodic data such as electricity can greatly improve the reliability of forecasting tasks in engineering applications. Time series forecasting problems are often modeled using deep learning methods. However, the deep information of sequences and dependencies among multiple variables are not fully utilized in existing methods. Therefore, a multivariate time series deep spatiotemporal forecasting model with a graph neural network (MDST-GNN) is proposed to solve the existing shortcomings and improve the accuracy of periodic data prediction in this paper. This model integrates a graph neural network and deep spatiotemporal information. It comprises four modules: graph learning, temporal convolution, graph convolution, and down-sampling convolution. The graph learning module extracts dependencies between variables. The temporal convolution module abstracts the time information of each variable sequence. The graph convolution is used for the fusion of the graph structure and the information of the temporal convolution module. An attention mechanism is presented to filter information in the graph convolution module. The down-sampling convolution module extracts deep spatiotemporal information with different sparsities. To verify the effectiveness of the model, experiments are carried out on four datasets. Experimental results show that the proposed model outperforms the current state-of-the-art baseline methods. The effectiveness of the module for solving the problem of dependencies and deep information is verified by ablation experiments.

**Keywords:** multivariate time series; deep spatiotemporal information; down-sampling convolution; attention; graph neural network

#### **1. Introduction**

With the development of the Internet, various sensors and data-storage devices have appeared in modern society. As a result, a large amount of time series data is generated by recording temperature, traffic, power consumption, and financial data. Multivariate time series data consists of time series data generated by multiple sensors or data storage devices, and there are dependencies among multiple time series. For example, a household's daily electricity usage and hourly electricity production from solar panels can be considered time series data. In most cases, the data usually come from the electricity consumed by multiple households or solar data generated at different locations. A multivariate time series is constructed from these data. There may be complex dynamic dependencies between multivariate time series data. Therefore, each time series in a multivariate time series is helpful for forecasting tasks. At the same time, multivariate time series data such as electricity and transportation have periodic characteristics. Multivariate time series forecasting has been studied for a long time in capturing the periodic characteristics and dynamic dependencies of variables [1–4]. For example, studies using the Harmonic regression method enhanced the prediction performance of PM2.5 by capturing periodic information [5].

**Citation:** He, Z.; Zhao, C.; Huang, Y. Multivariate Time Series Deep Spatiotemporal Forecasting with Graph Neural Network. *Appl. Sci.* **2022**, *12*, 5731. https://doi.org/ 10.3390/app12115731

Academic Editors: Dionisis Margaris and Stefanos Ougiaroglou

Received: 16 May 2022 Accepted: 2 June 2022 Published: 5 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In recent years, some researchers have made efforts in time series analysis and forecasting. For example, traditional forecasting methods represented by statistical methods. Most rely on mathematical equations to describe the evolution of time series, such as the Autoregressive Integrated Moving Average Model (ARIMA) [6]. Therefore, various variants based on the ARIMA model have emerged. Models based on traditional methods are more computationally efficient. However, these models are mostly limited to linear models and univariate predictions. It is difficult to extend to multivariate time series problems.

In the era of big data, with the development of sensor technology and computing power, most of the time series data comes from data collected by various devices. However, the information of large-scale data is difficult to extract by traditional methods. Deep learning is the current popular information extraction method. Features of large-scale data can be obtained through deep learning techniques. In time series forecasting, deep learning techniques are employed to achieve better forecasting accuracy than traditional methods. At the same time, deep learning technology has made great progress in the research and application of image processing, audio processing and natural language processing [7–9]. Although traditional machine learning and statistical methods are often employed in time series forecasting tasks, deep learning techniques are gaining attention from researchers. With further development, there have been studies on applying graph neural networks and attention methods in time series forecasting [10–12]. Capturing spatial information by building a graph structure of multivariate time series. These methods have achieved certain results. A graph neural network allows each node in the graph to acquire information about the surrounding nodes. Multivariate time series can be considered from the perspective of graph nodes. Spatial information can be obtained between multivariate time series by constructing graphs. For periodic data, the information of adjacent nodes is more informative. Compared with vanilla neural networks, graph neural networks can capture the dependency information between different sequences, which is more suitable for the prediction of multivariate time series. Therefore, graph neural networks can achieve more accurate prediction results than vanilla neural networks by aggregating information from multiple sequences.

In this paper, a multivariate time series deep spatiotemporal forecasting model with a graph neural network (MDST-GNN) is proposed. The model consists of four core components: graph learning, temporal convolution, graph convolution, and down-sampling convolution. The local information and deep spatiotemporal information of multivariate time series data can be learned by the model. Dependencies between sequences can be captured by a graph learning module. The temporal information is captured by the 1D convolution of the temporal convolution module. In the graph convolution module, the spatial dependencies between variables are extracted through relational graphs. The spatiotemporal information in sequences with different sparsity is captured by downsampling convolution modules. The experiments in this paper show that the model has good prediction performance and generalization ability.

The main contributions of this paper are as follows:


The remainder of this paper is organized as follows: Section 2 presents the related work and research status. Section 3 introduces the definition of the problem and presents the overall framework and algorithmic flow of the model. Section 4 describes the experimental part and gives the experimental results of different methods in the dataset, analysis of experimental results, and ablation experiments. Finally, Section 5 is the conclusion of this paper.

#### **2. Related Works**

#### *2.1. Time Series Forecasting with Traditional Methods*

Time series forecasting has been studied for a long time. Most of the existing work can be divided into statistical methods and machine learning and deep learning methods. Such as Autoregressive Integrated Moving Average Model (ARIMA) [6] and a hybrid ARIMA and multilayer perception model (VARMLP) [13]. ARIMA obtained future forecast values by constructing polynomials of historical information and adding noise. VARMLP modeled linear and nonlinear data combining ARIMA with an artificial neural network model (ANN) [14]. Neural networks are often used to capture nonlinear relationships in time series forecasting [15,16]. However, the main limitations of ARIMA models are the pre-assumed linear patterns and the requirement for high stationarity of the data. Therefore, ARIMA is not suitable for modeling problems of multivariate time series. Furthermore, Linear Support Vector Regression (SVR) [17] treats the prediction problem as a typical regression problem with parameters changing over time. The vector autoregressive model (VAR) [18] extends the AR model and was a commonly used econometric model. However, these models are difficult to extend to multivariate time series forecasting problems.

#### *2.2. Time Series Forecasting with Deep Learning*

In recent years, with the improvement of data availability and computing power, deep learning techniques have been adopted to obtain better prediction accuracy than traditional methods. Nonlinear patterns of data can be captured by deep learning models and outperform traditional methods. The time series forecasting models LST-Net [19] and TPA-LSTM [20] excelled in capturing nonlinear information. LSTNet used Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to extract short-term local dependence patterns and long-term patterns of time series. The former encoded the short-term local information of multivariate sequences into lowdimensional vectors. The latter decoded the vectors to capture long-term dependency information. Long Short-Term Memory (LSTM) [21] is a variant of RNN that is often used for time series forecasting tasks. LSTM also shows excellent performance in natural language processing tasks. TPA-LSTM extracted temporal patterns through the LSTM network, and a convolutional network and attention mechanism were used to calculate the attention score. However, LSTNet and TPA-LSTM cannot model the dependencies of multiple variables. RNN and LSTM cannot memorize the dependencies of multiple variables for a long time. Sample Convolution and Interaction Networks (SCINet) [22] use sequence sampling and convolutional neural network methods to capture temporal information. However, SCINet does not consider the dependencies among multiple variables, and the stacking of multiple layer structures results in huge time and space costs. Recently, Temporal Convolutional Networks (TCN) [23] have been applied to time series forecasting problems. In the TCN architecture, dilated convolutions can be used to flexibly adjust the receptive field, and its space complexity is lower than that of RNN. However, the causal convolution of TCN is not suitable for sliding window inputs. With the development of attention methods, the Transformer [24] model has replaced the RNN model in many sequence modeling tasks. Therefore, various time series forecast methods based on Transformers have emerged [25], which are quite effective in predicting long series. However, although time series models for reducing the spatial cost of Transformer have been proposed, the space cost is still large.

With further research, graph neural networks have shown great advantages in processing graph data [26–28]. In the structure of graph neural networks, each node is interconnected with its neighbors, and information is transmitted between nodes. The variables of a multivariate time series can be transformed into a graph-like structure. A variable represents a node in the graph. Nodes are interrelated through their hidden dependencies. Therefore, graphs in multivariate time series data modeling can better

represent the interdependencies between time series. Currently, many people have used graph neural networks to deal with time series problems [29,30]. Multivariate Time Series Forecasting with Graph Neural Networks (MTGNN) [31] built a graph with variables as nodes. The spatiotemporal information of the data is captured by dilated convolutional networks [32] and relational graphs. However, the dilated convolutional structure of MT-GNN causes the loss of locally continuous and deep-level information. Although the above methods have been successfully applied to many forecasting problems, most of them have corresponding shortcomings.

By comparing traditional methods and classical deep learning models, a multivariate time series deep spatiotemporal forecasting model with graph neural network (MDST-GNN) is proposed in this paper. The model is able to learn relational graphs from multivariate time series data. Deep spatiotemporal information is extracted and filtered through down-sampling convolutional networks and attention methods. Our model can capture the deep nonlinear spatiotemporal features in the sequence and solve the problem of local and global information loss.

#### **3. Methods**

Firstly, this section presents the definition of the prediction problem. The MDST-GNN model consists of a graph-learning module, a temporal convolution module, a graph convolution module and a down-sampling convolution module. In this paper, an attention mechanism is introduced into the graph convolution module to filter spatiotemporal information. A down-sampling convolution module is proposed to extract deep spatiotemporal information to improve model performance. Finally, the overall frame diagram and algorithm steps of the MDST-GNN model are given.

#### *3.1. Problem Definition*

**Definition 1.** *There is a multivariate time series with sequence length m, given as a set X* = {*sm*,1,*sm*,2,*sm*,3, ... ,*sm*,*n*} *,where sm*,1 = {*t*1,1, *t*2,1, *t*3,1, ... , *tm*,1} *is the set of the first variable sequence and tn*,[*i*] ∈ *R is the value of the ith variable at time step n.*

**Definition 2.** *A graph can be defined as G* = {*V*, *E*}*. V represents the set of nodes. E represents the set of edges formed by the dependencies between variables. An adjacency matrix is a mathematical form that facilitates graph computation. It is <sup>A</sup>* ∈ *<sup>R</sup>N*×*N, assuming* {*vi*, *vj*} ∈ *V. If* (*vi*, *vj*) ∈ *E, then Aij* = *C* > 0*, if* (*vi*, *vj*) ∈/ *E, then Aij* = 0*.*

**Definition 3.** *The goal is to learn a model F*(·) *and build the mapping of F from X to Y by minimizing the L2 regularization loss function. Given a prediction step size h, it is possible to predict the future value Y* = {*sm*<sup>+</sup>*h*,*n*} *of X after h steps.*

#### *3.2. Graph Learning Module*

During the training of the model, the spatial information representation of the input data and the implicit relationships among multiple sequences are obtained by the graph learning module. In existing research, graphs are mostly constructed by the method of node similarity and distance, and these graphs are usually bidirectional or symmetric. For the prediction task in this paper, other nodes may be affected by the change in one node. For example, if the traffic roads in a certain area are congested, other roads in the area will also change. When solar power generation in a region rises, other sites in the region will also have certain changes. Therefore, a one-way relational graph structure is more suitable to be constructed:

$$A\_1 = \tanh(\alpha \times E\_1 \times \mathcal{W}\_1) \tag{1}$$

$$A\_2 = \tanh(\mathfrak{a} \times E\_2 \times \mathcal{W}\_2) \tag{2}$$

$$B = A\_1 \times A\_2^{-T} \tag{3}$$

$$A = \operatorname{Relu}(\tanh(\alpha \times (B - B^T)))\tag{4}$$

According to the above formula, the initial adjacency matrix *A* can be obtained. *E*<sup>1</sup> and *E*<sup>2</sup> are learnable nodes encoding information. *W*<sup>1</sup> and *W*<sup>2</sup> are model-learnable parameters. The hyperparameter *α* is used to prevent the vanishing gradient problem caused by the saturation of the activation function. The adjacency matrix is converted to a one-way matrix by matrix subtraction and activation function Relu. When *Aij* is a positive number, *Aij* = 0:

$$A[\text{i}, \text{not } \underset{\text{g}}{\text{log}} \text{top} \{ A[\text{i}, \text{:} ], k)] = 0 \tag{5}$$

Argtopk() extracts the indices of the top *k* maxima of the vector. In order to simplify the calculation amount, the *k* nodes with the largest value among the nodes are selected as the nodes with high relevant information during graph convolution. Unselected nodes are assigned the value 0. This method is more flexible in constructing graphs, and the internal information of the graph can be adjusted according to the data during the training process.

#### *3.3. Temporal Convolution Module*

The temporal convolution module consists of two Dilated Inception layers. Tanh and Sigmoid activation functions are used after the Dilated Inception layer. One dilated inception layer is followed by an activation function, and Tanh is used as a filter. The sigmoid function of the other layer is used as the gate unit. The gate unit controls the amount of information that the filter propagates to the next module.

The advantages of the Inception network and Dilated convolution are incorporated into the Dilated Inception layer. Information at different scales can be captured by the Inception network, while the dilated convolutional network ensures that long-term sequences can be processed. First, the receptive field of traditional convolutional networks is limited by the depth of the network and the size of the convolution kernels. Processing long-term sequences requires increasing convolution kernel size and network depth, which increases the cost of computational resources for the problem. For example, for a network with *L* one-dimensional convolutions and *K* convolution kernels, the size of the receptive field is:

$$R = (K - 1) \times L + 1\tag{6}$$

In WaveNet [33], stacked dilated convolutions were used to make the network have a very large receptive field. Computational efficiency and input data integrity are guaranteed with only a few layers. In this paper, the dilated convolution method is used to reduce the cost of model computing resources. The dilation factor is doubled for each layer. With the increase in depth, the dilation factor can make the receptive field grow exponentially. For long time series, the advantage of dilated convolution is that the internal data structure can be preserved without reducing the length of the data input. However, models designed based on dilated convolutions also have some problems. Due to the discontinuity of the convolution kernel, all elements cannot be covered in the dilated convolution, so the continuity of information will be lost. Therefore, the Inception network is introduced into the model. Multiple convolution kernels of different sizes are used to cover the receptive field to retain more information.

In the Inception network, features of multiple scales are obtained by convolution kernels of different sizes. The sparse matrices output by multiple convolutional networks are aggregated into denser submatrices. For the Inception network, the size of the convolution kernel needs to be selected according to the characteristics of the data. According to the periodicity of the time series (2, 3, 6, 7), four convolution kernels are used by the Inception network in this paper. These convolution kernels can be combined to cover multiple time periods of different lengths. The input data *X* is processed using four convolution kernels

of different scales. The outputs of different scale convolutions are fused to obtain complete temporal information.

$$X = \text{concat}(X \times I\_{1 \times 2}, X \times I\_{1 \times 3}, X \times I\_{1 \times 6}, X \times I\_{1 \times 7})\tag{7}$$

Finally, the convolution results of different lengths are cropped to the same length. In Formula (7), the convolution results are concatenated by the channel dimension. Features of different scales are learned by each layer in the network. The adaptability of the network to different scales of information is increased by the concatenation of channel dimensions.

#### *3.4. Graph Convolution Module*

In the graph convolution module, dependency information is extracted from the input data through the adjacency matrix of the graph learning module. Each node of the adjacency matrix is fused with highly dependent node information to obtain the dependency information of each node and its related nodes.

The graph convolution module consists of two Attention Mixed Propagation layers for processing the inflow and outflow information of each node. The structure of the graph convolution module is shown on the left of Figure 1, and A is the adjacency matrix of the node relationship obtained by the graph learning layer. The inflow information of each node is processed by the Attention Mixed propagation layer. *A<sup>T</sup>* is used to process the outflow information of the node. After the Attention Mixed propagation layer, the final node feature information is obtained by summing the inflow and outflow information of the nodes. The structure of the Attention Mixed Propagation layer is shown on the right in Figure 1, and A is the dependency matrix between nodes, *Din* is the output information of the temporal convolutional layer, and *D*(*k*) represents the information that *Din* propagates K times in the nodes of A. After the Concat connection, C is the channel dimension of the feature information, N is the number of variables in the feature information, and T is the sequence length of the feature information. The node information is weighted and filtered for different dimensions of the feature information, and the weighted sum is used as the output of the Attention Mixed propagation layer. The graph obtained by the graph learning layer is utilized to process the information flow of the relevant nodes in the Attention Mixed propagation layer. The Attention Mixed propagation layer consists of two parts: information propagation and information selection.

**Figure 1.** Graph Convolution Module and Attention Mixed Propagation.

The information propagation steps are as follows:

$$D^{(k)} = \beta D\_{in} + (1 - \beta) \frac{(A + I)}{\sum\_{j=0}^{n} A\_{ij}} D^{(k-1)} \tag{8}$$

$$D = \mathbb{C}uncat(D^{(0)}, D^{(1)}, \dots, D^{(k)}) \times W^{(k)} \tag{9}$$

Among them, *β* in the Formula (8) is a hyperparameter used to control the retention rate of initial information. A is the graph adjacency matrix. K is the depth of information propagation. *Din* is the output of the previous module. It can be seen from Formula (9) that *D*(*k*) is concatenated through the channel dimension after the information propagation process. Afterwards, the channel dimension is compressed to the dimension before concatenation by a convolutional network. D contains the features of spatial dependence after each node information is propagated.

However, initial information and new information are continuously fused in graph convolution. The initial information of the node will be continuously lost when reaching a certain depth, which will lead to unreliable data. Therefore, the *β* parameter is introduced to preserve part of the initial information. Make sure that local and deep-level information of nodes is preserved. At the same time, the cases where the spatial dependence is not obvious or there is no spatial dependence also need to be considered between the data. In this case, the important information of each node will be disturbed by the graph convolution operation instead.

Therefore, Spatial and Channel Squeeze and Excitation (scSE) [34] is used as an information-selection method to filter out some unnecessary information in this paper. The scSE attention method consists of two parts, namely, Channel Squeeze and Excitation (cSE) and Spatial Squeeze and Excitation (sSE). The attention mechanism is used to adjust the weights of features in the network. By weighting important feature maps or feature channels, the influence of unimportant features was reduced, thereby improving prediction results. The core idea of Squeeze and Excitation (SE) was to dynamically learn feature weights through network training. The effective feature weight was amplified and the invalid or small effect feature weights were reduced so that the model can achieve better results.

Among them, cSE refers to the compressed spatial dimension information and the adjusted channel dimension weights. After the data passed through the fully connected layer, the weight of each channel was multiplied with the original input in the channel dimension. Finally, the adjusted feature map was obtained.

Channel attention is defined as follows:

$$z\_k = \frac{1}{H \times W} \sum\_{i}^{H} \sum\_{j}^{W} d\_k(i, j) \tag{10}$$

$$z = \operatorname{sigmoid}(w\_1 z), w\_1 \in \mathbb{R}^{c \times c} \tag{11}$$

$$\hat{D}\_{\mathcal{c}SE} = \left[ \sigma(\hat{z}\_1) d\_1 \wedge \dots \wedge \sigma(\hat{z}\_\mathcal{c}) d\_\mathcal{c} \right] \tag{12}$$

The input feature map is *<sup>D</sup>* = [*d*1, *<sup>d</sup>*2,..., *dc*], where each channel is *di* ∈ *<sup>R</sup>H*×*W*. *<sup>D</sup>* is converted to *<sup>z</sup>* ∈ *<sup>R</sup>*1×1×*<sup>C</sup>* after going through a global pooling layer (GAP). In Formula (10), *zk* is the global spatial information in each channel obtained through the global pooling layer. In Formula (11), global spatial information is transformed into channel weight values *z*ˆ (between 0 and 1) by sigmoid and fully connected layers. As the network continues to train, the input feature map is adaptively adjusted to emphasize important channels.

The sSE in the method referred to the compression of channel information and the adjustment of the weight of the spatial dimension. After the data passed through the fully connected layer, the weight of each point in the spatial dimension was multiplied with the original input in the spatial dimension. Finally, the adjusted feature map was obtained.

Spatial attention is defined as follows:

$$q = \text{sigmoid}(D \times w\_1), q \in R^{H \times W} \tag{13}$$

$$\hat{D}\_{\rm sSE} = \left[ \sigma(q\_{1,1}) d\_{1,1} \cdot \cdots \cdot \sigma(q\_{h,w}) d\_{h,w} \right] \tag{14}$$

$$D\_{\rm out} = \hat{D}\_{cSE} + \hat{D}\_{sSE} \tag{15}$$

The input feature maps are *<sup>D</sup>* = [*d*1,1, *<sup>d</sup>*1,2,..., *dh*,*w*], where each *di*,*<sup>j</sup>* ∈ *<sup>R</sup>*1×1×*C*. In Formula (13), *D* is converted into a spatial weight value *q* (between 0 and 1) after passing through the convolutional network and the activation function Sigmoid. Finally, the channel attention feature map and the spatial attention feature map are added to obtain the weighted feature map *Dout*. In addition, *Dout* is the output after information filtering.

#### *3.5. Down-Sampling Convolution Module*

The down-sampling convolution module in this paper is a multilayer module. Multiple resolution temporal features are captured by down-sampling convolutions to improve model performance. The down-sampling convolution module consists of basic downsampling blocks, each D-s Block contains a structure on the right of the figure. The last layer is connected via Concat. FC is a fully connected layer, which maps the output to the specified dimension, as shown on the left of Figure 2. The output of the previous module is used as the input of the binary tree structure, and multiple spatiotemporal short-sequence information is obtained after passing through L layers. The short-sequence information is rearranged into new deep spatiotemporal sequence information. Through residual connection, the original input is added to the newly extracted deep spatiotemporal information sequence to ensure the integrity of the information. A basic down-sampling block of a binary tree structure consists of sequence segmentation, convolution filtering, and information crossing. The output data of the previous module is divided into two subsequences in the down-sampling component as shown in Figure 2 right. The output *Dout* of the graph convolution module is used as input, and then each layer takes the output of the upper layer as input. *Deven* and *Dodd* represent sequences with odd and even subscripts, respectively. Conv1, Conv2, Conv3, and Conv4 are four identical 1D convolutional networks, respectively. The exp function is used to highlight peaks of information. Hadamard product is represented by , which is the multiplication between the elements. Finally, the output of the current layer is obtained by subtracting the two sequences. Different convolutional filters are used to extract new sequence information from each sequence. In order to avoid the loss of information caused by dividing the sequence multiple times, the important information of multiple subsequences is preserved through information crossing.

**Figure 2.** Down-sampling Convolution Module and Down-sampling block (D-s Block).

The input sequence is decomposed into odd and even sequences by the sequence decomposition operation in the down-sampling block of Figure 2. The odd sequences take data at positions 1, 3, 5, etc., in the input sequence. The even sequences take data at positions 0, 2, 4, etc., in the input sequence. The original sequence *Dout* is decomposed into two subsequences *Deven* and *Dodd*. Odd and even sequences reduce the amount of data while retaining most of the information of the original sequence. After that, different convolutional networks are employed to extract the feature information of the input sequence

from the two sequences. The feature information obtained by different convolutional networks has stronger representation ability after fusion:

$$D\_{even\prime}D\_{odd} = split(D\_{out}) \tag{16}$$

$$\boldsymbol{D}'\_{\text{even}} = \boldsymbol{D}\_{\text{even}} \odot \exp(\text{conv}\_1(\text{D}\_{\text{odd}})), \quad \boldsymbol{D}'\_{\text{odd}} = \boldsymbol{D}\_{\text{odd}} \odot \exp(\text{conv}\_2(\text{D}\_{\text{even}})) \tag{17}$$

$$\boldsymbol{D}\_{\text{even}}^{\prime} = \boldsymbol{D}\_{\text{even}}^{\prime} - \boldsymbol{conv}\_{3}(\boldsymbol{D}\_{\text{odd}}^{\prime})\_{\prime} \quad \boldsymbol{D}\_{\text{odd}}^{\prime} = \boldsymbol{D}\_{\text{odd}}^{\prime} - \boldsymbol{conv}\_{4}(\boldsymbol{D}\_{\text{even}}^{\prime}) \tag{18}$$

In Formula (16), *Dout* is split into *Deven* and *Dodd*. In Formula (17), *Deven* and *Dodd* are mapped to the hidden state through two convolutional networks and converted to exp format. In Formula (18), two other convolutional networks map *D even* and *D odd* to new hidden states, which are subtracted to obtain the final result. The mapping process of the hidden state can be viewed as a scaling transformation of the two subsequences. The scaling factor is learned during network training.

The down-sampling convolution module enlarges the receptive field compared to the dilated convolution used in the WaveNet architecture. More importantly, after the input is divided into two subsequences, the temporal information in *Deven* and *Dodd* are fused by different one-dimensional convolutions. In this process, the integrity of time information is ensured, and the ability to represent information is enhanced. The down-sampling convolution module consists of several basic modules. Among them, the basic module as a whole presents a tree structure. The information in the basic module is accumulated layer by layer. The deep feature information contains the small-scale temporal information transmitted by the shallow layer. In this way, both short-term and long-term dependencies of time series can be captured. In the last layer of the module, all the subsequences are recombined by inverse odd sequence and even sequence segmentation to obtain a new sequence representation. The new sequence information is fused with the original sequence through residual connection to ensure the integrity of the sequence information. Map to the specified output dimension using a fully connected layer.

The output part consists of skip connections and two 1 × 1 convolutional networks in this paper. Skip connections normalize the output of the down-sampling convolution module to have the desired predicted sequence length. A 1 × 1 standard convolutional network is used to convert the channel dimension to the desired dimension.

#### *3.6. Experiment Model*

The model structure of MDST-GNN is shown in Figure 3. It consists of four parts: a graph learning module, K temporal convolution modules, K graph convolution modules, and a down-sampling convolution module. The graph learning module is used to construct the spatial dependency graph of the data. The temporal convolution module captures the temporal information of the data. The spatiotemporal information is obtained by fusing the dependency graph and temporal information in the graph convolution module. The downsampling convolution module extracts deep spatiotemporal information. Add residual connections to the model to avoid vanishing gradients during training. A skip connection is added after each temporal convolution module. To obtain the final output, the hidden features are projected onto the desired output dimension by convolution. Algorithm 1 shows the algorithm steps of the model.



The above process can be divided into 4 parts. Line 1 to 2 is the data preprocessing part. Lines 4 to 15 are the MDST-GNN model training and fitting process. Line 16 to 18 is the loss function calculation part. Line 21 gives the final prediction result. The source code of the algorithm is available at https://github.com/yiminghzc/MDST-GNN (accessed on 28 May 2022).

**Figure 3.** Model structure diagram of MDST-GNN.

#### **4. Experiment Analysis**

*4.1. Experiment Dataset Analysis*

In order to evaluate the effectiveness and generalization ability of the model, a classical multivariate time series dataset is adopted in this paper. The dataset comes from the experimental dataset given by LSTNet [19]. It contains multivariate time series datasets in four different domains:


Table 1 presents the statistical information of the experimental dataset in this paper. It includes four data sets. Time length represents the time length of a sequence. Variable represents the number of sequences in a multivariate sequence, and Sample rate represents the time interval of data recording. For example, there are 862 sequences with a length of 17,544 in the Traffic dataset, and the sequence data are recorded at an hourly interval.

**Table 1.** Dataset Statistics.


The dataset contains linear and nonlinear interdependencies. The characteristics of the datasets are shown by selecting two variables from each dataset. The power consumption values of the two users at different times is shown in Figure 4. The traffic road occupancy values of the two roads at different times are shown in Figure 5. The horizontal axis represents the different time periods of each day. The vertical axis represents electricity consumption and road occupancy during this period.

**Figure 4.** Consumer electricity consumption per hour.

**Figure 5.** Road occupancy rate per hour.

The power generation values recorded every 10 min at both power stations are shown in Figure 6. Daily exchange rate values for the two countries are shown in Figure 7. The horizontal axis represents the time span in different units. The vertical axis represents power generation and exchange rate values during this period.

**Figure 6.** Solar power generation per 10 min.

**Figure 7.** Daily exchange rates for two countries.

From the above data analysis, the data characteristics of the four data sets can be found. As can be seen from the figure, the Traffic, Electricity, and Solar Energy datasets have strong periodic patterns. The time series periods of multiple variables are not exactly the same, but the values of multiple variables at different periods are very similar. The above data analysis results are instructive to this paper. A multivariate time series forecasting model MDST-GNN is proposed in this paper, which can fully utilize the similarity information in the data. Information filtering is enhanced by incorporating attention methods, and the down-sampling convolution module is added to improve the deep information extraction ability. The relationship between multiple variables is modeled through a graph neural network to capture the dependency information between variables. The dependency information obtained by the graph neural network can enhance the forecasting ability of periodic time series.

#### *4.2. Methods for Comparison*

In order to effectively evaluate the experimental model and show the performance difference between different methods, the current state-of-the-art methods are compared with the model in this paper. The performance of the model on different dataset is shown by the diversity comparison of different methods.

The methods in our comparative evaluation are the follows:


#### *4.3. Metrics*

To evaluate the performance of our proposed MDST-GNN model, the same evaluation metrics are used for the comparison methods in this paper. The pros and cons of different methods can be clearly displayed by the same evaluation metrics. The evaluation metrics are the relative root mean square error (RRMSE) and the empirical correlation coefficient (CORR):

$$\text{RRMSE} = \frac{\sqrt{\frac{\sum\frac{1}{n}\sum\limits^{n}\left(Y\_{ti} - \hat{Y}\_{ti}\right)^{2}}{\sum\limits^{n}\frac{1}{n-1}\sum\limits^{n}\left(Y\_{ti} - \left(\overline{Y}\right)\right)^{2}}}}{\sqrt{\sum\limits^{n}\frac{1}{n-1}\sum\limits^{n}\left(Y\_{ti} - \left(\overline{Y}\right)\right)^{2}}}, t \in \Omega\_{\text{Test}}\tag{19}$$

$$\text{CORR} = \frac{1}{n} \sum\_{i=1}^{n} \frac{\sum (Y\_{ti} - \left(\overline{Y\_i}\right)) \left(\hat{Y}\_{ti} - \left(\overline{\hat{Y}}\_i\right)\right)}{\sqrt{\sum\_{t} \left(Y\_{ti} - \left(\overline{Y\_i}\right)\right)^2 \left(\hat{Y}\_{ti} - \left(\overline{\hat{Y}\_i}\right)\right)^2}}, t \in \Omega\_{\text{Test}} \tag{20}$$

In Formulae (19) and (20), *Y* and *Y*ˆ are the truth value and the predicted value, respectively. *Y* and *Y*ˆ represent the mean of the truth and predicted values. Ω*Test* represents numerical computation using the test set. For RRMSE, lower values are better, and for CORR, higher values are better.

#### *4.4. Experimental Setup*

This section describes the hardware and software environment of the experiment and the configuration of related parameters. Our models are built with the Pytorch open source deep learning library. The device configuration used in the experiment is Intel core i5 10400F 2.9 GHz, the GPU is NVIDIA GeForce RTX 3060 12 G, and the memory is 16 GB.

The four datasets are divided by time into training set (60%), validation set (20%), and test set (20%) in this paper. According to the performance difference of the model among different hyperparameters, the optimal hyperparameters are selected to validate the performance of the model on the test set.

The key hyperparameter settings of MDST-GNN are shown in Table 2. In this paper, the model adopts four graph convolution modules, four temporal convolution modules, and one down-sampling convolution module. Adam is used as the optimizer, and the gradient clipping is 5. The network model is converged to the optimal solution by the learning rate decay technique. The input sequence window length is 168, and the output sequence length is 1. Train the model to predict future target intervals (Horizon) 3, 6, 12, and 24. The starting 1 × 1 convolution has 1 input channel and 16 output channels. The activation function saturation rate of the graph learning module is 3, and the node-embedding dimension is 40. In the dataset Electricity, Traffic, and Solar Energy, the Number of neighbors (K of Formula (5)) of each node is set to 20. Neighboring node information is not considered for the Exchange Rate. The dilation factor of the convolutional network in the temporal convolution module is 2. A dropout of 0.3 is employed after each temporal convolution module. Layer Norm is used after each graph convolution module. The propagation depth of the Attention Mixed propagation layer is 2. The information retention rate of the propagation layer is 0.05. Both the graph convolution module and the temporal convolution module have 16 output channels. The number of layers (Num layers) of the down-sampling convolution module is shown in Table 2. The input channel of the down-sampling convolution module is 16, and the dropout is 0.3. Two skip connection layers have 32 output channels. In the output part of the model, two standard 1 × 1 convolutional networks with 64 output channels and 1 output channel are used. The training Epoch is 30. More detailed hyperparameters can be found in Table 2.


**Table 2.** Hyperparameter Settings.

Hyperparameters are obtained by changing the research parameters and fixing other parameters to obtain the optimal hyperparameters in this paper. Each experiment was repeated 5 times with 30 Epochs per run, and the hyperparameter that minimized the RRMSE among the 5 experiments was chosen. We investigate the hyperparameters that affect the results of the MDST-GNN model. The number of neighbors and layers were selected from the parameters. The parameters are analyzed by the results obtained from 5 experiments on the dataset Exchange Rate. Figure 8 shows the experimental results of the parameters. It can be seen from Figure 8a that fewer related nodes have better results, which indicates that the spatial dependence among multiple exchange rates is weak, and increasing the information of adjacent nodes will only increase the noise. Figure 8b shows that increasing the number of layers can achieve good results, but the change in the RRMSE tends to be flat when the number of layers is greater than 4, which is because the overfitting phenomenon occurs when the number of layers is too large.

**Figure 8.** Number of neighbors and Layers parameter analysis in Exchange Rate. Left (**a**) is the RRMSE result of the parameter Number of neighbors taking 0, 2, 4, and 8 respectively, and right (**b**) is the RRMSE result of the parameter Layers from 0 to 6.

#### *4.5. Results*

The experiments are carried based on four test sets for the validation of the MDST-GNN model. And the evaluation results of all methods are summarized in Table 3. The horizon is set to 3, 6, 12, and 24 to predict the value after a specified time period in the future. For example, it represents forecasting 3 to 24 h into the future for electricity and traffic values in the Traffic and Electricity datasets. In Solar Energy, it means forecasting solar energy values 30 to 240 min into the future. In Exchange Rate, it represents the forecasted exchange rate value for the next 3 to 24 days. When the value of Horizon is larger, the prediction task is more difficult. The optimal results for forecast are highlighted in the table

in bold black, and the suboptimal results are shown in red. Finally, the comparison results between the MDST-GNN model and other related models are shown in Table 3.


**Table 3.** Comparison of forecasting methods for multivariate time series methods.

Note: The predicted optimal results are in bold black, and the suboptimal results are shown in red.

We plot the RRMSE results for all methods of Electricity and Traffic, as shown in Figure 9. From the figure, it can be found that the method using neural network significantly outperforms the VARMLP model in the dataset. Obviously, neural networks have more advantages in time series forecasting. Subsequent models using neural networks have improved their prediction accuracy. Among them, MTGNN combined with spatial information outperforms LSTNet and TAP-LSTM models in results. The SCINet model also has good results in some datasets. Figure 9a shows that the RRMSE of multiple models is relatively stable, and the prediction accuracy of each model is improved. At the same time, Figure 9b can see that the MTGNN model fluctuates when the horizon is 6, which indicates that the spatial dependence of the dataset is weak when the horizon is 6, which interferes with the model. Compared with MTGNN, the MDST-GNN model has smaller fluctuations in this part, which shows that the attention method plays a role in filtering information and reduces the interference of irrelevant information. When the horizon is other values, the RRMSE tends to be stable.

**Figure 9.** RRMSE comparison of all methods in Electricity and Traffic. Left (**a**) is the RRMSE result of all methods in Electricity, right (**b**) is the RRMSE result of all methods in Traffic, and horizon takes 3, 6, 12, and 24 respectively.

The RRMSE metric of our model is improved by 1%, 1.4%, 3.5%, and 3.3% on the Electricity dataset compared to the state-of-the-art method. Furthermore, on the Traffic dataset, when the horizon is 12, the RRMSE metric is 1.9% higher than the best baseline method, which proves the effectiveness of the model. More importantly, forecasts have

also improved for noncyclical exchange rate data. This is because the change trend of the exchange rate is relatively stable. In this paper, the temporal information of the exchange rate is well captured by the model's deep information extraction. It is demonstrated that the framework can capture deep spatiotemporal information well, even in data with weak periodicity.

To illustrate the effectiveness of MDST-GNN in modeling spatiotemporal features in time series data, Figures 10–12 show the performance of the MDST-GNN on a specific time series (one of the output variables). Our model obtains results that are consistent with the truth value in the periodic pattern of the data. The comparative prediction results and truth values on the test set Electricity are shown in Figure 10. The comparison of the prediction results on the test set Traffic is shown in Figure 11. The comparison of the prediction results on the test set Solar Energy is shown in Figure 12. Our model achieves excellent performance on prediction tasks. The error is small between the predicted result and the truth value.

**Figure 10.** Electricity (Horizon = 24) forecast results.

**Figure 11.** Traffic (Horizon = 24) forecast results.

**Figure 12.** Solar Energy (Horizon = 24) forecast results.

However, the model performed slightly worse on the prediction task of Solar Energy data. As can be seen from Figure 12, the periodic pattern is captured by the model well, but there is an error between the results and the truth value within the periodic time. This is because the fluctuation in the data at a certain moment causes the periodic information to be disturbed. In Figure 6, it can be seen that the data remain periodic, but the data fluctuate greatly in a certain period of time. At the same time, the power generation of solar power plants is also affected by many other factors. For example, the local weather and temperature.

Overall, the model is more dominant on most tasks, achieving excellent performance on the exchange rate, traffic, and electricity datasets.

#### *4.6. Ablation Study*

A model combining an attention mechanism and time series segmentation is proposed in this paper. In this section, an ablation experiment is performed on this model, and the effectiveness of the above method will be verified. Among them, the models containing different modules are named as follows:


The effectiveness of the module is verified by removing the attention method and the down-sampling convolution module. The results of the ablation experiments on the test set are shown in Table 4. Ablation experiments verify the effectiveness of the modules in four datasets. The best results are shown in bold black. The overall model outperforms the ablation model by comparison. In Table 4, the attention method has advantages on Electricity and Traffic, and the down-sampling convolution method is more effective on Exchange Rate and Solar Energy. The effectiveness of integrating these two parts into the model is verified in Table 4. Among them, the data are filtered by the attention method, so that the data retain more important information and enhance the robustness. Downsampling convolution captures the deep spatiotemporal information of the sequence and enhances the predictive power of the data on cyclical trends. Overall, the accuracy of model predictions is improved. To verify the effectiveness of the module, we plot the one-dimensional time series (one variable) of the Electricity test set in Figure 13, where the green curve is the real data and the yellow dashed line is the predicted value. We can see that the full model captures the spikes in the data better, and the w/o S model does not capture this change.


**Table 4.** Ablation study (Horizon = 24).

Note: The optimal results are in bold black.

**Figure 13.** Time series (one variable) predicted and true values for the Electricity test set (Horizon = 24). Left (**a**) is the w/o S model result, right (**b**) is the result of the full model.

In summary, time series forecasting involves temporal, spatial, and periodic information. Of course, introducing the modules proposed in this paper will incur additional computational costs. However, it is very helpful to consider the above information. The improvement of this information on the prediction task is demonstrated by the results of the ablation experiments.

#### **5. Conclusions**

Multivariate time series have complex spatiotemporal information, and there are dependencies among multiple series. In addition, the time series has a periodic pattern. According to the characteristics of multivariate time series, a multivariate time series deep spatiotemporal forecasting model with a graph neural network (MDST-GNN) is proposed in this paper. It comprises four modules: graph learning, temporal convolution, graph convolution, and down-sampling convolution. In order to deal with complex information better, the model adopts a graph neural network to transform complex information into processable graph information. Furthermore, the model utilizes a temporal convolutional network to extract temporal information in the predicted sequence. Graph convolutional networks are used to extract spatiotemporal information of predicted sequences. The attention method is added to improve the information filtering ability. Down-sampling convolutional networks are used to extract deep spatiotemporal information. The MDST-GNN model can filter out irrelevant dependency information between graph nodes and is more suitable for multivariate time series forecasting. The prediction accuracy of data peaks can be enhanced by capturing deep spatiotemporal information. In experiments, the proposed model predicts future values at different intervals for four datasets. Based on the above discussion and experimental results, the advantages of MDST-GNN are verified in the experimental results on four sets of public datasets, and the attention method and down-sampling convolutional network proposed in this paper improve the prediction accuracy of the model.

The MDST-GNN model has achieved significant improvements on multiple datasets, but there are still certain problems in predicting data with large short-term fluctuations

such as Solar Energy. This is mainly caused by the insensitivity of convolutional networks to local fluctuations and the lack of long-term memory information. At the same time, the model has many parameters, and how to reduce the model parameters while maintaining good prediction performance is a problem worth pondering. Our next research will revolve around the problem of model lightweight and insensitivity to local fluctuations.

**Author Contributions:** Conceptualization, Z.H., C.Z. and Y.H.; data curation, Z.H.; methodology, Z.H. and C.Z.; software, Z.H.; supervision, C.Z. and Y.H.; validation, Z.H.; writing—original draft, Z.H. and C.Z.; writing—review and editing, Z.H., C.Z. and Y.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (Nos. 61862062, 61104035).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are openly available in Github (https: //github.com/laiguokun/multivariate-time-series-data, accessed on 2 February 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Amira Abdelwahab 1,2,\* and Nesma Youssef <sup>3</sup>**


**Abstract:** Data mining techniques are useful in discovering hidden knowledge from large databases. One of its common techniques is sequential rule mining. A sequential rule (SR) helps in finding all sequential rules that achieved support and confidence threshold for help in prediction. It is an alternative to sequential pattern mining in that it takes the probability of the following patterns into account. In this paper, we address the preferable utilization of sequential rule mining algorithms by applying them to databases with different features for improving the efficiency in different fields of application. The three compared algorithms are the TRuleGrowth algorithm, which is an extension sequential rule algorithm of RuleGrowth; the top-k non-redundant sequential rules algorithm (TNS); and a non-redundant dynamic bit vector (NRD-DBV). The analysis compares the three algorithms regarding the run time, the number of produced rules, and the used memory to nominate which of them is best suited in prediction. Additionally, it explores the most suitable applications for each algorithm to improve the efficiency. The experimental results proved that the performance of the algorithms appears related to the dataset characteristics. It has been demonstrated that altering the window size constraint, determining the number of created rules, or changing the value of the minSup threshold can reduce execution time and control the number of valid rules generated.

**Keywords:** sequential rule mining; non redundant sequential rules; TRuleGrowth; top-k non redundant rules; closed sequential patterns

#### **1. Introduction**

There is a fundamental issue in extracting useful information from the temporal relations in large sequence datasets. It helps a user to acquire useful knowledge for making a prediction. Many methods are emerging for finding the temporal relations in datasets. One of the foremost popular methods is mining sequential patterns to discover, as often as possible, patterns in sequence datasets [1,2]. Mining sequential patterns rely on one measure called support measure. The support measure means the number of presence items in a dataset. It can be misleading and inadequate for making a prediction. Mining sequential rules is an extension of sequential patterns mining that addresses the previous problem by considering additional measures [3,4].

Mining sequential rules take another measure besides the support into consideration, called the confidence measure, which means that the probability of executing the next pattern is calculated. There are many challenges for mining sequential rules, such as classifying similar rules differently. Additionally, some rules have been discovered that lose their importance when appearing separately. Therefore, particular rules are losing their use in predicting. All previous reasons impact the generation of a large number of redundant rules that affect the efficiency of the mining process. Numerous analysts have proposed upgraded methods of sequential rule mining (SRM) to decrease redundant rules and

**Citation:** Abdelwahab, A.; Youssef, N. Performance Evaluation of Sequential Rule Mining Algorithms. *Appl. Sci.* **2022**, *12*, 5230. https:// doi.org/10.3390/app12105230

Academic Editors: Stefanos Ougiaroglou and Dionisis Margaris

Received: 18 April 2022 Accepted: 17 May 2022 Published: 21 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

progress the proficiency of these algorithms. SRM is divided into two categories: partially ordered and standard sequential rules. The proposed enhanced algorithms concentrate on two directions:

The first type is a common type for mining the sequential rules named standard. It helps to enhance efficiency through the process of mining. The mining process consists of two procedures; the first is mining sequential patterns that appear frequently. The second is producing sequential rules that rely on the first procedure. Therefore, numerous researchers give consideration to the first procedure to enhance the performance through disposing of grouping that does not affect the ultimate result. Mining closed sequential patterns is an example of the previous idea. It helps to produce rules based on more compact data. In this study, we make a reference to this type with the non-redundant dynamic bit-vector algorithm.

The second type is the newest type for mining the sequential rules named partially ordered rules. It helps to enhance efficiency by extending the mining of sequential rules algorithms by adding constraints or by finding a specified number of the most visit rules in a database. In the partially ordered mining, there is no arrangement between items in the prior and posterior sides. The pattern growth approach is applied to discover all rules incrementally. There are improved algorithms that acknowledge an extra constraint to enhance the overall performance. The TRuleGrowth algorithm adds a window size parameter. It makes a difference to decrease the number of rules produced, as it diminishes the runtime, and diminished disk space is a prerequisite to store generating rules, so that users are able to analyze the rules in an easy manner. Additionally, we have another approach to produce non-redundant rules. We find the most frequent rules (the top-k) that achieved the minConf threshold. Therefore, we have only two parameters, K and minConf. Like the TNS algorithm, we did not take the minSup into consideration due to the difficulty of determining the valid values that suit each dataset's features.

In this paper, we present a broad consider of two sorts of SRM: partially ordered and standard sequential rules. We study three algorithms named TRuleGrowth, TNS, and non-redundant with a dynamic bit-vector algorithm. All these algorithms generate non-redundant rules and we compare the results of their implementation to decide the most reasonable areas for each of them. We assess the final results according to these criteria: runtime, number of produced rules, and memory utilization.

#### **2. Literature Review**

Numerous research has been suggested to enhance the mining process of sequential patterns. The primary obstacle in mining sequential patterns is producing unessential sequential patterns when setting the support measure with a very low value. Mining the sequential rules is an alternative to sequential patterns that assist the users in having knowledge of sequence items for making a prediction. It has been found in numerous zones like electronic learning [5], manufacturing simulation [6], the analysis of customer behavior [7], and decision systems [8].

The primary algorithm proposed for mining the sequential rules by Mannila and Verkano is studying the sequence behavior for the prediction process. It helps in finding all items that occurred as often as possible in a sequence dataset [9]. Then, most research has been proposed to produce sequential patterns that appear frequently and remove repetitive rules within the following stage of the mining process. They had to check the database numerous times to find the support of each itemset that induces minimum complexity and extra cost like RuleGen algorithm [10]. After that, the researchers have found rules with no consideration of their arrangement; refer to partially ordered sequential rules (POSR). It stands up on those items in the predecessor, and forerunner sides are unarranged. Two primary algorithms for the POSR are named CMRule and CMDeo [11,12]. The CMRule is the baseline algorithm that expels the temporal information and generates rules that accomplish the support threshold. It relies on the produced number of sequential rules that cause ineffectual performance [13]. The second baseline algorithm is CMDeo, which acts as more proficient than the CMRule. It discovers all substantial rules size 1 ∗ 1 by expanding both sides of the rules. To address the previous obstacles, RuleGrowth has been proposed. It can extend rules amid guarantees as it was necessary for rules in sequence datasets [14–16].

Various algorithms have been developed based on the prefix tree to realize superior efficiency. The CNR, IMSR, and MNSR algorithms rely on enhancing the rules by eliminating non-redundant rules [17–19]. They sort the sequences according to the support value ascendancy before producing rules. In this way, it diminishes the scan's number and minimizes complexity to *o n*2 .

TRuleGrowth is an extension of RuleGrowth. It has been developed to control the maximum consecutive items by applying an additional constraint called window size. This constraint helps to produce much fewer number of rules. So, it enhances the performance due to diminishing the request space used to store the generated rules [20–23].

The researchers proposed the Top-k sequential rule mining to overcome the problem of the difficulty of determining the valid minSup value that suits each database features [24,25]. It depends on two parameters: k is the number of rules that users want to generate, and the minConf that is easy to determine due to user satisfaction. The TopSeqRules is the first algorithm that addresses the Top-k sequential rule mining. It depends on the same strategy as the RuleGrowth algorithm. It integrates the two procedures to expand the left and the right sides of the rule with the general procedure for mining the top-k pattern. Researchers enhanced many algorithms to reduce disk space and generate interesting sequential rules. The Top-k non-redundant sequential rule (TNS) utilizes the TopSeqRules for mining the top-k to eliminate the redundancy on the generated rules [26,27].

A proficient algorithm named non-redundant with dynamic bit-vector has been suggested. It is used to remove unnecessary rules early through utilizing dynamic bit-vector with pruning techniques, so that it improves the performance by reducing the execution time and memory utilization [28–30].

#### **3. Methods of Applying Comparative Algorithms**

There are two categories for mining the sequential rules. The first category is called standard sequential rules that discover relations between two patterns that act sequentially. It depends on two measures, support and confidence, which are selected by the user. It exists in many algorithms like RuleGen, IMSR (Improved Mining Sequential Rule), and NRD-DBV (Non-Redundant-Dynamic Bit Vector) algorithm. It produced the result only in integer's format, and it proves its efficiency in creating an imperative choice or expectation. The second category is called POSR, which is the newest type of SRM that discovers relations between two unarranged item-sets. It does not care about the relations between the predecessor and forerunner of rules. The POSR is based on the pattern growth approach, which starts with two items and expands the rules one element at a time recursively. It can expand the mining process by including extra constraints to improve the performance of the general rules like the TRuleGrowth algorithm. Therefore, it can control the number of generated rules, which helps in saving more space and being more specific. The POSR can also satisfy the user's desire by producing a predefined number of the rules like the Top-k algorithm. Each type has its benefits to generate the non-redundant rules. In this paper, we present the most frequent recent algorithms that help in discovering the non-redundant rules by applying them to different dataset's features to reach the best efficiency.

#### *3.1. The Fundamental Operations of the TRuleGrowth Algorithm*

The TRuleGrowth algorithm is a type of POSR that does not need to arrange its items. It is an extension of the RuleGrowth algorithm that proves its efficiency compared to the last one. It is similar to RuleGrowth based on an approach called pattern growth. It performs incrementally, begins with two elements, and grows one item at a time by expanding the right and the left side of the rule. The TRuleGrowth utilizes an extra constraint called window size to control the number of rules generated. It helps in producing more accurate

results and saving more space by generating much fewer sequential rules. The value of a window size constraint can be an obstacle when its value increases. The best solution, in this case, is using the RuleGrowth algorithm instead of TRuleGrowth to eliminate the high computations due to emerging large number of rules.

The construction of the TRuleGrowth algorithm is formed by three stages. The primary stage is transforming the database to the sequential list, and then setting the support value. The moment stage is creating a rule size 1 ∗ 1 and executing two procedures to extend the rule on the right and the left sides. The final stage is determining the window size and the confidence value, and after that, testing the legitimacy of the rules to produce the non-redundant sequential rules, as shown in Figure 1.

**Figure 1.** The RuleGrowth algorithm framework based on [31].

The TRuleGrowth Application Algorithm

This algorithm begins with scanning the dataset once to discover each item on each side. At that point, it distinguishes every item achieving the support value threshold to produce all accurate rules with size 1 ∗ 1. To computing sides (x→z) and sides (z→x), it filters the dataset to generate the first sequence and calculate each duo of items individually. Then we compute the support value by partitioning |sides (x→z)|/|s|. If the support values of the item are not lower than the minSup, then the two strategies have been executed to grow each side of the rule [14]. We have guaranteed that the value of the sliding window has applied for all the rules. It saves all events of each item and shows their position by including that item. For the events of c in the sequence {a,c}, {b,c}, {c}, and {a,b,e} are 1, 2, and 3. At that point, utilize the hash table for searching each item found before item z and items found before x in a sequence dataset. It can be done inside a window size parameter, as appeared in Appendix A Algorithm A1 [22].

The operation of the two procedures is accomplished in the following steps: begin by establishing a hash table and assign the null value. Then, check the item-set to dispose of all items that do not accomplish the sliding window predetermined value. If the measure of 'hash x' = size 'x', then include each item z y ∩ n in the hash table. In case 'hash table' < 'x' at the point, contain each item with the position of n as DB x ∩ n. At last, if 'hash y' = size 'y' and size 'hash x' = size 'x', include side to accessible side (y U {z}→x) for each item z /∈ y U x that contains ID, to begin with an item of 'x' inside window size.

An extra parameter called sliding window can be included which has the following characteristics:


3. Favoring its significance in practical for temporal transactions like analyses data for shares market.

#### *3.2. The Operation of the Top-k Non-Redundant Sequential Rule (TNS)*

The TNS algorithm adopts TopSeqRules to mine the Top-k rules after eliminating redundancy of sequential rule. It depends on the same search procedure of the RuleGrowth algorithm. The TopSeqRules has two parameters: the number of rules that is determined by the user needs (K), and the minConf threshold that also satisfies the user requirements. It also utilizes three main variables; the first is minSup that is set to initial value = 0 and raises with regard to the number of rules generated. The second variable is L, which helps to perform Top-k rules by keeping each item that achieved the threshold value. The third variable is R having the rule with the maximum support values that helps in choosing the most candidate rules.

#### The TNS Algorithm Implementation

The TNS algorithm firstly scans the dataset only once to identify each item, as seen in Figure 2. It considers an integer parameter K, minConf value, and initializes a value of minSup with zero. Then it generates a rule recursively by growing the valid rule size 1 ∗ 1. After that, it performes two procedures to extend the right and the left side of the rule to generate the Top-k sequential rules. It is followed by removing all redundant rules which means that if ra is equal to rb with the same support, we will absorb ra. Then it verifies the result by setting parameter delta Δ with a value higher than the all removed rules, so the Top-k non-redundant rules will be generated as shown in Algorithm A2 [24].

**Figure 2.** Framework of TNS algorithm.

Applying the TNS algorithm solves the trouble of generating either a large number of sequential rules or fewer numbers which may lose valuable information. It also addresses the challenge of generating redundant sequential rules. It has more advantages as:


#### *3.3. The Operation of the Non-Redundant with Dynamic Bit Vector Algorithm*

The main idea of the NRD-DBV algorithm is to mine more compact data without losing any information or distorting the final result. It is based on mining closed patterns that depend on a vertical format. It performs efficiently in large datasets since it generates a smaller number of rules. Additionally, it utilizes pruning techniques to enhance overall performance.

The NRD-DBV Application Algorithm

The implementation of the NRD-DBV is performed in the following steps, as shown in Figure 3:


**Figure 3.** The NRD-DBV algorithm framework based on [31].

The NRD-DVB algorithm helps to compact the dataset by eliminating any super sequence with a similar support value to the root. Therefore, it is more proficient, as:


#### **4. Experiments and Results**

Experiments were processed to assess the runtime, memory usage for each algorithm, and the number of generated rules. Additionally, the influence of the minSup on both NRD-DBV and TRuleGrowth algorithms was measured and the TNS algorithm was compared with the other two algorithms after producing the same number of rules for each of them. Three algorithms were implemented on a PC with an Intel Core i5 2.3 GHz and free RAM running with 6.58 GB. We utilized Python language encoded on Jet Brains PyCharm.

Four real databases with diverse characteristics were executed to assess the results obtained from SPMF. The first dataset is named BMSwebview1 (Gazelle) [32] which consists of 59,601 clickstream data sequences from an e-commerce website. There are 497 separate items and the foremost vital issue that distinguishes their items reiterated in a seldom manner. The second database is a database of a sign dialect articulation consisting of 800 sequences transcript from recordings [33]. It includes 267 particular sets of items. The Korsarak is the third database; it is regarded as one of the biggest sequential databases. It has 990,000 sequences of click-stream information from a Hungarian news entrance. It incorporates 41,270 items. Because of the trouble connected on TRuleGrowth that caused the overhead limitation to be surpassed, we used a subset of the global database of Korsarak that contained only 25,000 items, which include non-reputation of position on the news. The BMSWebView2 (Gazelle) is the fourth database; it utilizes the information set within the KDD CUP 2000. It includes click-stream data of e-commerce in range 77,512 with 3340 particular items with an average length of 4.62. All experiments have been analyzed regarding to runtime, generated rules, and memory usage. The impact of the minSup on the previously stated criteria is studied.

The utilized datasets have different features useful for our comparison. In the first dataset (BMSwebview1), due to the variance of sequences, we selected a lower minSup value, which means that items are not repeated frequently in the dataset. We did not have any rules during the mining process while setting minSup like other attempts. In all studies, the minConf threshold was set to 0.5 for all states. The parameters' values were determined after several preliminary experiments to achieve the most preferable results.

#### *4.1. Compared the TRuleGrowth with the NRD-DBV Algorithm*

From the first dataset (BMS web view1), it is clear that the run time increased with decreasing minSup. It is a reversed relationship between the minSup and the runtime. The clarification of the relationship appears with raising values of window size and in the NRD-DBV algorithm. When the window size constraint esteem diminishes, we take note that it takes less time and has fewer number of rules. That is due to eliminating the complex computations of produced rules. It is similar to the outcome for the generated rules. Additionally, there is moreover a reversed relationship between the minSup and the generated rules' number. With the high esteem of window size constraint in the TRuleGrowth algorithm, there is an additional increment in the number of rules produced. As the window size esteem diminishes and increments minSup esteems simultaneously, the generated rules of the NRD-DBV are adjacent to the number of generated rules from the TRuleGrowth algorithm. Figures 4 and 5 show the runtime, no. of rules with minConf = 0.5% and minSup set from (0.09 to 0.06%), and Figure 6 shows the memory usage.

Additionally, it is found that when minimizing the value of window size, the time and number of sequence rules required are decreased. This is because producing sequential rules does not need very much computing.

On the other hand, in the sign dataset, we unrestricted with a low value for minSup. As before, we observed that the runtime increased with decreasing minSup value. The gap between the two parameters increased, especially for the TRuleGrowth algorithm when assigning lower values of window size constraint. This is a result of generating a large number of rules at smaller minSup values and higher window size value. It is an opposite relation between minSup and the generated rules. Figures 7 and 8 show the runtime,

number of rules with minConf = 0.5% and minSup set from (0.2% to 0.8%), and Figure 9 presents the memory usage.

**Figure 5.** Comparison between the No. of rules generated for the TRuleGrowth and the NRD-DBV for (BMS webview1) dataset.

**Figure 6.** Comparison of memory usage for the TRuleGrowth with the NRD-DBV algorithm for (BMS webview1) dataset.

**Figure 7.** Comparison of the runtime of the TRuleGrowth with the NRD-DBV for (Sign) dataset.

**Figure 8.** Comparison between the No. of rules generated for the TRuleGrowth and the NRD-DBV for (Sign) dataset.

**Figure 9.** Comparison of memory usage for the TRuleGrowth with the NRD-DBV algorithm for (Sign) dataset.

TRuleGrowth algorithm with lower window size constraint still achieved the least time required for generating the sequential rules. This is because of lower computations needed for the smallest number of rules. NRD-DBV takes more time than TRuleGrowth as it generates more sequential rules based on ordering.

While in the Korsarak database, the NRD-DVB proved its efficiency in producing the sequential rules. It generated approximately 21 rules in 200 (s), whereas the TRuleGrowth stopped generating any sequential rules, as it was restrained by the overhead. We managed this issue by employing a subset of the Korsarak with 25,000 groupings.

By applying the alternative solution to solve the problem with stops generating rules in the TRuleGrowth algorithm, the subset of the Korsarak database with only 25,000 sequences is used. The results demonstrated the availability of the TRuleGrowth algorithm, especially when assigning a low value to the minSup threshold. It generates rules faster than the NRD-DBV algorithm. The high speed of the TRuleGrowth algorithm is evident in generating rules when diminishing the minSup value, whereas the NRD-DBV takes three times more time than the TRuleGrowth. The runtime and the number of the generated rules are not affected by different values of window size, as shown in Figures 10 and 11. The NRD-DBV achieved the best utilization of memory as shown in Figure 12.

**Figure 10.** Comparison of the runtime of the TRuleGrowth with the NRD-DBV for (Korsarak) dataset.

**Figure 11.** Comparison between the No. of rules generated for the TRuleGrowth and the NRD-DBV for (Korsarak) dataset.

The last database is named BMSwebview2; it has different features from the first database, BMSwebview1. It has 358,278 overall instances of items, whereas the BMSwebview1 has only 149,639 that forced us to assign fewer values to minSup as mentioned before. The higher the values assigned to the window size parameter, the more time taken to generate the rules, particularly when assigning a very low value to the minSup at the same time. However, the TRuleGrowth still achieved the most parcel of the execution time to produce the sequential rules, as clarified in Figure 13.

**Figure 13.** Comparison of the runtime of the TRuleGrowth with the NRD-DBV for (BMS webview2) dataset.

Concerning the number of rules, the TRuleGrowth algorithm generated the lowest number of rules with a low window size constraint value. With a larger window size or a higher minSup value, the number of consecutive rules in TRuleGrowth grew in the NRD-DBV algorithm because computations take longer time on a large number of rules, as shown in Figure 14, and the NRD-DBV algorithm still achieved the least memory usage as shown in Figure 15.

**Figure 15.** Memory usage for the BMS webview2 dataset with different minSup values and (minConf = 0.5).

Likewise, when using the TRuleGrowth algorithm with a small value of window size constraint or high values of minSup, the smallest number of rules was generated. When the minSup value was reduced or values of window size constraint were increased, the number of rules in TRuleGrowth increased and it took longer to make a computation.

#### *4.2. Comparing the TNS Algorithm with the Other Two Algorithms*

We compared the TNS algorithm with the other two algorithms by setting the K parameter equal to the number of rules generated from each of the other algorithms, either with different window size constraints or different minSup values.

The TNS algorithm stopped generating any rules with (BMSwebview1). That is because of the nature of this database's features: its items are rarely repeated.

In the sign database, the TRuleGrowth algorithm generated sequential rules in less time than the TNS algorithm, as shown in Figure 16. The TNS achieved its efficiency in generating sequential rules faster than the NRD-DBV algorithm (see Figure 17), and NRD-DBV achieved good performance in memory usage at the maximum value of minSup as presented in Figure 18.

**Figure 16.** Comparison of the runtime of the TNS with the TRuleGrowth for (Sign) dataset.

**Figure 17.** Comparison of the runtime of the TNS with the NRD-DBV for (Sign) dataset.

**Figure 18.** Comparison of memory usage for the TNS with the TRuleGrowth algorithm for (Sign) dataset.

As seen in Figures 19 and 20 in the Korsarak dataset, the TNS algorithm generated almost four times fewer sequential rules than the two other algorithms.

**Figure 19.** Comparison of the runtime of the TNS with the TRuleGrowth for (Korsarak) dataset.

In the BMSwebview2 dataset, rules were generated in half the time taken for generating rules with the NRD-DBV algorithm. Additionally, rules were generated in less or approximately close to the time taken for the TRuleGrowth algorithm, as shown in Figures 21 and 22.

**Figure 21.** Comparison of the runtime of the TNS with the TRuleGrowth for (BMS webview2) dataset.

**Figure 22.** Comparison of the runtime of the TNS with the NRD-DBV for (BMS webview2) dataset.

Regarding the memory usage for each algorithm, it makes sense that the amount of memory demanded increased with the lowering of the minSup value, since the number of sequential rules is growing. However, in all experiments, the NRD-DBV method was demonstrated to be more memory efficient than the TRuleGrowth and TNS algorithms. This is because it has the advantage of the DBV structure and prunes all prefix child nodes to remove unnecessary rules, as shown before in Figures 6, 9, 12, 15 and 18, and also the following Figures 23–27.

**Figure 24.** Comparison of memory usage for the TNS with the NRD-DBV algorithm for (Korsarak) dataset.

**Figure 25.** Comparison of memory usage for the TNS with the NRD-DBV algorithm for (Korsarak) dataset.

**Figure 26.** Comparison of memory usage for the TNS with the TRuleGrowth algorithm for (BMS webview2) dataset.

Computational complexity has been measured for each algorithm; we found that both the TRuleGrowth and NRD-DBV algorithms are less complex. The complexity of the TRuleGrowth algorithm is linear regarding the number of sequential rules in the dataset; either one or two recursive calls are applied to expand the right and left sides.

In the NRD-DBV, producing non-redundant rules has a complexity of o (n\*c), where n is the number of nodes and c is the average number of child nodes. We must implement (n−1) procedures for validating and producing sequential rules such as k << n for each sequence. As a result, NRD- DBV complexity is ≈ o (n). TNS algorithm is efficient when setting the parameter k up to 2000 rules. Otherwise, it performs more complexity because of high computational expenses on mining sequential rules.

**Figure 27.** Comparison of memory usage for the TNS with the NRD-DBV algorithm for (BMS webview2) dataset.

#### **5. Conclusions**

This paper presents a meaningful comparison of three algorithms designed for the task of sequential rule mining, which is especially common to several sequences. Each algorithm was analyzed utilizing four real datasets with different features. For example, one dataset had long sequential patterns with low diversity of items that forced us to use a low minSup value to generate the sequential rules. Another dataset had short sequential patterns with a high diversity of items. Another one was a huge dataset that caused an overhead limit excess, which made us resort to using a subset of the original dataset.

Our experiments indicated that the performance of the algorithms is associated with the features of the datasets. Moreover, experience shows that it can reduce the execution time and control the number of valid rules generated by several orders of size restrictions such as adjusting the window size constraint or determine the number of generated rules or change the value of the minSup threshold.

With a specified value of a window size constraint, mining the TRuleGrowth algorithm can run faster and the correctness of discovered sequential rules that are not constrained by the arrangement can be increased.

The NRD-DBV algorithm could be used to reduce the number of sequential rules and memory requirements. It relies on a DBV structure with a prefix-tree that leads to early pruning of child nodes to reduce the search space. In addition, the NRD-DBV method generates more rules for taking item arrangement into account than the TRuleGrowth algorithm.

The researchers conclude that each algorithm has its own use in the fields of application of sequential rule mining to achieve the highest possible efficiency. NRD-DBV algorithm has many applications in error detection, intervention, and bugs. It is useful in domains that require the arrangement of items such as in medical area (for example, if the patient is suffering from a fever, which is followed by a decrease in the level of coagulation, followed by the appearance of red marks on the body, it is reasonable that the patient will need to be treated for dengue fever. This order in events is important in predicting an appropriate type of treatment). Additionally, it can be used in marketing to design the most personalized strategy, and also in software engineering where ordering is mostly important to accomplish its tasks.

The TRuleGrowth algorithm implementation allows the allocation of optional parameters like maximizing the number of items that appear in the antecedent and consequent of a rule. It can be useful in making product recommendations and performing fast decision making.

For future work, we intend to improve the NRD-DBV algorithm by using another concise representation, such as maximal or generator patterns, to improve performance in the large sequences database.

**Author Contributions:** Conceptualization, A.A. and N.Y.; methodology, A.A.; software, N.Y.; validation, A.A. and N.Y.; formal analysis, N.Y.; investigation, A.A.; resources, A.A.; data curation, N.Y.; writing—original draft preparation, N.Y.; writing—review and editing, A.A.; visualization, N.Y.; supervision, A.A.; project administration, A.A.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported through the Annual Funding track by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Project No. AN000519].

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The dataset is available on https://www.philippe-fournier-viger.com/ spmf/index.php?link=datasets.php, accessed on 20 April 2021.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Algorithm A1:** The TRuleGrowth algorithm **Input** Sequence database Minimum support Minimum confidence **Output** Set of sequential rules The executed time and memory usage of the generated rules **Algorithm Begin** scanning dataset For (each item m in DS) do { Store each side that contains item m If sup of each pair of item x,y ≥ minSup { Set sides (x→y) equal to zero}} Calculate sides for generating rule (x→y) else If (sup of rule (x→y) ≥ minSup) then If (size of left side of rule ≤ maxleft && size of right side of rule ≤ maxright) then else If (rule <= window size Constraint) Then Expand left & Expand right End if If confidence of rule (x→y) ≥ minConf Then generate rule {x}{y} with its confidence Support End if End for **End**

```
Algorithm A2: TNS algorithm
Input
Sequence database
K= specified numbers of rules
Minimum support
Minimum confidence
Output
Set of sequential rules
The executed time and memory usage of the generated rules
Algorithm
Begin
R = ø. L= ø. minSup = 0.
Scan DB once & Record each item in variable(s)
FOR each pair of items I, j such that |sids(i)|≥ minsup and ∩ 4. |sids(j)|≥ minsup: sids(i⇒j) :=
Ø. sids(j⇒i) := Ø.
FOR each sid s ∈ (sids(i) ∩ sids(j):
IF i occurs before j in s THEN sids(i⇒j) := sids(i⇒j)∪7. {s}.
IF j occurs before i in s THEN sids(j⇒i) := sids(j⇒ i) ∪ 9.{s}.
END FOR
IF |sids(i⇒j)| / |S| ≥ minsup THEN conf({i}⇒{j}) := |sids(i⇒j)| / | sids(i).
IF conf({i}⇒{j}) ≥ minconf THEN SAVE({i}⇒{j}, L, k, minsup).
Set flag expandLR of {i}⇒{j}to true.
R := R∪{{i}⇒{j}}.
END IF
. . . [lines 11 to 17 are repeated here with i and j swapped] . . .
END FOR
WHILE ∃r ∈ R AND sup(r) ≥ minsup DO
Select the rule rule having the highest support in R
IF rule.expandLR = true THEN
EXPAND-L(rule, L, R, k, minsup, minconf).
EXPAND-R(rule, L, R, k, minsup, minconf).
ELSE EXPAND-R(rule, L, R, k, minsup, minconf).
REMOVE rule from R. REMOVE from R all rules r ∈ R | sup(r) <minsup.
END WHILE
SAVE(r, R, k, minsup)
L := L∪{r}.
IF |L| ≥ k THEN
IF sup(r) > minsup THEN
WHILE |L| > k AND ∃s ∈ L | sup(s) = minsup
REMOVE s from L.
END IF
Set minsup to the lowest support of rules in L.
END IF
FOR (NRD = 1, NRD < Δ, NRD + +) Do
The result is exact (generated TNS)
ELSE Return with higher Δ value
END FOR
End
```
#### **Algorithm A3:** NRD-DBV algorithm

#### **Input**


**Algorithm A4:** Closed Pattern-Extension method

#### **Input**

Frequent sequential patterns Minimum support Minimum confidence **Output** Set of sequential rules The executed time and memory usage of the generated rules **Algorithm Begin** Set listNode →child nodes of root For (each prefix Sequence in listNode) do If sequential patterns not pruned, then for (each prefix sequential patterns in listNode) do If (sup (PrefixSp →Sequence-extension ≥ minSup) then Add prefixSP as a new itemset after the last itemset of the sequence Else If (sup (Sp →Itemset-extension ≥ minSup) Add prefixSP as a new item in the last itemset of sequence End For Call Closed Pattern-Extension (prefixSP, minSup); End If Check & put the attribute of SP: closed pattern, prefixed generator or NULL; End For **End**

#### **References**


### *Article* **User Trust Inference in Online Social Networks: A Message Passing Perspective**

**Yu Liu \* and Bai Wang \***

Beijing Key Laboratory of Intelligence Telecommunication Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China

**\*** Correspondence: imyuliu@outlook.com (Y.L.); wangbai@bupt.edu.cn (B.W.)

**Abstract:** Online social networks are vital environments for information sharing and user interactivity. To help users of online social services to build, expand, and maintain their friend networks or webs of trust, trust management systems have been deployed and trust inference (or more generally, friend recommendation) techniques have been studied in many online social networks. However, there are some challenging issues obstructing the real-world trust inference tasks. Using only explicit yet sparse trust relationships to predict user trust is inefficient in large online social networks. In the age of privacy-respecting Internet, certain types of user data may be unavailable, and thus existing models for trust inference may be less accurate or even defunct. Although some less interpretable models may achieve better performance in trust prediction, the interpretability of the models may prevent them from being adopted or improved for making relevant informed decisions. To tackle these problems, we propose a probabilistic graphical model for trust inference in online social networks in this paper. The proposed model is built upon the skeleton of explicit trust relationships (the web of trust) and embeds various types of available user data as comprehensively-designed trust-aware features. A message passing algorithm, loop belief propagation, is applied to the model inference, which greatly improves the interpretability of the proposed model. The performance of the proposed model is demonstrated by experiments on a real-world online social network dataset. Experimental results show the proposed model achieves acceptable accuracy with both fully and partially available data. Comparison experiments were conducted, and the results show the proposed model's promise for trust inference in some circumstances.

**Keywords:** trust inference; trust propagation; online social network; social network analysis; probabilistic graphical model; message passing; belief propagation; model interpretability

#### **1. Introduction**

Trust exists in many different forms in various disciplines. For instance, the trust on the World Wide Web can be trust in content, trust in services, and trust in people [1]. Although different disciplines take different definitions and forms of trust, they commonly aim to solve the problem of accurately evaluating trust between two entities, to help complex systems make informed decisions. For example, some peer-to-peer system may take advantage of trust to curb malicious attacks and maintain network robustness [2]. A complex model selection system could evaluate the trustworthiness of a cloud-based machine learning model-as-a-service (MaaS) for industrial Internet of Things and smart city services [3]. Some online social service could use trust to improve the quality of recommendations [4]. Some researchers leveraged a trust network to study the relational antecedents of members' influence in organizations [5].

Without loss of generality, we focus on user trust in online social networks (OSNs) in this paper, which is one of the most common types of trust. We followed [6] to define user trust in OSNs: a subjective expectation an OSN user has about another user's future behavior. Social trust is a basic social construct [7], and it enables people to collectively live

**Citation:** Liu, Y.; Wang, B. User Trust Inference in Online Social Networks: A Message Passing Perspective. *Appl. Sci.* **2022**, *12*, 5186. https://doi.org/ 10.3390/app12105186

Academic Editors: Dionisis Margaris and Stefanos Ougiaroglou

Received: 27 April 2022 Accepted: 18 May 2022 Published: 20 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and work in groups. In social networks, trust helps people find whom to trust, to build beneficial or even reciprocal social relationships, so that the quality of interindividual interactions could be improved and the risks of social activities reduced.

Many online social services embrace trust management systems to help their users to build and expand their webs of trust so that users can keep being engaged in their services [8]. For example, Twitter has deployed a user recommendation service called "Who to Follow" to use a user's "circle of trust" for recommending new connections to the user [9]. Being a key contributing factor in many complex systems, trust has been elaborately explored and researched, and it has been proven to be helpful in securing social commerce [10], recommending trustworthy users [11], providing accurate and personalized recommendations [4,12–14], filtering trustworthy authorities or users [15], finding opinion leaders or trolls [16], maximizing influence diffusion [17], and decision making [18–20]. However, data of trust information, such as trust relationships, do not always explicitly or abundantly exist for the above-mentioned tasks to use. Therefore, the study of trust inference is necessary and practical for social network analysis and relevant decision making tasks.

The process of inferring an unknown trust relationship in OSNs, which is often referred to as trust inference, involves exploitation of social construct elements. The sources of relevant trust information that assembles social constructs in OSNs include but are not limited to existing trust relationships and data created by the users involved in the relationships, such as their activities, including posting reviews and casting votes, and other content generated by them (user-generated contents, UGC).

Various algorithms and models have been proposed to infer trust in OSNs [21]. Some of them make use of topological data, i.e., the web of trust relationships or the trust network, to predict trust relationships. However, due to the ever-present issue that observable trust relationships in OSNs are often sparse, some vanilla algorithms for predicting trust are prevented from achieving more accurate predicted results in large real OSNs. With the aim of tackling the issue, other methods use both the web of trust and UGC data to achieve more accurate results in inferring trust.

Nevertheless, there are three major problems holding back existing trust inference models. Firstly, if additional UGC data are available, some of them and various types of interplay among them are discarded when the trust inference framework integrates them with the trust network, making the model prone to generating less accurate results. Secondly, lesser types of data from OSN users are permitted to use, given the fact that the privacy and data protection of online users are being much more respected nowadays. Online service users may opt out of using some of or all of their data for certain data analysis tasks conducted by online services—especially with the regulations and laws being implemented and enforced, such as General Data Protection Regulation (EU) (GDPR) [22], the California Consumer Privacy Act (CCPA) [23], and the Personal Information Protection Law of the People's Republic of China (PIPL) [24]. Thus, some crucial data may be unobtainable for trust inference. Last but not least, most trust inference models are poorly interpretable. The interpretability of a trust inference model should be improved alongside achieving better performance, not only for inferring trust itself but for making other relevant informed decisions.

Bearing the aforementioned problems in mind, we propose creating a probabilistic graphical model in which various types of UGC data are built and then integrated as features in such a way that not only are most of their characteristics preserved, but that the interaction among features can also be captured and embedded. The contributions of this paper are as follows.

• The proposed model takes advantage of the integration of the trust network and user-generated contents in the network; the latter is embedded into a probabilistic graphical model built upon the former. The model permits the directionality of trust relationships and preserves various facets and properties of trust. The way of both building features from UGC data and embedding them into the probabilistic graph preserves as much information as the data may contain.


Although this paper only focuses on inferring user trust in online social networks, the proposed model may also be adopted to fulfill other inference tasks to better assist decision making in other complex systems. The least work required for other similar tasks is to properly define a concrete graphical model and a set of reasonably-built features, if additional data exist.

The rest of the paper is organized as follows. In Section 2, we briefly review the literature of related works on trust inference. In Section 3, we elaborate the prerequisites, define the problem, and then propose the model. In Section 4, we present experiments conducted with a dataset derived from a real-world online social network, and then analyze the performance of the proposed model. Finally, we conclude our work in Section 5.

#### **2. Related Work**

The problem of trust inference or trust prediction between two users in online social networks or two nodes in general networks bears some similarity with the link prediction problem, and thus it can be modeled as a link prediction task. However, using a link prediction method to predict trust links requires the method to work with directional links or even signed links, which complicates the link prediction method itself.

For example, Schall [25] leverages micro-structural patterns and the resulting node similarity to retrieve the probability of a missing directional link existing between two users in a social network. The method's drawback is that the micro-structural pattern is limited to a triad, and consequently it may fail in finding links between any two arbitrary nodes that do not have a common neighbor.

The work by Barbieri et al. [26] also employed link prediction techniques to infer friend or trust relationships. In this work, they proposed a generative model, one of the stochastic topic models, to generate social links for users with the consideration of user's interests in "topical" and "social" resources, e.g., whether a targeted user is an authority on a topic or he/she is recognized by an acquaintance in the real world.

Trust inference models based on the trust propagation theory are often called "walkbased" methods. In [27], Mao et al. developed a trust inference framework which obtains the trust rate between any pair of users by aggregating a set of strong trust paths generated with the knowledge of their weighted similarity about commonly interesting topics and their trust propagation ability in the social network. If the trust rate is above a user-defined threshold, the framework determines that there should be a trust link between the users.

The work by Oh et al. [28] included a unified model combining both explicit and implicit trust, and infers trust links by using different trust propagation strategies. The three primary trust propagation strategies include direct propagation, transposed trust, and global trust propagation. Other complex strategies, such as co-citations and trust coupling, are

combinations of the primary propagation strategies. Other walk-based methods include ModelTrust [29], TidalTrust [30], AssessTrust [31], OpinionWalk [32], etc.

Trust prediction can also be achieved and improved by using collaborative filtering techniques, particularly the matrix factorization (MF)-based methods. It is also quite convenient for MF-based methods to integrate other types of data that carry trust information. hTrust [33] incorporates low-rank matrix factorization and homophily regularization to infer trust links. The homophily regularization controls how user rating similarity affects predicting user trust relationships. MATRI [34] extracts user trust tendency using matrix factorization from the user trust rating matrix and incorporates trust propagation with trust tendency into a collective matrix factorization framework to infer trust. Another MF-based trust inference model [35] takes advantage of not only matrix-factorized trust tendency and propagated trust, but also similarities of users' rating habits, and achieves good performance.

Neural networks are also employed for trust evaluation. NeuralWalk [36] employs a neural network named WalkNet to model single-hop trust propagation, and then iteratively assesses unknown multi-hop trust relationships. Although NeuralWalk can achieve good accuracy in trust prediction, it is inefficient due to the massive matrix operations involved in training and test set selection. Besides, the interpretability for such methods based on neural networks is an obvious drawback.

#### **3. The Proposed Model**

In this section, we propose our model for user trust inference in OSNs. Firstly, we state the relevant assumptions and prerequisites on which the proposed model is based; secondly, we describe our modeling approach in detail; and finally, we briefly discuss the implementation of the model.

#### *3.1. Prerequisites*

A variety of studies and literature [6,21,37,38] have found that trust has many unique features and characteristics, and they are embedded in user profiles and other UGC data in OSNs. Based on relevant findings of trust and OSNs, we give some key principles and assumptions on which our proposed model stands in this section, without formal proofs.

**Assumption 1.** *Through activities of users in online social networks, user-generated content, such as reviews, posts, votes, issued trust, and so on, represent and bear the user's attitude, credibility, and trustworthiness.*

A survey [37] suggests that numerous factors, including logical trust attributes (e.g., experience, frequency, stability, and rationality), emotional trust attributes (e.g., hope, fear, and regret) and relational trust attributes (e.g., similarity), contribute to construct individual trust through social capital and social activities. Meanwhile, in online social networks, the above factors are expressed through various types of user-generated contents and user activities. Therefore, trust can be harvested through a variety of UGC data that exists in OSNs, such as their posted reviews or articles, cast ratings and votes, and issued trust relationships.

#### **Assumption 2.** *For a trust relationship between a trustor and a trustee, not only the trustee but also the trustor contributes to the relationship formation.*

The formation of a trust relationship involves the trustor who evaluates and issues trust, and the trustee who presents and receives it, which makes trust asymmetric and subjective. The issuance of a trust relationship is determined by the trustor through their perception of the trustee's trustworthiness and how others perceive it. However, due to the difficulty of modeling trust's subjectivity, many studies only focus on making use of UGC data that represent trustees' credibility [39–41], leaving trust an objective concept and measure.

Based on this point, we bear the mind in this paper that although same types of user data are being used for collecting trust information, they could serve different purposes for the trustor and the trustee involved in a trust relationship. For example, the same characteristics extracted from one of a user's reviews may serve as different features: the type of attitude features that suggest how the user trusts others, or the type of trustworthiness features which indicate how he is being trusted by others.

#### **Assumption 3.** *Users bearing similarities in their profiles or activities have a higher likelihood to establish trust relationships.*

That is suggested by the homophily effect, which is one of the most important theories that attempt to explain why people establish trust relations with each other [42]. For example, in the situation of product reviewing, people with similar tastes about items are more likely to trust each other.

Taking the previous two assumptions into consideration, we could further state that a group of similar users may build trust relationships with similar users from another group, if there exists a trust relationship between a pair of users from each group. This also implies trust's incomplete transitivity property.

**Assumption 4.** *In datasets from online social networks, no observable or explicit trust relationship between two users does not truly guarantee there will not be any trust between them.*

For example, we may infer a trust relationship issued by Alice to Bob, provided that (1) there are many other users who have certain similar characteristics as Alice has, and they also trust Bob, or (2) Alice trusts many other users who have traits in common with Bob. However, it is worth noting that there are quite a few reasons that the relationship of Alice and Bob is not present in the network. It can be explained from three perspectives:


This uniqueness of trust in OSNs makes trust inference in OSNs quite different from general link prediction problems and general binary classification tasks. Since we only focus on inferring trust relationships in online social networks without any distrust information, as a binary classification task, the goal of our proposed model is to find as many trust relations as possible, and in the meantime, maintain considerable overall accuracy. Further discussion on this assumption is beyond the scope of this paper, and it will be left for future work.

**Assumption 5.** *Trust is propagative but not fully transitive, so it will be beneficial for us to learn under what criteria trust can be transferable from one user to another.*

For example, if Alice trusts Bob and Bob trusts Chris, Alice can derive some amount of trust on Chris based on how much she trusts Bob and how much Bob trusts Chris, but Alice may not trust Chris or even she does not know Chris. However, under some particular circumstance which is what we need to learn, Alice will trust Chris. Based on

this assumption, we propose two primary trust propagation strategies below so that our model can learn from more complex trust relationship topology that commonly exists in real-world datasets:


The proposed trust propagation strategies are demonstrated in Figure 1. Some more complex propagation schemes can be derived from the above two primary ones. For example, if Alice trusts Bob and Chris, and Daniel also trusts Bob and Chris, there might be an increase in the trust between Alice and Daniel, which is a cascaded result of two transposed trust propagation instances. In the field of citation network analysis, this very same scheme is called *co-citation* and has been greatly studied. It is worth noting that the co-citation propagation conforms to Assumption 3, but, differently from it, we derived the "co-citation"-equivalent trust propagation from the topology perspective. The same above-mentioned trust propagation strategies of direct trust propagation and transposed trust propagation were also leveraged in [28], which helped the researchers build a better trust prediction model.

**Figure 1.** Two primary trust propagation strategies. (**a**) Direct trust propagation. (**b**) Transposed trust propagation.

#### *3.2. Model Construction*

Many modern online social services have deployed a feature which allows their users to build friend networks. It is common understanding that trust has a dedicated role in forming friendships between two individuals, and thus many trust-related studies also use friend networks in OSNs as trust networks. In particular, there are several online services that have explicitly implemented trust networks as web of trust—such online services include Epinion (http://www.epinions.com/ (accessed on 28 February 2018)) and Ciao (http://www.ciao.co.uk/ (accessed on 19 June 2021)).

We modeled our learning and inference tasks through a conditional random field (CRF), one of probabilistic graphical model variations. The structure of the probabilistic graphical model was built upon the trust network of an OSN, and the features that were to be added to the model were extracted and then built from the UGC data from the OSN. Due to the way that the UGC and topological features are embedded into the trust network, the CRF model not only uses both the explicit trust information (the trust network) and the user-generated contents from the social network, but also has the capability of capturing the interplay among various types of features extracted from the UGC and the network topology.

Differently from the conventional link prediction problem in which edges between nodes represent trust relationships, our method represents relationships as nodes. In other words, we model a real-world directional trust link as a trust relationship node in the model. Therefore, the random variables in the CRF model are the predefined states or labels of users and trust relationships. (In this paper, we use "state" and "label" interchangeably.) In the CRF model, the two types of random variable nodes in the probabilistic graph are:


Figure 2 demonstrates the difference between a trust relationship commonly represented in real social networks and one modeled by our approach.

**Figure 2.** Demonstration of one user trusting another in the real world and in the proposed model. (**a**) Real-world trust relationship representation: userA trusts userB. (**b**) userA trusting userB rendered in the proposed model.

For a user-user pair whose trust relationship is un-observed in the OSN, the objective of the model in this paper is to infer the probabilities of labels of the corresponding *trustRelation* node whose state is unknown in the probabilistic graph, so that the trust relationship can be predicted.

It is worth looking at how we handle edges in the model. Edges in a CRF are usually homogeneous in type. That means edges in the graph do not have to be type-specific. However, without breaking any conventional rule for model inference, we may particularly mark edges with dedicated types so that different edges can delegate different types of meanings and thus serve different purposes. In our model, the type of an edge between a *trustor* node and a *trustRelation* node is different from the type of an edge between a *trustRelation* node and a *trustee* node. It is also worth noting that involving different edge types grant the model the possibility of recognizing directional trust between a pair of two users. An example of modeling the mutual relationship between two users is demonstrated in Figure 3.

**Figure 3.** A demonstration of modeling two users' mutual relationship in the proposed model.

#### 3.2.1. Notation and Problem Definition

Let U*<sup>i</sup>* denote the user random variable at node *i* and T*<sup>j</sup>* or T*ab* (or T*a*→*b*) denote the trustRelation random variable at node *j* or the node representing the trust relationship from user U*<sup>a</sup>* to user U*b*. The notation of nodes and edges is detailed in Table 1.


**Table 1.** Notation of (random variable) nodes and edges in the proposed model.

Now, we examine two users, for example, Alice and Bob, in a trust network and use a binary variable TAlice→Bob to represent the possible trust relationship from Alice to Bob. If a trust relationship from Alice to Bob is present in the network, it means that Alice trusts Bob and we label the variable with value 1. If no trust relationship from Alice to Bob is observed in the dataset, we label the variable with value 0. As trust is directional and asymmetry, a different variable TBob→Alice also exists, and its value, which is also binary, represents the observational result of the trust relationship from Bob to Alice. Furthermore, we use probabilities to indicate trust relationships. For any two users A and B:

$$\begin{array}{ll} P(\mathbf{T}\_{\mathbf{A}\rightarrow\mathbf{B}}=1) > P(\mathbf{T}\_{\mathbf{A}\rightarrow\mathbf{B}}=0) & \text{ \\ \text{ }otherwise \end{array} \quad \text{ } \begin{array}{ll} \text{ if } \mathbf{A} \text{ truts } \mathbf{B}; \\ \text{ } \text{ if } \mathbf{A} \text{ does not express trust in } \mathbf{B}. \end{array} \tag{1}$$

Additionally, particularly for trust relationships being observed in the dataset:


The notation can be easily extended to support distrust, a concept that differs from either trust or no trust being observed in the dataset. However, distrust is beyond the scope of this paper, hence we would like to leave it for future research.

With the notation, the trust inference problem can be stated as follows. Given all user nodes U and a set of observed existing and nonexistent trustRelation nodes (their probability representations are either *P*(T = 1) = 1 or *P*(T = 1) = 0), the method predicts a set of un-observed trust relationships T∗ by comparing their probability representations *P*(T∗ = 1) and *P*(T∗ = 0) that are calculated during the model inference.

#### 3.2.2. Features

As stated in Assumption 1, trust information can be obtained, and trust can be harvested and evaluated through a user's generated contents in online social networks, including reviews, posts, connections, etc., as they are the representation and bearer of the user's attitude, credibility and trustworthiness, i.e., the trust constructs. Therefore, we extract various types of features from the dataset and embed them to the probabilistic graphical model to aid trust inference.

In our conditional random field, an arbitrary number of features of any arbitrary type can be attached to any node or any edge. All the features used in the proposed model are discrete, and they can be *label-observation features* for nodes, *label-label-observation features* for edges or *label-label features* for edges. In other words, the feature function of each feature is non-zero for a single state per node or a single state per edge (the state of an edge is determined by the state-pair of the two nodes connected by the edge), and the type and value of the feature are derived from observations in the UGC data from the OSN dataset. For example, we observe in the OSN dataset that a user has 57 trustors and the corresponding user node will be associated with a feature, whose type is "nTrustors" and value is 57. Table 2 lists typical sets of features used in this paper and we briefly describe them below.

**Table 2.** Sets of features used in the proposed model.


Statistical features for user profiles (User profile features)

A user's profile is the most direct depiction of the user's social capital that reflects their identity and status which in turns reflect their attitude, credibility and trustworthiness. A set of typical statistical features for user profiles built from UGC data are used in this paper. As [43] suggests, an Internet celebrity, who is in fact an active and vigorous source for disseminating information, usually have a great many followers or trustors. Thus, the *number of trustors* of a user is an obvious indicator of the evidence that how the user's being trusted by others. In the meantime, the *number of trustees* of a user also shows their engagement and importance in the online social network. The *number of reviews* posted by a user and the *number of ratings* cast by a user is a reflection of their experience, frequency and involvement in the online social network. Some online social services also have a rank system for user reviews' helpfulness, and the *numbers of a user's reviews' helpfulness being rated* as *exceptional helpful*, *very helpful*, *helpful*, *somewhat helpful* or *not helpful* are intuitive hints for the user's experience, expertise and credibility.

In the proposed model, for each user, we build the statistical features from the user's profile data and attach them to the user node; for each trust relationship, we build the statistical features from the two involving users' profiles as edge features and attach them to the corresponding edge between the user node and the trustRelation node. As explained previously, an edge between a user node and a trustRelation node may have two different types. Herein, for either type of such edges, each user profile feature has a designated type, either as a feature denoted by "u2TrU" for edges between a trustor node and a trustRelation node, or as a feature denoted by "Tr2uU" for edges between a trustRelation node and a trustee node. In this way, the model is guaranteed to distinguish the features for a user as a trustor in a trust relationship from those features for the same user acting as a trustee in another trust relationship, which conforms to our Assumption 2.

**Feature vector construction.** Using features listed in Table 3:


**Table 3.** Statistical features for user profiles used in the proposed model.


Linguistic and stylistic features for reviews (UGC features)

As previous studies [6,39,40] suggest, the linguistic characteristics and stylistic features of a review deliver the attitude, emotional status and part of expertise of the author, the quality of it implies whether or not the author is objective and unbiased, and the textual content of the review conveys the author's experience and expertise. According to Assumption 1, features extracted from reviews contribute to each user's attitude and trustworthiness, and thus affect their probability of trusting others and being trusted by others. Furthermore, according to Assumption 3, investigating this type of features also helps in finding similar users and suggesting trust relations to similar users. The linguistic and stylistic features for posts used in this paper include:


The *Affective words* (http://wndomains.fbk.eu/wnaffect.html (accessed on 1 June 2021)) and *Sentimental words* [45], which express an author's emotions, traits, sensations, attitudes or behaviors, can also be served as features for reviews. Using them may slightly increase the model's performance, however, for the sake of simplicity, we do not leverage them as features in this work.

All UGC features are attached to edges between user nodes and trustRelation nodes. Similarly as how we did with user profile features, we also mark UGC features with either type "u2TrR" or type "Tr2uR" to respect the two distinct types of edges to which they are attached.

**Feature vector construction.** Using features listed in Table 4:

• For each trustor ↔ trustRelation edge or trustRelation ↔ trustee edge, we create a feature vector *F*u2TrR UGC (E*u*↔*t*) or *<sup>F</sup>*Tr2uR UGC (E*t*↔*u*), respectively. Each feature of this type is a *label-label-observation feature*.


**Table 4.** Linguistic and stylistic features for reviews used in the proposed model.

Propagative features for trust propagation (TP features)

According to the trust propagation strategies described in Assumption 5, we propose two types of propagative features for trust propagation. Still take Alice, Bob and Chris for example, the proposed features are as follows.


For any arbitrary trustRelation node T1, we observe the state occurrence combinations of all motifs where each motif consists of three trustRelation nodes satisfying the following criteria:

	- **–** either a trustee node (for direct trust for trustRelation node T1) while the trustee node of T2 is the trustor node of T3,
	- **–** or a trustor node (for transposed trust for trustRelation node T1) while the trustee node of T2 is also the trustee node of T3.

In other words, for each motif, the observing target including the motif itself consists of a trustor node, a trustee node, their corresponding trustRelation node, a third user in whose trust relationships either the trustor/trustee or trustor/trustor are shared, and their corresponding trustRelation nodes. Figure 4 illustrates the topological diagram for direct trust and transposed trust in the proposed model.

**Figure 4.** Illustration of direct trust and transposed trust for a trustRelation node in a 3-trustRelationnodes motif. (T2 and T3 contribute direct trust to T1; and T2 and T3 contribute transposed trust to T1).

We build a trustRelation node's trust propagation features based on the numbers of state sequence occurrences of the three trustRelation nodes in each motif that the node has. Use the notation from the above criteria, for a trustRelation node T1 and all its possible 3-trustRelation-nodes motifs that meet the criteria, a total of 16 trust propagation features are built, through the eight state sequences of the three trustRelation nodes in a motif in the specific order of T1, T2 and T3. The feature values are either 1, if at least one motif with a corresponding state sequence exists, or 0, otherwise. Note that only trustRelation nodes representing trust or no trust relationships that exist in the observed dataset will get accounted for generating trust propagation feature values. As listed in Table 5, eight of the 16 features in this type are for direct trust propagation and the other eight are for transposed trust propagation.

**Feature vector construction.** Trust propagation features of each type will be generated as node features and get attached to trustRelation nodes. Using features listed in Table 5:

• For each trustRelation node T*t*, we check if any instance of the 16 state sequences exists to generate trust propagation features, by applying the above criteria to all 3-trustRelation-nodes motifs in which T*<sup>t</sup>* acts as T1, and then create a feature vector *F*T TP(T*t*) to include these features. Each of them is a *label-observation feature*.


**Table 5.** Trust propagation features used in the proposed model.

Refer to Figure 4 for T*A*→*B*, T*A*→*C*, T*C*→*B*, and T*B*→*D*. For a trustRelation node: "Y" indicates an existing trust and "N" a nonexistent one.

#### **Auxiliary features**

For better modeling real-world correlations among users and trust relationships and for the proposed method to work properly in the model's probabilistic inference, certain auxiliary edge features are built and attached to the model. They are called auxiliary in this paper because the proposed model and the trust inference approach will still work without adding them, though inefficiently.

We build the auxiliary edge features through the inspiration of the *Ising model* [46] (or more generally the *Potts model*) in statistical mechanics. These models imply that two directly connected nodes of the same type tend to be in the same state. This inspires us that the state–state pair of two directly connected nodes of different types might also follow certain statistical rules. In this paper, two categories of auxiliary edge features are proposed as follows.

1. One category of auxiliary edge features will be attached to each edge between a user node and a trustRelation node. Their labelnames are, respectively, prefixed with "u2TrT" and "Tr2uT" for features on a *trustor–trustRelation edge* and features on a *trustRelation–trustee edge*. This setting matches the construction of our probabilistic graphical model where edges between user nodes and trustRelation nodes have different types. Such an setting allows the model to distinguish how differently a trustor or a trustee affects a trust relationship's formation.

**Feature vector construction.** For each trustor ↔ trustRelation edge or trustRelation ↔ trustee edge, we create a feature vector *<sup>F</sup>*u2TrT TAUX(E*u*↔*t*) or *<sup>F</sup>*Tr2uT TAUX(E*t*↔*u*), respectively. Each feature in this category is a *label-label feature*.

2. The other category of auxiliary edge features will be attached to edges between trustRelation nodes that are involved in the motif structure explained previously. Similarly to trust propagation features, features in this category follow the concept of propagative trust, i.e., direct trust propagation and transposed trust propagation, and grant values of each of them with either 0 for direct trust or 1 for transposed trust. However, different from the trust propagation features which are node features, they are edge features trying to "filter out similarly behaving trustRelation nodes".

**Feature vector construction.** For each trustRelation ↔ trustRelation edge, we create a feature vector *F*<sup>T</sup> TPAUX(E*t*↔*t*). Each feature in this category is a *label-label-observation feature*.

The current implementation of the proposed model does not support inference on cliques yet, which will be discussed below. And thus, it hinders the employment of the second category of auxiliary features (*label-label-observation features*) on edges between trustRelation nodes, which require the prerequisite of the existence of trustRelation–trustRelation edges. However, we use the same inference method for pairwise graph structure approximately for cliques, and we'd like to leave the inference on cliques as future work to complete.

#### 3.2.3. Model Formulation

The random variables in our model are the states or labels of corresponding nodes. For a user node, its state is a predefined measurement for the user from the original online social network. It can be direct statistics of this user or carefully handcrafted measurement calculated from information obtained from their OSN data. For demonstration purpose and brevity, we use user categories defined by the online social service as user node states in this paper. For a trustRelation node, which represents the trust relationship between the trustor user and the trustee user, the random variable's state is the trust relationship's existence in the OSN dataset. As each node has a state, each edge has a state–state pair (or a transition state) that is determined by the states of the two nodes connected by it. Table 6 summarizes state configurations used in this paper.


**Table 6.** Configurations of node state and edge state–state pair.

Due to the way that we model trust relationships as random variable nodes in the probabilistic graph, the smallest significant structure in the resulting graph is pairwise. (Although there will be smallest cliques that consist of three trustRelation nodes if the second category of auxiliary features is leveraged, we still model the conditional random field and run probabilistic inference on it pairwisely.) Different pairwise structures are connected via the common user nodes or trustRelation nodes. It is worth noting that each node, if connected by a dummy node, can also be viewed as a pairwise structure, and thus we call each node a unitary structure.

Each unitary structure has a set of associated *node feature functions* and each pairwise structure has a set of *edge feature functions*. In our problem setting, we attach features to each node and edge. The features used in this paper are the user profile features, UGC features, and trust propagation features.

We use T and T∗ to denote the set of all known trust relationships and the set of trust links whose states are unknown, respectively. Let Ψ denote the set of feature weights, and the proposed model that computes a conditional distribution can be defined as

$$P(\mathbf{T}^\* | \mathbf{U}, \mathbf{T}; \Psi) = \frac{1}{Z(\mathbf{U}, \mathbf{T})} \prod\_i q\_i(\mathbf{U}, \mathbf{T}; \Psi), \tag{4}$$

where *Z*(·) is the normalization constant, and *ϕ<sup>i</sup>* is the *i*th potential function for either a unitary structure or a pairwise structure.

For any node V*<sup>i</sup>* ∈ (U, T, T∗) or any edge E*j*:V*s*,V*t* connecting nodes V*<sup>s</sup>* and V*t*, the potential functions for either unitary structures or pairwise structures used in this paper are, respectively, defined in the log-linear form as follows.

$$\log \left( \mathbf{V}\_{i} \mathbf{\dot{\cdot}} \mathbf{\dot{\cdot}}\_{\mathbf{V}} \right) = \exp \left( \sum\_{k} \psi\_{k} f\_{ik} \left( \mathbf{V}\_{i} \right) \right), \tag{5}$$

$$\langle \varphi\_{\rangle}(\mathbb{E}\_{j:\langle \mathbf{V}\_{\mathbf{s}}, \mathbf{V}\_{\mathbf{t}} \rangle}; \mathbf{\mathcal{Y}}\_{\mathbf{E}}) = \exp\left(\sum\_{k} \psi\_{k} f\_{jk}(\mathbb{E}\_{j:\langle \mathbf{V}\_{\mathbf{s}}, \mathbf{V}\_{\mathbf{t}} \rangle})\right)\_{\mathbf{t}'} \tag{6}$$

where *fik*(·) is the *k*th feature function in the *i*th structure. In the way that features are weighted and then linearly aggregated, the interplay among features of a same node or edge is collectively numericalized.

#### *3.3. Probabilistic Model Inference and Interpretation*

From the above-mentioned model construction, we know that there will be loops in the probabilistic graph if bidirectional trust relations exist. That means if Alice and Bob trust each other, which is common in the real world, then the graph containing only the two users and their trust relationships is no longer a linear-chain or a tree, as shown in Figure 3.

For the probabilistic model inference, collectively predicting all needed trust relationships from the online social network involves iterating through an exponential number of possible label combinations, and thus requires exponential time. Furthermore, as the probabilistic graph contains loops, and exact inference on such a general graphical model is thus intractable, approximations are employed. We propose to solve the inference problem using the *loopy belief propagation* (LBP) technique. As a *belief propagation* (BP) method variant, the LBP is a *message passing* algorithm but requires a slightly different schedule of message updating rules from the vanilla BP method.

It is worth noting that each pairwise structure in the probabilistic graph only connects two random variable nodes, and a unitary structure can be viewed as if there is a dummy node linked to it so that a pseudo-pairwise structure exists. Therefore, we can safely skip the factor graph framework and send messages directly between each pair of nodes connected by an edge. This way of handling message passing is equivalent to the original BP algorithm.

We chose to update messages synchronously, i.e., in each time epoch, each pair of nodes exchange messages, if they are connected by an edge. In the iterative message updates, each node's belief is normalized so that the normalized belief approximates the node's marginal probability and is further considered as the new local evidence (also called compatibility or potential) at this node. Similarly, for an edge and the two nodes connected by it, we use the normalized belief of the pairwise structure as its local evidence. In the model inference, passing messages, calculating node/edge beliefs, and updating messages are repeated until message convergence or an allowed maximum number of iterations is reached.

When the LBP is done, we normalize each node's belief so that it approximates the node's marginal distribution. Through the marginal probabilities of all possible labels for a selected trustRelation node, the trust relationship indicated by this node can be determined by Equation (1). Such probabilities for the trustRelation node can be further interpreted from the message passing perspective.

Let *mji* denote a message sent from node *i*'s neighboring node *j* to it, *xi* be the random variable at node *i*, and *ϕi*(*xi*) and *ϕij*(*xi*, *xj*) be the local evidence at node *i* (unitary structure) and edge *i*, *j* (and its connected nodes *i* and *i*, pairwise structure). According to BP, the belief *bi*(*xi*) at a node *i* before normalization, which will then approximate the variable's marginal probability, is proportional to the product of the local evidence at this node and all the messages coming to it:

$$b\_{\bar{l}}(\boldsymbol{\chi\_{\bar{l}}}) \propto q\_{\bar{l}}(\boldsymbol{\chi\_{\bar{l}}}) \prod\_{j \in \mathcal{N}(\bar{i})} m\_{j\bar{l}}(\boldsymbol{\chi\_{\bar{l}}}) \,. \tag{7}$$

where *N*(*i*) denotes the set of nodes directly neighboring node *i*. The messages are determined by message update rules as follows,

$$m\_{\mathrm{ji}}(\mathbf{x}\_{i}) \leftarrow \sum\_{\mathbf{x}\_{j}} \boldsymbol{\varrho}\_{j}(\mathbf{x}\_{j}) \boldsymbol{\varrho}\_{\mathrm{ji}}(\mathbf{x}\_{j}, \mathbf{x}\_{i}) \prod\_{k \in N(j)} m\_{kj}(\mathbf{x}\_{j}).\tag{8}$$

Analogously to Equation (7), the pairwise belief *bij*(*xij*) at a pairwise structure will be

$$b\_{ij}(\mathbf{x}\_{ij}) \approx \varphi\_{ij}(\mathbf{x}\_i, \mathbf{x}\_j) \varphi\_i(\mathbf{x}\_i) \varphi\_j(\mathbf{x}\_j) \prod\_{k \in N(i) \backslash j} m\_{ki}(\mathbf{x}\_i) \prod\_{k \in N(j) \backslash i} m\_{kj}(y\_j),\tag{9}$$

Without loss of generality, we take Figure 5 as an example for the following discussion. UA, UB, TAB are the random variables of the trustor node A, the trustee node B and the trustRelation node linked to them, respectively. Since the trustRelation node is linked to a trustor node and a trustee node by only two edges, with Equation (7), the belief at it is proportional to

$$\rho\_{\rm T\_{AB}}(\rm T\_{AB}) \propto \rho\_{\rm T\_{AB}}(\rm T\_{AB}) \prod\_{i \in N(\rm T\_{AB})} m\_{i \to \rm T\_{AB}}(\rm T\_{AB})\_{\prime} \tag{10}$$

and the two messages (*MA* and *MB* in the figure) sent to the trustRelation node are, respectively

$$m\_{\mathbf{U}\_{\mathcal{A}} \rightarrow \mathbf{T}\_{\rm AB}}(\mathbf{T}\_{\rm AB}) \leftarrow \sum\_{\mathbf{U}\_{\mathcal{A}}} \varphi\_{\mathbf{U}\_{\mathcal{A}}}(\mathbf{U}\_{\mathcal{A}}) \varphi\_{\mathbf{U}\_{\mathcal{A}}, \mathbf{T}\_{\rm AB}}(\mathbf{U}\_{\mathcal{A}}, \mathbf{T}\_{\rm AB}) \prod\_{k \in \mathcal{N}(\mathbf{U}\_{\mathcal{A}}) \backslash \mathbf{T}\_{\rm AB}} m\_{k \to \mathbf{U}\_{\mathcal{A}}}(\mathbf{U}\_{\mathcal{A}}), \tag{11}$$

$$m\_{\mathbf{U}\_{\mathbf{B}} \rightarrow \mathbf{T}\_{\mathbf{AB}}}(\mathbf{T}\_{\mathbf{AB}}) \leftarrow \sum\_{\mathbf{U}\_{\mathbf{B}}} \varphi\_{\mathbf{U}\_{\mathbf{B}}}(\mathbf{U}\_{\mathbf{B}}) \varphi\_{\mathbf{U}\_{\mathbf{B}}, \mathbf{T}\_{\mathbf{AB}}}(\mathbf{U}\_{\mathbf{B}}, \mathbf{T}\_{\mathbf{AB}}) \prod\_{k \in N(\mathbf{U}\_{\mathbf{B}}) \backslash \mathbf{T}\_{\mathbf{AB}}} m\_{k \to \mathbf{U}\_{\mathbf{B}}}(\mathbf{U}\_{\mathbf{B}}).\tag{12}$$

**Figure 5.** An illustration of passing messages through a trustor node UA and a trustee node UB to a trustRelation node TAB.

Let Λ*x*(*y*) be the multiplication of the local evident at node *x* and edge *x*, *y*, and substitute Equations (11) and (12) and the messages demonstrated in the figure into Equation (10). The belief at the trustRelation node amounts to

$$\left(b\_{\rm T\_{AB}}(\rm T\_{AB})\right) \quad = \; \kappa \varphi\_{\rm T\_{AB}}(\rm T\_{AB})(M\_A M\_B) \tag{13}$$

$$=\left.\kappa\rho\_{\rm T\_{AB}}(\rm T\_{AB})\left[\sum\_{\rm U\_{A}}\Lambda\_{\rm U\_{A}}(\rm T\_{AB})(M\_{1}M\_{2})\right]\right]\left[\sum\_{\rm U\_{B}}\Lambda\_{\rm U\_{B}}(\rm T\_{AB})M\_{3}\right],\tag{14}$$

where *κ* is a normalization constant.

It follows that the trustRelation node's belief is proportional to all local evidence at this node and two messages sent by the trustor node and the trustee node. The local evidence is the calculated potential through the compatibility function (energy function, refer to Equations (5) and (6). Based on the trustor's attitude, experience, awareness of other users' expertise and trustworthiness, etc., collected by Equation (11), the message *MA* sent by the trustor node tells the trustRelation node how much the trustor believes they will issue or not issue a trust relationship to the trustee. Likewise, through the trustee's expertise, experience and evidential suggestions from other users' trust, etc., constituted by Equation (12), the message *MB* sent from the trustee node tells the trustRelation node how much the trustee thinks their behaviors should be recognized by the potential trustor so that the trustor will trust them or not.

One can conclude that both the trustor and the trustee are involved in forming a trust relationship, obviously, in a real-world social network, and they have different contributions to the relationship formation, which is in accordance with Assumption 2, which we made previously.

#### *3.4. Parameter Estimation*

For the model to be able to infer trust, we need to know each parameter in the model, i.e., the weights for all features. In this section, we embrace the simple but effective gradient descent method to minimize the model's negative likelihood so that the training data will achieve locally highest probability under the model.

From Equations (4)–(6), the conditional log-likelihood with respect to the set of feature weights Ψ to be maximized can be obtained as follows.

$$l(\Psi) = -\log Z(\cdot) + \sum\_{i} \log \left[ \rho\_i(\mathbf{U}, \mathbf{T}; \Psi) \right] \tag{15}$$

$$=-\log Z(\cdot) + \sum\_{i} \left(\sum\_{k} \psi\_{k} f\_{ik}(\mathbf{V}\_{i})\right) + \sum\_{j} \left(\sum\_{k} \psi\_{k} f\_{jk}(\mathbf{E}\_{j; \{\mathbf{V}\_{s}, \mathbf{V}\_{t}\}})\right),\tag{16}$$

and with the LBP algorithm, the approximated gradients of the likelihood (the partial derivatives with respect to feature *ψk*) are

$$\frac{\partial l}{\partial \psi\_k} = \sum\_i \sum\_k f\_{ik}(X\_i) - \sum\_i \sum\_k f\_{ik}(X\_i) q(X\_i) + \frac{\psi\_k}{\sigma^2} \tag{17}$$

where *Xi* can be either a unitary or a pairwise structure; *q*(·) is the approximated marginal, i.e., the belief of the unitary or pairwise structure, which can be determined by running a pass of the LBP algorithm on the graph; and the last term is the regularization term to prevent over-fitting.

As was discussed in the previous section, calculating the logarithm of the normalizing constant log *Z*(·) is intractable, and thus we use Bethe Energy [47] to further approximate it:

$$\begin{array}{rcl} \log Z(\cdot) & \approx & l\_{\text{BETHE}}(\psi, q) \\ &=& -\sum\_{s,t} \sum\_{\mathbf{V}\_{s}\mathbf{V}\_{t}} q(\mathbf{V}\_{s}, \mathbf{V}\_{t}) [\log q(\mathbf{V}\_{s}, \mathbf{V}\_{t}) - \log P(\mathbf{V}\_{s}, \mathbf{V}\_{t})] \\ &+ \sum\_{s} \sum\_{\mathbf{V}\_{s}} [d(\mathbf{V}\_{s}) - 1] q(\mathbf{V}\_{s}) [\log q(\mathbf{V}\_{s}) - \log P(\mathbf{V}\_{s})] \end{array} \tag{18}$$

where *d*(V*s*) is the degree of node V*s*; *P*(V*s*) and *P*(V*s*, V*t*) are, respectively, the initial potentials of their unitary and pairwise structures; and *q*(·) and *q*(·, ·) are corresponding optimal marginal distributions of approximated beliefs through LBP.

Between gradient descent epochs, the new feature weight vector *ψ*(*m*) is computed from the old vector *ψ*(*m*−1) by

$$
\psi^{(m)} = \psi^{(m-1)} + \alpha\_m \nabla l(\psi^{(m-1)}),
\tag{19}
$$

where ∇*l*(·) is the partial derivatives calculated through Equation (17), and *α<sup>m</sup>* > 0 is a step size controlling the distance the parameter moves in the direction of the gradient, which is also called the learning rate.

#### *3.5. Implementation*

Our implementation of the proposed model, including the model construction, inference, and relevant training and predicting procedures, was based on the framework introduced in our previous work [48]. For the sake of brevity, we only discuss some concerns on the model implementation in this section which may greatly affect the model's performance.

**Numerical overflows and underflows.** These are very common in BP and other message passing algorithms. For example, some terms in the potential functions of unitary or pairwise structures (e.g., Equations (5) or (6)) may be overflowed due to the exponential calculations; in message passing (e.g., Equation (8)), if the messages passed via each edge are not constrained, then some messages will exhibit underflow after certain iterations of message updates; and calculating beliefs (e.g., Equation (7) and (9)) may cause overflow or underflow or both. The trick to tackling the overflow problem here is to shift every exponent in the calculated potential by subtracting the largest exponent. For the underflow issue during message passing, it is necessary to normalize messages frequently.

**Parameter learning rate in model training.** According to Equation (19), if the learning rate *α<sup>m</sup>* for step *m* is too large, the new parameters will move too far in the direction of the gradient; if it is too small, the training procedure will be very slow to accomplish. Thus, it is essential to properly schedule learning rates in the whole model training process. A simple common approach is to let *α<sup>m</sup>* decrease slowly to 0 as step *m* grows, which is

$$\alpha\_m = \frac{1}{\sigma^2 (\alpha\_0 + m - 1)} \,\prime \tag{20}$$

where *α*<sup>0</sup> is the initial learning rate and *σ*<sup>2</sup> is the L2 regularization. The initial learning rate *α*<sup>0</sup> can be manually set or obtained by running a few passes of gradient descent over the graph [49].

**Implementation of bi-directional message passing.** For bi-directionally passing messages via each edge, the implementation of a non-directional edge in the model is built as two mutually opposite directional edges. Consequently, the number of *real functioning edges* in the implementation will be doubled and the number of calculations in the probabilistic inference procedure will greatly increase. In order for the model to perform LBP efficiently, parallelism could be deployed in synchronous message passing. It is also beneficial to offload the LBP algorithm to *graphics processing unit* (GPU) devices to further accelerate the model inference and training.

#### **4. Experiments**

We performed experiments with data (the dataset is publicly accessible via https: //www.cse.msu.edu/~tangjili/trust.html (accessed on 10 June 2017)) from a typical online social network, Ciao, where users are allowed to build trust network, cast ratings, post reviews on a variety types of items, and rate other's reviews. We chose Ciao because of the availability of a relatively full web of trust and abundant user-generated content for explicit and inexplicit trust for trust inference.

#### *4.1. Data*

For fast model evaluation and comparison with other binary classifiers, we extracted a portion from the full data as our dataset used for experiments. As user links in social networks are often sparse, it is obvious that the observed trust relationships takes up only a very small fraction, whereas unobserved ones are common, which means the data are unbalanced. To deal with the imbalance in the dataset and for the sake of simplicity, we undersampled unobserved relationships. The statistics for the dataset used in experiments are listed in Table 7.

#### **Table 7.** Dataset specification.


For a trust relationship: "Y" indicates an existing trust and "N" a nonexistent one.

Features introduced in Section 3.2.2, including contextual features and relational features, were constructed with information extracted from the dataset. For reviews, we deployed an annotator by CoreNLP (https://stanfordnlp.github.io/CoreNLP/ (accessed on 20 June 2017)) to extract words and phrases from texts and build stylistic and linguistic features. Detailed feature statistics are listed in Supplementary Materials.

Although we carried out the experiments on only one dataset, the proposed method is universally applicable to various online social networks. With only a set of users and the set of trust relationships from other data sources, the model can still be built successfully and then infer trust, though without adequate relevant features the inference may perform unsatisfactorily. If additional features are available, the proposed method will prove its efficiency in trust relationship prediction.

#### *4.2. Experimental Settings*

#### 4.2.1. Comparison Methods

We chose several easy-to-implement state-of-the-art binary classifiers as baseline comparison methods, including *support vector machine* (SVM) with a radial basis function (RBF) kernel, *decision tree* (DT), and *random forest* (RF). Generally speaking, linear SVMs are interpretable but less efficient than SVMs with an RBF kernel that are partially interpretable; DTs provides interpretable results, and RFs deliver better results than DTs do but decrease interpretability.

We did not compare our model with the methods proposed in [33–35], as ours is supervised learning. We also did not conduct comparison experiments between ours and models based on neural networks. Although these models probably could, and most likely would, achieve better performances than ours, to interpret these models is still a problem.

#### 4.2.2. Evaluation Metrics

As we construct our mission of trust relationship inference as a binary classification task in this paper, we used the common metrics for classification evaluation. In detail, we used *Accuracy*, *Precision*, *Recall*, and F1 *score* for evaluations.

For the set of user trust relationships that are to be inferred by a model or a method, we denote the number of all elements in the set by C; the number of observed trust relationships that were also predicted to be existent (true positive data points) by tp; and the numbers of false negative, false positive, and true negative ones by fn, fp, and tn, respectively. Then, the Accuracy is defined as

$$\text{Accuracy} = \frac{\text{tp} + \text{tn}}{\text{C}},\tag{21}$$

and Precision, Recall, and *F*<sup>1</sup> score are, respectively, defined as

$$\text{Precision} = \frac{\text{tp}}{\text{tp} + \text{fp}'} \tag{22}$$

$$\text{Recall} = \frac{\text{tp}}{\text{tp} + \text{fn}'} \tag{23}$$

$$\mathbf{F}\_1 = \frac{2\mathbf{tp}}{2\mathbf{tp} + \mathbf{fp} + \mathbf{fn}}.\tag{24}$$

From the metric definitions, the higher one metric is, the better performance a model or a method has.

#### 4.2.3. Experiment Setup

In this paper, two sets of experiments were conducted.


As was discussed in Section 3.2.2, the proposed model will still work without auxiliary features. Nevertheless, the types of auxiliary features are a particular coexistent by-product of how the proposed model was built, and also reflect the real-world mechanism of trust relationship formation. Therefore, these features were used in all experiments for the proposed model, acting as part of the foundation of the model. Note that these features only work with the proposed model.

In all experiments for model validation and comparison, we used different combinations of feature sets to show each method's ability in trust inference. Table 8 lists the different feature sets for experiments. (Refer to Table 2 for the types of features used in this paper.) For the comparison methods, the feature set combinations available for them are similar but without any auxiliary features from FTAUX or FTPAUX.

For model learning and predicting, the set of user trust relationships was randomly split into a training set and test set, and the ratios of sizes of the two sets were 50–50%, 0–40%, 70–30%, 80–20%, and 90–10%. (A better way to split the data is to split data according to trust relationship creation time. However, in the chosen dataset, such information of trust relationship creation time was missing, and thus we resorted to splitting the data randomly.) All the second set of experiments used the 90–10% split for training and test datasets.

For the second set of experiments, the reduced feature data were generated by randomly removing one type of features from a certain percentage of users. The percentages for feature removal are 20%, 40%, 60% and 80%, respectively. If two types of features are used at the same time for an experiment, the random removal for each type of features is independent.

Other experiment setup included: in all experiments for the proposed model, we set the maximum number of gradient descent epochs to 20 and the maximum number of LBP iterations to 10.


**Table 8.** Sets of features used for experiments.

#### *4.3. Results and Discussion*

4.3.1. The First Set of Experiments with All Possibly Usable Feature Data Performance of the proposed method with different feature sets

Firstly, we report the trust inference performance of the proposed method. The experiments were conducted with eight different feature set combinations on the five split training and test datasets, and accuracy and **F1** score are reported for each of them. As shown in Figure 6, the results are organized into three groups by how the feature sets were composed: the first group contains results from experiments 1, 2, 5, 6, and 8, which illustrates how integrating the user profile feature set FPRF into the model affects the model's performance; similarly, the second group contains experimental results from experiments 1, 3, 5, 7, and 8 to show the power of using UGC feature set FUGC; the third group of 1, 4, 6, 7, and 8 depicts the effect of trust propagation feature set FTP + FTPAUX.

With either a single feature set or a combination of feature sets, the proposed method outperformed three naive classifiers, *uniformly random guess*, *random selected class*, and *majority class*, which, respectively, achieved accuracies of 0.5000, 0.5014, and 0.5269 according to the dataset specification (refer to Table 7). From the reported results evaluated by the accuracy metric, one can conclude that the proposed method is efficient and capable of inferring trust relationships.

As was previously discussed, with only the first category of auxiliary features, the proposed method will still worked, and it also achieved above-average performance. They are special features that only work with our model but not for other comparison methods. This makes our model with this set of features promising as a supervised learning model, when abundant data, such as user-generated content, are available as data input.

A quick glance over the performance results regarding both accuracy and F1 score shows that substituting additional features into the model does improve the model's performance. However, the results differ a bit when the added combination of additional feature sets varies. The observations are as follows.


performance of the model with the UGC feature set FUGC with or without other feature sets.

**Figure 6.** Performance results of the proposed model evaluated by Accuracy (Acc) and F1 score for the first set of experiments. (**a**) Experiments 1, 2, 5, 6, 8 (Acc). (**b**) Experiments 1, 3, 5, 7, 8 (Acc). (**c**) Experiments 1, 4, 6, 7, 8 (Acc). (**d**) Experiments 1, 2, 5, 6, 8 (F1). (**e**) Experiments 1, 3, 5, 7, 8 (F1). (**f**) Experiments 1, 4, 6, 7, 8 (F1).

It is expected to see the model get better results when additional features are added into it. It is also foreseeable that using UGC features (FUGC) should work better than using other types of features, as UGC data are usually more abundant than other types of features, both from a dataset and in the real world. The expected result will in turn validate Assumption 1, that a user's generated contents hold the representation of the user's trust information. All the good performance results came from a rich amount of UGC data, and so, a legitimate question arises, "What if some of the UGC data are missing possibly due to any privacy-related constraints?" Or more precisely, "What would the model performance become if some of the UGC data are not available?" The second set of experiments conducted with reduced feature data will shed light on the answers to them.

As for the trust propagation features, apart from the data imbalance, there is another possible reason that they do not always improve the model's performance: the probabilistic inference is performed on pairwise structures rather than cliques. Nevertheless, the trust propagation features do improve the model's performance when working with the statistical user profile features.

#### Performance Comparisons

Secondly, we report the comparative performance results achieved by using our model and other methods discussed in the previous section. For these experiments, all available feature datasets and their combinations are free to use for experimenting. The performances for each method evaluated by the accuracy and F1 score metrics, and experiments in which the best performances were achieved are reported in Tables 9 and 10. Full results, including precision and recall for each experiment, are reported in the Supplementary Materials. As has been already stated, the auxiliary features only fit in the proposed model, and all the other types of features can be used for all methods.


**Table 9.** First experiment set: best performance comparison for different methods by Accuracy.

**Table 10.** First experiment set: best performance comparison for different methods by F1 score.


From the experimental results listed in the tables and the Supplementary Materials, the first observation is that our proposed method outperforms other comparison methods in terms of accuracy and F1 score, if all types of features are free to use.

The second observation is that UGC features (FUGC) were always involved in the used features when all methods achieved their respective best or second best performances. The fact that there is trust construct embedded in UGC data is proved again, but through the results by comparison methods this time. Consequently, one can harvest a user's trust information from UGC data from the user and his "friends" including trustors and trustees, which thus bears witness to Assumption 1 we made in Section 3.1.

Last but not least, in contrast to the results achieved by our proposed model, SVM and decision tree generated worse results by adding UGC features than user profile features, and all three comparison methods had improved results by using trust propagation features. One possible reason for the performance divergence's cause is the very large number and dimensions of UGC features employed in the three comparison methods; meanwhile, a few yet powerful trust propagation features greatly improved their performance when working with either UGC features or user profile features. Another reason that our model behaved differently in these experiments is that ours can balance a massive number of features which may counter each other, and can also "capture and numericalize" interplayed features.

A peek at Recall: higher Precision or higher Recall for trust relationship prediction in OSNs?

For a classification task, the outcome for it determines the expectations for the precision and recall in the task's result. For the task of trust relationship prediction, or more generally friend recommendation, in an online social network that exists in an online social service, there are some considerations that might help with making the decision.

In OSNs, friend relationship or trust relationship recommendation is an efficient way for users to get started and engaged more in the online social service's activities so that the online social service's revenue could gain.

From the user's perspective, the recommendation should provide users with accurate sets of users based on their homophily and existing friend/trust networks, and the expected precision of the recommendation or the prediction is to be high ideally.

On the other hand, it is probable that the number of candidates being recommended to a user will be few due to the user's strict criteria, or nearly none, which would eventually make the recommendations less effective; and thus, from the online service's operating and managing perspective, in addition to suggested users who satisfy the user's criteria, it is essential to recommend extra possible candidates to the user, although these candidates

may not strictly match the user's criteria fed to the inference algorithm. Therefore, a higher recall for such a recommender or a predictor of a model will benefit an online social network. Additionally, of course, the model should maintain acceptable precision.

Another reason for a higher recall comes from the fact that the dataset used for friend recommendation or trust relationship prediction tasks is usually incomplete. As previously discussed, part of an "ideally complete" dataset would be missing due to various actual reasons. Consequently, even though a model generates good prediction results with high precision, the results may still be far away from a "real ideal." Thus, as a trade-off, a prediction result with a higher recall is acceptable.

Taking the above considerations into account, one of the best models that fit in the trust relationship prediction tasks in this paper would achieve a higher recall while still maintaining an acceptable precision. From the results for the first experiment set, our proposed model had an average recall of 0.9327 in all 40 experiments; and for SVM, decision tree, and random forest in their respective 35 experiments, the average recalls were 0.8239, 0.7950, and 0.8382.

#### 4.3.2. The Second Set of Experiments with Reduced Feature Data

From a quick look at the experiment setup, the second set of privacy-aware experiments conducted with reduced feature data revealed some similarities with the first set of experiments in the training set and test set split configuration. It is true that if we are to conduct another group of experiments in which users choose to opt out of usages of all their profile data and UGC data and even some of users deny the usage of their trust relationships in trust inference tasks, the experiments will be exactly the same as the ones in the first set of experiments using only the first category of auxiliary features (FTAUX) with specific training and test datasets.

However, the fact is that they are not the same from the management perspective. The split training set and test set configuration is for model verification. On one hand, the aforementioned experiments are only particular ones in trust relationship inference in this paper. On the other hand, real-world privacy settings vary a lot from service to service, OSN to OSN, and it is worth exploring how a model performs in close to real conditions.

#### Overall performance results

Figures 7 and 8 show the results of the second experiment set with reduced feature data evaluated by the Accuracy and the F1 score metrics, respectively. In general, for different feature set combinations, the performance of the proposed model and the comparison methods decreases when the ratio of removed features increases.

Similarly to the results of experiments with full feature data, the proposed model works well with UGC features and outperformed other methods greatly in these experiments. This is the answer to the early question about the model's performance with partially available UGC data.

#### A second peek at Recall with reduced feature data

This set of experiments were a simulation of the real-world scenario where obtainable online social network datasets are incomplete due to various reasons. In particular, one whole category or certain types of data from a certain number of users are unavailable, when the users choose to limit the usage of their data by online social services through privacy settings. Although such an experimental setup in this paper does not cover all possible scenarios, it may shed some light on the study about how a model might work when some of the privacy-restricted data are unavailable.

As previously discussed, a higher recall with acceptable precision achieved by a method to infer trust in OSNs is one of the ideal objects for online social services. In all 30 experiments from the second set, our proposed model achieved an average recall of 0.9409; 27 in 30 recalls were ≥0.90. The results for SVM, Decision Tree and random forest were, respectively: 0.7682 (2 recalls were ≥0.90), 0.7444 (none was ≥0.90), and 0.7844 (two recalls were ≥0.90). Refer to Supplementary Materials for the full results.

**Figure 7.** Performance results of the proposed model and comparison methods evaluated by accuracy for the second set of experiments. (**a**) Experiment 2s (Acc). (**b**) Experiment 3s (Acc). (**c**) Experiment 4s (Acc). (**d**) Experiment 5s (Acc). (**e**) Experiment 6s (Acc). (**f**) Experiment 7s (Acc).

**Figure 8.** Performance results of the proposed model and comparison methods evaluated by F1 score for the second set of experiments. (**a**) Experiment 2s (F1). (**b**) Experiment 3s (F1). (**c**) Experiment 4s (F1). (**d**) Experiment 5s (F1). (**e**) Experiment 6s (F1). (**f**) Experiment 7s (F1).

#### **5. Conclusions**

In this paper, we explored the problem of collecting trust information and exploiting trust's various properties to infer trust in online social networks. Both explicitly presented trust relationships and information inexplicitly embedded in user-generated contents that bear users' attitude, experience, expertise, credibility, trustworthiness, etc., are harvested as trust information. A probabilistic graphical model based on a conditional random field for trust inference was proposed, which can effectively take advantage of trust's asymmetric, propagative, non-transitive, and subjective properties. With loopy belief propagation, a message passing algorithm, the model inference was presented and well interpreted. Experiments were conducted to evaluate the proposed model on a real-world online social trust dataset, and the experimental results demonstrated the effectiveness of the proposed model for trust inference.

Further improvements to the proposed model can be achieved. Implementing the model inference on cliques rather than pairwise structures may help the model to capture interplay among trust relationships that have same trustors or same trustees more accurately, making it possible to integrate more types of features that convey users' beliefs through complex interactions between users. Our handling of class imbalance for classifications, the underlying fact of which is the sparsity in user relationships in online social networks, is quite simple, and it may be further addressed by introducing penalties.

The study presented in this paper is primitive, but the proposed model is promising. Distrust is also a trust relationship type that can be supported by adding an extra label to the trustRelation node in the proposed model. Then, with a proper dataset, a model that can infer relationships of both trust and distrust can be trained. The proposed model also supports quantitative and context-specific trust evaluation, which could be an interesting future study with a proper dataset. By using each user's probability of trusting another user, an individual-oriented personalized trust management system can be built, and many social recommendation tasks will benefit from it.

With respect to trust's subjectivity and asymmetric properties, the concept of distinguishing the trustor's attitude, experience and belief from the trustee's expertise, trustworthiness, and credibility when forming a trust relationship—which was shown to be helpful for supervised trust inference in this paper—may help to improve unsupervised methods for trust inference. Finally, it will also be crucial to study the model's sensitivity against attacks.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/app12105186/s1. Detailed feature statistics used in this paper and full experimental results, including performances evaluated by accuracy, precision, recall and F1 score for each experiment conducted in this paper, are enclosed in Supplementary Materials.

**Author Contributions:** Conceptualization, Y.L. and B.W.; data curation, Y.L.; methodology, Y.L.; software, Y.L.; validation, Y.L.; writing, Y.L.; funding acquisition, B.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partly funded by the National Key Research and Development Program of China under grant 2018YFC0831500, the National Natural Science Foundation of China under grant 61972047, and the NSFC-General Technology Basic Research Joint Funds under grant U1936220.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The full data of the results used to support the findings of this study are enclosed in the Supplementary Materials. The source code for our model implementation and raw data for the experiments conducted in this paper are available from the corresponding author upon request.

**Acknowledgments:** Yu Liu would like to thank Sergey Kosov for the inspiration for CRF network implementations and inference algorithms. The authors also would like to thank Yunlei Zhang and Jinna Lv for their kind and supportive suggestions for model implementation with accelerated computing using GPUs, and Bin Wu for providing a free test environment for GPU computing.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this paper:


#### **References**


**Jing Wang 1,\* and Xiongfei Li <sup>2</sup>**


**\*** Correspondence: wangj755@nenu.edu.cn

**Abstract:** Most data with a complicated structure can be represented by a tree structure. Parallel processing is essential to mining frequent subtrees from massive data in a timely manner. However, only a few algorithms could be transplanted to a parallel framework. A new parallel algorithm is proposed to mine frequent subtrees by grouping strategy (GS) and edge division strategy (EDS). The main idea of GS is dividing edges according to different intervals and then dividing subtrees consisting of the edges in different intervals to their corresponding groups. Besides, the compression stage in mining is optimized by avoiding all candidate subtrees of a compression tree, which reduces the mining time on the nodes. Load balancing can improve the performance of parallel computing. An effective EDS is proposed to achieve load balancing. EDS divides the edges with different frequencies into different intervals reasonably, which directly affects the task amount in each computing node. Experiments demonstrate that the proposed algorithm can implement parallel mining, and it outperforms other compared methods on load balancing and speedup.

**Keywords:** frequent subtree; parallel algorithms; data partitioning; load balancing

#### **1. Introduction**

The era of big data has arrived with the advent of massive data. Semi-structured data [1,2] plays a crucial role in massive data with the non-strict structure feature. Most data with a complicated structure, including semi-structured data, can be represented by a tree structure. Data mining methods are used to find hidden relationships among massive data [3,4]. Frequent subtree mining has become an important field of data mining research [5–7]. It is the process of mining a subtree set from a given data set that satisfies user attention (support or frequent degree). Frequent subtree mining can be applied in many fields. For example, RNA molecule structure can be represented by a tree structure where, in order to obtain information about a new RNA molecule, the new one must be compared to the known RNA structures. The function information of new RNA can be obtained by looking for the same topology [8].

CFMIS (compressed frequent maximal induced subtrees) [9] is an efficient method for the frequent subtree mining we proposed earlier. The CFMIS algorithm can find all frequent induced subtrees without throwing solutions in less time. Parallel frequent subtree mining processing is essential for mining massive volumes of data in a timely manner. MapReduce is an ideal software framework to support distributed computing on large data sets on clusters of computers [10,11]. However, not all algorithms could be transplanted to the MapReduce framework, in fact, only a few algorithms could. Assigning data into appropriate blocks is crucial for paralleling algorithms in MapReduce [12]. In this paper, three parallel CFMIS (PCFMIS) algorithms, PCFMIS1, PCFMIS2 and PCFMIS3, are proposed. PCFMIS1 parallels CFMIS, transplanting CFMIS to MapReduce framework by GS. Furthermore, PCFMIS2 is proposed by optimizing the compression to reduce the

**Citation:** Wang, J.; Li, X. Parallel Frequent Subtrees Mining Method by an Effective Edge Division Strategy. *Appl. Sci.* **2022**, *12*, 4778. https:// doi.org/10.3390/app12094778

Academic Editor: Federico Divina

Received: 13 April 2022 Accepted: 6 May 2022 Published: 9 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

running time on each slave node. Based on PCFMIS1 and PCFMIS2, PCFMIS3 is proposed to achieve load balancing by using an effective EDS.

In summary, the contributions of our work are as follows:


The rest of the paper is organized as follows: in Section 2, related work is reviewed; the proposed PCFMIS1, PCFMIS2 and PCFMIS3 are presented in Section 3; in Section 4, experimental results are displayed and discussed; conclusions are made in Section 5.

#### **2. Related Work**

With the extensive application of semi-structured data, the research priority of frequent pattern mining has expanded from frequent item set mining [13,14] to frequent subtree mining [15]. L. Wang et al. proposed a novel framework for mining temporal association rules, which mainly represent the temporal relation among numerical attributes [16]. A new structure called a frequent itemsets tree is proposed to avoid generating candidate item sets in mining rules. Building the tree and mining the temporal relation between the frequent itemset proceed simultaneously. V. Huynh et al. proposed an improved version for IPPC tree, called IPPC+, to increase the performance of the tree construction [17]. IPPC+ improves the poor performance of IPPC tree in the case of datasets comprising a large number of distinguishing items but just a small percentage of frequent items. W. Pascal et al. proposed an algorithm mining probabilistic frequent subtrees with polynomial delay, but by replacing each graph with a forest formed by an exponentially large implicit subset of its spanning trees [18]. The algorithm overcomes the drawback that the number of sampled spanning trees must be bounded by a polynomial of the size of the transaction graphs, resulting in less impressive recall even for slightly more complex structures beyond molecular graphs. J. Wang et al. proposed a compression tree sequence (CTS) to construct a compression tree model and saved the information of the original tree in the compression tree. CFMIS [9] was proposed based on CTS to mine frequent maximal induced subtrees. For each iteration, compression could reduce the size of the data set, thus, the traversal speed was faster than that of other algorithms.

Out of memory and computing resources lead massive data mining to difficulties. Parallel data mining can be an effective solution to this problem [19–21]. In recent years, researchers have made some achievements in frequent item mining. S. Shaik et al. presented a scalable parallel algorithm for big data frequent pattern mining [22]. Three key challenges are identified to parallel algorithmic design: load balancing, work partitioning and memory scalability. D. Yan et al. proposed a general-purpose framework PrefixFPM for FPM that is able to fully utilize the CPU cores in a multicore machine [23]. PrefixFPM follows the idea of prefix projection to partition the workloads of PFM into independent tasks by divide and conquer. The state-of-the-art serial algorithms are adapted for mining frequent patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM. D. Yan et al. extend PrefixFPM to provide the complete parallel algorithms by adopting four new algorithms so that a richer set of pattern types are covered, including closed patterns, ordered and unordered tree patterns, and a mixture of path, free tree, and graph patterns [24]. PrefixFPM exposes a unified programming interface to users who can readily customize it to mine their desired patterns. Xun. Y. et al. proposed FiDoop to achieve compressed storage and avoid building conditional pattern bases [25]. Hong T. P. et al. proposed a parallel genetic-fuzzy mining algorithm [26] based on the master–slave architecture to extract both association rules and membership functions from quantitative transactions. C. FB et al. proposed a frequent itemset mining method using sliding windows capable of extracting tendencies from continuous data flows [27]. They develop this method using Big Data technologies, in

particular, using the Spark Streaming framework enabling distribution of the computation along several clusters and thus improving the algorithm speed. Sicard. N. et al. proposed a parallel fuzzy tree mining method (PaFuTM) [28]. In some of their approaches, the levelwise architecture is preserved but tasks are parallelized within each level. All candidates from each frequent subtree are assigned to specific tasks where they are generated and tested against the database.

#### **3. PCFMIS Algorithm**

In this section, we provide the definitions for some concepts that will be used in the remainder of the paper. The proposed parallel algorithm will also be explained in detail.

#### *3.1. Prepared Knowledge*

#### 3.1.1. Definitions for Concepts

The CFMIS algorithm deals with the tree in which sibling nodes are unordered with labels and the sibling nodes of the same parent node have no repeats. The 'tree' mentioned below is the same tree as the CFMIS processes, assuming that there is no repeat label in a same tree. CFMIS focuses on frequent maximal induced subtree mining.

**Definition 1.** *Induced subtree. A tree T* = (*V* , *E* ,*r* ) *is an induced tree of T* = (*V*, *E*,*r*)*, denoted as T* ⊂ *T if V* ⊂ *V, E* ⊂ *E, where V is the set of nodes; E is the set of edges in which* (*x*, *y*) ∈ *E represents that x is the parent of y; r is the root node.*

**Definition 2.** *Frequent subtree. D* = {*T*1, *T*2, ... , *Tn*} *is a tree set, ε is the frequency threshold, Occ*(*Ti*, *T* ) *represents whether T occurs in Ti, if T* ⊂ *Ti, then Occ*(*Ti*, *T* ) = 1*, else Occ*(*Ti*, *T* ) = 0*. Frq*(*T* ) = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Occ*(*Ti*, *T* )*. T is frequent subtree if Frq*(*T* ) *ε.*

**Definition 3.** *Maximal subtree. F is the frequent subtree set of D. T and T are two frequent subtrees in F, T maximize T (denoted as T <sup>m</sup>* −→ *T* ) *if and only if T* ⊂ *T , there is no T in F to make T <sup>m</sup>* −→ *Thappen. T is called maximal subtree.*

**Definition 4.** *Compression tree sequence(CTS). CTS is an ordered sequence composed by nodepd, CTS* = (*nodepd*<sup>0</sup> <sup>0</sup> , *nodepd*<sup>1</sup> <sup>1</sup> , ... , *nodepdn <sup>n</sup>* )*, pd is the index pointing to the parent node for each node except the root node. Let the parent of nodepdi <sup>i</sup> be nodepdj <sup>j</sup> , and then pdi* <sup>=</sup> <sup>|</sup>*<sup>i</sup>* <sup>−</sup> *<sup>j</sup>*|*, where nodepd*<sup>0</sup> <sup>0</sup> *is the root node.*

For example, the CTS of the example tree in Figure 1 can be *a*0*b*1*c*2*d*1*e*<sup>2</sup> *f* 3. *a*<sup>0</sup> is the root element. *a* is the parent element of *b*, due to the difference between the positions of *a* and *b* in the CTS is 1 while the index superscript of *b* is 1. *c* is the parent element of *f* , due to the difference between the positions of *c* and *f* in the CTS being 3 while the index superscript of *f* is 3. The same rule applies to other nodes, so the information about tree structure and all the edges in the tree can be obtained from CTS.

**Definition 5.** *Tree list length. The number of nodepd in CTS is the length of CTS.*

#### 3.1.2. CFMIS Algorithm

A simple review of the CFMIS algorithm is described below (more details in [9]), and it is primarily performed via four stages:

Stage 1. Construct the compression tree model: the original data set is constructed as a compression tree model using CTS;

Stage 2. Cutting edge: this stage is divided into two subprocesses, trim edges and clean-up edges. First, trim the edges for which the edge frequent degree is less than threshold. Then, delete the single node.

**Figure 1.** An example tree.

Stage 3. Find frequent subtrees: Compress them according to the descending edge frequent degree to obtain CTSs, and sort CTSs according to the tree list length from shortest to longest. Match the *CTSi* with the CTSs following it; if matched (the *CTSi* is obtained in another *CTSj* ), then the frequent degree of the *T* represented by *CTSi* is incremented by 1.

Stage 4. Maximal Stage: Run the frequent subtree sets maximal processing. The frequent maximal induced subtree set of the original data set are obtained.

#### *3.2. Grouping Strategy*

Data partitioning is the premise of the parallel algorithm. An effective data partitioning method can greatly reduce the data communication between different slave nodes, thereby reducing the parallel computing time. Based on the above findings, the grouping strategy (GS) of PCFMIS used to divide all the trees in original data set *D* into different groups is described below:

(1) If given *m* slave nodes, we should divide all the trees in *D* into *m* groups, and the number of the group is denoted as *Gnum*. For the edge (*x*,*y*) in *D*, divide (*x*,*y*) into the set *A* = {*Ak*},*k* ∈ [1, *m*].

(2) For *Ti* ∈ *D*, all the edges in *Ti* may belong to several *Ak* according to (1). Take the minimum of these *k* as the *Gnum* of *Ti*, *Gnum* = min *k*, so *Ti* is put into the group which *Gnum* = min *k*. Then, for *Ti*, cut the edges which belong to *Amink*. Some new trees *Ti* will appear after cutting edges.

(3) For these new trees *Ti* , repeat (2) until no new trees are produced.

(4) For *Tj* ∈ *D*, repeat (2) and (3).

(5) All the trees in original data set *D* are divided into different groups, and the different groups will be put in different slave nodes.

**Definition 6.** *Related Tree. Given a tree T* = (*V*, *E*,*r*)*, for* ∀(*x*, *y*) ∈ *E, there is* (*x*, *y*) ∈ *Ak. min k is the minimum of k. When* (*x*, *y*) ∈ *Amink, T is a related tree of* (*x*, *y*)*, denoted as* (*x*, *y*) → *T. The set of t*(*x*, *y*) → *T is denoted as* Γ(*x*, *y*)*.*

**Property 1.** *According to grouping strategy, a tree in original data set only belongs to one group.*

**Property 2.** *According to grouping strategy, if* (*x*, *y*) ∈ *Ak, all* Γ(*x*, *y*) *will be put into the group which Gnum* = *k.*

According to Property 1, the trees in the original data set are grouped into different groups, which can reduce the scale of the data each slave node needs to process. Although in (2), new trees may be generated during the grouping process, the new trees are trimmed and less complex than the original tree.

We can conclude from Property 2, for (*x*, *y*) ∈ *Ak*, all the induced subtrees of Γ(*x*, *y*) containing (*x*, *y*) can be found in the group in which *Gnum* = *k*. So frequent induced subtrees of Γ(*x*, *y*) containing (*x*, *y*) can be found only in group *k* instead of the whole original data set. This avoids communication between groups and realizes real parallelism.

**Definition 7.** *Frequent Edge Degree. D* = {*T*1, *T*2, ..., *Tn*}*, the frequent edge degree of* (*x*, *y*) *is the number of times* (*x*, *y*) *appears in D, denoted as EFrq*(*x*, *y*)*.*

How to divide the edge (*x*, *y*) into set {*Ak*} in (1) directly affects the *Gnum* of Γ(*x*, *y*). One feasible method is described below:

Given an evenly divided interval *Adiv* = (*a*1, ..., *ai*, ..., *am*), *a*<sup>1</sup> > ... > *ai* > ... > *am*, *a*<sup>1</sup> = *EFrqmin*(*x*, *y*), *am* = *EFrqmax*(*x*, *y*). If *ai*+<sup>1</sup> > *EFrq*(*x*, *y*) ≥ *ai*, the edge (*x*, *y*) is divided into *Ai*, then Γ(*x*, *y*) will be put into the group in which *Gnum* = *k*. For example, in Figure 2, suppose that the number of slave nodes is 3. Divide the trees *T*0, *T*<sup>00</sup> into 3 groups (*G*1, *G*2, *G*3). Given dividing interval *Adiv* = (*a*1, *a*2, *a*3), *a*<sup>1</sup> > *a*<sup>2</sup> > *a*3. *EFrq*(*a*, *b*), *EFrq*(*c*, *d*), *EFrq*(*d*, *f*) ≥ *a*1, *a*<sup>1</sup> > *EFeq*(*h*, *d*), *EFrq*(*d*,*e*) ≥ *a*2, *a*2 > *EFrq*(*a*, *c*) ≥ *a*3, so (*a*, *b*)(*c*, *d*)(*d*, *f*) are divided into *A*1; (*h*, *d*)(*d*,*e*) are divided into *A*2; (*a*, *c*) is divided into *A*3. Take *T*<sup>0</sup> in Figure 2 as an example. For *A*3, Γ(*a*, *c*) (in Figure 2) is divided into *G*3. Then delete (*a*, *c*) from *A*<sup>3</sup> and cut (*a*, *c*). For *A*2, Γ(*a*, *c*) (*Ti* in Figure 2) are divided into *G*2. Then delete (*d*,*e*) from *A*<sup>2</sup> and cut (*d*,*e*). For *A*1, Γ(*ab*) (*T*<sup>3</sup> in Figure 2), Γ(*cd*) (*T*4 in Figure 2) and Γ(*d f*) (*T*<sup>4</sup> in Figure 2) are divided into *G*1. *T*01, *T*<sup>02</sup> are obtained after applying GS on *T*00.

By using the above grouping strategy on the original data set and applying CFMIS algorithm in each group, the CFMIS algorithm is implemented in parallel. The new parallel algorithm is called PCFMIS1. However, the experiment results shown in Section 4.2 indicate that the parallel computing time and speedup did not achieve the desired results. Further improvements on PCFMIS1 will be discussed below.

**Figure 2.** An example of weight tree dividing.

#### *3.3. Improvements on Parallel Algorithm*

3.3.1. Optimized Compression in CFMIS

Stage 3 is the central step of the CFMIS algorithm, and it also takes most of the time in CFMIS. The subsequence processing of Stage 3 must find all subtrees of a compression tree to determine whether each of them is frequent. In fact, it consumes up to 70% of the execution time in this stage. Optimizing Stage 3 can greatly reduce time consumption of the algorithm, particularly when the data size is large. In this section, CFMIS is refined by optimizing Stage 3.

According to Property 2, for any edges (*x*, *y*) ∈ *Ak*, all the induced subtrees of *γ*(*x*, *y*) containing (*x*, *y*) can be found in the group *k*, so frequent induced subtrees of *γ*(*x*, *y*) can also be found. For other edges (*p*, *q*) ∈/ *Ak*,(*p*, *q*) ∈ *Af* , all the induced subtrees of *γ*(*p*, *q*) containing (*p*, *q*) can be found in the group *f* . For the reasons above, we only need to find all the frequent induced subtrees of *γ*(*x*, *y*) which contain (*x*, *y*) instead of all the frequent induced subtrees in group *k*. Other frequent induced subtrees could be found in other groups. If a CTS in group *k* does not contain (*x*, *y*), this CTS does not have to match with the CTSs following it. That is, this improvement in Stage 3 avoids finding all subtrees of a compression tree to determine whether each of them is frequent. The improvement in compression in groups makes the running time shorter. The improved parallel algorithm is called PCFMIS2. Although the time has been reduced, the load on slave nodes is not balanced.

#### 3.3.2. Load Balancing

An effective edge division strategy (EDS) is proposed in this section. If (*x*, *y*) *Ak*, Γ(*x*, *y*) will be put into the group *k*. The division of edges in the original data set affects the load of the slave nodes. Take the frequent edge degree as the basis for the edge division strategy. Abstract edge division strategy as a math problem, it can be described as below: Suppose that there are *w* different edges in the original data set, and their frequent edge degree is denoted as *EFrq*(*i*), 1 *i w*. Record *EFrq*(*i*) in the array *x*[*i*]. Divide different edges from the original data set into the set {*Ak*}, ensuring that the sum of the frequent edge degree in each *Ak* is approximately equal. The improved parallel algorithm is called PCFMIS3.

The steps of EDS are described below:


An example of the EDS method is given here to show how it works: suppose that the frequent edge degree of the original data set is 50, 60, 80, 100, 150, 200, 400, 1000, and *k* = 3. The mean value *u* = <sup>50</sup>+60+80+100+150+200+400+<sup>1000</sup> <sup>3</sup> = 680. As 1000 > 680, 1000 is assigned separately to *A*1. Now, the problem is translated into such a problem: Divide the rest of the edges into two sets, ensuring that the sum of the frequent edge degree in each set is approximately equal. Get the mean value *u* = <sup>50</sup>+60+80+100+150+200+<sup>400</sup> <sup>2</sup> = 520. Translate the problem to 0–1 Knapsack Problem in order to solve it. The weights of these items are 50, 60, 80, 100, 150, 200, 400 and the backpack capacity is 520. The answer is that 50, 60, 400 make the total value biggest in the backpack. 50, 60, 400 are assigned to *A*2, and the rest are assigned to *A*3.

PCFMIS3 solved the problem of load balancing and introduced the optimized compression. The flowchart of PCFMIS3 is shown in Figure 3. Three steps including two Map-reduce operations are completed during parallel computation.

**Figure 3.** Parallel algorithm PCFMIS3 flowchart.

Step1: Calculate edge frequent degree of each edge in original tree set. In Map 1 (Algorithm 1), record each occurrence of each edge in a tree set split, which is the input of Reduce 1 (Algorithm 2). In Reduce 1, the edge frequent degree of each edge is counted out.

Step 2: Cut edge and find frequent subtree. In Map 2 (Algorithm 3), trim the edges for which the edge frequent degree is less than according to the list; divide the edges into different sets according to EDS; divide the subtrees into different groups according to GS. Then which group the subtree is divided to is the input of Reduce (Algorithm 4). In Reduce 2, find frequent subtrees in each group. The maximal frequent subtrees of each group could be obtained.

Step 3: Find the maximal subtrees of original tree set. In this step, run the frequent subtree sets maximal processing to obtain the final maximal frequent subtrees.




**Algorithm 3:** Map 2 (*key*, *value*) **Input:** //*split*: subtree set split //*list*: the list of the edges **Output:** *key* // key is the *Gnum value* // value is the subtree in different Gnum **<sup>1</sup> for** *each edge in list* **do <sup>2</sup>** *subtree*\_*group* = *cut*(*edge*) **3** // cut edges that do not meet the threshold requirements, group is the remaining subtree set **4 end <sup>5</sup> for** *each edge in subtree\_group* **do <sup>6</sup>** *A*(*edge*)=*EDS*(*edge*) **7** // EDS is the edge division strategy; A(edge) is the division set **8 end <sup>9</sup> for** *each subtree in subtree\_group* **do <sup>10</sup>** *Gnum*(*subtree*)=*GS*(*subtree*)// GS is the grouping strategy **<sup>11</sup>** *value*=*subtree* **<sup>12</sup>** *key*=*Gnum* **<sup>13</sup>** Emit(*key*, *value*) **14 end**

**Algorithm 4:** Reduce 2 (*key*, *local*\_*set*)

**Input:** // *Gnum*: the group number

// *Gnum*\_*subtrees*: all the subtrees which group number is *Gnum* **Output:** *key* // null

*local*\_*set* // the frequent subtree set which belongs to this group

**<sup>1</sup>** *f requent*\_*list*=*get*\_*edges*(*Gnum*)

**<sup>2</sup>** // get\_edges(*Gnum*) is used to get the list of the frequent edges of this group


#### **4. Experiments and Results**

Paralleling the CFMIS algorithm only affects execution time, and it has no effect on the accuracy of the calculation results. The corresponding validity of the algorithm could consult to [9]. The number of groups in the GS is the number of the slave nodes in this paper. In order to prove that EDS is an effective method to solve load balancing, the load balancing tests are compared between PCFMIS1 and PCFMIS3. The experiments on computational time and speedup are done among PCFMIS1, PCFMIS2, PCFMIS3 and PaFuTM [28].

#### *4.1. Experimental Environment*

All of the experiments were conducted on a PC cluster connected with 100M Ethernet. Each PC was equipped with a 3.20 GHz Intel Core i5 and 4 GB main memory, running the Centos6.6 operating system. The version of the platform is Hadoop 2.6.2. The system configuration is shown in Figure 4. The synthetic data set used in this paper was generated by tree generator using the method in [29]. Parameters of the synthetic dataset are set as follows: f = 10 (f represents fan-out), d = 10 (d represents the depth of the tree), n = 100 (n represents the number of labels), m = 100 (m represents the number of tree nodes), and t

(t represents the number of trees). While t = 50,000, the data set is denoted as D5; t = 100,000, denoted as D10; t = 200,000, denoted as D20; t = 500,000, denoted as D50; t = 1,000,000, denoted as D100; t = 2,000,000, denoted as D200. The real data set was obtained from CSLOGS data set, which is from a month-log of the data of the Rensselaer Polytechnic Institute's web site. CSLOGS10 contains 100,000 trees and CSLOGS100 contains 1,000,000 trees. The support thresholds for D5, D10, D20 and CSLOGS10 are 0.01 and 0.05. For D50, D100, D200 and CSLOGS100, the support thresholds are 0.01 and 0.001.

**Figure 4.** System configuration diagram.

#### *4.2. Experiments and Analysis*

During the experiments, it was found that for the data sets with large data volume, time cost increases dramatically when the number of nodes is too small. The reason is that the computation of the tree structure needs to constantly operate the stack. If the JVM stack is not enough, it can only keep replacing the stack. In this paper, experiments have been done on a small number of nodes for small data sets and on a large number of nodes for large data sets.

#### 4.2.1. Comparison of Load Balancing

We performed the load balancing evaluation on both real data set CSLOGS10 and synthetic data set D50. For CSLOGS10, the number of nodes is set to 1–5; for D50, the number is set to 1–8.

Table 1 shows the amount of subtrees and computing time in different groups of CSLOGS10 in Reduce 2 while the total number of nodes are 3, 4, 5 and the support threshold is 0.01. The subtree amounts in each group divided by PCFMIS1 and PCFMIS3 are shown in Table 1. For PCFMIS1, when the number of nodes is 3, the amount of subtrees in the most loaded group (*Gnum* = 3) is 3.66 times that of the least loaded group (*Gnum* = 1); when the number of nodes is 4 and 5, the ratio is 4.54 and 6.96. For PCFMIS3, when the number of nodes is 3, the amount of subtrees in the most loaded group (*Gnum* = 1) is 1.32 times that of the least loaded group (*Gnum* = 2); when the number of nodes is 4 and 5, the ratio is 1.30 and 1.31. For the computing time, when the number of nodes is 3 in PCFMIS1, the longest computing time (*Gnum* = 3) is 3.31 times that of the shortest computing time (*Gnum* = 1). The corresponding amount ratio and time ratio are shown in Table 2. The closer the ratio is to 1, the better the effect of the load balancing. The same load balancing test is done on D50 (the support threshold is 0.01) which is shown in Table 3, and the corresponding amount ratio and time ratio are shown in Table 4. The experiments indicate that the efficient division of edges in the original data set affects the load of the slave nodes. The proposed EDS can achieve load balancing.


**Table 1.** The amount of subtrees and computing time in different *Gnum* groups of CSLOGS10.

**Table 2.** The amount ratio and time ratio in different *Gnum* groups of CSLOGS10.



**Table 3.** The amount of subtrees and computing time in different *Gnum* groups of D50.


**Table 4.** The amount ratio and time ratio in different *Gnum* groups of D50.

#### 4.2.2. Comparison of Running Time and Speedup

The speedup is used to measure the parallelization performance of parallel systems or programs. It is the ratio of the time that the same task runs in uniprocessor and parallel processor systems. The speedup is defined by the following formula [30]:

$$S\_p = \frac{T\_1}{T\_p} \tag{1}$$

where *p* is the number of nodes, *T*<sup>1</sup> is the execution time on a single node, and *Tp* is the execution time on *p* nodes.

When *Sp* = *p*, the speedup is a linear speedup. It is the ideal speedup that the parallel algorithm tries to achieve. However, the linear speedup is difficult to achieve because the communication cost increases with the increasing number of cores.

Figure 5 shows the computational time of four algorithms for four data sets on a small number of nodes. Figure 6 shows the computational time of four algorithms for four data sets on large number of nodes. As the number of the nodes increases, the computational time of the four parallel methods becomes shorter. Due to the use of optimized compression, PCFMIS2 has less computational time than PCFMIS1. PCFMIS3 performs best, saving half the time compared to PCFMIS1. It indicates that the improvements on optimized compression and load balancing are effective. In particular, the EDS strategy proposed by PCFMIS3 divides the edges more rationally to provide a basis for grouping. As PaFuTM has to find all the candidate subtrees and duplicate redundant subtrees exist among nodes, PaFuTM has longer computational time than PCFMIS3.

**Figure 5.** Computational time of four algorithms for four data sets on a small number of nodes.

**Figure 6.** Computational time of four algorithms for four data sets on a large number of nodes.

Figure 7 shows the speedups on four data sets with a small number of nodes. Figure 8 shows the speedups on another four data sets with a large number of nodes. As the results show, the speedup of PCFMIS1 performs worst on all data sets. PCFMIS2 performs better than PCFMIS1. The speedups of PCFMIS3 and PaFuTM are closest to the linear one, but PCFMIS3 performs better. As the number of nodes increases, the speedup keeps growing. However, in Figure 7e,f, when the nodes increase from 5 to 6, the speedup of PCFMIS1 and PCFMIS2 grow slowly. There is a clear salient point. This also appears in Figure 8a,b,f. PCFMIS3 maintains a steady growth.

**Figure 7.** The speedups of four algorithms for four data sets on a small number of nodes.

**Figure 8.** The speedups of four algorithms for four data sets on a large number of nodes.

#### **5. Conclusions**

A paralleling algorithm PCFMIS3 with GS and EDS is proposed in MapReduce framework to parallel CFMIS to mine frequent subtrees efficiently. The GS method divides the subtrees into different groups, which solves the data division problem in parallel computing. It avoids inter-group communication while mining frequent subtrees in each group to reduce parallel computing time. In PCFMIS3, load balancing is achieved by using EDS. Additionally, the compression stage in mining frequent subtrees is optimized by avoiding all candidate subtrees of a compression tree, which reduces the calculation time of nodes. Experiments demonstrate that the PCFMIS3 algorithm performs best on the comparison of load balancing and running time on both the real data set and synthetic data set. The maximum load is 1.3 times the minimum load, while it is up to 7 times without EDS. The time ratio of PCFMIS3 is only about 1.1, while the time ratio exceeds 2.0 without EDS. PCFMIS3 also performs best on different support values, saving half the computational time compared to PCFMIS1. The PCFMIS3 achieves the optimal speedup which is closest to the linear one on both small and large number of nodes.

Serial computing technology is difficult to meet the needs of massive data processing. Parallel computing can take advantage of multi-node computing resources to reduce problem resolution time. The proposed GS and EDS methods solve the two important issues of data partitioning and load balancing in parallel computing. To apply GS and EDS method in the mining of other frequent item sets is our future work.

**Author Contributions:** Conceptualization, J.W. and X.L.; methodology, J.W. and X.L.; software, J.W.; validation, J.W.; formal analysis, X.L.; writing—original draft preparation, J.W.; visualization, J.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Technology Development Plan of Jilin Province (Jing Wang 2022) and the Fundamental Research Funds for the Central Universities.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Guangyi Man \* and Xiaoyan Sun \***

School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China

**\*** Correspondence: mgy\_paper@163.com (G.M.); xysun78@126.com (X.S.)

**Abstract:** Keyframe recognition in video is very important for extracting pivotal information from videos. Numerous studies have been successfully carried out on identifying frames with motion objectives as keyframes. The definition of "keyframe" can be quite different for different requirements. In the field of E-commerce, the keyframes of the products videos should be those interested by a customer and help the customer make correct and quick decisions, which is greatly different from the existing studies. Accordingly, here, we first define the key interested frame of commodity video from the viewpoint of user demand. As there are no annotations on the interested frames, we develop a fast and adaptive clustering strategy to cluster the preprocessed videos into several clusters according to the definition and make an annotation. These annotated samples are utilized to train a deep neural network to obtain the features of key interested frames and achieve the goal of recognition. The performance of the proposed algorithm in effectively recognizing the key interested frames is demonstrated by applying it to some commodity videos fetched from the E-commerce platform.

**Keywords:** key interested frame; commodity video; clustering; deep neural network

#### **1. Introduction**

Videos have been successfully used in an increasing number of fields due to the development of video technology, including video retrieval [1], recommending interested videos to users in the personalized recommendation [2], recognizing and tracking moving targets based on surveillance videos as pattern recognition [3], and so on. The greatest advantage of applying videos in many scenarios is to continuously record what is happening and provide important information when required. However, it is also quite difficult to directly and efficiently find valuable frames due to the continuously recording. For example, when searching for a hit-and-run vehicle based on a traffic video, the police may spend several days watching every frame to find the target and may also suffer a great failure. If interesting keyframes can be recognized by removing redundant video frames, the workload of the users can be greatly reduced and the success rate of finding key information can be improved. Therefore, extracting the key and valuable frames from video has become one of the research hotspots in video processing [4,5].

In the existing relevant work on keyframe extraction, the keyframe of a video is generally defined as the frame containing the key action changing of the object [6]. The purpose of keyframe extraction is to find a set of images from the original video to represent the main action changes. Through the keyframe, users can understand the behavioral features of the main character or object in a relatively short time [7], and it can provide important information for users to make decisions. There are four kinds of keyframe extraction algorithms—that is, based on target, clustering algorithm, dictionary, and image features. Target-based keyframe extraction transforms the problem into the detection of important objects or people and extracts frames containing important elements from the video as the keyframe. Lee et al. [8] proposed a crucial person or target detection method

**Citation:** Man, G.; Sun, X. Interested Keyframe Extraction of Commodity Video Based on Adaptive Clustering Annotation. *Appl. Sci.* **2022**, *12*, 1502. https://doi.org/10.3390/app12031502

Academic Editors: Stefanos Ougiaroglou and Dionisis Margaris

Received: 9 December 2021 Accepted: 27 January 2022 Published: 30 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

based on the photographer's perspective and predicted the importance of new targets in video according to the significance region. This kind of algorithm is mainly used to detect important targets from the video. Another kind of keyframe extraction is based on clustering algorithm, which gathers each frame into different categories by designing an appropriate clustering method and selects one or more images in each category as the keyframe [9,10]. This kind of algorithm is simple, intuitive, and easy to implement but often needs to specify parameters such as clustering number or clustering radius, which limits its practical applicability. Dictionary-based keyframe extraction adopts a dictionary to reconstruct video, assuming that the video keyframe sequence is the best dictionary [11]. This kind of algorithm turns the keyframe selection into dictionary learning. Mademils et al. [12] proposed a keyframe extraction algorithm for human motion video based on significance dictionary and reconstructed the whole video using the benchmark of human activity; then, they extracted keyframes according to the dictionary. Furthermore, this sort of frame extraction algorithm pays more attention to the characteristics of the whole video but ignores the uniqueness of individual frames. Feature-based keyframe extraction generally uses the color, texture, or motion feature to realize the recognition of motion information. Zhang et al. [13] put forward a kind of keyframe extraction based on the image color histogram strategy. Yeung and Liu [14] come up with a fast keyframes recognition method based on the maximum distance of feature space. Meanwhile, Lai and Yi [15] extracted movement, color, and texture characteristics of frames and built dynamic and static significant mapping, improving the priority of movement information, which enhanced the extraction effect of motion information. Li et al. [11] mapped the video frame features into the abstract space, and then extracted keyframes in it. Existing keyframe extraction algorithms have achieved good results in motion target detection, key figure detection, creating video abstracts, and other fields but these algorithms either lack static feature extraction or pay attention to the overall features of the video while ignoring the uniqueness of a single frame or needing prior parameters that greatly limits the application value of keyframe extraction. In addition, these studies mainly focused on the movement information in the video and did not consider individual demands.

In fact, different users have different concerns when browsing a video, so extracting keyframes only by motion information is difficult to meet user's demands. For instance, on the e-commerce platform [16], the commodity video provides vital information for users searching for their needs, and the key is how the user can obtain the interesting information about the video in a very short time. Such information goes beyond movement changes, as representing global and local content about products is more crucial. Besides, the commodity video is commonly short video and movement of the object is not obvious, so traditional keyframe extraction algorithms based on motion changes no longer apply. At the same time, the definition of keyframe regarding the commodity video varies from person to person. Therefore, how to extract video keyframes according to the needs of different users has become a challenge to the traditional algorithm.

In recent years, many keyframe extraction algorithms have focused on adding an attention mechanism. Shih [17] designed an attention model based on semantic and visual information to mark frames, and selected frames with high scores as keyframes. Although the attention mechanism is integrated, there is no solution to deal with different users' interests.

Aiming at the above problems, we propose an algorithm of commodity video keyframe recognition based on adaptive clustering annotation to solve commodity video keyframe extraction. First of all, from the perspective of users' demands, the keyframe is defined as the frame containing global and local information that users are interested in, and these frames could not include noise and blur. Then, the frame-to-frame difference method is used to obtain the differential frame set by looking for the maximum difference between frames, where a differential frame is defined as the frame that has the greater difference and object movement compared to adjacent frames. For the set, an efficient and adaptive clustering strategy with few parameters based on frame difference degree and category scale is designed to realize the clustering of different frames and the differential frame sets are divided and annotated based on the keyframe definition defined by users' requirements. Furthermore, a small number of marked commodity keyframe samples are utilized to train the deep neural network by means of transfer learning [18] to realize accurate keyframe recognition and extraction.

Contributions of this paper mainly include the following four parts: (1) From the perspective of users' attention and interest in commodities, we propose the definition of commodity video keyframe. (2) An adaptive image clustering strategy based on the frameto-frame difference and keyframe labeling method are proposed. (3) A deep neural network model is presented fusing the frame-to-frame difference with the transfer mechanism to realize the extraction of commodity video keyframes. (4) The proposed algorithm is applied to the self-built video library of clothing commodities, and the results show its effectiveness in meeting the personalized needs of users. Figure 1 shows the algorithm framework of our work.

**Figure 1.** Proposed algorithm framework.

The paper is organized as follows: In Section 2, we provide a review of related work about keyframe extraction, image clustering, and deep neural network algorithm. In Section 3, we introduce algorithms proposed in this paper in detail, including the adaptive clustering strategy (Adaptive Cluster Based on Distance, ACBD), commodity video keyframe labeling algorithm, and keyframe extraction strategy. In Section 4, we evaluate the algorithm on the commodity dataset and analyze the results. In the Discussion section, we discuss the advantages and significance of our algorithm. In the Conclusion section, we conclude the paper.

#### **2. Related Work**

Keyframe extraction has become a hotspot in video processing technology nowadays and mainly uses clustering strategies, such as K-means clustering [19], mean-Shift clustering, density clustering [20], fuzzy C-mean clustering [21], etc. to generate video

summaries and retrieve information. Generally, in the field of video retrieval, the video is usually transformed into keyframes so as to improve the efficiency. Sze et al. [22] proposed a keyframe extraction algorithm based on pixel features, which improved the retrieval performance compared with the traditional algorithm based on histogram feature. Pan et al. [21] proposed a keyframe extraction algorithm based on improved fuzzy C-means clustering, extracting frame images from each class as keyframes according to the maximum entropy. Xiao et al. [23] dynamically divided frames into different classes in accordance with the captured content and selected the frame closest to the class center as the keyframe. Wang and Zhu [24] extracted a moving target from the original video as keyframes through lens boundary detection. Chen et al. [25] used posture information to identify and select keyframes with abrupt posture changes on the basis of human targets, in order to make the network have adaptive ability in attitude changes and, thus, extract keyframes having motion information.

Normally, in the field of generating video summary, the extracted keyframe is used as the video summary. Ren et al. [26] divided the video into different segments according to lenses; then, they clustered different video segments, and finally, selected several frames from every category as keyframes of the video. In general, image features are usually fused into the keyframe extraction algorithm based on clustering. For instance, Gharbi et al. [10] extracted SURF features from video frames and then clustered features to extract keyframes. Likewise, Mahmoud et al. [27] extracted and clustered color features, and sequentially selected the center of each category as keyframes of the video. Liu et al. [28] designed feature description windows to extract the objects and reduced the number of windows through prior knowledge to reduce the possibility of overfitting. Gygli et al. [29] considered multiple target detection; they extracted the features of different targets through supervised learning and fused multiple features to extract the keyframe. Li et al. [11] extracted a frame per second to greatly shorten the length of the video, thus improving the efficiency of the algorithm.

With the rapid development of deep learning in the field of video processing, some scholars applied it to extract and recognize keyframe feature. Zhao et al. [30] used Recurrent Neural Networks (RNN) to extract video keyframes, which make the keyframe sequence represent the semantic information about the original video better. Agyeman et al. [31] extracted video features by 3D-CNN and ResNet and then recognized keyframes using a Long Short-Term Memory (LSTM) network trained by these features. Universally, RNN and LSTM are used to process time series data. Although video is a kind of time series data, for commodity video keyframe extraction, users do not focus on the temporal characteristics between two frames and pay more attention to the contents of each image. Therefore, we should consider other neural network models to extract image features that users are interested in to improve the accuracy of keyframe extraction.

At present, the convolutional neural network is mainly used to extract image features. Since AlexNet [32] made a breakthrough, the architecture of convolutional neural network (CNN) has been getting deeper and deeper. For example, VGG [33] network and GoogleNet [34] have reached 19 and 22 layers, respectively. However, with the increase in network depth, the problem of ineffective learning caused by gradient vanishing will lead to the saturation or even degradation of the performance of the deep convolutional neural network. Hence, He K. et al. proposed a deep residual network (ResNet) [35], adding the identity mapping and relying on the residual module to overcome the gradient disappearance and enhance the convergence of the algorithm. There are five models of ResNet network, including 18 layers, 34 layers, 50 layers, 101 layers, and 152 layers. With the network layers being deepened, the fitting ability of models is gradually improving but the training complexity is increasing sharply. From the perspective of reducing the complexity of deep convolutional neural network, Howard et al. proposed the structure of MobileNet [36], which greatly reduced the number of training parameters through the use of deep separable convolution. Further, on the basis of this network, Sandler et al. integrated the residual module and proposed MobileNet-V2 [37] with higher accuracy. On

this basis, Gavai et al. [38] used MobileNet to classify flower images, and Yuan et al. [39] used Mobilenet-V2 to detect surface defects of galvanized sheet. Fu et al. [40] extracted text information from natural scenes by combining Mobilenet-V2 and U-NET. On account of the high efficiency and accuracy of Mobilenet-V2 in image feature extraction, we apply MobileNet-V2 to extract image features of commodity video keyframes in order to obtain frame features reflecting users' interest.

#### **3. Keyframe Extraction by Adaptive Clustering Annotation Facing User's Interestingness**

#### *3.1. Keyframe Definition for User's Interest*

In commodity videos, users will concentrate on images that reflect the global and local information of commodities. In addition, the clarity and quality of images will also greatly affect user experience and decision-making. Therefore, we first provide quantitative descriptions of image quality and the information mentioned previously and then define the keyframe based on the descriptions.

Figure 2 illustrates the global and local information of two frames. Visibly, the image containing global commodity information has many contour features and, oppositely, the other image lacks these features. Thus, we adopt the Laplace operator, which can embody contour features of images to define the global and local features of images that express user's interest.

**Figure 2.** (**a**) Local feature, (**b**) Global feature. Local features reflect commodity details including texture, material, workmanship, etc. and global features reflect the overall effect of the commodity on the model.

The pixel point matrix of the commodity video frame is denoted as *f*(*u*, *v*), where *u* and *v* represent the row and column of the image, and its Laplace transform is shown in Equation (1) [41].

$$\begin{aligned} \nabla^2 f(u, v) &= \left[ f(u+1, v) + f(u-1, v) + f(u, v+1) + f(u, v-1) \right] - 4f(u, v), \\ u &\in \{1, 2, \dots, h\}, v \in \{1, 2, \dots, w\} \end{aligned} \tag{1}$$

In the Equation (1), *h* and *w* represent the sum of row and column, respectively. Further, we binarize the image after Laplace transform and then count the number of white points in the binary image to calculate the ratio denoted as *η* = *Nw Nt* , which reflects the

richness of contour information, where *Nw* and *Nt* represent the number of white points and total pixels, respectively. When *η* is large, it indicates that the frame of commodity video contains more contour features and likely reflects the global feature of the commodity. On the contrary, the frame may only contain the local features of goods. For example, for two frames in Figure 2, the *η* is 0.0029 and 0.028, respectively, showing large differences.

Mean value, standard deviation, and mean gradient are commonly used to measure image quality and describe the brightness, the color saturation, and the clarity of image, respectively. In commercial video frames, poor image quality is caused by blur and lack of the main object due to lens conversion, so the mean gradient can be used. Meanwhile, lack of the main object results in a large area of blank and the average gradient of blank part is 0, resulting in a small average mean of the whole image. For this reason, we adopt mean gradient to evaluate the quality of commodity video frames, as in Equation (2). In the equation, respectively, *w* and *h* represent the width and height of the picture; *<sup>∂</sup> <sup>f</sup>*(*u*,*v*) *∂x* and *<sup>∂</sup> <sup>f</sup>*(*u*,*v*) *<sup>∂</sup><sup>y</sup>* represent horizontal gradient and vertical gradient. When the mean gradient of a frame is greater than the mean value of all frames' gradients, the image quality is higher.

$$G = \frac{1}{w \times h} \sum\_{u=1}^{h} \sum\_{v=1}^{w} \sqrt{\frac{(\frac{\partial f(u,v)}{\partial x})^2 + (\frac{\partial f(u,v)}{\partial y})^2}{2}} \tag{2}$$

Nonetheless, we encounter two problems using the above quantitative description in practical application. That is, (1) when a commodity has many designs, its image has abundant local and global contour features. At this condition, the *η* of every frame are similar to each other, so the global and local features cannot be completely divided by *η*. (2) When mean gradient is used to measure the image quality, the mean gradient of frames without complicated designs is similar to the value of frames that lack the main object, and these frames may be misclassified into low-quality. Therefore, we introduce user's interest to make up for the deficiency of the quantitative descriptions—that is, on the basis of these descriptions, every user participates in the judgment of the global information, local information, and image quality. In general, the commodity video keyframe is determined by *η*, *G* and the correction operations of the user.

In order to accurately identify keyframes in commodity video, our algorithm is divided into the following steps: (1) We extract a part of frames according to frame-to-frame difference and design an efficient automatic clustering strategy to cluster these frames. (2) We calculate *η* and *G* based on categories to make the clustering result more accurate by means of introducing local features, global features, and quality evaluation and then submit them to users for correction. Finally, we obtain reliable keyframe labels. (3) Aiming to extract keyframe features, we utilize a small number of keyframes containing labels to train Mobilenet-V2 by transfer learning and, finally, we can obtain keyframes using the network trained before.

#### *3.2. Personalized Keyframe Adaptive Clustering Based on Frame-to-Frame Difference*

Before extracting keyframes reflecting personalized demands, frames should be annotated. Obviously, it is time-consuming and difficult for users to annotate each frame. Thus, we expect to reduce this burden with the method of frame-to-frame difference, which can be used to identify the frames of shot transition and model's movements as the important information of the video. Therefore, we use it to measure the difference between two adjacent frames, and then design the Adaptive Cluster Based on Distance (ACBD) to shorten the labeling process.

#### 3.2.1. Differential Frames Extraction

We define the set of all video frames as Ψ = {*X*1, *X*2, ... , *Xn*}, and after grayscaling the Ψ, it is expressed as Ψ = {*X* <sup>1</sup>, *X* <sup>2</sup>, ... , *X <sup>n</sup>*}. Then, the frame-to-frame difference between frame *i* and frame *i* − 1, denoted as *D*(*i*), is shown in Equation (3).

$$D(i) = \frac{\sum\_{m=1}^{w} \sum\_{j=1}^{h} ||X\_i'(j, m) - X\_{i-1}'(j, m)||}{w \times h}, i \in \{2, 3, \dots, n\} \tag{3}$$

Frame-to-frame difference is often used to detect object movement. The frame having larger difference value than the threshold is selected as a keyframe. In our work, we adopt a differential frame selection strategy based on maximum value. For example, *<sup>D</sup>*(*i*) <sup>≥</sup> *<sup>D</sup>*(*i*−1)+*D*(*i*+1) <sup>2</sup> reveals that the frame *Xi* has great difference from other frames in {*Xi*−2, *Xi*−1, *Xi*, *Xi*+1}, so we consider that *Xi* is a differential frame. The set of differential frames is denoted as Ψ*<sup>d</sup>* = {*X*1, *X*2, ... , *XL*}. In order to filter redundant information ulteriorly, we set the threshold *α*, and when *L* ≤ *α*, the algorithm outputs the result; otherwise, the algorithm repeats the above process of differential frame extraction on Ψ*d*.

#### 3.2.2. Adaptive Clustering of Differential Frames

In order to improve annotation efficiency, we propose an Adaptive Cluster Based on Distance (ACBD) to cluster the set of differential frames and select the class center for user annotation. The ACBD algorithm uses Euclidean distance to measure the similarity between images and divides the categories through dynamic threshold to achieve accurate clustering of images. Assume that the set to be clustered is a differential frame set, as Ψ*<sup>d</sup>* = {*X*1, *X*2, ... , *XL*}. The ACBD algorithm is completed in four steps: (1) Obtain the initial classes—that is, calculate the similarity of the images and group *L* images into *L* − 1 classes. (2) Remove noise. (3) Combine related categories and delete abnormal images by intraclass difference. (4) According to the number of samples of every class, dynamically adjust the cluster threshold to optimize the clustering results. The details are shown as follows.

(1) Obtain the initial classes. We transform the image *Xi* to row vector, as *xi* = [*xi*,1, *xi*,2, ... , *xi*,(*w*×*<sup>h</sup>*)], and the Euclidean distance between any two frames is shown in Formula (4).

$$\begin{aligned} dis(X\_{m\nu}X\_l) &= \sqrt{\sum\_{i=1}^{w \times h} (\mathbf{x}\_{m,i} - \mathbf{X}\_{l,i})^2}, \\ m &\in \{1, 2, \dots, L\}, l \in \{1, 2, \dots, L\} \end{aligned} \tag{4}$$

Thus, we can obtain the image distance matrix of Ψ*<sup>d</sup>* represented by *Dis*(Ψ*d*).

$$\operatorname{Dis}(\mathbb{Y}\_d) = \begin{bmatrix} 0 & \operatorname{dis}(\operatorname{X}\_1, \operatorname{X}\_2) & \cdots & \operatorname{dis}(\operatorname{X}\_1, \operatorname{X}\_L) \\ \operatorname{dis}(\operatorname{X}\_2, \operatorname{X}\_1) & 0 & \cdots & \operatorname{dis}(\operatorname{X}\_2, \operatorname{X}\_L) \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{dis}(\operatorname{X}\_{L\prime} \operatorname{X}\_1) & \operatorname{dis}(\operatorname{X}\_{L\prime} \operatorname{X}\_2) & \cdots & 0 \end{bmatrix}$$

Afterwards, we calculate the mean of *Dis*(Ψ*d*) by Formula (5).

$$E = \frac{\sum\_{j=i+1}^{L} \sum\_{i=1}^{L} dis(X\_i, X\_j)}{C\_L^2} \tag{5}$$

$$\text{dis}(X\_{p\prime}X\_q) < K \tag{6}$$

For each image, we gather its similar images into a group. According to Formula (6) in which *K* equals *E* for the first time, we group Ψ*<sup>d</sup>* into *L* − 1 classes denoted as *A* = {*A*1, *A*2,..., *AL*−1}. For example, the clustering result of image *i* is represented as *Ai* = {*Ai*,1, *Ai*,2, ... , *Ai*,*p*}, *i* = 1, 2, ... , *L* − 1. *Ai*,*<sup>p</sup>* represents that the image *p* is similar to the image *i*.

(2) Remove noise. Firstly, we calculate the number of elements in *Ai* denoted as *ai* and set the threshold *β* to divide noise. If *ai* < *β*, it means that images in *Ai* are quite different from most other images and the number of elements in it is not enough to be considered a separate cluster, so we determine *Ai* as a noise and delete it. Besides, for *Ai*,*q*, *q* = 1, 2, ... , *p* we seek similar elements in *Ai* and count them denoted as *Countq* to delete *Ai*,*<sup>q</sup>* if *Countq* is less than *β*.

(3) Merge categories based on associated images: We assume that sets with the same elements describe the same class, and therefore merge the intersecting sets. For example, a sample belongs to *Ai* and *Aj*, so we merge them to a category. And the result is denoted as *A* = {*A* <sup>1</sup>, *A* <sup>2</sup>, *A* <sup>3</sup>,..., *A <sup>m</sup>*}.

(4) Optimize cluster based on dynamical cluster threshold. In Formula (6), *K* is usually unable to accurately cluster different clusters, so we have to update *K* to improve accuracy. For this, we count the number of elements in *A <sup>t</sup>*, *t* = 1, 2, ... , *m* denoted as *Numt* and we set a threshold *μ* to change Formula (6) into Formula (7) so as to recalculate *A* and redo step (2) and step (3) if *Numt <sup>L</sup>* > *θ*, *t* = 1, 2, ... , *m*. *θ* is a decimal representing the percentage of each category in the video.

$$\text{dis}(A\_p, A\_q) < \mu \text{K} \tag{7}$$

Finally, we can obtain the result until *K* stops changing or the number of iteration *ε* is reached. After clustering, we calculate *η* and *G* to separate images into containing local features, containing global features, and containing distorted information. Then, these images are sent to users to correct labels, so this process integrates user's preferences to achieve personalized keyframe extraction. The ACBD algorithm framework is shown in Algorithm 1.

#### **Algorithm 1** The process of ACBD.

**Input:** The image set Ψ*<sup>d</sup>* = {*X*1, *X*2,..., *XL*} **Output:** Clustered images **Step 1:** Transform the image *Xi* to row vector *xi* and get *Dis*(Ψ*d*) and *E*. Obtain the initial classes *A* = {*A*1, *A*2,..., *AL*−1} according to Formula (6) **Step 2: for** *i* from 1 to *L* − 1 **do** Count*ai* of *Ai* = {*Ai*,1, *Ai*,2,..., *Ai*,*p*} **if** *ai* < *β* **then** Delete *Ai* **for** *q* from 1 ti *p* **do** Count *Countq* **if** *Countq* < *β* **then** Delete *Ai*,*<sup>q</sup>* **Step 3:** Merge categories that have same elements and get the result *A* = {*A* <sup>1</sup>, *A* <sup>2</sup>, *A* <sup>3</sup>,..., *A m*} **Step 4: for** *t* from 1 to *m* **do** Count *Numt* **if** *Numt <sup>L</sup>* > *θ* **then** According to Formula (7), change *K* to *μK* to recalculate *A* and redo step (2) and step (3) **else return** *A* = {*A* <sup>1</sup>, *A* <sup>2</sup>, *A* <sup>3</sup>,..., *A m*}

#### *3.3. The Extraction of Keyframes of Interest Combined with Frame-to-Frame Difference and Deep Learning*

After obtaining labeled images, it is time to train the network used for keyframe extraction. Considering the practicability of the network on mobile, the MobileNet-V2 network

model is used in this paper, because under the condition of high accuracy, compared with Resnet-50, MobileNet-V2 runs faster and has fewer parameters.

However, the network may not be adequately trained only using a self-built dataset. Therefore, inspired by transfer learning [42], we use the pretrained model of ImageNet to retrain and fine-tune the network parameters on our dataset. After differential frame extraction and neural network classification, the keyframe containing user interest information is finally obtained.

#### **4. Experiment**

In this part, we verify the effectiveness and rationality of the algorithm proposed in this paper through experiments. Firstly, we design experiments to verify the effectiveness of the differential frame extraction algorithm and the adaptive clustering algorithm, and compare the difference between MobileNet-V2 and RESNET-50 in commodity video keyframe recognition. Finally, we give the overall output of the proposed algorithm. The experiment will be run and tested on the Clothes Video Dataset constructed in this paper.

#### *4.1. The Experiment Background*

In recent years, the volume of commodity video data has grown rapidly, but there is no publicly available commodity video dataset. Therefore, we downloaded 30 videos of each jacket, pants, shoes, and hat from Taobao (www.taobao.com) to construct a new Clothes Video Dataset, which contains 120 videos, and the length of videos ranging from ten seconds to one minute. The basic parameters of the dataset are shown in Table 1.

**Table 1.** Basic parameters of the Clothes Video Dataset.


We consider that for the product, the details are the most important part, and the overall effect is second. Therefore, commodity video frames are divided into four categories: the first category shows commodity details reflecting texture, material, workmanship, etc.; the second category shows the overall effect of the commodity on the model; the third category contains a lot of information unrelated to commodities, such as scenes and models' faces; the fourth category is the distorted image. Thus, the first and second categories are defined as the keyframe of the video. We randomly selected 3 videos from each category in the dataset, a total of 12 videos, and then obtained the initial clustering by ACBD algorithm. Next, we corrected and divided the clustering results into the four categories mentioned before to obtain 790 images of the first category, 247 images of the second category, 58 images of the third category, and 45 images of the fourth category.

We used an Intel Core i7-8700K CPU with 64 GB RAM and an NVIDIA GeForce GTX 1080Ti graphics card. For the software environment, we used Python version 3.7. The parameters *α*, *β*, *μ*, *θ*, *ε* used in the experiment are 200, 3, 0.95, 1/3, and 5, respectively, through many experiments.

#### *4.2. Differential Frame Extraction*

We compared the algorithm proposed by Li et al. [11] with our algorithm for differential frame extraction to analyze the advantages and disadvantages. The algorithm proposed by Li et al. extracts one frame of image per second. The essence of the algorithm is to extract video frames at the same time interval. In this experiment, we set the extraction interval as 4 frames; the duration of the experiment video is 60 s and contains 1501 frames.

As can be seen in Figure 3, the video of 1501 frames is reduced to 176 frames, which greatly reduces the amount of data for subsequent processing and retains the main information of the video. By comparing Figures 4 and 5, a large number of frames remain after equal interval extraction, so it can be seen that the interval size of 4 frames is not reasonable, and the extraction interval should be expanded to improve the result. In practical application, the video length is often unknown, so it is impossible to determine the size of the extracted interval. Therefore, the differential frame extraction algorithm is more applicable.

**Figure 3.** Extraction results of differential frames.

**Figure 4.** Interframe difference after differential frame extraction.

**Figure 5.** Interframe difference after equal interval extraction.

#### *4.3. Acbd Algorithm*

We designed three experiments to verify the effectiveness of our proposed ACBD algorithm. First, we conducted experiments about the effect of ACBD, then compared the effect of the ACBD algorithm with the DBSCAN algorithm on the commodity dataset constructed in this paper and tested its applicability with DBSCAN and K-Means on the UCI dataset.

#### 4.3.1. Effect of ACBD

We used the above video for experiment and used the ACBD algorithm to cluster the obtained differential frames.

From the distribution of Figure 6b, it can be clearly seen that compared with Figure 6a, the ACBD algorithm deletes some differential frames and leaves 106 frames. The deleted frames are mainly the shot transition frames shown in Figure 7, and the transition frames contain a lot of useless and even misleading information, which cannot be used for subsequent algorithm processing. The removal of transition frames can improve the accuracy and efficiency; thus, the ACBD algorithm has the function of removing noise. Comparing the clustering results shown in Figure 8 with the difference frames in Figure 3, the ACBD algorithm removes the 128th to 136th pictures in Figure 3. According to statistics, the proportion of this scene in the video is 5.1%. For commodity videos, it can be considered that the scene with a low proportion is less descriptive for the commodity, so it can be deleted. From the clustering results, the ACBD algorithm can distinguish each cluster and achieve accurate clustering of differential frames.

**Figure 6.** (**a**) Differential frame distribution. (**b**) Distribution after clustering.

**Figure 7.** Transition frames.

**Figure 8.** ACBD clusters the differential frames into 8 clusters (**a**–**h**).

4.3.2. Compared with DBSCAN and K-Means

In the commodity dataset, we compared ACBD with the common clustering algorithms K-means and DBSCAN to judge whether the ACBD algorithm is superior. We randomly selected a video from the dataset for experiment, and the clustering radius of DBSCAN was determined according to the mean distance of all images. The radius in

this experiment was 20,000, and the minimum number of sample points was set to 3 after repeated debugging. The number of K-means clustering was set as 4.

From Figure 9a,b, we can see that, compared with the original video, DBSCAN algorithm abandoned many differential frames resulting in the loss of some important information. In addition, the DBSCAN algorithm needs to adjust the clustering radius and density artificially, and readjust the parameters for different videos, which greatly increases the workload. By comparing (c) and (d), the number of categories represented by "+" in the K-means clustering results is far greater than other categories, so the results are not accurate enough. However, the categories represented by "·" and "×" may be misclassified due to the long distance within classes. We also cannot determine the number of clusters for different videos in advance. It can be seen from the result presented in (d) that ACBD can solve the problems existing in K-means well. For the outliers in the upper right corner of Figure 9, the image has a small number of transition regions due to the scene transition and becomes an outlier after dimensionality reduction. However, this image should be classified into the green "·" category in Figure (d) after comparison one by one.

**Figure 9.** Comparison of clustering effects between the ACBD algorithm and DBSCAN algorithm. (**a**) Differential frame distribution. (**b**) DBSCAN algorithm clustering results. (**c**) K-means algorithm clustering results. (**d**) ACBD algorithm clustering results.

#### 4.3.3. Effects on the UCI Dataset

In order to test the applicability of the ACBD algorithm, we verified the effect on IRIS, WINE, and HAPT datasets and compared it with K-means algorithm and DBSCAN algorithm. We used Adjusted Rand index (ARI) [43], Fowlkes–Mallows index (FMI) [44], and Adjusted Mutual Information (AMI) [45] to evaluate the clustering results, and these evaluation criteria are defined as follows. For a given set *S* of *n* instances, assume that *U* = {*u*1, *u*2, ... , *uR*} represents the ground-truth classes of *S* and *V* = {*v*1, *v*2, ... , *vC*} represents the result of the clustering algorithm. *nij* represents the number of instances in *ui* and *vj*. *ni*. and *nj*. represent the number of instances in *ui* and *vj*, respectively. Rand index (RI) is calculated by ∑*i*,*<sup>j</sup>* ( *nij* <sup>2</sup> ), and thus, ARI Can be expressed as below.

$$ARI = \frac{RI - E(RI)}{\max(RI) - E(RI)}\tag{8}$$

In Formula (8), *<sup>E</sup>*(*RI*) = [∑*<sup>i</sup>* ( *ni*. <sup>2</sup> ) ∑*<sup>j</sup>* ( *n*.*j* 2 )] ( *n* <sup>2</sup>) , *max*(*RI*) = <sup>1</sup> <sup>2</sup> [∑*<sup>i</sup>* ( *ni*. <sup>2</sup> ) + ∑*<sup>j</sup>* ( *n*.*j* <sup>2</sup> )]. ARI is used to measure the degree of consistency between the two data distributions, and its range is [−1,1]. In order to measure the effect of the clustering algorithm more comprehensively, FMI and AMI are introduced for evaluation through different methods. Their value range is [0,1] and, the larger the value is, the more similar are the clustering results to the groundtruth. FMI and AMI can be expressed by Formulas (9) and (10), respectively.

$$FMI = \frac{\sum\_{i,j} \binom{n\_{ij}}{2}}{\sqrt{\sum\_{i} \binom{n\_{i.}}{2} \sum\_{j} \binom{n\_{.j}}{2}}} \tag{9}$$

$$AMI = \frac{I\{\mathcal{U}, V\} - E\{I(\mathcal{U}, V)\}}{\max\{H(\mathcal{U}), H(V)\} - E\{I(\mathcal{U}, V)\}}\tag{10}$$

In Formula (10), *I*(*U*, *V*) = ∑*<sup>R</sup> <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>C</sup> j*=1 *ni*,*<sup>j</sup> <sup>n</sup> log nij n ni*. *n*.*j n*2 , *<sup>H</sup>*(*U*) = <sup>−</sup>*i*=<sup>1</sup> *R ni*. *<sup>n</sup> log ni*. *<sup>n</sup>* . The results

are shown in the following table.

It can be seen from Tables 2–4 that the K-means algorithm performs the best on the Iris, Wine, and HAPT datasets, and ACBD can compete with K-means on the HAPT dataset. The DBSCAN algorithm did not obtain available clustering results after adjusting parameters many times on the WINE and HAPT datasets, so its clustering evaluation index was 0 on these two datasets. From the experimental results, the performance of the proposed algorithm (ACBD) on the large datasets, such as HAPT, is far better than that on the small datasets, such as IRIS and WINE.

**Table 2.** Comparison of IRIS experimental results.


**Table 3.** Comparison of Wine experimental results.


**Table 4.** Comparison of HAPT experimental results.


#### *4.4. Neural Network Comparison and Overall Algorithm Effects*

After ACBD clustering, user auxiliary annotation was carried out for each category to divide images into the above four categories, which greatly reduced the workload of manual annotation. The four labeled images were divided into training set and testing set according to 7:3, so as to train the deep neural network. The hyperparameters are set as follows: batch size, 32; epoch, 400; learning rate, 0.0001. Finally, the trained network is used to classify the video frames. The first category showing commodity details and the second category showing commodity overall effect are the video keyframes. Considering the deep network layer of ResNet, the accuracy is higher, but if the network layer is too deep, the machine is difficult to load. So, we verified whether MobileNet-v2 meets practical requirements while ensuring accuracy compared with ResNet-50. In the experiment, the batch size was set to 64.

As can be seen in Figures 10 and 11, the accuracy rates of ResNet-50 and MobileNet-V2 were similar, both at around 90%. Therefore, in consideration of applicability, we compared the training time and storage space of the two models.

**Figure 10.** ResNet-50 training accuracy and loss (**a**,**b**).

**Figure 11.** MobileNet-V2 training accuracy and loss (**a**,**b**).

Compared with ResNet-50, the training speed of MobileNet-v2 increased by 14.3% and the number of parameters decreased by 66.9%, shown in Table 5. Therefore, MobileNet-v2 has a wider available range from the applicability of mobile terminal, so we adopted the MobileNet-v2 network.

**Table 5.** Comparison of ResNet-50 and MobileNet-V2.


Finally, we randomly selected one video from the remaining 108 videos in the database and inputted it into the MobileNet-V2 after training. Figure 12 shows the final classification results; (a), (b), (c), and (d) correspond to the first, second, third, and fourth categories, respectively, demonstrated in "Supplementary Materials". The first and second categories are the keyframes defined in this paper. From the experimental results, it can be concluded that the algorithm proposed in this paper can achieve the purpose of extracting keyframes, user preferences are reflected in the auxiliary annotation stage, and the ACBD clustering algorithm can greatly improve the efficiency of annotation. Therefore, the overall framework of the algorithm in this paper is reasonable and has good results.

**Figure 12.** Final result of algorithm. (**a**–**d**) correspond to the first, second, third, and fourth categories, respectively.

#### **5. Discussion**

Since different users have different preferences, the existing keyframe extraction algorithms do not provide solutions for different interests. For this reason, we propose an algorithm of commodity video keyframe extraction based on adaptive clustering annotation. Compared with the existing keyframe extraction algorithms, our algorithm reflects the preferences of different users through the process of user annotation, and achieves accurate and personalized keyframe extraction. It provides a feasible method for users to find products faster and more accurately.

#### **6. Conclusions**

In order to solve the problem that different users have different definitions of keyframes in commodity video, this paper proposes a keyframe recognition algorithm of commodity video that integrates transfer learning and interest information, and extracts the keyframes of different users' preferences. First, the differential frames were extracted from commercial videos by frame-to-frame difference; then, the ACBD algorithm was used to cluster these frames to simplify the user annotation process. The data annotated by the user were used to train the Mobilenet-V2 network, and finally, the keyframes for individuals were extracted. The experimental results on the commodity video dataset constructed in this paper show that the proposed algorithm is effective and extracts the keyframes accurately. However, user preferences tend to change over time, which is not considered in this paper, so how to cater to the changing interests of users is the focus of our next research.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/app12031502/s1, Video S1: Demonstration.

**Author Contributions:** Conceptualization, G.M. and X.S.; methodology, G.M.; software, G.M.; validation, G.M. and X.S.; writing—original draft preparation, G.M.; funding acquisition, X.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Natural Science Foundation of China under grants No. 61876184.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviation is used in this manuscript:

ACBD Adaptive Cluster Based on Distance

#### **References**


### *Article* **Optimal Tests for Combining** *p***-Values**

**Zhongxue Chen**

Department of Epidemiology and Biostatistics, School of Public Health, Indiana University Bloomington, 1025 E. 7th Street, Bloomington, IN 47405, USA; zc3@indiana.edu; Tel.: +1-812-855-1163

**Abstract:** Combining information (*p*-values) obtained from individual studies to test whether there is an overall effect is an important task in statistical data analysis. Many classical statistical tests, such as chi-square tests, can be viewed as being a *p*-value combination approach. It remains challenging to find powerful methods to combine *p*-values obtained from various sources. In this paper, we study a class of *p*-value combination methods based on gamma distribution. We show that this class of tests is optimal under certain conditions and several existing popular methods are equivalent to its special cases. An asymptotically and uniformly most powerful *p*-value combination test based on constrained likelihood ratio test is then studied. Numeric results from simulation study and real data examples demonstrate that the proposed tests are robust and powerful under many conditions. They have potential broad applications in statistical inference.

**Keywords:** chi-square test; constrained likelihood ratio test; Fisher test; gamma distribution; uniformly most powerful test

### **1. Introduction**

In statistical inference and decision making, it is critical but challenging to appropriately aggregate information from different sources. *p*-value combination approaches provide possible solutions. A *p*-value combination method usually combines the transformed statistics via the original individual *p*-values, and then, an overall *p*-value is obtained. The development of combining *p*-value has a long history. Many pioneer statisticians, including R. A. Fisher [1] and K. Pearson [2], had important contributions in this area. Their methods (e.g., Fisher test), along with others, such as the z-test [3] and the minimal *p*-test [4], are still widely used in today's statistical practice. Many studies have been conducted to compare the performances among those *p*-value combination tests [5–7]; it turned out that no test is uniformly most powerful although some methods may perform better than others under certain conditions. Combining dependent *p*-values is another research direction, as many robust and powerful methods have been proposed in the literature, including a recently proposed test based on Cauchy distribution (CCT) [8]. Although the CCT can be applied to both independent and dependent *p*-values, it has been shown that this test can never obtain a *p*-value less than the smallest *p*-value to be combined, and therefore, is not recommended for combining independent *p*-values [9]. In this paper, we focus on the situation where we have independent *p*-values to be combined.

It is well known that combining *p*-value methods have important applications in meta-analysis [10,11]. However, it is less recognized that combining *p*-value methods are more frequently but implicitly used in statistical testing. For instance, the commonly used chi-square tests, including the likelihood ratio test, the score test, and the Wald test, with degrees of freedom (df) greater than one can be viewed as *p*-value combination methods, which are special cases of our proposed gamma distribution-based tests (see Section 2). In other words, the popular chi-square tests only provide possible and special ways to combine *p*-values that are not necessarily optimal; more powerful methods for combining *p*-value may be found and used instead.

**Citation:** Chen, Z. Optimal Tests for Combining *p*-Values. *Appl. Sci.* **2022**, *12*, 322. https://doi.org/10.3390/ app12010322

Academic Editor: Pentti Nieminen

Received: 16 November 2021 Accepted: 27 December 2021 Published: 29 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

With recent technical developments, larger volume data, such as genome-wide genomic data, are generated more easily and rapidly. Consequently, advanced statistical methods, including *p*-value combination tests, are highly desirable [12–20]. For instance, meta-analyses, which combine information from different genome-wide association studies (GWASs), have identified many associated genetic variants that could not be identified from a single GWAS [21,22]. It is expected that with more powerful *p*-value combination tests being developed and available, more and more significant associated genetic variants will be discovered in cancer genomics.

Unfortunately, as Birnbaum [23] already noticed, there exists no uniformly most powerful (UMP) *p*-value combination test for all alternative hypotheses. However, it is possible that a method is UMP under a certain condition. Moreover, if the true condition is unknown, it is desirable to choose a robust *p*-value combination method in the sense that it has reasonable detection power under many conditions.

In this paper, we first propose a class of *p*-value combination methods based on gamma distribution. We show that several existing popular methods are equivalent to certain special cases of this class of tests. Then, we show that the proposed tests are UMP when the *p*-values to be combined are from certain distributions. When the *p*-values to be combined are from certain type of distributions whose parameters are partially or fully unknown, asymptotically UMP tests based on constrained likelihood ratio test (CLRT) are proposed and studied.

The rest of the manuscript is organized as follows: In Section 2, we first introduce some existing popular *p*-value combination tests, describe our proposed tests based on gamma distributions, and then study their connections to existing popular methods and their properties as of UMP tests. Finally, some asymptotically UMP tests based on CLRT are proposed and studied. In Section 3, we compare the performances of the proposed tests with some existing popular methods through a simulation study. In Section 4, two examples of real data applications are demonstrated to illustrate the desired performance of the proposed tests. This paper concludes in Section 5 with discussion and conclusions.

#### **2. Methods**

Suppose we have *n* independent *p*-values, denoted as *Pi*(*i* = 1, ··· , *n*), obtained from testing the individual null hypothesis *Hi*<sup>0</sup> versus the alternative hypothesis *Hi*1, respectively. In addition, throughout this paper, we assume *Pi* ∼ U(0, 1) under *Hi*0, where U(0, 1) stands for the uniform distribution between 0 and 1. For a combining *p*-value test, in this paper, we consider testing the global null hypotheses, *H*0= ∩*Hi*<sup>0</sup> vs. the global alternative hypothesis, *H*<sup>1</sup> = ∪*Hi*1. In statistical literature, several *p*-value combination tests were proposed long time ago but are still widely used today. We introduce some of the most popular ones as follows.

#### *2.1. Some Existing Popular Tests*

#### 2.1.1. The Minimal *p*-Value (Min *p*) Test

This test is denoted as *Tp*, with the test statistic defined as [4]:

$$\min(P\_1, P\_2, \dots, P\_n) \tag{1}$$

whose null distribution is the beta distribution *Beta*(1, *n*) and its *p*-value is defined as 1 − 1 − *P*(1) *n* , where *P*(1) = min(*P*1, *P*2, ··· , *Pn*). The Tippett's Min *p* test in (1) is closely related to the Bonferroni method [24]. When the minimal *p*-value *p*(1) is small, the two tests obtain similar results, and both are close to *np*(1).

2.1.2. The Chi-Square Test with *n* Degrees of Freedom

Denoted as *χ*<sup>2</sup> *<sup>n</sup>*, it has the test statistic [25,26]:

$$\sum\_{i=1}^{n} \left(\Phi^{-1}(1-P\_i)\right)^2\tag{2}$$

where <sup>Φ</sup>−1(·) is the inverse function of the cumulative distribution function (CDF) of the standard normal distribution, *N*(0, 1). The null distribution of the test *χ*<sup>2</sup> *<sup>n</sup>* in (2) is the chi-square distribution with *n* df.

#### 2.1.3. The Fisher Test

Denoted as *Fp*, it has the following test statistic [1]:

$$-2\sum\_{i=1}^{n} \ln(P\_i) \tag{3}$$

whose null distribution is *χ*<sup>2</sup> <sup>2</sup>*n*, the chi-square distribution with 2*n* df.

#### 2.1.4. The z Test

Denoted as *Zp*, it has the test statistic [3]:

$$\sum\_{i=1}^{n} \Phi^{-1}(1 - P\_i) / \sqrt{n} \tag{4}$$

whose null distribution is *N*(0, 1).

Note that for all of the above tests, their overall one-sided *p*-values are calculated based on the right-tails, i.e., the areas beyond the test statistics from the right sides of their null distributions, to reflect the fact that smaller individual *p*-values provide stronger evidence to support the global alternative hypothesis.

#### *2.2. New Tests Based on Gamma Distribution*

We use *Gamma*(*α*, *β*) to denote a random variable that has a gamma distribution with shape parameter *α* and rate parameter *β*, where both parameters are positive. The probability density function (PDF) of *Gamma(α*, *β*) is:

$$f\_{\mathbb{G}(\mathfrak{a},\mathfrak{\beta})}(\mathfrak{x}) = \beta^{\mathfrak{a}} \mathfrak{x}^{\mathfrak{a}-1} \exp(-\beta \mathfrak{x}) / \Gamma(\mathfrak{a}) \tag{5}$$

for *x* > 0, where the gamma function Γ(*z*) = # <sup>∞</sup> <sup>0</sup> *<sup>x</sup>z*−1*e*<sup>−</sup>*xdx*. Denote the corresponding CDF as *FG*(*α*,*<sup>β</sup>*)(*x*). We can combine *n* independent *p*-values using *Gamma*(*α*, *β*) and obtain an overall *p*-value accordingly.

2.2.1. Gamma Distribution-Base Test *TG*(*α*,*β*)

Define the following test statistic:

$$T\_{G(\mathfrak{a},\mathfrak{f})} \left( P\_1, P\_2, \dots, P\_n \right) = \sum\_{i=1}^n F\_{G(\mathfrak{a},\mathfrak{f})}^{-1} (1 - P\_i)\_{\prime} \tag{6}$$

where *F*−<sup>1</sup> *G*(*α*,*β*) (*y*) is the inverse function of the CDF *FG*(*α*,*<sup>β</sup>*)(*x*).

We also define the following right-tailed *p*-value for *TG*(*α*,*β*):

$$P = \Pr(Gamma(na, \beta) > t) = 1 - F\_{\mathbb{G}(na, \beta)}(t) = \mathbb{S}\_{\mathbb{G}(na, \beta)}(t) \tag{7}$$

where *t* is the observed value for the test *TG*(*α*,*β*), and *FG*(*nα*,*<sup>β</sup>*)(·) and *SG*(*nα*,*<sup>β</sup>*)(·) are the CDF and the survival function of *Gamma*(*nα*, *β*), respectively.

The above test determined by the test statistic *TG*(*α*,*β*) in (6) and the *p*-value *P* in (7) is called *TG*(*α*,*β*). The *p*-value for a specific test *T* is also denoted as *PT*. Therefore, the *p*-value in (7) can also be written as *PTG*(*α*,*β*) . Based on the properties of gamma distributions, we have the following result for the test *TG*(*α*,*β*):

**Proposition 1.** *The test TG*(*α*,*β*) *with statistic defined in (6) and p-value defined in (7) controls type I error rate exactly at given significance level*.

To study the properties of the new tests, we use the following definitions:

**Definition 1.** *Two tests T*<sup>1</sup> *and T*<sup>2</sup> *are called equivalent and denoted as T*<sup>1</sup> ≡ *T*<sup>2</sup> *if pT*<sup>1</sup> (*x*) = *pT*<sup>2</sup> (*x*) *for any observed data x*, *where pTi* (*x*) *is the p-value obtained by test Ti*(*i* = 1, 2) *from given data x*.

Based on the properties of the gamma distributions, we can easily verify the following result:

**Proposition 2.** *For any α* > 0 *and β* > 0, *TG*(*α*,*β*) ≡ *TG*(*α*,1).

Therefore, the rate parameter of the gamma distribution has no effect on the test *TG*(*α*,*β*). For convenience, in this paper, we set *β* = 1 and use *TG*(*α*) to denote *TG*(*α*,1) hereafter unless otherwise specified.

**Definition 2.** *A positively-valued function c*(*θ*) *is called the exact slope of the sequence of tests Tn if lim <sup>n</sup>*→∞[(−2/*n*)*ln*(<sup>1</sup> <sup>−</sup> *Fn*(*Tn*))] <sup>=</sup> *<sup>c</sup>*(*θ*) *with probability 1, where Fn*(*Tn*) *is the CDF of Tn*. *A test is called asymptotically Bahadur optimal (ABO) if its exact slope is maximal at every θ* ∈ Θ − Θ0, *where* Θ *is the parameter space, and* Θ<sup>0</sup> *is the parameter space under the null hypothesis. The exact slope is a measure of the rate at which the attained p-value of a test statistic tends to 0 and is a measure of asymptotic efficiency*.

It has been proven that *TG*(*α*,1/2) is ABO for any *α* ∈ (0, ∞) [27]. Hence, from proposition 2, we have the following property:

**Proposition 3.** *The test TG*(*α*) *is ABO for any α* ∈ (0, ∞).

Previously, we proposed a different *p*-value combination test based on gamma distribution with the test statistic *T* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(1/*Pi*,1) (1 − *Pi*), which uses the random shape parameter *ai* = 1/*Pi* for *p*-value *Pi*, and its null distribution is intractable, and a resampling method is used to estimate the *p*-value. While in the current proposed tests, the same shape parameter is used for all individual *p*-values.

#### 2.2.2. Connections between *TG*(*α*) and Existing Popular Tests

Although the existing popular tests described in Section 2.1 were proposed a long time ago, and their theoretical properties and empirical performances have been extensively studied and compared [6,7,25,28,29], surprisingly, their relationships have not been fully investigated, and the theoretical explanation on the differences of their performances is lacking. In this subsection, we show that they are connected to the aforementioned gamma distribution-based tests *TG*(*α*) with different *α* values. In fact, we have the following results:

**Theorem 1.** *The class of gamma distribution-based tests TG*(*α*) *include special cases that are equivalent to the aforementioned existing popular methods. More specifically*,

*TG*(0) lim *α*→0+ *TG*(*α*) ≡ *Tp*; *TG*(0.5) ≡ *<sup>χ</sup>*<sup>2</sup> *n*; *TG*(1) ≡ *Fp*; *and TG*(∞) lim*α*→<sup>∞</sup>*TG*(*α*) <sup>≡</sup> *Zp*.

The proof of Theorem 1 is given in the Appendix A.

2.2.3. *T(<sup>α</sup>)* as the Uniformly Most Powerful Test

Besides the ABO property, there is another attractive property for the gamma distributionbased test *TG*(*α*): under certain conditions, it is UMP. We thus have the following theorem:

**Theorem 2.** *Suppose P*1, *P*2, ··· , *Pn are iid from the following common density function with parameters α* > 0 *and* 0 < *c* < 1

$$f\_{a,c}(p) = (1 - c)^{\mu} \exp\left[c F\_{G(a)}^{-1}(1 - p)\right] \text{ for } p \in (0, 1), \tag{8}$$

If both *α* and *c* are known, then test *TG*(*α*) is UMP. The proof is given in the Appendix A.

**Remark 1.** *(i) For p* ∈ (0, 1), *when c* = 0, *fα*,0(*p*) = 1. *Therefore*, *fα*,0(*p*) *corresponds to the global null hypothesis. (ii) Under the condition fα*,*c*(*p*) = (1 − *c*) *α exp*\$ *cF*−<sup>1</sup> *G*(*α*) (1 − *p*) % *with* 0 < *c* < 1, *from Theorem 1, the existing tests Tp*, *χ*<sup>2</sup> *<sup>n</sup>*, *Fp*, and *Zp defined in (1)–(4) are UMP for given α* = 0, 0.5, 1, and ∞, *respectively. (iii) Along with Theorem 1, Theorem 2 provides insightful explanations on when and why an existing popular test described in Section 2.1 is preferred. For instance, when the p-values are extremely heterogeneous (e.g., a very small α in (8)), the Min p test (i.e.*, *TG*(0)*i) s more powerful than the other gamma distribution-based tests. On the other hand, when the p-values are more homogeneous (e.g., a large α in (8)), the z test (i.e.*, *TG*(∞)*) is preferred. cF*−<sup>1</sup>

*(iv) The function fα*,*c*(*p*) = (1 − *c*) *α e G*(*α*) (1−*p*) *with two parameters α* (*α* > 0) and *c* (0 < *c* < 1*) represents a large class of densities and can be used to approximate the true density functions under the alternative hypotheses. Based on simulation study, we found that under many situations, the true density functions under the alternative hypotheses can be closely approximated by fα*,*c*(*p*) *with the two parameters being estimated from the data (see Figures S1–S19 in the Supplementary File)*.

#### *2.3. Constrained Likelihood Ratio Tests*

In Section 2.2, we have shown that *TG*(*α*) is UMP if the *p*-values to be combined are from the common density as described in (8) with known constants *α* and *c*. When *c* is unknown, or both *α* and *c* are unknown, constrained likelihood ratio tests (CLRTs) can be constructed accordingly. In this subsection, we study those CLRT-based tests when parameter *c* is unknown with *α* being known or unknown.

#### 2.3.1. CLRT-Based Tests with Known *α* Values

Under (8), with known *α* = *α*0, the CLRT-based tests can be constructed based on the constrained MLE of *c* obtained through maximizing *l*(*α*0, *c*) = *l*(*c*) = *nα*0*ln*(1 − *c*) + *c* ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*α*0) (1 − *Pi*). More specifically, we define the following test statistic:

$$T\_{\text{CLK},\mu\_0}(P\_1,\cdots,P\_n) = 2a\_0n\ln\left(1 - \mathcal{E}\_{\text{CLK},\mu\_0}\right) + \mathcal{Z}\left(\mathcal{E}\_{\text{CLK},\mu\_0}\right)\sum\_{i=1}^n F\_{G(a\_0)}^{-1}(1 - P\_i),\tag{9}$$

where *c*ˆ*CLRT*,*α*<sup>0</sup> is the constrained MLE of *c* through maximizing the log-likelihood *l*(*α*0, *c*) with the constrain 0 < *c* < 1. We will reject the overall null hypothesis if the test statistic is large. In other words, a one-sided *p*-value will be calculated from the test.

For the above CLRT-based test *TCLRT*,*α*<sup>0</sup> in (9), we have the following result [30]:

**Proposition 4.** *The asymptotic distribution of the test TCLRT*,*α*<sup>0</sup> *is a mixture of chi-square distributions* ∑<sup>1</sup> *<sup>i</sup>*=<sup>0</sup> *wiχ*<sup>2</sup> *<sup>i</sup>* , *where <sup>χ</sup>*<sup>2</sup> *<sup>i</sup> is the chi-square distribution with d f* = *<sup>i</sup>*, *<sup>χ</sup>*<sup>2</sup> <sup>0</sup> *is the random variable with probability 1 of being 0, and the weights w*0, *w*<sup>1</sup> *are determined by the null and the alternative hypothesis*.

In practice, the *p*-values of the test *TCLRT*,*α*<sup>0</sup> can be estimated through resampling methods (e.g., see Section 2.3.2 for an example). However, the following result shows that this test is tightly connected with the gamma distribution-based test *TG*(*α*0), whose *p*-value can be calculated directly.

**Theorem 3.** *Let tCLRT*,*α*<sup>0</sup> *be the observed statistic of test TCLRT*,*α*<sup>0</sup> ; *the p-value of TCLRT*,*α*<sup>0</sup> *is determined by the gamma distribution-based test TG*(*α*0) <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*α*0) (1 − *Pi*) *as follows*:

$$\Pr\left[T\_{\text{CLK},\mathcal{A}\_0} > t\_{\text{CLK},\mathcal{A}\_0}\right] = \begin{cases} \Pr\_{T\_{\text{G}(\mathcal{A}\_0)}} \text{ if } t\_{\text{CLK},\mathcal{A}\_0} > 0\\ \Pr\left[T\_{\text{G}(\mathcal{a}\_0)} < n\mathcal{a}\_0\right] \text{ if } t\_{\text{CLK},\mathcal{A}\_0} = 0 \end{cases} \tag{10}$$

The proof is given in the Appendix A.

**Proposition 5.** *Under the conditions specified in (8), if the parameter α* = *α*<sup>0</sup> *is known, then asymptotically TG*(*α*0) ≡ *TCLRT*,*α*<sup>0</sup> .

**Proof.** From the proof of Theorem 3, we know that under (8), when 0 ≤ *c* < 1, *Pr*[*TG*(*α*0) < *nα*0] = *Pr*[*c*ˆ*α*<sup>0</sup> ≤ 0] → 0 (*as n* → ∞), and hence, the two *p*-values from tests *TG*(*α*0) and *TCLRT*,*α*<sup>0</sup> are asymptotically equal with probability 1.

**Theorem 4.** *Under the conditions specified in (8), if the parameter α* = *α*<sup>0</sup> *is known, then the gamma distribution-based test TG*(*α*0) *and the CLRT-based test TCLRT*,*α*<sup>0</sup> *are both asymptotically UMP*.

**Proof.** When *α* = *α*<sup>0</sup> is known, it can be shown that the constrained MLEs for *c* is consistent (see, for instance, Theorem 1 of Self and Liang [30]). Hence, from Theorem 2, *TG*(*α*0) is asymptotically UMP. From Proposition 5, *TCLRT*,*α*<sup>0</sup> is asymptotically UMP.

#### 2.3.2. The Optimal CLRT-Based Test When *α* Is Unknown

When both parameters *α* and *c* in (8) are unknown, they need to be estimated via the constrained MLEs, from which the following constrained likelihood ratio test is defined:

$$T\_{\rm CLRT}(P\_1, \dots, P\_n) = 2\pounds\_{\rm CLRT} \text{vol}(1 - \pounds\_{\rm CLRT}) + 2(\pounds\_{\rm CLRT}) \sum\_{i=1}^n F\_{\rm G(\hat{a}\_{\rm CLRT})}^{-1} (1 - P\_i) \tag{11}$$

where *α*ˆ *CLRT* and *c*ˆ*CLRT* are the constrained MLEs for parameters *α* and *c*, respectively, through maximizing the log-likelihood function *<sup>l</sup>*(*α*, *<sup>c</sup>*) <sup>=</sup> *<sup>n</sup>αln*(<sup>1</sup> <sup>−</sup> *<sup>c</sup>*) <sup>+</sup> *<sup>c</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*α*) (1 − *Pi*) with the constrains 0 < *c* < 1 and *α* > 0. The R function "nlminb" can be applied to find the constrained MLEs and the corresponding test statistic. The proposed test was implemented using R; the R package "opt" (optimal *p*-value combination test) can be freely download from https://github.com/zchen2020/opt (accessed on 15 November 2021).

For the above CLRT-based test *TCLRT*, similar to Proposition 4, we have the following result [30]:

**Proposition 6.** *The asymptotic distribution of the test TCLRT is a mixture of chi-square distributions* ∑<sup>2</sup> *<sup>i</sup>*=<sup>0</sup> *wiχ*<sup>2</sup> *<sup>i</sup>* , *where <sup>χ</sup>*<sup>2</sup> *<sup>i</sup> is the chi-square distribution with d f* = *<sup>i</sup>*, *<sup>χ</sup>*<sup>2</sup> <sup>0</sup> *is the random variable with probability 1 of being 0, and the weights w*0, *w*1, *w*<sup>2</sup> *are determined by the null and the alternative hypothesis*.

The above asymptotic result may not be directly applicable to estimate the *p*-value for this test, as the number of *p*-values *n* is usually small, and more seriously, the weights *wi s* are difficult to obtain. Instead, a simple resampling method can be used to approximate the null distribution and to estimate the *p*-value of *TCLRT*. More specifically, for given sample size *n*, randomly sample *n* null *p*-values evenly distributed between 0 and 1, then calculate the test statistic using (10). Repeat this process many times (e.g., 105); then, the empirical distribution of the test statistic can be used to approximate the null distribution and therefore the *p*-value of *TCLRT*.

Similar to Theorem 4, we have the following result for *TCLRT*:

**Theorem 5.** *Under the conditions specified in (8), the CLRT-based test TCLRT is asymptotically UMP*.

**Proof.** Under conditions (8), it can be shown that the constrained MLEs for *α* and *c* are consistent (see, for instance, Theorem 1 of Self and Liang [30]). Hence, *TCLRT* is asymptotically equivalent to *TCLRT*,*α*<sup>0</sup> for known *α* = *α*<sup>0</sup> in (8). Then from Theorem 4, *TCLRT* is asymptotically UMP.

**Remark 2.** (*i) When α* = *α*<sup>0</sup> *is known, compared with test TCLRT*,*α*<sup>0</sup> , *the test TG*(*α*0) *is preferred because (a) its test statistic and p-value are easier to get and (b) the two tests in general have very similar performances. (ii) When both α and c are unknown, the test TCLRT is asymptotically UMP, while, in general, neither TCLRT*,*α*0*nor TG*(*α*0) *for preset α* = *α*<sup>0</sup> *is UMP or asymptotically UMP. Therefore, it is expected that TCLRT is more robust, and overall, it has better performance than each individual gamma distribution-based test TG*(*α*), *including the existing popular ones described in Sections 2.1.1–2.1.4*.

#### **3. Numeric Studies**

In this section, we assess the performances of the proposed tests through a simulation study. In the simulation, we compare the optimal CLRT-based test *TCLRT* with the popular and representative gamma distribution-based tests, *TG*(0), *TG*(1), and *TG*(∞) (i.e., the Min *p*, Fisher, and z test, respectively).

In the simulation study, fifty (*n* = 50) independent *p*-values are simulated and combined. Among these 50 *p*-values, *m* (*m* = 0, 10, 20, 40, 50) are assumed from the true individual alternative hypotheses, and the rest (*n* − *m*) are from the true individual null hypotheses. When *m* = 0, all 50 individual null hypotheses are true, and the empirical power obtained under this condition is the empirical type I error rate. The *p*-values from the true null hypotheses are randomly sampled between 0 and 1 from the uniform distribution. For a true individual alternative hypothesis *Hi*<sup>1</sup> (*i* = 1, ··· , *m*), we assume the *p*-value *pi* is obtained via a random variable *zi* ∼ *N*(*μi*, 1). We randomly set *k* of the *m u i s* as positive or negative (those alternative hypotheses with the same direction of the effects are called concordant alternatives) and the rest of *m* − *k* having the other direction. A two-sided *p*-value for each true individual alternative hypothesis are obtained via the standard z test by comparing the test statistic with the standard normal distribution *N*(0, 1).

For the true alternative hypotheses, we consider three different scenarios for the effects of *μ<sup>i</sup> <sup>s</sup>*. Scenario 1: <sup>|</sup>*μi*<sup>|</sup> <sup>=</sup> *<sup>μ</sup>vi*/ <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *vi*, where *vi* = <sup>10</sup>*ri* ; *ri* ∼ *<sup>N</sup>*(0.3, 1); and *μ* = 0.8, 0.6, 0.4, 0.3 when there are 10, 20, 40, and 50 true individual alternatives, respectively. Scenario 2: <sup>|</sup>*μi*<sup>|</sup> <sup>=</sup> *<sup>μ</sup>vi*/ <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *vi*, where *vi* ∼ *U*(1, 100) and *μ* = 1.2, 1.0, 0.6, 0.5 when the number of true individual alternatives is *m* = 10, 20, 40, and 50, respectively. Scenario 3: |*μi*| = *μ*/*m*, and *μ* = 1.5, 1.2, 0.8, and 0.6 when *m* = 10, 20, 40, and 50, respectively. Note that (i) the constants (e.g., *μ*, the parameters in the normal distribution for *ri* and the uniform distribution for *vi*) are chosen in the way so that the empirical powers are appreciable for comparison. (ii) For all the three scenarios, the sum of the absolute effect sizes is equal to *μ*; and (iii) for given *m*, the degree of heterogeneity of the effect sizes among the true individual alternatives decreases from Scenario 1 to Scenario 3. More

specifically, in scenario 1, the effect sizes are extremely heterogenous when the number of the true individual alternatives is small. In Scenario 3, the effect sizes are more homogenous. The situations in Scenario 2 are between those in Scenarios 1 and 3. By considering those different conditions, we tried to conduct a reasonable and realistic simulation study to fairly compare our proposed tests with others.

The empirical power values of the tests are estimated using the rejection proportions based on 1000 replicates at the significance level of 0.05. For the new tests, *TCLRT* 10<sup>5</sup> replicates are used to estimate their *p*-values from the resampling method described in Section 2.3.2. Under the overall null hypotheses (i.e., all individual null hypotheses are true), the empirical type I error rates for all methods with different significance levels were obtained. From the simulation study, all methods controlled type I error rate quite well (see Table S4 in the Supplementary File).

Figures 1–3 plot the empirical power values of the Min *p* (Min), Fisher (Fisher), z test (Z), and the proposed CLRT-based test *TCLRT* (LRT\_CS) when *p*-values are combined under Scenarios 1 to 3, respectively. We have the following observations: First, for Scenario 1 (Figure 1), where the effect sizes from the true individual alternative hypotheses are extremely heterogeneous, the Min *p* test (i.e., *TG*(0)) usually performs better than the Fisher test (*TG*(1)), which in turn performs better than the z test (*TG*(∞)). Second, when the degree of heterogeneity of the effect sizes among individual alternative hypotheses are less extreme, as in Scenarios 2 and 3 (Figures 2 and 3), Fisher test and the z test usually perform better or much better than the Min *p* test. Third, for the Min *p*, Fisher, and z test, one may perform very well for some conditions but very poorly under others. For instance, under Scenario 1 (extremely heterogeneous effect sizes among the alternatives), the Min *p* test is more powerful than the other two, while it is much less powerful under scenario 3 (homogeneous effect sizes among the alternatives). The opposite direction was observed for the z test. Fourth, under all conditions considered, the new test *TCLRT* is either the best or very comparable to the best one. When the number of *p*-values to be combined is small, we observed similar patterns (see the simulation results in Tables S1–S3 in the Supplementary Materials when *n* = 10). This demonstrates that, as expected, *TCLRT* is a robust test in the sense that under many conditions, it has reasonable detection power compared with other tests. We would like to point out that, as expected, the empirical power values from the CLRT-based test *TCLRT*,*α*<sup>0</sup> are very close to those from the corresponding gamma distribution-based tests *TG*(*α*), with *α* = 0, 1, ∞ being fixed (data not shown).

**Figure 1.** Empirical power values of the tests based on two-sided *p*-values under Scenario 1: <sup>|</sup>*μi*<sup>|</sup> <sup>=</sup> *<sup>μ</sup>vi*/ <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *vi*, where *vi* <sup>=</sup> <sup>10</sup>*ri* , *ri* <sup>∼</sup> *<sup>N</sup>*(0.3, 1), and *<sup>μ</sup>* <sup>=</sup> 0.8, 0.6, 0.4, 0.3 when there are 10, 20, 40, and 50 true individual alternatives, respectively.

**Figure 2.** Empirical power values of the tests based on two-sided *p*-values under scenario 2: <sup>|</sup>*μi*<sup>|</sup> <sup>=</sup> *<sup>μ</sup>vi*/ <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *vi*, where *vi* ∼ *U*(1, 100), and *μ* = 1.2, 1.0, 0.6, 0.5 when the number of true individual alternatives is *m* = 10, 20, 40, and 50, respectively.

**Figure 3.** Empirical power values of the tests based on two-sided *p*-values under scenario 3: |*μi*| = *μ*/*m*, and *μ* = 1.5, 1.2, 0.8, and 0.6 when *m* = 10, 20, 40, and 50, respectively.

#### **4. Real Data Examples**

In this section, we apply the proposed tests along with others to two real-world problems to demonstrate the usefulness of the proposed test.

#### *4.1. Example 1: A Meta-Analysis*

In a meta-analysis, 12 randomized trials examining the effect of patient rehabilitation designed for geriatric patients versus usual care on improving functional outcome at 3–12 month follow-up were used [31,32]. The estimated odds ratios (ORs) from the 12 trials are listed in Table 1.

The *p*-value from the Cochran's test for homogeneity was 0.021, indicating that the commonly used fixed effect model of meta-analysis was inadequate for this data set. Therefore, the authors ran the meta-analysis with a random effect model and estimated the overall OR as 1.36 with 95% CI (1.08, 1.71) [32]. However, the goodness-of-fit test for the random effect model obtained a *p*-value of 0.025 [33], indicating the lack of fit of the random effect model for this data set. Therefore, instead of using the problematic

fixed or random effect models to combine information from the 12 trials, we use *p*-value combination methods to test whether there is an overall effect.

**Table 1.** Estimated odds ratio and its 95% CI from each study in a meta-analysis with 12 trials. Data were taken from Bachmann et al. and Riley et al.


In order to use the *p*-value combination methods, for each trial, we calculate its *p*value based on its reported 95% CI. Denote U and L the upper and lower limits of the 95% CI; the test statistic can be approximated as *t* = ln(*U* × *L*)/ &4 ln(*U*/*L*)/3.92, whose asymptotic null distribution is *N*(0, 1). The sample sizes of these 12 trials were relatively large, ranging from 108 and 1388; therefore, we can reasonably estimate their *p*-values using the asymptotic null distribution. We calculate the two-sided *p*-value for each trial and apply the gamma distribution-based tests. The *p*-values from the Min *p* (i.e., *TG*(0)), Fisher (*TG*(1)), z test (*TG*(∞)), and *TCLRT* for combining those two-sided *p*-values are 0.013, 0.0068, 0.075, and 0.0047, respectively. Except for the z test, which is known to be less powerful for this heterogeneous situation, all methods obtained *p*-values less than 0.05, and the proposed test *TCLRT* had the smallest *p*-value, indicating that the proposed test is more powerful than the other tests under this specific situation.

#### *4.2. Example 2: A Survival Analysis from a Clinical Trial*

The second data set is from the randomized, double-blinded Digoxin Intervention Trial [34]. In this trial, patients with left ventricular ejection fractions of 0.45 or less were randomly assigned to digoxin (3397 patients) or placebo (3403 patients) groups. A primary outcome was the mortality due to worsening heart failure (see Figure S20 in the Supplementary Materials). In the original study, the authors used the log-rank (LR) test and obtained a *p*-value of 0.061, indicating that the evidence of the effectiveness of digoxin, in terms of reducing the mortality due to worsening heart failure, is at most marginal.

However, it is well known that the LR test may fail to detect the difference between two survival functions if their hazard rate functions are crossing [35,36]. We apply the two-stage approach [35,36] to this data set and obtained two *p*-values of 0.06 and 0.04 for the two stages. Since, under the null hypothesis, the two *p*-values from the two stages are asymptotically independent [35,36], we can combine them using the proposed test *TCLRT* and the gamma distribution-based tests. The *p*-values are 0.078, 0.017, 0.0099, and 0.011 from the Min *p*, Fisher, z test, and *TCLRT*, respectively. The proposed test obtained the second smallest *p*-value, which is slightly larger than the smallest one obtained by the z test.

In addition, the drug effect may differ between males and females. To investigate the possible interaction between sex and treatment, we divide the data into four groups based on the combinations of sex and treatment: Male—Placebo (MP), Male—Drug (MD), Female—Placebo (FP), and Female—Drug (FD). The sample sizes for the four groups are 2639, 2642, 764, and 755, respectively. We then compare the survival functions in the following three pairs of groups: MP vs. MD, (MP + MD) vs. FP, and (MP + MD + FP) vs. FD, where (MP + MD) is a new group with pooled data from groups MP and MD, and (MP + MD + FP) includes all the subjects from groups MP, MD, and FP (see Figure S21 in the Supplementary Materials). For each comparison, the two-stage approach is applied. We obtain the following six *p*-values: 0.019, 0.026, 0.504, 0.092, 0.975, and 0.050. It has been shown that under the null hypothesis, the six *p*-values obtained from the above approach are asymptotically independent [36]. Applying the gamma distribution-based tests, along with *TCLRT*, to the six asymptotically independent *p*-values, we obtain *p*-values

of 0.11, 0.0067, 0.020, and 0.013 from the Min *p*, Fisher, z test, and *TCLRT*, respectively. It noticeable that the proposed test obtained the second smallest *p*-value, while Fisher test obtained the smallest one among those methods. These results show that except for the Min *p* test, all other tests obtain *p*-values less than 0.05. In addition, the *p*-values from *TCLRT* in general are close to the smallest ones, while the popular tests, Fisher and z test, may get quite different *p*-values under different situations (two groups vs. four groups).

Compared with the original analysis, the results of the proposed optimal test *TCLRT* using *p*-values from the two-stage approach applied to either two groups (placebo vs. drug) or four groups (combinations of sex and drug) provide stronger evidence against the null.

#### **5. Discussion and Conclusions**

In this paper, we studied a class of gamma distribution-based *p*-value combination methods, which include special cases that are equivalent to some existing popular methods. This class of tests provide unlimited choices for combining independent *p*-values. However, under a given situation, some of them may perform very poorly. Therefore, arbitrarily picking one of them may result in failing to detect true alternatives. On the other hand, if we try many different methods and report the smallest *p*-value, we need to adjust this *p*-value due to multiple comparison issue; otherwise, we will have more false findings than expected due to inflated type I error rate. Therefore, it is desirable to develop methods that can adaptively find the optimal approach from candidate tests. Our proposed CLRT-based test *TCLRT* is one of such methods. We have shown that if the *p*-values to be combined are from a common density function *fα*,*c*(*p*) = (1 − *c*) *α e cF*−<sup>1</sup> *G*(*α*) (1−*p*) for *<sup>p</sup>* <sup>∈</sup> (0, 1), the gamma distribution-based test *TG*(*α*) is UMP when both parameters *α* and *c* are known. When *α* = *α*<sup>0</sup> is known but *c* unknown, both *TG*(*α*) and the CLRT-based test *TCLRT*,*α*<sup>0</sup> are asymptotically UMP. Furthermore, when both *α* and *c* are unknown, the proposed CLRT-based test *TCLRT* is asymptotically UMP.

In a meta-analysis, it is natural to assign different weights to individual studies [5,7,37,38]. For instance, a larger weight can be assigned to a study with more subjects; hence, in the z test, a *p*-value from a larger study may receive a greater weight. Weights can also be assigned based on other quantities, such as the variances of the estimated effect sizes. However, there is no consensus on weight assignment. For our proposed tests, we can easily incorporate weights assigned to each individual *p*-value. For instance, the weighted gamma distribution-based tests can be constructed using *T<sup>w</sup> <sup>G</sup>*(*α*) <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*wiα*) (1 − *Pi*), where *wi* is the weight assigned to study *i* (*i* = 1, ··· , *n*). Based on the properties of gamma distributions, it is not difficult to show that lim*α*→∞*T<sup>w</sup> <sup>G</sup>*(*α*) <sup>≡</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *wi*Φ−1(<sup>1</sup> <sup>−</sup> *Pi*)/ - ∑*n <sup>i</sup>*=<sup>1</sup> *wi* 2, the weighted z test. Therefore, the class of weighted gamma distribution-based tests *T<sup>w</sup> <sup>G</sup>*(*α*) are generalizations of the weighted z test. Likewise, the weighted log-likelihood function with given weights becomes *l <sup>w</sup>*(*α*, *<sup>c</sup>*) <sup>=</sup> *<sup>α</sup>ln*(<sup>1</sup> <sup>−</sup> *<sup>c</sup>*) <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *wi* + *<sup>c</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*wiα*) (1 − *pi*), from which the corresponding weighted CLRT-based optimal test *T<sup>W</sup> CLRT* can be constructed accordingly.

Our proposed tests have much broader applications than in meta-analysis. In fact, they can be applied to almost all statistical testing problems when (asymptotically) independent *p*-values from individual components are available. For instance, in model selection, a typical step is to test whether a set of variables (or a single categorical variable with multiple levels) should be included in the final model. Often the time, the parameters, and the covariances of their estimates are estimated simultaneously through maximum likelihood estimation. Then the LRT via comparing the log-likelihood values from two models with and without the candidate variables, or the Wald chi-square test of the weighted sum of the squared estimated effect sizes, can be applied. For both LRT and the Wald test, a set of asymptotically independent *p*-values can be obtained through their asymptotically independent components (see, e.g., Chapter 16 of [39]). Hence, our proposed *p*-value combination methods, such as *TCLRT*, can be applied and may result in a better final model.

Another example is the association test for two categorical variables in a two-way contingency table to which the Pearson chi-square test is usually applied. It is known

that the Pearson's chi-square test statistic with *k* df can be partitioned into *k* asymptotically independent components whose null distributions are asymptotically *iid* chi-square distribution with 1 df [40]. For instance, the partition can be done through the Lancaster approach [41]. Hence, we can calculate a set of asymptotically independent *p*-values to which our proposed CLRT-based test is applicable.

The performance of the proposed approaches can be improved if the *p*-values to be combined are obtained from an individual study using more powerful tests. For instance, if we already know the direction of the effect (positive or negative) when we compare two group means, we can use a one-sided rather than a two-sided test to obtain the individual *p*-value. However, it should be pointed out that sometimes one-sided tests may not be always applicable to individual studies. Nevertheless, our proposed approaches can still be used.

In this paper, we focus on using gamma distribution to combine independent *p*-values. Our future direction will be developing gamma distribution-based methods to combine dependent *p*-values. The difficulty in this direction is how to choose the "optimal"-shape parameter so that the resulting test has good detection power and can control type I error rate for arbitrary dependency structure of the *p*-values to be combined. Our preliminary results indicate that this direction is promising. A follow-up paper will be published.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/ 10.3390/app12010322/s1, Figure S1: some densities of *fa*,*c*(*t*); Figure S2: Histogram and the estimated density from simulated data when *μ<sup>i</sup>* = 0; Figures S3–S21: Histograms and the estimated densities from simulated data; Table S1: Empirical power from simulation under scenario 1 using *n* = 10 and *α* = 0.05; Table S2: Empirical power from simulation under scenario 2 using *n* = 10 and *α* = 0.05; Table S3: Empirical power from simulation under scenario 3 using *n* = 10 and *α* = 0.05; Table S4: Empirical type I error rates from simulation study with 10,000 replicates using different significant levels.

**Funding:** This work was partially supported by the National Institutes of Health grants 1R03DE030259, UL1TR002529, and the Indiana University Open Access Article Publishing Fund.

**Data Availability Statement:** Data is contained within the article or supplementary material.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **Appendix A Proof of Theorems**

To prove Theorem 1, we need the following results:

**Lemma A1** (Theorem 1 of Liu, Martin, and Syring 2017)**.** *If Y<sup>α</sup>* = *Gamma*(*α*, 1), *then lim <sup>α</sup>*→0<sup>+</sup> <sup>−</sup> *<sup>α</sup>ln*(*Yα*) <sup>∼</sup> *Exp*(1) *in distribution [42]*.

$$\mathbf{Corollary A1.}\Pr\left(\lim\_{\alpha \to 0+} \mathbf{Y\_{\alpha}} > 1\right) = \mathbf{0}.$$

**Proof of Corollary A1.** From Lemma A1, Pr lim *α*→0+ *Y<sup>α</sup>* > 1 = Pr( lim *α*→0+ ln(*Yα*) > 0) = Pr( lim *<sup>α</sup>*→0<sup>+</sup> <sup>−</sup> *<sup>α</sup>* ln(*Yα*) <sup>&</sup>lt; <sup>0</sup>) = Pr(*Exp*(1) <sup>&</sup>lt; <sup>0</sup>) = 0.

**Corollary A2.** *Let Y* = *lim α*→0+ *Y*−*<sup>α</sup> <sup>α</sup>* , *where Y<sup>α</sup>* = *Gamma*(*α*, 1), *then the PDF of Y is fY*(*y*) = 1/*y*<sup>2</sup> for *<sup>y</sup>* ∈ (1, <sup>∞</sup>).

**Proof of Corollary A2.** Notice that *Y* = lim *α*→0+ *Y*−*<sup>α</sup> <sup>α</sup>* = exp [ln( lim *α*→0+ *Y*−*<sup>α</sup> <sup>α</sup>* )]= exp[ lim *α*→0+ − *α* ln(*Yα*)]. But from Lemma A1, lim *<sup>α</sup>*→0<sup>+</sup> <sup>−</sup> *<sup>α</sup>* ln(*Yα*) <sup>∼</sup> *Exp*(1). Hence, the PDF of *<sup>Y</sup>* is *fY*(*y*) <sup>=</sup> exp(− ln(*y*))/*<sup>y</sup>* = 1/*y*<sup>2</sup> for *<sup>y</sup>* ∈ [1, <sup>∞</sup>).

**Lemma A2.** *Let* <sup>0</sup> <sup>&</sup>lt; *<sup>p</sup>*<sup>1</sup> <sup>&</sup>lt; *<sup>p</sup>*<sup>2</sup> <sup>&</sup>lt; <sup>1</sup> *and q*(*α*) *<sup>i</sup>* <sup>=</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*α*) (*pi*) (*i* = 1, 2). *Denote r<sup>α</sup>* = *q* (*α*) <sup>2</sup> /*q* (*α*) <sup>1</sup> , *then we have lim <sup>α</sup>*→0<sup>+</sup> *r<sup>α</sup>* = ∞.

**Proof of Lemma A2.** Suppose lim *α*→0+ *r<sup>α</sup>* = ∞ does not hold; then, there exists a constant *<sup>R</sup>* such that *<sup>r</sup><sup>α</sup>* <sup>&</sup>lt; *<sup>R</sup>* for any *<sup>α</sup>* <sup>&</sup>gt; 0. However, <sup>0</sup> <sup>&</sup>lt; *<sup>p</sup>*<sup>2</sup> <sup>−</sup> *<sup>p</sup>*<sup>1</sup> <sup>=</sup> Pr *q* (*α*) <sup>1</sup> < *Y<sup>α</sup>* < *q* (*α*) 2 = Pr\$ <sup>−</sup>*<sup>α</sup>* ln *q* (*α*) 2 <sup>&</sup>lt; <sup>−</sup>*<sup>α</sup>* ln(*Yα*) <sup>&</sup>lt; <sup>−</sup>*<sup>α</sup>* ln *q* (*α*) 1 % <sup>=</sup> Pr[−*<sup>α</sup>* ln *q* (*α*) 1 − *α* ln(*rα*) < −*α* ln(*Yα*) < <sup>−</sup>*<sup>α</sup>* ln *q* (*α*) 1 ] < Pr\$ <sup>−</sup>*<sup>α</sup>* ln *q* (*α*) 1 <sup>−</sup> *<sup>α</sup>* ln(*R*) <sup>&</sup>lt; <sup>−</sup>*<sup>α</sup>* ln(*Yα*) <sup>&</sup>lt; <sup>−</sup>*<sup>α</sup>* ln *q* (*α*) 1 % <sup>→</sup> Pr[−*<sup>α</sup>* ln *q* (*α*) 1 <sup>&</sup>lt; <sup>−</sup>*<sup>α</sup>* ln(*Yα*) <sup>&</sup>lt; <sup>−</sup>*<sup>α</sup>* ln *q* (*α*) 1 ] = 0 (*α* → 0+), a contradiction.

**Corollary A3.** *lim α*→0+ *TG*(*α*)(*P*1, *P*2, ··· , *Pn*) = *lim α*→0+ *F*−<sup>1</sup> *G*(*α*) 1 − *P*(1) , *where P*(1) *is the smallest value of P*1, *P*2, ··· , *Pn.*

**Proof of Corollary A3.** This is a direct consequence of Lemma A2.

**Proof of Theorem 1.** Now, we prove Theorem 1:


#### **Proof of Theorem 2 .**

First, we show that *fα*,*c*(*p*) = (1 − *c*) *<sup>α</sup>* exp\$ *cF*−<sup>1</sup> *G*(*α*) (1 − *p*) % is a PDF. Let *y* = *F*−<sup>1</sup> *G*(*α*) (1 − *x*), then *<sup>x</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *FG*(*α*)(*y*) <sup>=</sup> # <sup>∞</sup> *<sup>y</sup> t <sup>α</sup>*−<sup>1</sup> exp(−*t*)/Γ(*α*)*dt*, and *dx* <sup>=</sup> <sup>−</sup>*yα*−<sup>1</sup> exp(−*y*)/Γ(*α*)*dy*. Hence, # <sup>1</sup> <sup>0</sup> *fX*(*x*)*dx* <sup>=</sup> # <sup>1</sup> <sup>0</sup> (1 − *c*) *<sup>α</sup>* exp\$ *cF*−<sup>1</sup> *G*(*α*) (1 − *x*) % *dx* = # <sup>∞</sup> <sup>0</sup> (1 − *c*) *<sup>α</sup>* exp(*cy*)*yα*−<sup>1</sup> exp(−*y*) /Γ(*α*)*dy* = 1 as(1 − *c*) *<sup>α</sup>* exp(*cy*)*yα*−<sup>1</sup> exp(−*y*)/Γ(*α*) <sup>=</sup> (<sup>1</sup> <sup>−</sup> *<sup>c</sup>*) *α <sup>y</sup>α*−<sup>1</sup> exp[−(<sup>1</sup> <sup>−</sup> *<sup>c</sup>*)*y*]/Γ(*α*), the PDF of *Gamma*(*α*, 1 − *c*). However, under the global null hypothesis, *Pi* ∼ U(0, 1), the log-likelihood ratio under the global null and alternative hypotheses is ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> ln(*fα*,*c*(*Pi*)) = *a*(*α*) + *c* ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*α*) (1 − *Pi*), where *a*(*α*) = −*nα* ln(1 − *c*), a constant. Therefore, by the Neyman–Pearson lemma [43], *TG*(*α*) is UMP under the specified condition.

**Proof of Theorem 3.** Since the unconstrained MLE for *<sup>c</sup>* is *<sup>c</sup>*ˆ*α*<sup>0</sup> <sup>=</sup> <sup>1</sup><sup>−</sup> *<sup>n</sup>α*0/ <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*α*0) (1 − *pi*), when *<sup>c</sup>*ˆ*α*<sup>0</sup> <sup>≤</sup> 0, *TCLRT*,*α*<sup>0</sup> <sup>=</sup> 0. On the other hand, when *<sup>c</sup>*ˆ*α*<sup>0</sup> <sup>&</sup>gt; 0, i.e., *TG*(*α*0) <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*α*0) (1 − *pi*) <sup>&</sup>gt; *<sup>n</sup>α*0, *<sup>c</sup>*ˆ*CLRT*,*α*<sup>0</sup> <sup>=</sup> *<sup>c</sup>*ˆ*α*<sup>0</sup> , and *TCLRT*,*α*<sup>0</sup> <sup>=</sup> <sup>2</sup>*nα*0*ln nα*0/ ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*α*0) (1 − *pi*) +2 ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>F</sup>*−<sup>1</sup> *G*(*α*0) (1−

*pi*) <sup>−</sup> <sup>2</sup>*nα*<sup>0</sup> <sup>=</sup> <sup>2</sup>*nα*0*ln nα*0/*TG*(*α*0) <sup>+</sup> <sup>2</sup>*TG*(*α*0) <sup>−</sup> <sup>2</sup>*nα*<sup>0</sup> <sup>=</sup> <sup>2</sup>*nα*0*ln*(*nα*0) <sup>−</sup> <sup>2</sup>*nα*0*ln TG*(*α*0) + 2*TG*(*α*0) − 2*nα*0. For any *t* > 0, let *A* = *t <sup>t</sup>* <sup>&</sup>lt; *TCLRT*,*α*<sup>0</sup> ; it is easy to show that *A* = *t <sup>t</sup>* <sup>&</sup>lt; *TCLRT*,*α*<sup>0</sup> = ) *t* <sup>2</sup>*nα*0*ln*(*nα*0) <sup>−</sup> <sup>2</sup>*nα*0*ln TG*(*α*0) + 2*TG*(*α*0) − 2*nα*<sup>0</sup> >*t* \* = ) *t* 2 *TG*(*α*0) − *nα*<sup>0</sup> <sup>&</sup>gt;*<sup>t</sup>* <sup>+</sup> <sup>2</sup>*nα*0*ln TG*(*α*0)/*nα*<sup>0</sup> \*. Let *<sup>f</sup>*(*x*) <sup>=</sup> *<sup>x</sup>* <sup>−</sup> *<sup>n</sup>α*0*ln*(*x*) <sup>−</sup> *<sup>n</sup>α*<sup>0</sup> <sup>−</sup> *<sup>t</sup>*/2 <sup>+</sup> *nα*<sup>0</sup> ln(*nα*0), then for *x* > *nα*0, *f* (*x*) = 1 − *nα*0/*x* > 0, and *f*(*x*) is an increasing function of *<sup>x</sup>*, but lim *<sup>x</sup>*→(*nα*0)+ *<sup>f</sup>*(*x*) <sup>=</sup> <sup>−</sup>*t*/2 <sup>&</sup>lt; 0, and *<sup>f</sup>*(*knα*0) <sup>=</sup> (*<sup>k</sup>* <sup>−</sup> <sup>1</sup> <sup>−</sup> ln(*k*))*nα*<sup>0</sup> <sup>−</sup> *<sup>t</sup>*/2 <sup>&</sup>gt; 0 for large *k*. Hence, there must exist a unique *x*<sup>0</sup> ∈ (*nα*0, ∞) such that *f*(*x*0) = 0. Accordingly, *Pr TCLRT*,*α*<sup>0</sup> > *t* = *Pr TCLRT*,*α*<sup>0</sup> > *t*, *c*ˆ*CLRT*,*α*<sup>0</sup> > 0 <sup>=</sup> *Pr*[2*nα*0*ln*(*nα*0) <sup>−</sup> <sup>2</sup>*nα*0*ln TG*(*α*0) + <sup>2</sup>*TG*(*α*0) <sup>−</sup> <sup>2</sup>*nα*<sup>0</sup> <sup>&</sup>gt; *t and TG*(*α*0) <sup>&</sup>gt; *<sup>n</sup>α*0]= *Pr*[*TG*(*α*0) <sup>−</sup> *<sup>n</sup>α*0*ln TG*(*α*0) − *nα*<sup>0</sup> − *t*/2+ *nα*0*ln*(*nα*0) > 0, *TG*(*α*0) > *nα*0] = *Pr*[*TG*(*α*0) > *t*0], where *t*<sup>0</sup> is the root of *f*(*x*), i.e., *f*(*t*0) = 0. This shows that when *TG*(*α*0) > *nα*0,*TCLRT*,*α*<sup>0</sup> and *TG*(*α*0) have the same *p*-value. On the other hand, when *TG*(*α*0) ≤ *nα*0, *TCLRT*,*α*<sup>0</sup> = 0; hence, *Pr TCLRT*,*α*<sup>0</sup> = 0 = *Pr*\$ *TG*(*α*0) < *nα*<sup>0</sup> % . 

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland www.mdpi.com

*Applied Sciences* Editorial Office E-mail: applsci@mdpi.com www.mdpi.com/journal/applsci

Disclaimer/Publisher's Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Academic Open Access Publishing

mdpi.com ISBN 978-3-0365-9502-3