*Article* **Social Capital on Social Media—Concepts, Measurement Techniques and Trends in Operationalization**

#### **Flora Poecze <sup>1</sup> and Christine Strauss 2,\***


Received: 27 September 2020; Accepted: 2 November 2020; Published: 4 November 2020

**Abstract:** The introduction of the Web 2.0 era and the associated emergence of social media platforms opened an interdisciplinary research domain, wherein a growing number of studies are focusing on the interrelationship of social media usage and perceived individual social capital. The primary aim of the present study is to introduce the existing measurement techniques of social capital in this domain, explore trends, and offer promising directions and implications for future research. Applying the method of a scoping review, a set of 80 systematically identified scientific publications were analyzed, categorized, grouped and discussed. Focus was placed on the employed viewpoints and measurement techniques necessary to tap into the possible consistencies and/or heterogeneity in this domain in terms of operationalization. The results reveal that multiple views and measurement techniques are present in this research area, which might raise a challenge in future synthesis approaches, especially in the case of future meta-analytical contributions.

**Keywords:** social capital; social media; operationalization; measurement; scoping review

#### **1. Introduction**

The launch of Web 2.0 at the turn of the 21st century enabled a communication revolution. This was followed by the rapid emergence of diverse social media platforms, of which Friendster was one of the first globally known ones; in turn, a growing scientific interest started to characterize the present era [1,2]. This concentrated attention brought up a heterogeneous set of terminological approaches for the novel phenomenon termed "social media" (SM) [3].

Scientific publications in this domain commonly highlight the interactive function of the platforms in question. Furthermore, their services that offer instant communication, extended with possibilities offered as user-generated content (UGC) such as liking, sharing, and commenting. Based on such reasons, one of the most widely used definitions in this area of research is offered by Kaplan and Haenlein (2010), according to whom social media platforms are "internet-based applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user generated content" [4]. VanMeter et al. (2015) stated that such platform is an interactive one, which "allows social actors to create and share in multi-way, immediate and contingent communications" [3]. The core nature of social media platforms is the purpose of enhancement and maintenance of individual user relationships; therefore, SM use can be considered as an investment in social relationships [5]. The exponential user base growth on a plethora of social media platforms has a multitude of individual, underlying reasons, from which this paper will tap into the human drive for social interactions and engagement, and evolutionary phenomena (e.g., survival, reproduction), facilitated by cooperation and trust, leading to the perception of an individual's reputation.

The human need to belong is a widely discussed psychological phenomenon [6]. There are empirical suggestions for the influence of this drive on college students' social media use, manifesting in interaction and social engagement [7]. The need for belongingness appeared in Maslow's [8] theory of human motivation, which is manifesting itself on today's social media platforms. Drawing back to the thoughts of Baumeister and Leary (1995), this motivation leads individuals to enhanced efforts to broaden and strengthen their social connections. As discussed in the previous social media literature, belongingness and self-representation [9] are considered among the primary reasons for social media use [10]. Taking into of Lin's (2001) definition, social capital is the "investment in social relations with expected returns in the marketplace" [11]. Under these circumstances, individual social media presence is indeed influenced by the drive for belongingness, paired with expected returns, which, under these circumstances, indicates enhanced social ties and the possibility of leveraging social support [12].

Considering the exponential growth of the global user base and the diversity of social media platforms, which have become part of humanity's everyday lives, the measurement of individual social capital on social media has become a crucial area of investigation. The expected and perceived returns, which characterize this phenomenon make it essential for scientists to empirically examine and quantify their potential impacts on the lives of individuals at a global scale. Therefore, emerging scientific attention has turned to operationalize and validate social capital scales, the evaluation of which can possibly describe this phenomenon in detail.

There are indications, however, that there are possible inconsistencies present regarding the measurement techniques of perceived individual social capital. Williams (2006) pointed out, based on the theories of Putnam (2000), that the bridging and bonding dimensions of the cultural view of social capital are not orthogonal (cf. ibid., pp. 596–597), which can lead to possible measurement discrepancies. Further concerns suggested that the distinction between bridging and bonding social capital is rather ad hoc [13]; therefore, their treatment in such manner can possibly lead to harmful consequences [14]. Additional critics [14] highlighted the issue that Granovetter's seminal works about the implications of weak and strong social ties [15,16], which indicated that weak ones are possibly more important than strong ones, are dominating this research field [17].

In this manner, this investigation aims to scope the presently existing measurement techniques and trends in this research area. No hypotheses were set in this study, reflecting on its exploratory nature. The rest of this article is structured as follows. In the next section, we present a terminological outline of the existing social capital theories. This leads to differentiation regarding online and offline social capital. The birth and theoretical development of this phenomenon is then presented. This includes a discussion of the cultural and multidimensional views of social capital, as subjects of the vast majority of publications operationalizing social capital. The items of measurement are detailed. This is followed by a presentation of the research methodology, aiming to explore past scientific research measuring social capital, paying attention to possible inconsistencies regarding operationalization techniques. The results are presented next, followed by the conclusions, limitations of the present paper, and future research suggestions.

#### **2. Literature Review**

Social capital terminology has undergone a substantial transformation since its birth. As a possible reason for such diversity, Fine (2010) [18] highlighted that previous research practices applied an appropriate definition matching the particular application in question [19–22]. According to Bourdieu, social capital is the "aggregate of the actual or potential resources which are linked to possession of a durable network of more or less institutionalized relationships of mutual acquaintance and recognition" [23], emphasizing that it is a form of capital, measurable on an individual or group level, characterizing embedded relationships between individuals. Furthermore, Coleman defines social capital as an accumulation of resources stemming from various individual relationships [24]. Portes highlighted the importance of the structure of such relationships, wherein the actors of this phenomenon are located [22], while Fukuyama underlined the importance of co-operation among individuals, which promotes social capital [25].

According to Putnam, social capital is social networks associated with the norms of reciprocity, indicating that the phenomenon itself jointly describes these networks and their effects on participating individuals. [26] However, to offer a brief outlook regarding the up to date inconsistencies regarding social capital discussed in detail by Fine [18], the argument of social capital being the cause, the effect, or the process itself, is a matter of present scientific debates as well [27].

The emergence of Web 2.0 brought further developments regarding the theory and parallel development of the applied measurement techniques of social capital, distinguishing between online and offline contexts, with research indicating that the use of the Internet is associated with trust and community involvement enhancement. [28] The impact of the Internet as a surrogate and a supplement of human communication has been discussed widely in scientific research, the focus of which has been e-mail usage [29], or the functions of chat rooms in idea sharing and political participation [30] in the early stages of Internet studies in social capital research.

With the emergence of computer-mediated social networks, the discussion of the associations between social capital and individual tie strength research, investigating both strong and weak ties in an online context based on Granovetter's social tie theories [15,16], opened a new research field. The aim for the development of online social capital became one of the core aspects of empirical research, a milestone of which was the development and validation of the first comprehensive online social capital scale by Williams [27]. To better understand the importance and details of this contribution, it is essential to discuss the most widely discussed, existing views of social capital theory. This is crucial to grasp the context of this scale, which still has one of the most noted impacts in the present, empirical social capital research measured on social media.

The birth of social capital as a scientific phenomenon is unclear [18]. According to Hofer and Aubert (2013), Lyda Hanifan's article from 1916 [31] can be seen as a possible theoretical root [32]. Hanifan's rediscovery during the beginning of the 21st century can be potentially attributed to an article by Putnam and Goss (2002), in which the authors stated that her definition encompassed all of the crucial elements identified ones in contemporary science [33].

The beginning of the 1980s marked its first concentrated attention through the works of French radical sociologist Pierre Bourdieu, joined by American rational choice sociologist James Coleman, who started elaborating on this topic in the late 1980s and early 1990s [34]. Robert Putnam's investigations during the turn of the century [26] took an important scientific step towards, through the definition and conceptualization of bridging and bonding social capital. This based on Granovetter's works on social tie strength [15,16], wherein he proposed that strong social ties (e.g., family or friends) are not valuable for an individual in the process of a new job acquisition. However, weak social ties (i.e., the vast network of acquaintances) are beneficial for the individual in question [15]. Based on Granovetter's seminal works, Putnam (2000) proposed that a person's bridging social capital (i.e., weak social ties) is valuable for the acquisition of previously unknown, new information, while the function of bonding social capital refer to an individual's strongest ties, is the provision of social and emotional support (cf., [35]).

In parallel to the previously described cultural view of social capital [23,24,26], and theorists of the structural view [3,8,18], the turn of the century marked Nahapiet and Goshal's seminal work (1998). This work elaborated on the multidimensional view of social capital, segmenting it into structural (i.e., social interaction ties), relational (i.e., shared language, cultural understanding) and cognitive (i.e., trust, norms, obligations, identification) dimensions [36].

Social capital has been put into relationship analyses with enormously diverse phenomena. Fine (2010) offered several curious examples [18] (e.g., the prevention of deforestation [37], skin color as a factor in marriage prospects [38], or pets as social capital conduits [39]). The most promising research platforms in social capital studies are social media platforms, building upon belongingness as a human

drive, and social engagement. The shared question of such studies is, whether or not the use of social media affects the individual's perception of the social capital of the self and the perceived social support (see the meta-analysis of Domahidi, 2018 [40]). As a result, the number of papers is growing rapidly, in which the perceived individual social capital is analyzed on social media platforms [41].

The different aspects, viewpoints, and theoretical considerations in terms of social capital raise the question of how these are operationalized for further empirical investigation and evidence. The development of measurement techniques was already urged by Quan-Haase and Wellmann in 2004, who argued for its necessity based on the accelerated emergence of the Internet in parallel to the development of social capital [42]. Following this call, Williams (2006) created the Internet Social Capital Scale (ISCS), consisting of two scales proposed to measure bridging, and bonding social capital, respectively, including 10-10 measurement items, based on Putnam's (2000) conceptualization. [27] These scales were extended and modified [1], wherein the bridging and bonding social capital of Michigan State University (MSU) students was measured. The aim was the analysis of student social capital, the intensity of Facebook use and further control variables. In this article, the authors introduced the definition and measurement of maintained social capital as well, which refers to such prior, high school social connections of students that were later maintained during their time spent in higher education.

The following table aims to introduce the measurement items of the ISCS [27], in comparison to those items that were included in the seminal work of Ellison et al. (2007) (cf. Table 1).

Table 1 illustrates that Ellison et al. (2007) adapted five statements from the ISCS, slightly adjusting the statements to the MSU context, while they operationalized and validated a scale for maintained social capital as well. Therefore, Table 1 presents the five distinct measurement items of this social capital construct as well.

The dichotomous handling of bridging and bonding social capital has raised concerns. As Williams (2006) pointed out, these constructs "are not mutually exclusive, [ ... ], they are oblique rather than orthogonal to one another" [27]. Their treatment as distinctive constructs can result in harmful consequences; therefore, they should be handled as oblique ones [14]. Further critics noted that the distinction between bridging and bonding social capital measurement instruments is rather ad hoc [13]. Additionally, based on the wide recognition of Granovetter (1973), which highlighted the importance of weak social ties, academic research tended to highlight the existence of this phenomenon and sought evidence for its underpinning [17]. This has also generated concerns in recent studies [14].

The development of the measurement constructs in the multidimensional view of social capital followed a different path in operationalization (see [4], pp. 140–157, for a critical summary) from the end of the 1990s until 2006. This marked the year of the publication of the article by Chiu et al. (2006), which created and validated a comprehensive set of items for all three studied social capital dimensions (i.e., structural, relational, cognitive) (Table 2) [43].

The history of measurement development in the multidimensional view of social capital reached a milestone with the seminal work of Chiu et al. (2006), which built upon the work of Tsai and Goshal (1998). The definition of the structural dimension as social interaction ties, the relational one as trustworthiness and trust, and the cognitive one as shared vision (in an enterprise setting) was defined by these authors. They created a standardized betweenness index for the evaluation of social interaction ties, while standardized in-degree, centrality was calculated for the measurement of trust and trustworthiness. The cognitive dimension was measured through two Likert-scale items [44].

 University. **Online**/**O**ffl**ine Bridging SC [27] Online**/**O**ffl**ine Bonding SC [27] Bridging SC [1] Bonding SC [1] Maintained SC [1]** Interacting with people online/offline makes me interested in things that happen outside my town There are several people online/offline I trust to help solve my problems I feel I am part of the MSU community There are several people at MSU I trust to solve my problems I'd be able to find out about events in another town from a high school acquaintance living there Interacting with people online/offline makes me want to try new things There is someone online/offline I can turn to for advice about making very important decisions I am interested in what goes on at MSU If I needed an emergency loan of \$100, I know someone at MSU I can turn to If I needed to, I could ask a high school acquaintance to do a small favor for me Interacting with people online/offline makes me interested in what people unlike me are thinking There is no one online/offline that I feel comfortable talking to about intimate personal problems. (reversed) MSU is a good place to be There is someone at MSU I can turn to for advice about making very important decisions I'd be able to stay with a high school acquaintance if traveling to a different city Talking with people online/offline makes me curious about other places in the world When I feel lonely, there are several people online/offline I can talk to. I would be willing to contribute money to MSU after graduation The people I interact with at MSU would be good job references for me I would be able to find information about a job or internship from a high school acquaintance Interacting with people online/offline makes me feel like part of a larger community If I needed an emergency loan of \$500, I know someone online/offline I can turn to. Interacting with people at MSU makes me want to try new things I do not know people at MSU well enough to get them to do anything important (reversed) It would be easy to find people to invite to my high school reunion Interacting with people online/offline makes me feel connected to the bigger picture The people I interact with online/offline would put their reputation on the line for me. Interacting with people at MSU makes me feel like a part of a larger community Interacting with people online/offline reminds me that everyone in the world is connected The people I interact with online/offline would be good job references for me. I am willing to spend time to support general MSU activities I am willing to spend time to support general online/offline community activities The people I interact with online/offline would share their last dollar with me. At MSU, I come into contact with new people all the time Interacting with people online/offline gives me new people to talk to I do not know people online/offline well enough to get them to do anything important. (reversed) Interacting with people at MSU reminds me that everyone in the world is connected Online/Offline, I come in contact with new people all the time The people I interact with online/offline would help me fight an injustice.



In the next stage of measurement development, Yli-renko et al. (2001) selected items for the structural and relational dimensions from the existing one by Tsai and Goshal (1998), while developing new items for the cognitive one. In their paper, the structural dimension was termed as social interaction, while the relational one as relationship quality. Furthermore, the cognitive dimension was defined as customer network ties [45]. Wasko and Faraj (2005) proposed a self-rating scale for the cognitive dimension and applied the technique of Tsai and Ghoshal (1998) for the operationalization of the structural dimension. They also defined two subscales for the relational one (i.e., commitment and reciprocity), adapting previously operationalized scales from past literature [46–48]. These approaches were synthesized and validated in the aforementioned study by Chiu et al. (2006), which set a virtual, professional, IT-related community (i.e., BlueShop) in Taiwan as the subject of the analysis [43].

Although previous research indicates that the two previously discussed viewpoints constitute the majority of the empirical measurement approaches in terms of the perceived individual social capital on social media, the paper at hand intends to explore unique, emerging measurement techniques as well, to offer a broad and detailed scope for future studies. The present article aims to scope out the practical characteristics of the empirical studies evaluating social capital constructs on social media, therefore, measuring individual social capital. Based on the previous studies, social capital measurement techniques will be evaluated through a scoping review of 80 published studies to determine the measurement approaches used in past research. Papers employing bridging, bonding, and/or maintained social capital will be explored, followed by those of the multidimensional view, along with a discussion of unique social capital measurement approaches. The goal of the present paper is to: (i) span a broad and detailed scope, (ii) evaluate these techniques, and (iii) identify possible similarities or differences, to provide a more transparent view about the state of this research area and its possible empirical performance and explanatory power.

#### **3. Search and Filtering Method**

The scoping review methodology [49,50] was applied to map the current state of the scientific knowledge and identify possibly existing research gaps. A scoping review is appropriate here, as it provides an opportunity for a broader research question and the avoidance of bias-assessment.

A multi-keyword search was employed in ProQuest and Google Scholar (i.e., "social capital" AND ("social media" OR "social network" OR "SNS" OR "SM")). The collection of scientific literature followed a funnel approach [51]. Only peer-reviewed articles, peer-reviewed conference proceedings and peer-reviewed book chapters were included into the search criteria. The search process identified 2478 records.

Four manually performed filtering steps were performed on the 2478 records: (a) abstracts and reference lists were checked (139 records remaining), (b) quantitative studies were kept (65 records remaining), (c) studies that did not measure social capital were eliminated (53 records remaining), and (d) the citations of the remaining 53 papers were reviewed backwards and forwards. After the four steps were complete, the final set of *n* = 80 records remained.

Additional inclusion criteria for the final set of publications were as follows. The manuscript has to: (a) appear in a peer-reviewed article or conference proceedings or book chapter, (b) be written in the English language, (c) set individual, perceived social capital as the focus, (d) investigate this phenomenon on one or more social media platforms, and (e) empirically measure the perceived social capital in a quantitative manner.

The categorization for comparison and coding were performed using tables in Excel, involving the application of the cultural view, the multidimensional one, or a unique approach. The elements of the operationalized constructs were collected for evaluation based on consistency, joined with the collection of authors, whom the analyzed publications refer to in this regard. Furthermore, the dimension names were collected with attention to papers empirically investigating the multidimensional view of social capital, or using unique measurement approaches. This process involved the authors and two additional, independent reviewers stemming from the respective scientific areas of research.

#### **4. Results**

The analysis is based on the observations and trends extracted from the systematically collected literature. As previously indicated, two distinct operationalization techniques emerged from the analyzed *n* = 80 records: the majority, (i.e., two thirds (66%; 53 items)) of the analyzed publications investigated bridging, bonding, and/or maintained social capital constructs. These studies followed the theoretical considerations of Putnam (2000). The multidimensional view was explored by one-fifth (18%; 15 items) of the papers. These two viewpoints represent a contrast in terms of polarity, as they did not intersect regarding their direction of operationalization; however, none of the analyzed articles empirically compared these two concepts.

Bridging social capital was present in all empirical studies that operationalized the social capital constructs according to the cultural view, with the exception of one manuscript. Bonding capital appeared in almost all studies apart from six, thereby indicating its importance. Merely six studies quantified maintained social capital. The description of individual measurement items was explored, based on its cruciality for future replication possibilities. The review concluded that replication was not possible in 18% (nine items) of the studies interpreting bridging, bonding, and/or maintained social capital measurement on social media, as a lack of a measurement item description.

Through the analysis of individual measurement items, it became evident that there is a considerable diversity in terms of how many, and what kind of items the studies employed. Among the underlying reasons for difference is that the performed principal component analyses (PCA) and confirmatory factor analyses (CFA) delivered different results in individual studies, resulting in the exclusion of at least one or more measurement item. The measurement consistency in the cultural view of social capital was clearly visible in other terms, in the cases of all three measurement constructs (i.e., bridging, bonding, and maintained); however, considerable heterogeneity was found in the operationalization techniques of the multidimensional view.

Table 3 offers a summary of the measurement constructs of each social capital dimension according to the multidimensional view in the 15 analyzed papers, with the exception of Chiu and colleagues (2006).

As Table 3 illustrates, there are distinct differences in terms of sources for measurement operationalization and the construct names for all three dimensions. More specifically, Chiu and colleagues (2006) analyzed the structural dimension by employing one construct (social interaction ties). However, Table 3 shows a variety of construct names (e.g., social networking, instrumental network ties, expressive network ties) in this regard, combined with the diverse operationalization techniques. This trend is visible in terms of the relational and cognitive dimensions as well. It is, however, necessary to note that all studies analyzed in this view offered clear sources in applied measurement, combined with the availability of the measurement items, which can greatly enhance the possibility of replication and the ability of results generalization in a cumulative manner.

The majority of the sampled records employed either the cultural or the multidimensional view of social capital, along with their matching measurement techniques. Unique approaches are summarized in Table 4.

Table 4 reveals a high degree of consistency in the wording for social capital; however, quite distinct differences regarding operationalization techniques are observable as well. While all studies mentioned in Table 4 aimed to analyze the same theoretical concept (i.e., social capital), with a clear majority evaluating bridging, bonding and maintained constructs, the previously mentioned heterogeneity in operationalization discussed in the multidimensional view, extended with these unique approaches, further indicates that there is no particular measurement in this social capital view, which can be considered as common starting point. Quite the contrary, these results address the uncertainty regarding the construct measurement of social capital. Albeit the hypotheses aiming to find relationships with various constructs and social capital itself, were verified in the individual papers, they depicted these results through plentiful operationalization techniques.

**Table 3.** Evaluation of the 15 items (19%) of the final set of publications operationalizing social capital measurement constructs according to the multidimensional view. Abbreviations used: struct. = structural, dim. = dimension, constr. = construct, meas. = measurement(s), rel. = relational, cogn. = cognitive.


**Table 4.** List and basis of comparison in the cases of 12 publications employing unique measurement approaches.



**Table 4.** *Cont.*

#### **5. Discussion and Conclusions**

The present paper aimed to discover and evaluate prior empirical social capital research conducted in the realm of social media. The primary objective of the study was to tap into measurement operationalization techniques used for evaluating social capital, concentrating on cultural and multidimensional view approaches, and offering an extension into unique measurement approaches. Our analysis involved several tasks to provide a more transparent view about the state of the preceived individual social capital measurement on social media, and its possible empirical performance and explanatory power: (i) span a broad and detailed scope, (ii) evaluate the techniques, and (iii) identify possible similarities or differences. The paper intended to contribute to approaches, such as the meta-analytical approach in Liu et al. (2016), who observed the relationship of bridging and bonding social capital with global social media use and site activity. Such contributions can offer an opportunity of comparison and jointly reveal effect sizes of multiple records to answer the core question, whether the interpreted effects are existing, statistically significant, or the results of selective reporting [77–80].

From the viewpoint of interdisciplinary research, it seems necessary to discuss operationalization technique consistency and offer a synthesis to highlight that the possibility of future meta-analyses is strongly dependent on the comparability and coherence of measurement techniques to maintain validity in effect size measurement and the avoidance of system-inherent bias.

By means of a scoping review, the present study assessed 80 articles to evaluate the standing of social capital research on social media, concentrating on their operationalization techniques. While there is a general observable trend regarding the interpretation of individual measurement items and constructs, studies in the multidimensional view depicted great heterogeneity in terms of operationalization and proposed measurement techniques, which indicates challenging conditions for future meta-analytical approaches in this domain. On the other hand, studies employing the cultural view of social capital, along with the validated measurement techniques proposed by Williams (2006) or Ellison et al. (2007), show a high degree of consistency. It should be noted, however, that there is heterogeneity in the individual studies in terms of employed items from these scales, based on the results of the performed PCA and CFA analyses, resulting in possible item drops. Furthermore, unique social capital measurement techniques on social media are also present in this research domain, enhancing the complexity of a possible, empirical synthesis.

Social media platforms offer to fulfill the human drive to belong and have an exponentially growing user-base. The underlying motivations for the usage of such platforms, along with the expected and perceived benefits as a result of being present and active on them, are especially crucial to better understanding human behavior.

The present article aimed to provide a detailed view into the individual, perceived social capital research on social media, and limited itself for the discussion to articles exploring this phenomenon on at least one SM platform. However, as empirical social capital measurement is present in a plethora of further research fields in both an online and offline context, while investigated not merely in a perceived notion, nor solely on an individual level.

Social capital, which seems to be attached to a diverse set of behavioral phenomena [18], can be considered as one of such phenomena; therefore, its analysis, and possible synthesis is an ever pressing issue, since the concept of social capital is indeed a "buzzword" in science [89]. The wide array of measurement approaches discussed in this article, however, raise questions about the measurement: do they measure the same concept, or, as the opposite extreme, maybe none of them do.

The importance of social capital research on social media has possible individual benefits in terms of student learning outcomes, based on the discussed benefits of weak ties as an example. Further benefits include a diverse set of research areas, including the challenge of the cultural barriers for women's economic independence and autonomy [90], highlighting the importance of these investigations aiming to reduce inequalities. This cruciality also manifests itself in labor market studies, wherein individual social capital can be considered as an enabler for successful labor market integration [91]. It also manifests itself in healthcare research, since online conversations can possibly strengthen patient–caregiver connections, leading to successful online health communities, and ultimately, effective policy interventions [92]. Albeit, these examples are far from reaching comprehensivity, they do indicate the relevance of both the existence of social media for the benefits of humanity, and the diversity of areas on which social media can possibly provide benefits for individuals through enabling social capital.

Machine-learning based methods can further enhance the results of such empirical investigations, (e.g., sentiment analysis [93–97]), which could be employed as an extension to reveal the underlying sentiment in student communication present on forums, and class discussion boards. The usage of big data in data sciences, especially in the research area of digital marketing, indicate the crucial importance of such investigations, involving numerous industrial areas, detailed recently by Saura (2020). While companies aim to leverage from such methods, from which the author distinguished nine individual core topics [98], highlighting social media listening as well, the empirical research of individual, perceived social capital might offer crucial insights for corporations aiming to achieve effective digital marketing strategies. This implication is also supported by the relevant publications on the importance of social media marketing, wherein electronic word of mouth (e-WOM) is facilitated by user-generated content, which empowers customers with the ability of sharing their experiences about brands, products, or firms, in which trust plays a key role. [99] Trust is an essential part of the perceived, individual social capital according to the presently discussed views of individual social capital.

It is recommended that future research determines, in detail, how and in what manner, levels of individual online social capital on social media can possibly enable corporational profit enhancements through the mediating role of electronic word of mouth, possibly leading to more refined customer relationship management, accompanied with a positive brand perception.

**Author Contributions:** Conceptualization, F.P. and C.S.; data curation, F.P.; investigation, F.P.; methodology, F.P.; resources, F.P.; software, F.P.; supervision, C.S.; writing—original draft, F.P.; writing—review & editing, C.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** Open Access Funding by the University of Vienna.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

1. Ellison, N.B.; Steinfield, C.; Lampe, C. The benefits of facebook "friends:" Social capital and college students' use of online social network sites. *J. Comput. Commun.* **2007**, *12*, 1143–1168. [CrossRef]


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Towards Context-Aware Opinion Summarization for Monitoring Social Impact of News**

#### **Alejandro Ramón-Hernández 1, Alfredo Simón-Cuevas 2,\*, María Matilde García Lorenzo 1, Leticia Arco <sup>3</sup> and Jesús Serrano-Guerrero <sup>4</sup>**


Received: 10 October 2020; Accepted: 13 November 2020; Published: 18 November 2020

**Abstract:** Opinion mining and summarization of the increasing user-generated content on different digital platforms (e.g., news platforms) are playing significant roles in the success of government programs and initiatives in digital governance, from extracting and analyzing citizen's sentiments for decision-making. Opinion mining provides the sentiment from contents, whereas summarization aims to condense the most relevant information. However, most of the reported opinion summarization methods are conceived to obtain generic summaries, and the context that originates the opinions (e.g., the news) has not usually been considered. In this paper, we present a context-aware opinion summarization model for monitoring the generated opinions from news. In this approach, the topic modeling and the news content are combined to determine the "importance" of opinionated sentences. The effectiveness of different developed settings of our model was evaluated through several experiments carried out over Spanish news and opinions collected from a real news platform. The obtained results show that our model can generate opinion summaries focused on essential aspects of the news, as well as cover the main topics in the opinionated texts well. The integration of term clustering, word embeddings, and the similarity-based sentence-to-news scoring turned out the more promising and effective setting of our model.

**Keywords:** opinion mining; opinion summarization; topic modeling; semantic similarity measures; word embeddings

#### **1. Introduction**

The globalization of the use of the Internet and the development of technologies such as Cloud Computing, Internet of Things, social networks, Mobile Computing, and others has favored the increase of user-generated content on the web. Nowadays, a surprisingly high quantity of news, messages, and reviews of products or services are generated in online social media, news portals, e-commerce sites, etc. The data and information produced by users have proven useful in many domains (e.g., marketing studies, business intelligence, health, governance, and others) [1]. The processing of user-generated content on digital platforms (e.g., news platforms) is playing significant roles in the success of government programs and initiatives in digital governance, from extracting and analyzing citizens' sentiments for decision-making [2]. Several efforts have been dedicated to deal with extracting knowledge and efficient processing of this unstructured information produced by users [3], resulted

in increasing research interest in tasks within Natural Language Processing (NLP) such as sentiment analysis, also called opinion mining [4].

Opinion mining is the field of study that analyzes people's opinions, sentiments, appraisals, attitudes, and emotions towards entities and their attributes expressed in written texts [3]. Opinion mining (or sentiment analysis) is a broad area that includes many tasks, such as sentiment classification, aspect-based sentiment analysis, lexicon construction, opinion summarization, and others [5]. Opinion summarization is the task of automatically generating summaries for a set of opinions that are related to the same topic or specific target [6]. The aspect-based opinion summarization is one of the main approaches [7], but it would not be very appropriate in contexts where the opinions are not about products or services (e.g., opinions about news). Although summaries generated by several of the reported approaches are focused on specific topics [1,8,9], they are generally identified by looking only at the content in opinionated texts, whereas the context that originates the opinions (e.g., news) is not usually taken into account, being this a weakness. A comprehensive summary of the users' reactions concerning a news article can be crucial due to various reasons, such as (1) understanding the sensitivity/importance of the news, (2) obtaining insights about the diverse opinions of the readers regarding the news, and (3) understanding the key aspects that draw the interest of the readers [10]. On the other hand, to integrate both topic-opinion analysis and semantic information can yield satisfactory results in opinion summarization [1]. Nevertheless, the use of WordNet [11], as well as the deep-learning-based word embedding [12,13] (e.g., word2vec [14]) to represent and analyze the semantic of words when dealing with opinion summarization problems has been limited. Our work is addressed to the application of these models and resources to cope with opinion summarization challenges.

In this paper, a news-focused opinion summarization model is presented, which is conceived according to the conception of extractive and topic-based text summarization methods. Our model combines topic modeling, sentiment analysis, and the news-focused relevance scoring in seven phases: preprocessing, topic detection, sentiment scoring, topic-sentence mapping, topic contextualization, sentence ranking, and summary construction. The integration of these techniques allows us to deal with the problem in which the relevance focus not only comes from the texts of the opinions, but also comes from the news articles as the context that originates them. Semantic analysis is included in several phases, to improve text processing. The semantic characteristics of words are captured through the word2vec representation model [14] and from WordNet [11]. Besides, semantic similarity measures are used to assess the semantic relatedness between sentences-to-sentences and sentences-to-news.

The model was evaluated across two datasets containing Spanish news and opinions collected from a real digital news platform. The selected news and opinions are related to telecommunication services and the COVID-19 pandemic. The performance of our proposal was measured, using the Silhouette [15] and the Jensen–Shannon divergence (JSD) measures [16]. The first one is used to measure the quality of the clustering process, and then to estimate the prospective quality of the topic detection phase. The second one is used to measure the quality of the obtained summaries. Several experiments were carried out, to provide a deeper grounding for the contribution of our approach. Different settings of the proposed model were evaluated and compared, to analyze the behavior of the different techniques integrated into the model and to identify the best solution for the news-focused opinion summarization process. The analysis of the experimental results and obtained conclusions were substantiated through the well-known Wilcoxon's Statistics Test.

The rest of the paper is organized as follows: Section 2 summarizes the analysis of related works; Section 3 describes the proposed opinion summarization model; and Section 4 presents the datasets, metric description, and the experimental results and discussion. Conclusions and future work are pointed out in Section 5.

#### **2. Related Works**

Automatic text summarization is the task of producing a concise and fluent summary, condensing the most relevant and essential information contained in one or several textual documents, while preserving key information content and overall meaning of the information source [17]. Summarizing texts is still an active research field and needs further developments due to the huge data increase on the web [18] (e.g., user-generated content). These methods and techniques have been addressed for processing user-generated opinionated content on social networks and digital platforms, emerging as a new challenge [6]. Summaries can be automatically obtained through extractive (i.e., selecting the most important sentences from documents) or abstractive methods (i.e., generating new cohesive text that may not be present in the original information) [6,19]. Most of the opinion summarization models follow extractive methods [7,20]. Unlike traditional text summarization, the opinion-oriented summaries have to take into consideration the sentiment a person has towards a topic, product, place, or service [1]. Since a text summarization aims to generate a concise version of factual information, a sentiment summarization summarizes sentiments from a large number of reviewers or multiple reviews [21]. The opinion mining provides the sentiment associated with a document at different levels through the polarity detection task, whereas text summarization techniques identify the most relevant parts of one or more documents and build a coherent fragment of text (the summary) from them [1].

One of the main approaches to generate opinion summaries is the aspect-based opinion summarization [7,22], which summarizes opinions depending on different aspects or features (attributes or components) of an entity (objects, organizations, services, and products). In the context in which the aspects or features do not stand out, topic detection turning out critical for dismissing non-relevant sentences. However, achieving high effectiveness in this process constitutes a challenging task in contexts of the great diversity of opinions. Identifying topics is of great importance to determine regarding which issues users are giving their criteria [23], being one of the reasons that some opinion summarization approaches detect topics in their textual analysis [1,8,9,24,25]. Although the resulting summaries are generally focused on aspects or topics, they are mainly identified taking into account only the content of the opinionated texts and do not focus on specific information-context interests. Nevertheless, there are approaches where the relevance focus not only comes from the texts of the opinions, such as query-based opinion summarization, which aims to extract and summarize the opinionated sentences related to the user's query [6,26,27]. In these systems, classical summarization techniques are applied, and the context (query) is used as a relevant focus, to generate a coherent and useful summary for the user [28]. Other challenges are implicit in these opinion summarization methods, such as the following: how to retrieve query relevant sentences, how to cover the main topics in the opinionated text set, and how to balance these two requests [29]. Our proposal is addressed to a similar problem, where news articles are used as the relevant focus instead of users' queries, although few approaches dealing with this problem have been identified [10]. For instance, Chakraborty et al. reported a method of summarizing news article tweets that initially captures the diverse opinions from the tweets by creating a unique tweet similarity graph, followed by a community detection technique to identify the tweets representing these diverse opinions [10]. Representative keywords of the news articles are extracted to identify related tweets. The similarity scoring between news-tweets and a pair of tweets is based on the overlapping keywords (content similarity), and the word vectors' similarity (context similarity), respectively.

According to the results reported in Reference [1], integrating both topic-opinion analysis and semantic information can yield satisfactory results in opinion summarization. In this sense, for the analysis of opinions which are generally short texts, it is more useful to represent terms and to capture semantic information about them. Two fundamental approaches collect semantic characteristics of terms. One of them depends on the context, and the other one depends on the meaning. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are more commonly used methods for topic modeling in opinions and to capture the semantic information from the context, as reported in [1,8,24,25]. However, some researchers consider LDA- and LSA-based approaches to not proprerly model the aspects of the reviews made on the web [3]; instead, clustering text segment approaches have the advantage of keeping the document structure through segments, to capture the semantics of texts [30]. On the other hand, word embedding models [12] (e.g., word2vec [14], Glove [31], and FastText) have been less applied; only a few approaches have been identified [8,10]. A word embedding is a learned representation for text where words that have the same meaning have a similar representation. This kind of representation has been successful in extractive summarization [32]. WordNet [11] is the most commonly used technique for capturing and processing the semantic meaning of terms; however, it has not been so much when summarizing opinions. In this context, the use of WordNet is mainly limited to capture synonyms, and few approaches have been identified [26,33,34]. Nevertheless, the use of WordNet in our proposal goes further on.

#### **3. News-Focused Opinion Summarization Model**

The conception of the proposed model is based on the extractive and topic-based text summarization approach, where the relevance scoring of sentences not only requires processing the information content to be summarized (e.g., the set of opinions), but also requires to carry out an alignment process with external or contextual information of interest—in our case, news content. An overview of the proposed model is shown in Figure 1. The proposed model combines the topic modeling (phase 2) and the news content, to determine the "importance" of opinionated sentences; it also includes the sentiment analysis process (phase 3) to determine the polarity strength of sentences and avoid the inclusion of non-opinionated sentences in the automatic summary. The topic-sentence mapping (phase 4) and topic contextualization (phase 5) allow us to align the sentences to the corresponding identified opinion topics and to determine the most relevant topics concerning the news. The least relevant topics are discarded, following the sentence ranking (phase 6) and summary construction (phase 7) processes.

**Figure 1.** Workflow overview of the proposed model.

Several model settings and techniques were developed and evaluated, which are centered to address three important problems in the proposed model, such as (1) granularity in the topic modeling, (2) semantic processing of words and sentences, and (3) sentence relevance scoring. All of these developed alternatives are explained in the following subsections.

#### *3.1. Preprocessing and Feature Extraction*

In this phase, several Natural Language Processing tasks are performed for structuring the text (news and opinions) and extracting features, according to the preprocessing steps commonly reported in the opinion mining solutions [4]. Initially, the texts are split into sentences, and the tokenization task is applied to each sentence, for obtaining words or phrases. Some stop words, such as "la", "de", "y" and "o" (experiments were developed using Spanish text), are removed, considering that these words provide little useful information. Besides this, the lemmatization process of all words is carried out. Subsequently, the Part-of-Speech (POS) tagging is performed to determine the POS tag corresponding to each word belonging to sentences that make up opinions and news. The spaCy library of Python was used to support these tasks.

A crucial phase in opinion summarization is the feature-extraction phase, which simplifies the complexity of the involved tasks (e.g., topic modeling, sentiment classification, and semantic processing) by reducing the feature space. POS tags, such as adjective and noun, are quite helpful because the opinion words are usually adjectives and opinion targets (e.g., entities, aspects, or topics) are nouns or combinations of nouns [4]. Consequently, opinion features are constituted by noun phrases, adjectives, and adverbs. In the case of news texts, noun phrases play an important role as keywords in the content; therefore, they are used to construct the news keyword vector.

The vector space model was adopted for representing words and sentences (features). Two semantic representation approaches to reinforce the semantic processing were developed and evaluated, which are conceived through the use of (1) WordNet [11] and (2) word embeddings [12]. WordNet groups nouns, verbs, adjectives, and adverbs into sets of cognitive synonyms (*synsets*), each expressing a distinct concept meaning. Synsets are interlinked by means of conceptual–semantic and lexical relations. In the first case, the semantic characteristics of words are captured depending on their meaning. The feature vector is constructed with the *synset* of each word included in the sentence; in the case of ambiguous words (more than one *synset* in WordNet), the first *synset* that appears is selected. In the second case, the semantic characteristics of words are captured depending on their context. Word embedding vectors are obtained by applying the automatic learning model word2vec [14] on the sentences and news texts. Specifically, those vectors are generated by using the word2vec pre-trained model included in the es\_core\_news\_md model of the spaCy library, which includes 300-dimensional vectors trained using FastText CBOW on Wikipedia and OSCAR (Common Crawl) containing 20 k unique words in Spanish.

#### *3.2. Topic Detection*

Topic detection is a way for monitoring and summarizing information generated from social sources, about which the participants discuss or argue or express their opinions. Therefore, identifying topics is of great importance to determine the relevant sentences of the opinion source to be included in the automatic summary. A topic can be analyzed and represented by considering different textual unit granularity, such as a group of terms, keywords, or sentences [30]. Term and sentence-based topic modeling approaches were applied and evaluated, adopting finally the first one in our proposal, as a consequence of the experimental results.

In our proposal, topic detection from all opinions is based on a clustering process, specifically of the terms extracted in the preprocessing task. In this sense, the cluster of terms represents the topics that have been boarded in the opinions. The objective of the clustering algorithms is to create groups that are coherent internally. In brief, cluster analysis groups data objects into clusters such that objects belonging to the same cluster are similar, while those belonging to different ones are dissimilar [35]. Both term and sentence clustering are carried out by applying a Hierarchical Agglomerative Clustering (HAC) algorithm [35]. HAC algorithm build hierarchies until obtaining a single cluster where all the objects are included. However, we need to obtain a certain quantity of groups of sentences that represent the topics boarded in the opinions. In this way, it is necessary to cut the hierarchy at some level for obtaining a partition. Although some variants to obtain a partition from a dendrogram are

reported in Reference [35], we adopted the definition of a threshold to achieve a standard cut-point for the hierarchies, which allows us to compare the results of the similarity measures of the clusters with this threshold in the cluster-construction process. Thus, terms are clustered until their higher similarities are less than the specified threshold; otherwise, the clustering process will be stopped. To obtain the threshold value, the mean of the maximum values of the similarities among any pair of objects was considered.

Two semantic processing approaches for measuring the similarity between text units in the clustering process were evaluated: (1) WordNet and (2) word embedding based, with the last one being the most promising. The Wu and Palmer measure included in WordNet::Similarity [36] is applied for computing the similarity of terms where the WordNet-based semantic processing is applied. The cosine similarity measure is applied over the word embeddings based term representation. The similarity between the sentences *S*<sup>1</sup> and *S*<sup>2</sup> is determined by using the following sentence-to-sentence similarity function [37] expressed in Equation (1):

$$\text{user\\_sim}(S\_1, S\_2) = \frac{1}{2} \left( \frac{\sum\_{w \in [S\_1]} (\text{maxSim}(w, S\_2) \* \text{idf}(w))}{\sum\_{w \in [S\_1]} \text{idf}(w)} + \frac{\sum\_{w \in [S\_2]} (\text{maxSim}(w, S\_1) \* \text{idf}(w))}{\sum\_{w \in [S\_2]} \text{idf}(w)} \right) \tag{1}$$

In this function, given two sentences, *S*<sup>1</sup> and *S*2, for each word (*w*) in *S*1, it is identified the word *w*' in the sentence *S*<sup>2</sup> that has the highest semantic similarity *maxSim(wi*, *S*2*)*, according to one of the word-to-word similarity measures (in our proposal, Wu and Palmer or cosine measures).

#### *3.3. Sentiment Scoring*

Different from traditional extractive text summarization, whose fundamental goal is extracting "important" sentences from single or multi-documents according to some features, the opinion-oriented summaries have to take into consideration the sentiment a person has towards a topic, product, place, service, etc. Opinion mining provides the sentiment associated with a document at different levels and through the polarity detection task, whereas text summarization techniques identify the most relevant parts of a document and build from them a coherent fragment of text (the summary) [1].

In this step, the sentiment analysis processing is performed based on a lexicon-based method, using the SpanishSentiWordNet (Spanish adjustment of SentiWordNet [38]) to extract sentiment-related words in texts. The SpanishSentiWordNet [39] lexicon is the result of the automatic annotation of all *synsets* of Spanish WordNet, according to the notions of "positivity" and "negativity". In this process, each WordNet *synset* is associated with two numerical scores, which indicate degrees of positivity and negativity of the contained terms (noun, verb, adjective, and adverb) in the *synset* [39]. The sentences that do not include sentiment content, or that have lower sentiment scores than a threshold value, are filtered. Words with a positive or negative SpanishSentiWordNet score greater than 0.4 are considered when computing the sentiment scores. The polarity scoring of a sentence is calculated as shown in Equations (2) and (3) [30]:

$$PosSentenceScore(j) = \sum\_{t\_i \in Option(j)} PosValue(t\_i) \tag{2}$$

$$\text{NegSentenceScore}(j) = \sum\_{t\_i \in \text{Optim}(j)} \text{NegValue}(t\_i) \tag{3}$$

where *PosValue*(*ti*) and *NegValue*(*ti*) are the polarity values in SpanishSentiWordNet of the identified sentiment word *ti* in the opinion *j*. The opinion polarity is determined according to the highest obtained polarity scores. According to Reference [30], the sum operator reached better accuracy achieved in the experimental results between four compared classical compensatory operators. The topic polarity scores are measured by using the sum of the polarity scores *PosSentenceScore*(*Sj*) and *NegSentenceScore*(*Sj*) of each sentence *Sj* included in each cluster, according to Equations (4) and (5).

$$PosToProjectScore(i) = \sum\_{S\_j \in ClassIuter(i)} PosSentenceScore\left(S\_j\right) \tag{4}$$

$$\text{NegTopicScore}(i) = \sum\_{S\_j \in \text{Cluster}(i)} \text{NegSentenceScore}(S\_j) \tag{5}$$

The highest obtained value of the cluster polarity score (*TopicScore*(*i*)) is used for determining which judgment (positives or negative) about the detected topics is the most representative in the processed opinion.

#### *3.4. Topic-Sentence Mapping*

Topic-based opinion-summarization systems, as our proposal, should be able not only to detect sentences that express a sentiment, but, more important, they should detect sentences that contain sentiment expressions towards the topic we are considering [1]. Once the opinion topics are identified and the sentences are classified as positive or negative, a mapping process between topics and sentences is performed. This process avoids the introduction of irrelevant sentences in the automatic summary. Mapping is carried out through computing the semantic similarity between the vocabulary that describes the topic and the sentences. For each sentence, Equation (1) is applied to compute sentences-to-topic similarity scores concerning all identified topics. Finally, the sentence is mapped onto the topic of the highest similarity score.

#### *3.5. Topic Contextualization*

Topic contextualization is one of the distinguishing tasks of our methodological proposal, concerning the generic opinion summarization systems that have been reported. In those systems, the generated summaries are generally focused on aspects or topics that are mainly identified while taking into account only the content of the opinionated texts. However, the purpose of our model is to provide automatic summaries focused on contexts of interest. In our model, these contexts are news articles, due the to fact they are the generators of the opinion comments.

In this phase, the news-based topic-ranking process is performed through computing the topic salience concerning the news content, obtaining a salience score for each topic. The topic salience is obtained by measuring the semantic similarity between the vocabulary associated with the topic and the news content. Topics with the lowest score (smaller or equal to a predefined threshold, which empirically was fixed in 0.5) are eliminated for the next steps of the summary construction process. This procedure means that the automatic summary will be built by extracting sentences from relevant topics of the news.

Similar to previous phases, Equation (1) and the conception for word-to-word semantic similarity are also applied. Topics are represented through term vectors, since the news is represented through the previously generated news feature vector. Formally, the salience score of a topic *Ti* for piece of news *nj* is defined according to Equation (6). In the case of using sentence-based topic modeling (another developed and evaluated approach), topic salience is computed by averaging the semantic similarity between the sentence *Sk*/*Sk*∈*Ti* and the news keyword vector, as shown in Equation (7).

$$\text{Isalience\\_score}\_1(T\_{i\nu}n\_j) = \text{sem\\_sim}(T\_{i\nu}n\_j) \tag{6}$$

$$\text{scalicence\\_score}\_2(T\_{i\prime}, n\_j) = \frac{\sum\_{S\_k \in T\_i} \text{sem\\_sim}(S\_{k\prime}, n\_j)}{|T\_i|} \tag{7}$$

#### *3.6. Sentences Ranking*

In this phase, the relevance assessment process applied to each opinionated sentence is carried out for generating the sentence ranking, according to a relevance score. Three approaches were developed and evaluated for measuring the relevance score:

	- *Sentence length*: A longer sentence is very likely to be more explanatory than than a shorter one, since a longer sentence, in general, conveys more information.
	- *Popularity and representativeness*: A sentence is very likely to be more explanatory if it contains more terms that occur frequently in all sentences.
	- *Discriminativeness relative to background*: A sentence containing more discriminative terms that can distinguish opinionated sentences from background information is more likely explanatory.

In our proposal setting, for each sentence *Sk*, the clustered content by the contextualized topic to which the sentence *Sk* belongs is used as a reference for computing the *representativeness*. In addition, sentences from all opinions are used as background for computing the *discriminativeness*. It is important to point out that contextualized topics are the most important opinion topics for the news; therefore, this setting allows us to indirectly align the sentence relevance scoring process with the news context.


#### *3.7. Summary Construction*

Once the relevance of the sentences is computed in the previous phase, the summary-construction process is carried out by selecting the *N* opinionated sentences with a higher relevance score from each contextualized relevant topic. The *N* value depends on the predefined compression rate (summary size). However, we set *N* = *3* when evaluating our proposal.

#### **4. Experimental Results**

#### *4.1. Description of Datasets*

To evaluate the effectiveness of our proposed model, two datasets with real information in the Spanish language, regarding two different domains, namely telecommunications services (TelecomServ dataset) and COVID-19 pandemic (COVID-19), were created. These datasets were manually constructed recovering information (news and opinions) from Cubadebate (www.cubadebate.cu), which is one of the most important and visited digital news platforms available in Cuba. For both datasets, the news-selection task was carried out while considering two fundamental requirements:


The TelecomServ dataset consists of 80 news and its associated opinions. Selected news are related to the Cuban Telecommunication Enterprise S.A. (ETECSA) and published in the last three years. The gathered information is one of the information sources that the enterprise may consider for measuring the customer's satisfaction regarding its services. On the other hand, the COVID-19 dataset consists of 85 news, along with their associated opinions, related to the battle against the one SARS-CoV2 coronavirus pandemic in Cuba. This dataset mostly gathers news related to information emitted by government authorities that were published in six months of the pandemic (March–August 2020). In this case, the gathered information and its processing/summarizing could be of great value for monitoring the social impact of the government actions for breaking the pandemic growth and the events that emerge in this difficult situation. The characterization of these datasets is shown in Table 1.

**Table 1.** Dataset characterization.


#### *4.2. Evaluation Metrics*

Evaluation in text summarization can be extrinsic or intrinsic. In an extrinsic evaluation, summaries are assessed in the context of a specific task a human or machine has to carry out. In an intrinsic evaluation, summaries are evaluated about some ideal model. An intrinsic evaluation has been the most adopted paradigm, and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures [43] are the most widely used metrics for evaluating automatic summaries. However, these content-based evaluation metrics require us to compare the automatic summary with a human summary model; this is a problem when this human summary is not available.

The effectiveness of our proposal was evaluated in a real context where the human summary model is not available; therefore, the ROUGE measures would be discarded. To address this problem, we use Jensen–Shannon divergence [16] as the quality evaluation metric for assessing our automatic summary from different perspectives. The adoption of this metric is mainly motivated by two reasons: (1) good summaries to be characterized by a low divergence between probability distributions of words in the input and summary would be expected [44] and (2) several reported studies demonstrate the existence of a strong correlation among measures that use human models (e.g., ROUGE, Pyramids, and others) and the Jensen–Shannon metric [44,45]. These studies and their experiments were developed in the context of generic multi-document summarization, topic-based multi-document summarization [44], and opinion summarization tasks [45].

Jensen–Shannon divergence (*JSD*) is an Information-Theoretic measure of divergence between two probability distributions and is defined as shown in Equations (8)–(10) [45]:

$$\log(\text{S} \| \, \text{D}) = \frac{1}{2} \sum\_{w} P\_w \log\_2 \frac{2P\_w}{P\_w + Q\_w} + Q\_w \log\_2 \frac{2Q\_w}{P\_w + Q\_w} \tag{8}$$

$$P\_w = \frac{C\_w^T}{N} \tag{9}$$

$$Qw = \begin{cases} \frac{\frac{C\_w^S}{N\_S}}{\frac{C\_w^T}{N} + \delta} & \text{if } w \in S\\ \frac{\frac{C\_w^T}{N} + \delta}{N + \delta \bullet B} & \text{otherwise} \end{cases} \tag{10}$$

where *P* is the probability distribution of a word, *w*, in the text, *T*, and *Q* is the probability distribution of a word, *w,* in a summary, *S*; *N*, defined as *N* = *NT* + *NS*, is the number of words in the text (*NT*) and the summary (*NS*); *B* is equal to 1.5 |*V*|, where *V* is the vocabulary extracted from the text and the summary; *C<sup>T</sup> <sup>w</sup>* is the number of words, *w*, in the text; and *C<sup>S</sup> <sup>w</sup>* is the number of words, *w*, in the summary. For smoothing the summary's probabilities, we used δ = 0.005. The *JSD* measure values are in the range [0, 1], where a lower value indicates a low divergence between the compared two probability distributions, resulting in a better quality of the automatic summary in our context. This measure can be applied to the distribution of units in system summaries *P* and reference summaries *Q*, and the value obtained would be used as a score for the system summary [45]. Nevertheless, in our evaluation framework, this measure was applied according to Reference [44], using the input (text news and opinions set) as a reference, through comparing the distribution of words in full input documents with the distribution of words in automatic summaries.

Topic detection constitutes another key piece in our summarization framework; therefore, its evaluation is also very important. The proposed topic-detection process was conceived through a clustering approach, applying a HAC algorithm, which suggests that, the higher quality the clustering process has, the higher quality the topic detection has. According to this supposition, we decide to apply the Silhouette measure [15]. Silhouette, a clustering validity measure, is conceived to select the optimal number of clusters with ratio scale data (as in the case of Euclidean distances) that are suitable for a separated cluster. It is important to point out that Silhouette values range from −1 to +1, where a high value indicates that the object is well matched to its cluster and poorly matched to neighboring clusters, therefore resulting in a better quality of the clustering process.

#### *4.3. Experimental Setup*

In this section, we describe the experimental setup that was considered for both datasets and used to evaluate the effectiveness of the proposed news focused opinion summarization model. In our experiments, several solutions based on our model were developed and evaluated, to identify the best alternatives. The characterization of the evaluated approaches is shown in Table 2. For each piece of processed news and automatically generated summary with each of these solutions, we computed the averaged Silhouette and *JSD* measures. The *JSD* measure was computed from two perspectives:



**Table 2.** Characterization and identification of the evaluated solutions.

The following experimental tasks were performed:

1. Evaluating two topic detection approaches by using both term and sentence based granularities in the clustering process and comparing them by applying both WordNet and word-embedding-based semantic-processing approaches. Selecting the clustering and semantic-processing approaches that provide the best results for topic detection.


Wilcoxon's Statistics Test was performed to validate the obtained results and to find significant differences between the evaluated solutions. From each dataset, 100% news and opinions were selected to constitute the sample group. In each test, the statistical significance was 95%, which means that the null hypothesis (*H0*) will be rejected when the *p*-value ≤ 0.05.

#### *4.4. Results and Discussion*

Figures 2 and 3 show detailed results of the first experimental task, where the evaluated solutions are grouped by the clustering approaches (term and sentence clustering), and the semantic processing (WordNet or word embeddings). This experimental task is focused on the Silhouette measure. Figures 4 and 5 show a comparative summary of the averaged Silhouette values for both datasets.

**Figure 2.** Results of the Silhouette measure for the two clustering approaches in the topic detection on the TelecomServ dataset by applying (**a**) WordNet and (**b**) word embeddings based semantic processing approaches.

**Figure 3.** Results of the Silhouette measure for the two clustering approaches in the topic detection on COVID-19 dataset by applying (**a**) WordNet and (**b**) word embeddings based semantic processing approaches.

As shown in Figures 2 and 3, Silhouette values are generally better when terms are clustered, regardless of the used semantic processing technique. Only in the case of the COVID-19 dataset, when WordNet is used (Figure 3a), do Silhouette values show better performance when sentences are clustered. It is important to point out that Silhouette values associated with each news show less dispersion when term clustering is applied, which is very positive behavior, because that means it is less sensitive to the diversity of news length and the number of associated opinions. Besides, term clustering represents a more stable clustering quality behavior. According to Figures 4 and 5, applying word embedding representation reaches best-averaged Silhouette values, those that are significantly higher when terms are clustered. These results allow us to conclude that term clustering, combined with word embeddings, is a more promising and effective setting of the topic modeling in our model. This combination guarantees good quality in the clustering-based topic detection, under the assumption that the quality of the detected topics is proportional to the clustering quality.

**Figure 5.** Average Silhouette values of compared topic detection approaches applied to the COVID-19 dataset.

Figures 6–9 show the detailed results associated with the second experimental task, which is based on the *JSD* measure. The evaluated and compared solutions are grouped according to the *JSD* scope focused on news or all opinions, as well as both term and sentence clustering. The semantic processing approach is specified in the identification of each solution (according to Table 2), which allows for an integral analysis of all developed model instances. As shown in Figures 6–9, OS4-WN and OS4-we are solutions that obtained the best results from *JSDNews* in both datasets, concerning the use of WordNet (OS4-WN) or word embeddings (OS4-we). These results indicate that combining topic modeling based on term clustering with the proposed *Sentence-to-news\_scoring* for the sentence ranking is the setting of our model that allows us to generate automatic summaries more aligned to the main topics in the news, regardless of the semantic processing approach adopted.

**Figure 6.** Results of *JSDNews* (Jensen–Shannon divergence focused on the news) applying (**a**) term and (**b**) sentence clustering, using WordNet and word embeddings on the TelecomServ dataset.

**Figure 7.** Results of *JSDOpinions* (Jensen–Shannon divergence focused on the opinions) applying (**a**) term and (**b**) sentence clustering, using WordNet and word embeddings on the TelecomServ dataset.

On the other hand, OS1-WN and OS1-WN are solutions that reach the best results from *JSDOpinions* in both datasets, which means that *Explanatoriness\_scoring* reaches better effectiveness to summarize the most important ideas of all opinions. These solutions do not ensure that the generated summaries have higher alignment with the news, concerning other solutions. Nevertheless, *JSD* focused on news obtained by these solutions, and their comparison with the rest of the solutions (see Tables 3 and 4) suggests that the inclusion of the topic-contextualization phase in the proposed model improves news-focused opinion summarization. Unlike the results shown in the first experiment, sentence clustering shows less sensitive behavior concerning the diversity of news length and the number of associated opinions.

**Figure 8.** Results of *JSDNews* applying (**a**) term and (**b**) sentence clustering, using WordNet and word embeddings on the COVID-19 dataset.

**Figure 9.** Results of *JSDOpinions* applying (**a**) term and (**b**) sentence clustering, using WordNet and word embeddings on the COVID-19 dataset.

Results shown in Tables 3 and 4, as well as in Figures 6–9, signify that the combination of term clustering and the word embedding representation model is also the more promising and effective setting of our model for reaching news-focused automatic summaries. Tables 3 and 4 show the averaged results of the *JSDNews* and *JSDOpinions* metrics, allowing them to complete the objective of the third task. Results of the WordNet-based semantic processing approaches are shown in Table 3, where OS3-WN was adopted as baseline 1. Results of the word-embedding-based semantic processing approaches are shown in Table 4, where OS3-we was adopted as baseline 2. These baselines were selected because the previous evaluation task concludes that the term clustering is the more promising and effective setting for topic modeling in our proposal. Thus, it allows us to evaluate the performance of the different approaches of our model and to compare them with notable summarizers as TextRank [41] (a similar decision is adopted in References [46,47]).

All solutions are compared according to the *JSD* scope for both datasets, and the best results are highlighted in bold. This comparison allows us to have a better understanding of the behavior of each approach. In general, the obtained results also showed that OS4-we is the best setting of our proposed model, according to *JSDNews* in both datasets. Furthermore, OS4-we is one of those solutions with best results from *JSDOpinions* when the word embedding representation is applied. This result allows us to conclude that the integration of term clustering, word embeddings, and the similarity-based sentence-to-news scoring turned out to be the more promising and effective setting of our model. The automatic summaries obtained with OS4-we are more focused on the news content; they also cover the main topics in the opinion set, reaching an appropriate balance among these targets.


**Table 3.** Summary of averaged results of the *JSDNews* and *JSDOpinions* metrics considering WordNet-based semantic processing.

**Table 4.** Summary of averaged results of the *JSDNews* and *JSDOpinions* metrics considering word-embedding-based semantic processing.


The previous results were validated through statistical tests. Wilcoxon's test was applied to find significant differences between the OS4-we results and those obtained by the rest of the evaluated solutions, using *JSDNews* as quality metrics, as shown in Table 5. The statistical results show that there are significant differences between OS4-we and the compared solutions, since the obtained *p*-value is less than 0.05; thus, the null hypothesis in all compared cases is rejected. On the other hand, according to the #items-best values, OS4-we obtains best results for 87% of news (as average) in the TelecomServ dataset and the 85% of news (as average) in the COVID-19 dataset. Therefore, OS4-we is the best configuration of our proposed model for news-focused opinion summarization.

**Table 5.** Statistical results of Wilcoxon's test from OS4-we vs evaluated solutions.


#### *4.5. Illustrative Examples*

Examples 1 and 2 were selected to illustrate the summaries generated by applying OS4-we on opinions about two news articles related to COVID-19, which facilitates a better understanding of how our proposal works.

**Example 1.** Excerpt from the summary generated regarding opinions related to the news "VALIENTES: Cuatro heroínas en la batalla contra la COVID-19" by applying OS4-we.


**Example 2.** Excerpt from the summary generated regarding opinions related to the news "Cuba frente a la COVID-19, día 100: Últimas noticias" by applying OS4-we.


In these examples, some fragments of the news and generated summaries were included to avoid further extension. These examples show summaries constituted by negatives and positives sentences, as well as the terms related to the most relevant opinion topics. Terms that more contribute to compute the polarity ratings (according to the SpanishSentiWordNet lexicon) are highlighted. Selected examples illustrate that the generated summaries are strongly related to the general meaning of the news content, still when the terminology used in both information units is different. The semantic relatedness with the most relevant identified topics is also appreciated. These results are achieved due to the semantic processing conceived in our model, which is carried out by integrating a semantic representation model (word2vec [14]) and two semantic similarity measures (Wu and Palmer [36] and the sentence-to-sentence similarity measure reported in Reference [37]).

Some sentences in the generated summaries are slightly extensive, which is fundamentally due to the opinion size is not restricted in the news platform used as opinion source—being another challenge to determine the relevance of the sentences with effectiveness. The longest sentences have more probability of obtaining higher relevance scores, since they can contain a higher number of terms semantically related to the news' content. Therefore, this suggests considering other sentence features, such as tf-idf and sentence length, and their integration to the sentence relevance assessment [48].

#### **5. Conclusions and Future Works**

In this paper, we have presented a news-focused opinion summarization approach that was designed according to the conception of extractive and topic-based text summarization methods. The proposed model can retrieve relevant sentences for the essential aspects of the news (context of interest), as well as cover the main topics of the opinionated texts in the generated summary. Our proposal integrates topic modeling, sentiment analysis, news-focused relevance scoring, and semantic analysis techniques. Several techniques and settings of our model were developed and evaluated with Spanish news and opinions regarding two different domains. The selected texts come from a real digital news platform.

The proposed model outperforms both adopted baselines, which are based on the classical text summarization method TextRank, obtaining automatic summaries more relevant to the news content, as well as covering the main topics in the opinionated texts well. The integration of term clustering, word embeddings, and similarity-based-sentence-to-news scoring turned out to be the more promising and effective setting of our model, due to its reaching the best values of Jensen–Shannon divergence concerning the news and very good values for all opinions. The use of semantic representation of words for applying similarity metrics was especially effective, resulting in the best option when the word embedding representation is used. Filtering the topics non-related with the news was a crucial step for generating automatic summaries aligned with the news, as well as the calculation of the semantic similarities of the sentences with the news to extract relevant sentences. The application of the explanatoriness-scoring technique in the sentences-ranking phase reached summaries that best cover the main topics in the opinionated texts. Nevertheless, it is necessary to point out that an important factor to achieve those good results was the integration of the topic-contextualization process, where the news is used to refine the identified topics from opinions. These results give us an idea that generally the topics treated in opinions are, in fact, closely related to a context that originates them (e.g., the news).

Despite promising results, several tasks could be considered as future works. Studying the effects of applying other clustering algorithms and similarity measures could contribute to obtaining better results. In the case that there are too-short sentences, to explore opinion and sentence augmentation could improve the opinion summarization process. Besides, it would be necessary to address the problem of the inverse polarity caused by the negation and integrate several sentiment lexicons in the sentiment analysis process. The use of other sentence features and the aggregation of their results for improving the relevance scoring should also be studied.

**Author Contributions:** Conceptualization, A.S.-C., A.R.-H. and M.M.G.L.; methodology, A.S.-C. and M.M.G.L.; software, A.R.-H.; validation, A.R.-H., A.S.-C. and M.M.G.L.; formal analysis, A.S.-C., A.R.-H. and M.M.G.L.; investigation, A.R.-H., A.S.-C., M.M.G.L., L.A. and J.S.-G.; resources, A.R.-H.; data curation, A.S.-C. and A.R.-H.; writing—original draft preparation, A.S.-C. and A.R.-H.; writing—review and editing, A.S.-C., A.R.-H., M.M.G.L., L.A. and J.S.-G.; visualization, A.S.-C.; supervision, A.S.-C. and M.M.G.L.; project administration, A.S.-C.; funding acquisition, A.S.-C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The work within the projects SAFER—PID2019-104735RB-C42 (AEI/FEDER, UE) and MERINET—TIN2016-76843-C4-2-R (AEI/FEDER, UE) supported by the Spanish Government.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Sentiment Analysis and Text Mining of Questionnaires to Support Telemonitoring Programs**

#### **Chiara Zucco 1, Clarissa Paglia 2, Sonia Graziano 3, Sergio Bella <sup>2</sup> and Mario Cannataro 1,4,\***


Received: 15 October 2020; Accepted: 24 November 2020; Published: 26 November 2020 -

**Abstract:** While several studies have shown how telemedicine and, in particular, home telemonitoring programs lead to an improvement in the patient's quality of life, a reduction in hospitalizations, and lower healthcare costs, different variables may affect telemonitoring effectiveness and purposes. In the present paper, an integrated software system, based on Sentiment Analysis and Text Mining, to deliver, collect, and analyze questionnaire responses in telemonitoring programs is presented. The system was designed to be a complement to home telemonitoring programs with the objective of investigating the paired relationship between opinions and the adherence scores of patients and their changes through time. The novel contributions of the system are: (i) the design and software prototype for the management of online questionnaires over time; and (ii) an analysis pipeline that leverages a sentiment polarity score by using it as a numerical feature for the integration and the evaluation of open-ended questions in clinical questionnaires. The software pipeline was initially validated with a case-study application to discuss the plausibility of the existence of a directed relationship between a score representing the opinion polarity of patients about telemedicine, and their adherence score, which measures how well patients follow the telehomecare program. In this case-study, 169 online surveys sent by 38 patients enrolled in a home telemonitoring program provided by the Cystic Fibrosis Unit at the "Bambino Gesù" Children's Hospital in Rome, Italy, were collected and analyzed. The experimental results show that, under a Granger-causality perspective, a predictive relationship may exist between the considered variables. If supported, these preliminary results may have many possible implications of practical relevance, for instance the early detection of poor adherence in patients to enable the application of personalized and targeted actions.

**Keywords:** text mining; sentiment analysis; Web-based questionnaire; telemedicine; telemonitoring; telehomecare

#### **1. Introduction**

Telemedicine can be defined as the set of health services providing medical care in patients' daily living environment, which is possible thanks to the support of information and telecommunication technologies [1].

Common goals of telemedicine programs are substantially threefold [2–4]:

• to increase self-management skills for patients whether they have a chronic condition or, for instance, during recovery or a rehabilitation phase after surgery or in the follow-up after

a long time hospitalization and also during the treatment for depression and other mental health conditions;


A subfield of telemedicine is telehomecare, or home telemonitoring, which enables the rapid exchange of information between health systems and patients. Patients enrolled in a telehomecare program are provided with bio-monitoring devices and Internet reporting systems installed in their daily living environment. The devices can be used to autonomously measure vital signals, specifically related to the specific condition of the patient. These measurements are then transmitted and evaluated by health professionals (physicians and nurses) who will subsequently re-contact patients via phone call or message to check their symptoms and, eventually, provide an early medical response.

Telemedicine and, specifically, telehomecare systems have shown themselves to be cost-effective [5] and to provide an improvement in patients' quality-of-life, in terms of significant reduction of both mortality and length of stay of patients in progressive care unit [6], significant improvement of glycemic control for patients with diabetes [7], etc.

However, two variables that may affect a telemonitoring program's effectiveness are adherence levels and the degrees of drop-out. Adherence levels may be measured in different suitable ways. Here, adherence is intended as the rate of performed monitoring events with respect to the ideal number of events, suggested by the telemonitoring protocol, while the degrees of drop-out refer to the percentage of patients who abandon the telemedicine or the telehomecare program they were enrolled in, generally due to poor adherence. In particular, in [8], a systematic review among 37 healthcare programs for Heart Failure and Chronic Obstructive Pulmonary Disease was carried out. In the study, rates of refusal of almost one-third of patients were reported. Moreover, among patients who took part in the telehomecare program, one-fifth abandoned the program after enrollment.

Another interesting study is related to cystic fibrosis patients' follow-up at home [9].

Cystic fibrosis is the most common life-threatening genetic disease in the Caucasian population [10]. It is characterized by recurrent episodes of a respiratory infection that cause progressive lung deterioration, with a long-term decline in lung function. The spirometry test is a simple test used for lung function monitoring. It is known that continuous monitoring of lung function during the follow-up of patients with cystic fibrosis can reduce lung damage by preventing bronchopulmonary exacerbations and, consequently, prevent patient's exitus for lung insufficiency [11].

The authors reported that, of 39 enrolled patients, 15 dropped out of the program (38.46%). The percentage decreases to 31.4% if considering voluntary drop-out. Eighty-one percent of drop-out was due to poor program adherence [9].

The most frequently used approaches for conducting research studies in the social sciences make use of surveys [12]. Among the methods for collecting survey data, questionnaires represent a widely used tool. Thanks to the availability of tools and systems that facilitate the development and administration phases through online platforms, their popularity has grown significantly. Compared to face-to-face or telephone interviews and to questionnaires-on-paper, online questionnaires provide several advantages: (i) a cost reduction; (ii) the collection of a greater number of data in a shorter time; (iii) the possibility for the individual to manage the place and time to take the questionnaire; and (iv) responses are already digitized and exportable in formats which are suitable for a subsequent analysis [13].

Thus far, closed-ended questions in questionnaires have dominated the scene in the social sciences and, consequently, in the psychological and health sciences. This choice is justified by the easiness of data collection, reliability and simplicity of analysis, and the possibility of standardizing the collection to compare results between different populations [14]. On the other hand, the possibility of using open-ended questions would allow performing fine-grained analysis by offering new and interesting insights, capable of detecting slight differences, especially, for example, in the context of patient monitoring.

The investigations carried out to verify if a significant benefit may be obtained by introducing open-ended questions in questionnaires led to different results. In [15,16], it is shown that answers which received a high response rate in closed-ended questionnaires were not mentioned when the same question was formulated in an open-ended form, whereas the study conducted in [17] showed no benefit in introducing open-ended questions.

With the availability of many textual data coming from social platforms, noteworthy developments concerning the automated analysis of texts have been registered during the last decade. Above all, an increasing interest in the field of sentiment analysis, which aims at the automatic extraction of emotions and opinions, mainly from text [18], has been witnessed.

This work aims to present an integrated software architecture for the online provision and collection of questionnaires or surveys, which exploits a sentiment analysis-based approach to monitor patients' adherence to telehomecare programs. The idea is that the sentiment, i.e., the degrees of positiveness/negativeness, expressed by patients through their responses to questionnaires, may be related to their adherence and used to predict drop-out.

The present architecture proposal is intended as a contribution that can help the context of home telemonitoring programs. The basic idea is to integrate within a telehomecare system an online survey instrument to investigate the polarities of patients' opinions in relation to their experience.

The proposed system also encompasses a novel analysis approach that leverages lexicon-based sentiment analysis techniques and exploits the inferred polarity as a numerical feature to enhance further statistical or machine learning analysis.

To the best of our knowledge, no specific research has been published, nor has a system architecture been proposed that would explicitly monitor changes in patient's opinion across time through the repeated administration of a questionnaire, using the polarity associated with answers to open-ended question as a numerical feature, in a telehomecare system.

Additionally, the paper presents a case study application of the system architecture to discuss whether a predictive relationship, in terms of Granger-causality test modeling, may be assessed between patient adherence in a cystic fibrosis telehomecare program and their opinion about the program they are enrolled in.

The rest of the paper is organized as follows. Section 2 describes the methodology behind the proposed approach and the case-study application. Section 3 provides insights related to collected data and presents the Granger-causality hypothesis tests results and discusses it. Finally, Section 4 concludes the paper and outlines future works.

#### **2. Materials and Methods**

In this section, some preliminary information related to the case study, a description of the experimental protocol used and the analysis pipeline's proposal are presented.

#### *2.1. Preliminary Information*

Since 2001, a home telemonitoring program has been provided by the Cystic Fibrosis Center of the "Bambino Gesù" Pediatric Hospital in cystic fibrosis patients' follow-up. Patients are provided with Spirotel instrumentation from MIR (Medical International Research), which transmits data from the spirometry test and overnight pulse oximetry remotely, following the clinical workflow detailed in [11]. Patients are suggested to send spirometry transmission at least twice a week.

After the transmission, physicians contact patients by performing a telephone interview involving questions about some pulmonary symptoms and more general health conditions. Patients included in the telemonitoring program are treated with standard follow-up protocols, similar to those not enrolled in the program. A detailed description can be found in [19]. Despite the promising results, a significant percentage of abandonment for poor adherence has been constantly registered.

Table 1 reports some statistics related to patient enrollment, related to a nine-year period (2010–2018). As shown in Table 1, the drop-out patients enrolled in the telehomecare program represent 41% of the total number. Table 2 further illustrates the composition of patients who leave the program. In particular, 81.25% of patients' drop-out is due to voluntary drop-out: 50% of patient abandonment is related to poor adherence, while 31.25% of intentional abandonment is related to other reasons.

**Table 1.** Balance of enrolment during the period 2010–2018.


**Table 2.** Proportion of patients drop-out during the period 2010–2018, grouped by abandonment causes.


#### *2.2. Experimental Protocol and Dataset Description*

The data analyzed in the present case study application were collected from the Cystic Fibrosis Unit, Bambino Gesù Children's Hospital, Rome, Italy. In this study, 169 online surveys sent by 38 cystic fibrosis patients (F/M = 20/18, age = 28.7 ± 9.91, age range = 14–49) recruited among patients already enrolled in a telemedicine program (years of enrollment = 5.9 ± 3.9) were collected and analyzed at five different survey epochs.

The enrollment criteria included patients more than 12 years old with cystic fibrosis who access the Cystic Fibrosis Unit in ordinary, daytime, or outpatient hospitalization. All patients who have undergone a transplant (liver/lung) were excluded from the study.

The study was formally approved by the local Medical Research Ethics Committee.

#### *2.3. Administration of Questionnaire*

From June 2019, 38 enrolled patients were asked to complete, every three months, an online questionnaire designed ad-hoc by the clinical team. In the following, each set of surveys submission is indicated as an epoch.

The Telemedicine Drop-Out (TDO) questionnaire consists of 15 blocks of closed, mixed, and open-ended items with yes/no constraints, and it was administered through a self-hosted web-based survey instrument built on top of LimeSurvey. The TDO survey was designed as an online, structured version of the interview led by the medical team within the telemedicine program, extended with a series of open-ended questions, whose objective was to infer polarity or, in perspective, to extract emotions from the relative answers [20]. The TDO questionnaire is reported in Appendix A.

To administer surveys to patients, LimeSurvey (https://www.LimeSurvey.org/) [21] is set up as a highly customizable, free, and responsive online survey tool. It also provides various API functions through the LimeSurvey RemoteControl 2 (LSRC2). The survey structure and the participants are created through the user interface provided by LimeSurvey. The collection of survey answers is automatized using the Python library Limepy that provides a Python wrapper for the LSRC2 API and the Python library Schedule to automatically update the responses. The DBMS server is MySQL.

As already stated, adherent patients need to transmit the results of the spirometry test at least twice a week. For each survey administration, i.e., survey epoch, the patient's adherence score (Adh-score) to the telemonitoring program was assessed as the total number of spirometry transmissions sent during a three-month window starting from the month before until the month subsequent to the survey administration, averaged by twice the total number of weeks following. More in details, suppose that a survey was carried at month *t*, then:

$$\text{Adh-score}\_{t} = \frac{nS\_{t-1} + nS\_{t} + nS\_{t+1}}{2(w\_{t-1} + w\_{t} + w\_{t+1})}$$

where *nSt*−1, *nSt*, and *nSt*+<sup>1</sup> refer to the number of spirometry transmissions sent in month *t* − 1, *t*, and *t* + 1, respectively, while *wt*−1, *wt*, and *wt*+<sup>1</sup> refer to the number of weeks in months *t* − 1, *t*, and *t* + 1, respectively.

For instance, to calculate the Adh-score related to the first epoch submission, *t* = June 2019. Therefore, each patient's total number of spirometry transmissions from May 2019 to July 2019 was considered. Moreover, since the three-month window encompasses 13 weeks, the total number of spirometry transmissions was averaged by twenty-six.

By definition, patients who strictly follow medical advice have a related Adh-score ≥ 1. In the following, the percentage of Adh-score, i.e., Adh-score (%), is considered. Therefore,

$$10 \le \text{Add-score} \left( \% \right) = \text{Add-score} \ast 100$$

and Adh-score (%) > 100 for patients with high rates of adherence. The clinical team provided the number of transmissions per month.

#### *2.4. System Architecture*

The system architecture encompasses three independent modules, connected in a cascade-fashion. In future works, the modules are supposed to be integrated using a unique user interface. Figure 1 shows the overall architecture for the system, which is organized as three logical levels:


#### 2.4.1. Data Analysis Pipeline

The general pipeline for the analysis of textual data, i.e., answers to open-ended questions, involves:


**Figure 1.** The modules of the system architecture, implemented as three independent levels connected to each other in cascade. The architecture is designed to be cyclical, as the system is used for each scheduled administration of the survey.

#### 2.4.2. Sentiment Polarity Extraction

Valence Aware Dictionary for sEntiment Reasoning (VADER) [23] is a lexicon-based sentiment analysis engine that combines lexicon-based methods with a rule-based modeling consisting of five human validated rules.

The benefits of VADER's approach are: it does not require a training phase, and, consequently, its application is feasible even in low resource data domain; it works well on short text; it is fast and, therefore, may be suited for near real-time application; being related on general "parsimonious" rules, it is basically domain-agnostic; and it constructs a white box model, thus is highly interpretable and adaptable to different languages.

The starting point of the VADER system is a generalizable, valence-based, human-curated gold standard sentiment lexicon, built on top of three well-established lexicons, i.e., LIWC [26], General Inquirer, and ANEW [27], expanded with a set of lexical features commonly used in social media, which include emoji, for a total of 9000 English terms subsequently annotated in a [−4, 4] range through the Amazon Mechanical Turk's crowd-sourcing service.

The VADER engine's second core step is the identification of some general grammatical and syntactic heuristics to identify semantic shifters, i.e., words that increase, decrease, or change the polarity orientation of another word. In particular, five heuristics for sentiment polarity shifters have been identified:


To extend VADER to the Italian language, Sentix [28], a lexicon that automatically extends the SentiWordNet annotation to the Italian synsets provided in MultiWordNet [29], was considered.

Among the five heuristics designed in VADER, only three needed to be adapted to the Italian language since the shifter role of capitalization of words and exclamation marks is used as intensifiers for both languages. Words belonging to the VADER set of negation words were translated in the Italian language, and the set was then extended by retrieving MultiWordNet synset terms for each word, while contrastive particle "but" was simply translated to Italian.

Among the intensifier sets, VADER also considered a few idioms, but, due to discrepancies across different languages, idioms were not considered.

#### 2.4.3. Granger-Causality Testing

Granger-causality is a statistical hypothesis testing model to determine if there is a directed relationship between two time series [25]. A time series X is said to Granger-cause Y if it can be shown that there is a statistically significant improvement in predicting future values of Y by using past values of X (i.e., lagged values of X) and Y, compared to predictions based only on past values of Y.

The possibility to relate past values of X to Y's actual values is in virtue of a lag factor. Here, the Granger-causality test was computed for X's lagged values. All the lags ranging from one to four were tested, where four is the number of considered submission epochs minus one.

Here, the considered alternative hypothesis is that the polarity-score time series associated with each considered open-ended question Granger-cause the time series of adherence. The level of significance was set at 5%, i.e., *p-*value < 0.05. The Granger-causality test assumes the hypothesis that the investigated time-series are stationary. Therefore, the augmented Dickey–Fuller method was exploited to check stationarity conditions [24].

#### **3. Results and Discussion**

In this section, we present the results related to the Granger-causality testing model to assess the plausibility of the existence of a predictive relationship between a score representing the opinion polarity of patients about telemedicine and their Adh-score. Moreover, to gain useful insights about the collected data, a preliminary exploratory data analysis was performed by following the pipeline discussed in the previous section and by summarizing data through suitable visualization.

#### *3.1. Exploratory Data Analysis*

In this study, 169 answers to the TDO survey were collected and analyzed following the system architecture described in the previous Section.

The present exploratory data analysis aims to provide useful insights into the data collection and integration processes.

In particular, the collected data were sent by 38 cystic fibrosis patients through five subsequent submissions, scheduled every three months on average.

Figure 2 shows a violin plot describing the distribution of Adh-score in percentage associated with each submission epoch, while, in Table 3, the same information is provided in a tabular form. Although mean values of Adh-score (%) are in the range [38.67%, 51.45%] for each epoch, the standard deviation and minimum and maximum values of Adh-score (%) show a considerable variation of Adh-values, with patients who sent zero spirometries and patients who transmitted three times more than the medical advice.

**Figure 2.** Distribution of Adh-score in percentage across the five subsequent submission epochs.


**Table 3.** Adh-score (%): descriptive statistics across five subsequent epochs.

A comprehensive analysis of the responses to the TDO survey is beyond the scope of this paper. Instead, only answers to two open-ended questions collected from the TDO survey are discussed:


A polarity score ranging in [−1, 1] was inferred by adapting the VADER framework to the Italian language and considered as a numerical feature for each set of answers. In Figure 3, the sentiment polarity with respect to the TDO survey Question Q1 is shown through time. In particular, the polarity intensities for the five different epochs are shown in different colors. The results show an overall positive opinion about telemedicine. In Figure 4, the sentiment polarity with respect to the TDO survey Question Q2 is shown through time. Answers related to this question show a more negative polarity score with respect to Question Q1.

**Figure 3.** Sentiment polarity associate to Question Q1 are visualized through time. Responses are represented with the relative patient id and the questionnaire session epoch, i.e., E1, E2, E3, etc. On the y-axis, the compound polarity score related to the patient answer at the epoch *Ej* is inferred by the VADER framework's adaption to the Italian language.

**Figure 4.** Sentiment polarity associate to Question Q2 are visualized through time. Responses are represented with the relative patient id and the questionnaire session epoch, i.e., E1, E2, E3, etc. On the y-axis, the compound polarity score related to the patient answer at the epoch *Tj* is inferred by the VADER framework's adaption to the Italian language.

To further provide some insights about the latent aspects more frequently mentioned by patients, in Figure 5, the 50 words resulting more used by patient are shown. The set of free-text answers was pre-processed with standard NLP techniques, i.e., tokenization, stop word removal, and lemmatization.

**Figure 5.** Word cloud showing the most frequent tokens with respect to the answers to question "What do you think about telemedicine?". Tokens with the largest font size are the most frequent.

The results show how "excellent", "useful", "tool", "health", and "patient" are the most common words in the patient response through time.

#### *3.2. Testing Granger-Causality*

Three time-series were considered, i.e., polarity scores related to Q1 and Q2 and the time series of Adh-scores. The Augmented Dickey–Fuller Test showed that for all the three considered time series the stationarity condition holds (*p*-value= 4.6124 × <sup>10</sup>−18, *<sup>p</sup>*-value= 3.2185 × <sup>10</sup>−7, and *p*-value = 0.0035, respectively).

Two Granger-causality tests were performed to check whether Q1 Granger-causes Adh-score and whether Q2 Granger-causes Adh-score. Moreover, since all the three series are considered contemporaneously, we also need to check whether Adh-score Granger-causes Q1 and whether Adh-score Granger-causes Q2. Three different test-statistics, i.e., *F*-test, chi2, and likelihood-ratio, were considered, with the number of lags varying from one to four. Tables 4 and 5 show the results in terms of *p*-values. It can be seen that both Q1 and Q2 Granger-cause Adh-score for lag = 1. On the other hand, Adh-score appears to not Granger-cause Q1 or Q2.

Therefore, the results suggest the existence of a predictive relationship between the polarity scores series associated with Q1 and the polarity score series associated with Q2 with respect to the Adh-score.

**Table 4.** Q1 and Adh-score: *p*-value of Granger-causality test performed with three different statistics and four different lags.



**Table 5.** Q2 and Adh-score: *p*-value of Granger-causality test performed with three different statistics and four different lags.

#### *3.3. Discussion*

The survey instrument and the analysis pipeline were applied to a real case study related to the remote follow-up of patients with cystic fibrosis, held in collaboration with the Cystic Fibrosis Unit, at Children's Hospital "Bambino Gesù", Rome, Italy.

In particular, 169 online surveys sent by 38 patients enrolled in a home telemonitoring program provided by the Cystic Fibrosis Unit at the "Bambino Gesù" Children's Hospital in Rome, Italy, were collected and analyzed through five subsequent questionnaire submissions.

Only answers to two open-ended questions were considered, i.e., Q1 "What do you think about telemedicine?" and Q2 "Since you joined the telemonitoring program, what has improved the quality of your life?".

The time-series of polarity score inferred through the adaption of VADER to the Italian language were used as numerical features to perform the Granger-causality testing model to investigate whether a predictive relationship between the polarity score of open-ended questions and Adh-score may exist.

The experimental results reported in Tables 4 and 5 therefore suggest that, under a Granger-causality perspective, the existence of a predictive relationship between the polarity scores series associated to Q1 and the Adh-score (lag = 1, *p*-value = 0.0339, statistic = Chi2 test) and between the polarity score series associated to Q2 and the Adh-score (lag = 1, *p*-value = 0.0016, statistic = Chi2 test).

The results are consistent with the hypothesis that the polarities extracted from patients' opinion on telemedicine may help predict their average adherence one epoch after the survey administration.

If supported, these results may enable the possibility to intervene early, in a targeted and individual way, to avoid drop-out and continue with the home telemonitoring program which represents a valid aspect of care for patients with Cystic Fibrosis.

Moreover, the early recognition of the reasons that lead the patient to drop-out and intervene immediately may result in:


#### **4. Conclusions**

In the present paper, a system architecture for the extraction of emotional states from textual contents, designed to support the monitoring of patients with chronic disease, is presented. The main goal of the proposed system is to present a methodology to capture the underlying opinions that chronic patients have about the program they are enrolled in and to investigate whether these features may help in the early prediction of patient drop-out from the telemedicine program.

The proposed system is designed in an end-to-end fashion to provide support through the whole process, including the implementation of the questionnaire, the survey administration at scheduled intervals, as well as the analysis. Specific contributions are:


In particular, in the present study, we focused on three variables modeled as time series: the polarity scores extracted from the responses to Question Q1, the polarity scores extracted from the responses to Question Q2, and an adherence score (Adh-score) defined for each epoch starting from the number of spirometries over a three-month window provided by the medical team.

The Granger-causality testing model was performed to assess whether a predictive relationship between the polarity scores series associated to Q1 and the Adh-score (lag = 1, *p*-value = 0.0339, statistic = Chi2 test) as well as between the polarity score series associated to Q2 and the Adh-score (lag = 1, *p*-value = 0.0016, statistic = Chi2 test).

Limitations to the present analysis may be found in the small number of data collected up-to-date, which does not allow the investigation of changes for a single patient through time.

Moreover, not every patient answered each survey session, which may have an impact on the number of lags.

Nevertheless, the promising results encourage us to further investigate the potentiality of the proposed architecture and the analysis pipeline with the aim to develop, as future work, a predictive system for the early detection of poorly-adherent patients that may also alert doctors to contact patients and eventually update/personalize their telemedicine program (e.g., in terms of timing, technological equipment, psychological counseling, etc.).

**Author Contributions:** C.Z. and M.C. conceived the main idea of the algorithm and designed the tests; S.B. and M.C. supervised the design of the system; C.Z. designed the system and ran the experiments; and C.Z. and C.P. collected data; Investigation, S.G. All authors contributed in writing the original draft preparation. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**


**The Telemedicine Drop-Out (TDO) survey** 

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Information* Editorial Office E-mail: information@mdpi.com www.mdpi.com/journal/information

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18