2.1. Social Media, Topics, and Trends
Social media (or SNS) are web-based services that allow individuals to create a profile, articulate a list of connections to other users, view and traverse such connections, and share content [
10]. Although social media are today the primary source of information for many, the spread of information is still irregular, difficult to predict in nature, and incidental. Methods that aim to track and retrieve given conversations are required to be flexible and dynamic enough to be able to follow them with the minimum information loss.
To be able to accurately track opinion over time has been one of the main concerns of analysts for a long time [
11,
12,
13,
14]. With the advent of Twitter, public opinion can be tracked continuously and in real time. In fact, Twitter continuously publishes real-time produced lists of the most popular topics under discussion either locally or globally (also known as trending topics). Many users consider something to be news when it becomes a Twitter trending topic, even though the way Twitter produces these lists of trending topics is unknown, as the algorithm used remains unpublished. It is relevant, however, that although Twitter is assumed to classify trends according to every tweet published during a period, it only provides (via its API) a reduced sample of the entire stream for free. This limitation makes it difficult for those trying to extract trending topics from the whole data stream. However, our work does not aspire to replicate the way Twitter produces its trending topic lists, and assumes working on a subset of the complete data to be reasonable.
Thus, Twitter produces trending topic lists from all users’ tweets. However, the way information spreads in online social networks is more reminiscent of a complex contagion model where information diffusion is affected not only by the number of exposures to a piece of information but also by the exposure to multiple sources and their social influence [
15]. Hence, users tend to follow other users on topics of their interest in order to acquire information on those topics. These later users, leaders of opinion or “authorities”, are sources of information for many individuals in this media environment, and consequently, become more influential, and the information diffused by them more “viral” [
1].
As mentioned, our work aims not at detecting global trending topics but instead at obtaining the relevant topics of discussion among a set of opinion leaders that we call “authorities”, and adapting the tracked keywords accordingly to retrieve the most relevant information throughout the duration of the monitoring. As a result, we will be able to: (a) observe how the topics of interest addressed by a set of authorities evolve; and (b) to extract the conversations linked to those topics, discarding any others, which will tell us how authorities’ public discourse has changed over time.
2.2. Topic Tracking
Previous work has pointed out the difficulties of implementing reliable, precise, and fast trend detection [
16]. The high volume of information, the myriad different topics under discussion in any particular time, and the significant variations in the time and volume scales of social datasets stand in the way of direct tracking. Many authors have addressed the problem of detecting new events from a stream [
17,
18]. The work by Petrovic et al. [
19] described a method applicable to a stream of Twitter posts, but although their system outperformed previous approaches, the results invariably include spurious information and not only news events as one would expect. One way to overcome the limitations related to the analysis and classification of corpora is to look at the Twitter hashtags (keywords or terms starting with “#”), which are the most common feature for users to connect and relate to within a larger networked discourse [
20]. D’heer et al. have shown how messages that include such hashtags have, in general, more informational value than tweets without them; these labeled messages are also often longer, connected with other topics, and enriched with hyperlinks [
21]. Additionally, as Enli and Simonsen [
22] note, using hashtags is not an individual action: many actors, such as politicians, make use of them to reach outside their own connections and contacts. This is an example of how the use of social media by many sectors is closely-related to their professional practice.
To be able to track hashtags accurately is, thus, critical to be able to analyze the discourse in Twitter. However, difficulties arise when the researcher needs to predict what hashtag or word is going to be used by the public in a particular situation, and whether it is going to change in the process, alternating with modified or customized hashtags, so a flexible approach to the tracking becomes a requirement. In spite of this, many previous works follow a static tracking approach. For instance, Fano and Slanzi [
23] used Twitter to monitor the discussion around a constitutional referendum held in Italy for five weeks. Even with a potentially disputed debate, they opted for a manual selection of five hashtags. Similarly, Reyes-Menendez et al. [
24] tracked a single hashtag, #WorldEnvironmentDay, to assess the public opinion around such an event. However, such opinion included only users that used hashtags in English, and therefore, discarded variants or hashtags written in different languages that could, potentially, be even more widely used, individually or altogether, than the official tag. Takahashi et al. [
25] also used four static hashtags to analyze the messages in Twitter in a natural disaster, even when the typhoon went on for five days. In their work about the EU 2014 election trends, Tsakalidis et al. [
26] admit that when the selection of hashtags is static, missing data and losing track of the conversation is inevitable, even when they “aggregated tweets written in the respective language that contained a party’s name, its abbreviation, its Twitter account name and some possible misspells” and “excluded several ambiguous keywords in an attempt to reduce the noise”. These examples reflect how when only the expected generic terms or the mainstream hashtags are used, relevant information is lost as they leave little space to unexpected events, shortly lived relevant topics or unusual wordings. In this line, it is also worth noting how different languages, abbreviations, misspellings, or ambiguous keywords can be problematic when using a static approach.
Within the general analysis of trending topics on Twitter, some authors have tried to analyze what makes a topic trend [
27], what characterizes these topics [
28], what different types of emerging topics exist [
29], and even what characteristics are shared by users who started or had a greater influence in the dissemination of trending topics—what has been referred as Twitter trend demographics [
30]. Nevertheless, only a limited number of works have tackled real-time topic detection in Twitter. Choi and Park [
31] proposed a method to detect emerging topics on Twitter using high utility pattern mining (HUPM), which takes the frequency of appearance and the utility of words into account. Although their approach gives good results when detecting topics in known datasets, it is not designed to dynamically use the resulting topics for extraction. With a different approach, the work of Adedoyin-Olowe et al. [
32] aims to detect relevant events from a set of twitter posts, but it displays a similar problem as the analysis is conducted in postprocessing, instead of detecting events in near real time and tracking them to extract all the relevant tweets produced by the users. Such adaptation is found in Gaglio et al. [
33], who proposed a system able to progressively refine its query to include new relevant terms, reflecting the emergence of new topics or trends. In their conclusions, they also noted how “other systems were unable to capture the social aspects of the observed events [...] every time the users left the main topic and started to talk about unexpected events”. Their work, however, presents a couple limitations that could have a relevant impact in many contexts, as an initial set of hashtags must be provided in advance, and afterwards, the adaptation is solely based on the extracted tweets, without considering the relevancy (or authority) of their authors. This could easily become an issue, as irrelevant hashtags can get into the system, gradually drifting to diverging topics. Our proposed solution aims to overcome the previous shortcomings by proposing a method which works in real time, provides precise outputs, and is adjustable against spurious emerging topics.
It is important to clarify that the problem our model solves is a simplification of the general problem of trend detection and monitoring. As previously mentioned, other works monitor hashtags or topics in the tweets published by any user, while we analyze those tweets produced by a known set of users only. This set of users is not necessarily fixed—this can be customized—but it is quite stable and constrained. In any case, the size of the corpus is still enormous, and the variations in time and volume scale of our datasets are still there. To control these two aspects, and to do so dynamically in one pass is still challenging. As we will see in the next section, our solution provides excellent results without the need for any preprocessing or the creation/use of additional corpora.