Next Article in Journal
Vignetting Compensation Method for CMOS Camera Based on LED Spatial Array
Previous Article in Journal
Wideband Low Phase-Noise Signal Generation Using Coaxial Resonator in Cascaded Phase Locked Loop
Previous Article in Special Issue
Enriching Language Models with Graph-Based Context Information to Better Understand Textual Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring Reddit Community Structure: Bridges, Gateways and Highways

Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2024, 13(10), 1935; https://doi.org/10.3390/electronics13101935
Submission received: 14 April 2024 / Revised: 10 May 2024 / Accepted: 12 May 2024 / Published: 15 May 2024
(This article belongs to the Special Issue Advances in Graph-Based Data Mining)

Abstract

:
Multiple research directions have been proposed to study the information structure of Reddit. One of them is to model inter-subreddit relations but modeling user interactions in the form of a graph. Building upon prior work centered on political subreddits using pre-2020 data, we expand this investigation to include a more extensive dataset spanning 2022 and encompassing diverse topic areas. Employing NLP techniques such as text embeddings, we model subreddit content directly and construct a subreddit graph network based on cosine similarity. Community detection using the Louvain method reveals distinct subreddits and allows the analysis of inter-community connections via previous works’ concepts of “bridges” and “gateways”. Surprisingly, our findings indicate redundancy between bridges and gateways in the utilized dataset. Therefore, we introduce a new concept, “highways”. Highways, representing the most traversed paths between subreddits, unveil insights not captured by previous analyses, underscoring the significance of novel conceptual frameworks in uncovering latent knowledge within Reddit’s online community structures.

1. Introduction

Reddit serves as a widely accessible, large platform with distinct topic-specific discussion areas, known as subreddits [1]. These subreddits bring together users who share common interests, forming topical forums. Any user possessing an account that is at least 30 days old and a non-negative “karma score” (a metric denoting reputation) is allowed to establish a subreddit (https://www.reddit.com/r/help/comments/2yob6r/creating_a_subreddit accessed on 11 May 2024). Reddit’s administration imposes minimal restrictions on the subject matter of subreddits, allowing users meeting certain criteria to create them. This latitude fosters the organic development of “communities”, with minimal oversight.
Subreddits cover a wide range of topics [2]. The independence and large scale of subreddits facilitate scientific exploration, including studying popular subjects [2], finding thematically similar subreddits [3], grouping subreddits into clusters, and investigating possible transitions between these clusters [4]. The cluster detection and analysis of online communities has been widely studied with different applications related to echo chambers [5], political studies [6], community conflict modeling [7,8], conspiracy theories analysis [9], social role, and expert detection [10,11].
The specific point of the study are the thematic transitions between communities. In this context, this work applies natural language processing and graph network methodologies, leveraging a comprehensive Reddit dataset spanning the whole year 2022 and over 3000 subreddits, to further knowledge about the information structure of subreddits. This research is heavily inspired and draws the methodology from a study of bridges and gateways between subreddit communities (see [12]). However, it should be stressed that this is neither a reconstruction nor a verification of the original study; rather, it is an extension of the original methods with different approaches of relation modeling, network creation, and community detection.
The remaining parts of this work are organized as follows. The related works are presented in Section 2. The dataset is introduced in Section 3. Then, Section 4 elaborates on methods used in this contribution. Section 4.1 explains the specifics of Reddit (subreddit) structure. Further, Section 5 presents the findings resulting from application of the proposed method. Moreover, Section 5.5 introduces the novel concept of a “highways” and illustrates its usefulness. Finally, Section 6 presents concluding remarks.

2. Related Works

This work focuses on exploring the information structure of Reddit using graph networks. Specifically, it follows the avenue explored in [12], which was based on an earlier study of YouTube [13]. The methodology introduced in [12] is core to this contribution, so we recall it in what follows.
The research [12] aimed at examining the user traffic, within Reddit communities, through a network-oriented analysis. Here, the core methods included user attention flow applied to representing user interactions, combined with the community detection and modeling of the inter-subreddit relations with graph networks. The attention flow defines how users interacted with subreddits. The original work [12] modeled the subreddit–subreddit relation with “attention flow” (also called “user flow”). This method captured “how much each subreddit contributes to the user base of another subreddit” in time. In other words, two subreddits were considered to be close if a large number of readers were subscribed to both. It also models the changes in these interactions over time. By first representing the interaction in a temporal network, and then aggregating it at each time step, a single network of the subreddit–subreddit relation has been built. Further, the networks have been processed with a Stochastic Block Model (SBM; [14]) to uncover communities of nodes. Note that, here, a community is not a single subreddit but a group of subreddits, or group of nodes in the network. Having detected the communities, the authors describe an original method for subreddit community analysis. Their approach is as follows. Edges (transitions) between nodes can be traveled. A case of such traveling is random walks. A random walk on a graph is a stochastic process where a walker moves between vertices according to probabilistic transitions, called edge weight. Based on random walks, their theory is based on two core concepts.
Definition 1.
Bridge node
A subreddit s, belonging to a community Y, is a bridge from the community X (for X Y ) if it has a high probability of being reached by a random walk with a uniform probability of restart in all s X [12].
Definition 2.
Gateway node
A subreddit s X is a gateway for community X if it has a high probability of being reached by a random walk, with a uniform probability of restart, in all s X [12].
Figure 1 shows examples of bridges, gateways, and highways (See Section 5.5) given 4 communities: A, B, C and D.
Intuitively speaking, the gateways answer the question regarding “entering” a particular community or a set of nodes (subreddits). The bridges are the “transitional” nodes between two particular communities. Note that both concepts use the term “high probability”, which is not particularly defined, but rather an indication that “higher is better”.
There are major differences between the original study and this contribution.
Firstly, the originally used dataset is focused only on political subreddits. Hence, only four communities have been detected: Liberal, Radical left, Alt-right, and Esoterism. This highly restricts the available thematic space, which has been shown to cover a far wider area than just politics [2]. This contribution will cover a wider range of topics; i.e., there is no restriction to only political subreddits. Moreover, we change the time windows from before 2020 to the year 2022.
Secondly, the graph creation methods as described in Section 4.3 is based on content that is directly modeled with NLP methods, not user interactions. This decision follows the findings of our three previous studies [2,3,4], which yielded interesting results.
Thirdly, the communities within the graph networks will be unveiled using the Louvain method, which has recently been acclaimed as “the method of choice in social networks” [15] following the paradigm shift towards modularity-based methodologies.
Fourthly, this contribution is not a recreation or validation, nor does it in any other way aim to directly address the findings of the original work [12]. The original work is an inspiration. The theory of bridges and gateways provides the baseline for this analysis, which additionally introduces the “highways”.

3. Data for the Study

The initial dataset used in this research spans the whole year of 2022 and contains all subreddits with over 100,000 subscribers (totaling 5380). Here, recall that the original work covered only political subreddits (i.e., 492 selected subreddits).
An important step is filtering out the “Not Safe For Work” (“NSFW”) subreddits. These are subreddit with “18+” content. As found in previous works [4], NSFW subreddits are very popular on Reddit. However, their contribution to this work is not interesting and clouds the results. For example, the vertex (node) communities found with the Louvain method become difficult to unambiguously characterize if NSFW subreddits are included. Tests have been conducted on a dataset including NSFW subreddits. The communities detected with the Louvain method were much more ambiguous and got polluted with various NSFW subreddits. Furthermore, NSFW subreddits contain mostly pornography [4] and in practically all cases the posts have extremely short titles (sometimes just the performer name) and no further textual content, but deliver just an image or a video. This also makes it extremely difficult to meaningfully embed them with language models. The post data contain an explicit field that says if a post is NSFW or not. Therefore, subreddits with more than 80% of posts being NSFW are removed (a total of 1620; subreddits remaining for further analysis: 3760).

Dataset Comparison with Original Work

The set of available subreddits contains only 69% of subreddits from the original work [12]. After the elimination in preprocessing (as described in Section 4.2), this number goes to 45%.
The most likely reason for the lack of subreddits in the current dataset can be subreddit banning. Recall that the original one contained subreddits from the years 2017–2020, a time before the “2020 purge”. The “2020 purge” is an important event in Reddit history (https://www.reddit.com/r/announcements/comments/hi3oht/update_to_our_content_policy/ accessed on 11 May 2024), when Reddit administration banned 2000 subreddits, all at once, including large subreddits such as r/The_Donald and r/ChapoTrapHouse (both included in the original work). The official reasons were inactivity or the disobedience of Reddit’s content policy (https://www.redditinc.com/policies/content-policy accessed on 11 May 2024). The information about a subreddit being banned/quarantined/private is available when opening the subreddit page, e.g., https://www.reddit.com/r/The_Donald/ (accessed on 11 May 2024). From the remaining 31% percent of subreddits from the original study not covered in the 2022 dataset, 27% have been banned, 7% are private, and 2% are quarantined. Moreover, 35% have fewer than 1000 subscribers. This provides a logical justification for the dataset inconsistency between the current dataset (2022) and the original work dataset (pre-2020).

4. Methods

Let us now introduce the methods for this work.

4.1. Subreddit Structure

First, we focus on the sole information structure under study—the information and community structure of Reddit.
Reddit is split into subreddits, which are thematic communities defined by users. These subreddits serve as distinct realms for discussions, each dedicated to a specific topic [2,16]. Noteworthy examples include r/worldnews and r/news, focusing on global events and information dissemination; the gaming-oriented r/gaming, r/leagueoflegends; r/Overwatch; the technology-focused r/gadgets; r/programming; r/technology; r/sports; r/nba; r/soccer; and subreddits containing memes and comedy content, found in subreddits such as r/funny and r/memes. The diversity extends to subreddits with contrasting themes, as evident in r/Conservative and r/Libertarian. This decentralized structure allows for organic community growth, as any user meeting basic criteria can create a subreddit. The end result is a diverse mix of different topics on Reddit, creating lots of conversations and connections across the platform.

4.2. Modeling Relations with Natural Language Processing

Following the findings from three previous works [2,3,4], the method for converting subreddit content to networks starts with the embedding of textual content.
To accomplish this task, we utilize a language model to process posts from all relevant subreddits. A general-purpose model that is both efficient and accurate is necessary for this purpose. Among the options available, BERT [17], which was introduced in 2018, stands out as the most popular choice in this context. Additionally, in 2019, a scaled-down version of BERT known as DistilBERT was introduced. This variant maintains 97% of BERT’s original performance on downstream tasks while being 40% smaller and 60% faster [18]. The speed of the model is especially important due to large volumes of data and the limited time for this research. By employing DistilBERT, the textual content of subreddit posts can be converted into meaningful numerical vectors.
Then, the relation between subreddits can be found by comparing the similarity of created vectors. Here, typically, cosine similarity is used. This metric has previously been shown to have great potential in text embedding applications on Reddit [3,19]. Since each pair of vectors has some similarity (smaller or higher), only the top vector pairs are considered (here, the 97th percentile). This is consistent with the approach in the original work [12], where the same similarity measure has been applied. The embedding process is performed on the top 1000 posts from each subreddit by post score. A post score is a user appreciation mechanism on Reddit. Hence, the higher the score, the higher the appreciation. After pruning the least important relations (the lowest scoring similarities), as described in Section 4.2, 2901 subreddits remain.
These selected relations are used to build graphs.

4.3. Graphs Creation Process

Having the subreddit content from posts in its vector form (embedded), they are used to build graphs.
A graph is a set of vertices (nodes) V and a set of edges E, where each edge connects a pair of vertices (nodes), thereby establishing relationships or associations between the elements of the vertex (node) set. Transforming subreddits to nodes results in a simple graph without edges. The relations between subreddits can be modeled with user interactions [20] and crossposts [3], but also the content directly [4]. In this contribution, the subreddit relations are modeled based directly on the subreddit content. Not only did it show great results in subreddit clustering, but also it skips the ‘middle-man’ and models the information, not the information carriers, who are mostly anonymous on Reddit anyway (https://www.reddit.com/r/privacy/comments/86e3o4/is_reddit_anonymous/ accessed on 11 May 2024). This relation is modeled using advanced natural language processing and text content embeddings as described in Section 4.2.
So subreddits become nodes (a total of 2901). Each subreddit’s content is embedded into a vector. Each subreddit’s vector is compared with each other subreddit’s vector, and their similarity is measured with cosine similarity. Cosine similarity is a standard measure for textual embedding vector comparison [7,21]. The subreddits with the highest cosine similarity are joined with an edge (total of 228,206 edges). This creates an undirected graph. There are no loops nor multi-edges.

4.4. Establishing Graph Communities

Now, the nodes can be grouped into communities. As mentioned “the method of choice in social networks” [15] is the Louvain method.
The Louvain method, originally proposed in 2008 [22], is a community detection algorithm designed for identifying modular structures in complex networks. It aims to maximize the modularity of a network by iteratively optimizing the assignment of nodes to communities. The algorithm consists of two phases: a local optimization step at the level of individual nodes and a global optimization step to merge communities. The Louvain method is known for its efficiency in handling large-scale networks and its ability to uncover hierarchical community structures [23]. Therefore, in this work, using a community detection algorithm (Louvain method), the subreddit nodes are aggregated into node communities.
Then, the graph is analyzed with the original paper’s [12] methodology. The analysis leads to the following findings.

5. Findings

Having introduced the dataset and methods, we move on to the findings.

5.1. Communities

We have been able to identify 170 communities. The top largest communities of size at least three are summarized in Table 1. For large communities (e.g., containing 484 subreddits), a random sample is shown.
The largest detected community in the subreddit network is related to a vague topic of lifestyle. It contains subreddits about relationship and dating (r/datingoverthirty), parenthood (r/Mommit, r/pregnant), and home improvements (r/HomeDecorating) but also craftsmanship (r/woodworking) and fashion (r/HairDye, r/Skincare_Addiction). Next, there are several very large communities concerning comedy, including memes (e.g., r/memes), funny videos (e.g., r/MemeVideos), funny pictures (e.g., r/surrealmemes), and other media aiming to amuse, amaze or generally spark interest in the viewer (e.g., r/blackmagicfuckery). Notably, there are also multiple large communities of gaming subreddits. They are either related to a particular game (e.g., r/LeagueOfLegends), console (e.g., r/xboxone), or general gaming topics (e.g., r/games, r/gaming). One of the communities is strongly dedicated to sports teams (r/falcons, r/buffalobills, etc.) There are also location-based communities with different cities (r/london), states (r/massachusetts), or countries (r/ireland). The communities also have a separate place for politics (r/conservatives, r/democrats) and news (r/usanews). There is one distinct community about celebrities (r/GalGadot, r/ScarlettJohansson). There are also several more topic-oriented communities, as summarized in Table 1.
The general categories and topics of the communities are similar to previous studies [3,4,24,25,26], but there are some differences. Previous studies’ [4] top community (cluster) categories were pornography, pictures, games, memes, mixed, tech, social chatting, TV series, animals, politics, music, and sports. Here, we detect major categories to focus on: lifestyle, games, location-based subreddits, comedy (memes), animals, politics, pop culture, technology, and finance.
Most of the communities’ compositions are self-explanatory, but there are a few unusual ones. The lifestyle-related community contains subreddits related to the games r/Sims4 (a life simulation game) and r/StardewValley (a gardening simulation game). This shows a text-embedding similarity between real-life phenomena and simulated gameplay. This particular observation is consistent with previous research [4]. One of the gaming-related communities contains war and shooting games such as r/Blackops4 and r/Warframe. It also contains a World War II subreddit (r/WWII) and a machine gun subreddit (r/ar15). The programming cluster with several different programming technologies contains two job-seeking subreddits (r/jobs and r/careeradvice), showing that Reddit users looking for job-related topics are particularly close to technology and programming topics. There is also a small but distinct community of military subreddits (r/JustBootThings, r/MilitaryStories, r/army), which also contains the r/HiTMAN—a subreddit about a video game where the player becomes a paid assassin. This shows the (relatively easy-to-explain) topical closeness of these subreddits.
More observations focused on clusters themselves can be mined. They could also be researched from different perspectives, e.g., sociology and/or political science. However, this contribution aims to cover the intersubreddit relations with data science methods. Hence, this topic will be addressed next, starting with gateway nodes and their respective communities in Section 5.2, then moving to bridges and the communities they connect in Section 5.3.
Note that the definitions in Section 4 do not prevent a gateway node to be a bridge node at the same time or vice versa. A noticeable overlap between gateways and bridges was detected and will be discussed in Section 5.4.

5.2. Gateways

This section elaborates on the gateways findings and addresses the challenge of presenting all gateways of all 170 communities, the comprehensive inclusion of which would unduly lengthen this paper. To address this, gateways are selected and categorized based on the overarching themes of the communities.
Firstly, an illustrative top-down example is provided, wherein the gateway serves as a representative entry point to the entire community.
For instance, gateways to different political and news communities, namely r/uspolitics, r/AnythingGoesNews, r/PoliticalCompassMemes, and r/seculartalks, embody the ‘general-purpose” nature within the politics and news sphere. Another instance is r/funnyvideos. It is a gateway to the animals’ community, encapsulating subreddits like r/FunnyAnimals and r/MadeMeSmile, which collectively comprise a community dedicated to amusing and heartening content.
Let us now discuss examples of interesting gateways. The primary gateways to the gaming community typically involve r/starcraft, r/runescape, and r/MonsterHunterWorld. These subreddits do not represent the general gaming theme, but topic-wise, they are the ‘middle-man’ if one wants to enter the gaming content area. Next, the sport-related community is often entered through the subreddit r/MaddenUltimateTeam, a subreddit focusing on the Madden Ultimate Team game mode. Additionally, entry into the Asian gaming community is through the subreddits r/Kirby, r/Guiltygear, and r/JonTron. The first two are games produced by Asian companies. The latter subreddit is dedicated to the YouTuber, comedian, and Internet content creator JonTron. This reveals a noteworthy intersection between the creator’s content and the Asian video games. Further examples include gateways to the pop culture community, encompassing series, movies, and music, such as r/DeathStranding (a video game), r/futurama (a cartoon series), and r/TopGear (a TV series).
In smaller communities, i.e., those with fewer than 100 subreddits, intuitive examples include r/FL_Studio (a music software) serving as the gateway to the musician and music production community, as well as r/iWallpaper (a subreddit aggregating wallpapers), which has been identified as the entry point to the celebrity-themed community. Additionally, the communities focused on food and cooking can be most easily accessed through r/grilling, r/spicy, and r/Breadit (all of those are cooking-related subreddits).
There are countless other examples, but introducing them and putting them in the necessary context might require expert knowledge. However, to further highlight the uniqueness of gateways, an automated graph structure method was used. Node embeddings, specifically Node2Vec [27], were employed to assess the structural similarity between gateway and non-gateway nodes. The results indicate that it is possible to capture the dissimilarity between these two classes of subreddits. Specifically, the average similarity within the gateway nodes set is at 0.56. It is 500% larger than the similarity observed among non-gateway nodes (0.09). This underscores the structural distinctiveness of gateway nodes.

5.3. Bridges

There are only a couple examples of notable bridges, which are not gateways (this phenomenon is explained in Section 5.4). The communities of music subreddits (including r/musicproduction and r/WeAreTheMusicMakers) are bridged with a community of technology and finance (including r/javascript, r/finance, r/smallstreetbets, and r/excel) subreddits by r/thepiratebay (a platform for Torrent sharing). The same technology- and finance-related community is joined with a new community by the subreddit r/Daytrading (a subreddit about trading on a stock market). The sport and sports teams community (e.g., r/falcons, r/NYKnicks, r/FIFA, and r/fantasybball) is connected with the “Ask” subreddits (r/AskEurope, r/AskNYC, and r/TooAfraidToAsk) by r/formuladank (a subreddit described as “Formula1 shitposts”).
Another news community (e.g., r/AnythingGoesNews, r/neutralnews, r/news) is connected to the previously mentioned sport and sports teams community by r/PublicFreakout (a subreddit containing videos of manifestations of public agitation or disturbance).
There is also an interesting case of an alien- and UFO-related community (e.g., r/UFOs, r/aliens, r/ufo). Even though r/UnexplainedPhotos is not its gateway, it is a bridge between this community and two other communities: a community containing animal and other amusing content (r/MadeMeSmile, r/cats, r/Thisismylifemeow) and a general lifestyle community (r/HomeDecorating, r/wedding, r/dating, r/houseplants, and r/naturalbodybuilding).
Furthermore, there were two subreddits that appear as bridges in this work and also were mentioned as bridges in the original work [12]. Subreddit r/WikiLeaks bridges a community of Western-world location-based subreddits (r/boston, r/london, r/LosAngeles, r/Utah) and news subreddits (e.g., r/AnythingGoesNews, r/neutralnews, r/news). Subreddit r/MarchAgainstNazis bridges a political community (e.g., r/conservatives, r/DemocraticSocialism, r/SandersForPresident, and r/ENLIGHTENEDCENTRISM, r/MURICA) with many other subreddit groups: news community (a/m) and the “ask” subreddit (a/m).

5.4. The Main Observation: Bridges Are Gateways

Having discussed the most interesting gateways and bridges, let us now move to the core meta-finding of this contribution, i.e., that gateways and bridges overlap. This phenomenon was not accounted for in the original work [12]. Figure 2 and Figure 3 show the number of bridges, which are also gateways based on probability score (percent of random walks, which goes through them).
It is evident that nodes most likely to be bridges (i.e., with a likelihood exceeding 20%) demonstrate a noteworthy correlation with gateways, with the majority of such nodes falling within the range of 75% to 100% likelihood. In other words, it is extremely likely that a bridge node is also a gateway node. This pivotal observation suggests a redundancy between bridges and gateways theory in this dataset. Consequently, this renders the analysis of bridges on this dataset practically unnecessary. Hence, in Section 5.3, we only cover the most interesting bridges that were not gateways.
Moreover, this inference aligns with logical reasoning when examining the definitions. If node G serves as a gateway to community C 1 , it implies that a substantial number of random walks originating from nodes outside community C 1 pass through the node G to access community C 1 . In contrast, if bridge B connects two communities, say C 1 and C 2 , it means that numerous random walks, predominantly from community C 2 , traverse node B to reach community C 1 . The differentiating factor between bridges and gateways lies in their starting points. However, since community C 2 is distinct from community C 1 , given our consideration of solely disjointed communities, nodes from C 2 may appear as starting nodes for gateway-related random walks. Consequently, it is plausible for a bridge to function as a gateway. Notably, in this dataset, nearly all bridges are observed to also operate as gateways.
This may indicate that a simple ‘single node’ analysis is not sufficient in this work. Therefore, we introduce a new concept, ‘highways’, with the aim to analyze sets of nodes (paths) to develop a deeper understanding of Reddit’s network.

5.5. Highways

Let us now introduce a new concept—highways. Highways are based on the shortest paths in a graph. The shortest path between two vertices (nodes) in a weighted undirected graph is defined as the path (sequence of vertices connected with edges) with the minimum sum of weights among all possible paths connecting those two vertices. The highways aim to examine global transition routes between all graph communities (information areas).
Definition 3.
Highway
A path of length N is a highway if it appears in the highest number of shortest paths in the graph.
In other words, to determine a highway of length N, consider all the shortest paths. Find all subpaths of length N in them. Count how many times each subpath occurs. The highway is the subpath of a given length with the highest number of occurrences.
As part of this work, highway theory is tested to see if it becomes redundant or innovative in terms of the previous gateway and bridges theory. Table 2 presents the top highways by the highest percentage of communities they connect. Specifically, we find all the shortest paths of at least a given length (‘length’ column in Table 2) between each community pair. Then, we measure how many of these shortest paths contain the highway of that length. For example, the percentage value for a highway of length 3 considered only community pairs for which the shortest path is at least 3. This calculation is performed because the average shortest path of this graph is 2.7, and 83% of community pairs have the shortest path of length 2 (counting the starting and ending nodes). So most of the communities are ‘neighbors’ in terms of node distances. This is logical since the Louvain method assigns each node of a graph to a (single) community, so all communities have ‘community neighbors’ (in a connected graph).
Moving on to the results, the top of the ranking presented in Table 2 contain only highways of length 3 and 4. Two larger groups of subreddits can be identified here. The most traversed highways are r/DigitalArt, r/SpecArt, and r/ImaginaryMindscapes. All of these are artistic subreddits containing mostly pictures of drawings and paintings, both analog and digital. These subreddits also appear in the second most traversed highway, where they are accompanied by r/CurrentlyTripping, which contains abstract art and “pictures worth seeing after drug usage” (a formal paraphrase of the subreddit description). The third place in the ranking contains subreddits r/ActualPublicFreakouts, r/iamatotalpieceofshit, and r/instantkarma, which contain mostly pictures or videos of everyday situations where a person or people are doing something that the viewer may consider unethical or publicly unacceptable. A mid-ranking highway worth mentioning is the musical highway r/electronicmusic, r/dreampop, r/LofiHipHop, and r/EDM. The bottom of the ranking mostly contains highways consisting of the “ask” subreddits, such as r/AskMenOver30, r/AskWomenOver30, r/ask, and r/ASKUK. These appear on highways in tandem with introvert subreddits (r/introvert, r/introverts), but also with adult subreddits such as r/nonmonogamy and r/RedditForGrownups.
The findings presented herein reveal new subreddits that play a crucial role in traversing the content-based Reddit network. Furthermore, highways provide information on subreddit transitions and consider not singular subreddits but sets (sequences) of them. Notably, experimentation conducted on the Reddit network demonstrates that highways can expose outcomes that do not manifest as either gateways or bridges. This discovery underscores the theoretical significance of highways in social network exploration. Subsequent research endeavors will aim to develop the broader impact of highways.

6. Concluding Remarks

In conclusion, this work contributes to the ongoing task of the analysis of Reddit’s information structure, leveraging natural language processing and graph network modeling. By embedding textual content from various subreddits using NLP models, it was possible to construct a representation of Reddit’s graph representation of relations between subreddits.
Following conceptually, and extending earlier studies, the focus of reported research was on recognizing the existence of “bridges” and “gateways”. To do so, the dataset comprised all Reddit posts from 2022, and the Louvain method was used to recognize “topical communities”. The following general findings were uncovered.
The results point to clearly distinguishable groups of subreddits devoted to lifestyle, comedy, politics, games, location-specific topics, technology, finance, and more. Here, the results go beyond the earlier work, as the dataset used included all subreddits. Interestingly, it was found that using the proposed approach can be spread to NSFW (Not Safe For Work) subreddits across multiple communities. However, this observation would require further in-depth study, which was out of the scope of this work.
Next, the performed analysis revealed that bridges and gateways, which were pointed to as essential components of Reddit’s informational structure, became redundant when the “complete dataset” was considered. More precisely, it has been observed that utilizing only gateways is “sufficient”. Upon further reflection, this seems quite natural, considering the definitions of these terms and the way that Reddit content is created.
In this context, the concept of “highways” has been introduced. Highways offer additional insights into the informational structure of Reddit. They serve as pathways that capture and represent “connections” and “information flow” between communities. In this way, they enrich understanding of Reddit when modeled as a complex graph network. The highways uncovered new subreddit relations, absent in the bridges and gateway analysis.
As a by-product, it has been discovered that a significant number of subreddits “disappeared” between this contribution and the original work [12]. The reason for this can be attributed, primarily, to subreddit bans, enacted in 2020. These missing communities underscore the dynamic nature of Reddit and the impact of external factors on its content landscape. However, what is also worth pointing out is that, when comparing the results, the main communities found in [12] still exist. This, on the other hand, illustrates the long-term stability of the informational structure of Reddit.
Moving forward, the reported research opens new avenues for further exploration into online community dynamics and can have broader implications for social network analysis.

Author Contributions

Conceptualization, J.S.; methodology, J.S.; software, J.S.; validation, M.G.; formal analysis, J.S.; investigation, J.S.; resources, M.G.; data curation, J.S.; writing—original draft preparation, J.S.; writing—review and editing, J.S. and M.G.; visualization, J.S.; supervision, M.G.; project administration, M.G.; funding acquisition, M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

This research was carried out with the support of the Laboratory of Bioinformatics and Computational Genomics and the High Performance Computing Center of the Faculty of Mathematics and Information Science Warsaw University of Technology.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Proferes, N.; Jones, N.; Gilbert, S.; Fiesler, C.; Zimmer, M. Studying reddit: A systematic overview of disciplines, approaches, methods, and ethics. Soc. Media+ Soc. 2021, 7, 20563051211019004. [Google Scholar] [CrossRef]
  2. Sawicki, J.; Ganzha, M.; Paprzycki, M.; Bădică, A. Exploring usability of Reddit in data science and knowledge processing. arXiv 2021, arXiv:2110.02158. [Google Scholar] [CrossRef]
  3. Sawicki, J.; Ganzha, M.; Paprzycki, M.; Watanobe, Y. Reddit CrosspostNet—Studying Reddit Communities with Large-Scale Crosspost Graph Networks. Algorithms 2023, 16, 424. [Google Scholar] [CrossRef]
  4. Sawicki, J. Text embeddings and clustering for characterizing online communities on Reddit. In Proceedings of the 18th Conference on Computer Science and Intelligence Systems (FedCSIS), Warsaw, Poland, 17–20 September 2023; pp. 17–20. [Google Scholar]
  5. Cinelli, M.; De Francisci Morales, G.; Galeazzi, A.; Quattrociocchi, W.; Starnini, M. The echo chamber effect on social media. Proc. Natl. Acad. Sci. USA 2021, 118, e2023301118. [Google Scholar] [CrossRef]
  6. Enli, G. Twitter as arena for the authentic outsider: Exploring the social media campaigns of Trump and Clinton in the 2016 US presidential election. Eur. J. Commun. 2017, 32, 50–61. [Google Scholar] [CrossRef]
  7. Curiskis, S.A.; Drake, B.; Osborn, T.R.; Kennedy, P.J. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Inf. Process. Manag. 2020, 57, 102034. [Google Scholar] [CrossRef]
  8. Habib, H.; Musa, M.B.; Zaffar, M.F.; Nithyanand, R. Are Proactive Interventions for Reddit Communities Feasible? In Proceedings of the International AAAI Conference on Web and Social Media, Atlanta, GA, USA, 6–9 June 2022; Volume 16, pp. 264–274. [Google Scholar]
  9. Monti, C.; Cinelli, M.; Valensise, C.; Quattrociocchi, W.; Starnini, M. Online conspiracy communities are more resilient to deplatforming. PNAS Nexus 2023, 2, pgad324. [Google Scholar] [CrossRef] [PubMed]
  10. Buntain, C.; Golbeck, J. Identifying social roles in reddit using network structure. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Republic of Korea, 7–11 April 2014; pp. 615–620. [Google Scholar]
  11. Strukova, S.; Ruipérez-Valiente, J.A.; Gómez Mármol, F. Computational approaches to detect experts in distributed online communities: A case study on Reddit. Clust. Comput. 2023, 27, 2181–2201. [Google Scholar] [CrossRef]
  12. Rollo, C.; De Francisci Morales, G.; Monti, C.; Panisson, A. Communities, gateways, and bridges: Measuring attention flow in the reddit political sphere. In Proceedings of the International Conference on Social Informatics, Glasgow, UK, 19–21 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 3–19. [Google Scholar]
  13. Fabbri, F.; Wang, Y.; Bonchi, F.; Castillo, C.; Mathioudakis, M. Rewiring what-to-watch-next recommendations to reduce radicalization pathways. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 2719–2728. [Google Scholar]
  14. Abbe, E. Community detection and stochastic block models: Recent developments. J. Mach. Learn. Res. 2018, 18, 1–86. [Google Scholar]
  15. Cohen-Addad, V.; Kosowski, A.; Mallmann-Trenn, F.; Saulpic, D. On the power of louvain in the stochastic block model. Adv. Neural Inf. Process. Syst. 2020, 33, 4055–4066. [Google Scholar]
  16. Li, S.; Xie, Z.; Chiu, D.K.; Ho, K.K. Sentiment analysis and topic modeling regarding online classes on the Reddit Platform: Educators versus learners. Appl. Sci. 2023, 13, 2250. [Google Scholar] [CrossRef]
  17. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  18. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
  19. Wang, Z.; Rastorgueva, E.; Lin, W.; Wu, X. No, you’re not alone: A better way to find people with similar experiences on Reddit. In Proceedings of the 5th Workshop on Noisy User-Generated Text (W-NUT 2019), Hong Kong, China, 4 November 2019; pp. 307–315. [Google Scholar]
  20. Olson, R.S.; Neal, Z.P. Navigating the massive world of reddit: Using backbone networks to map user interests in social media. PeerJ Comput. Sci. 2015, 1, e4. [Google Scholar] [CrossRef]
  21. Neelakantan, A.; Xu, T.; Puri, R.; Radford, A.; Han, J.M.; Tworek, J.; Yuan, Q.; Tezak, N.; Kim, J.W.; Hallacy, C.; et al. Text and code embeddings by contrastive pre-training. arXiv 2022, arXiv:2201.10005. [Google Scholar]
  22. Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef]
  23. Lancichinetti, A.; Fortunato, S. Community detection algorithms: A comparative analysis. Phys. Rev. E 2009, 80, 056117. [Google Scholar] [CrossRef]
  24. Singer, P.; Flöck, F.; Meinhart, C.; Zeitfogel, E.; Strohmaier, M. Evolution of reddit: From the front page of the internet to a self-referential community? In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Republic of Korea, 7–11 April 2014; pp. 517–522. [Google Scholar]
  25. Soliman, A.; Hafer, J.; Lemmerich, F. A characterization of political communities on reddit. In Proceedings of the 30th ACM Conference on Hypertext and Social Media, Hof, Germany, 17–20 September 2019; pp. 259–263. [Google Scholar]
  26. Barnes, K.; Riesenmy, T.; Trinh, M.D.; Lleshi, E.; Balogh, N.; Molontay, R. Dank or not? Analyzing and predicting the popularity of memes on Reddit. Appl. Netw. Sci. 2021, 6, 1–24. [Google Scholar] [CrossRef]
  27. Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
Figure 1. An example network with assumed communities A, B, C, and D, with an example of a bridge between communities A and D, a gateway of community B, and a highway of length 2.
Figure 1. An example network with assumed communities A, B, C, and D, with an example of a bridge between communities A and D, a gateway of community B, and a highway of length 2.
Electronics 13 01935 g001
Figure 2. Bridges that are gateways according to probability score (below probability score 0.2).
Figure 2. Bridges that are gateways according to probability score (below probability score 0.2).
Electronics 13 01935 g002
Figure 3. Bridges which are gateways by probability score (probability score above 0.2).
Figure 3. Bridges which are gateways by probability score (probability score above 0.2).
Electronics 13 01935 g003
Table 1. Selected subreddits in detected communities.
Table 1. Selected subreddits in detected communities.
SizeSubredditsDescription
484interiordecorating, HomeDecorating, Skincare_Addiction, SkincareAddicts, woodworking, HairDye, Parenting, Tools, succulents, crafts, gardening, Adulting, Mommit, pregnant, Carpentry, …lifestyle
365SWGalaxyOfHeroes, Eve, xboxone, xbox, Rainbow6, classicwow, starcraft, MagicArena, farcry, GrandTheftAutoV, lostarkgame, MonsterHunterWorld, indiegames, …Games
230boston, raleigh, Dallas, Austin, Sacramento, Portland, wisconsin, Ohio, pittsburgh, philadelphia, Charlotte, orlando, rva, baltimore, nova, NewOrleans, longbeach, Connecticut, Denver, NorthCarolina, southcarolina, …location-based subreddits
196MemeVideos, cursedmemes, AvatarMemes, marvelmemes, dankmemes, TheRealJoke, suspiciouslyspecific, AccidentalComedy, beetlejuicing, Dank, SequelMemes, starwarsmemes, engrish, HolUp, Funnymemes, …comedy
189catpics, catpictures, germanshepherds, goldenretrievers, dogpictures, lookatmydog, PuppySmiles, BirdsBeingDicks, AnimalsBeingDerps, Dachshund, rarepuppers, DOG, blackcats, cats, reptiles, snakes, …animals
174libertarianmeme, LateStageImperialism, conservatives, Fuckthealtright, tucker_carlson, Anarcho_Capitalism, Trumpvirus, MarchAgainstNazis, PoliticalHumor, DankLeft, WayOfTheBern, DemocraticSocialism, …politics
15890DayFiance, futurama, DeathStranding, ArcherFX, breakingbad, howyoudoin, Wicca, StrangerThings, lucifer, adventuretime, Supernatural, BobsBurgers, Scrubs, SchittsCreek, boburnham, stephenking, …popculture
155smallstreetbets, web_design, webdev, golang, java, realestateinvesting, learnjavascript, csharp, stocks, investing, financialindependence, androiddev, software, WorkOnline, AusFinance, Bogleheads, ValueInvesting, technews, technology, tech, devops, Frontend, MacOS, linuxadmin, …technology and finance
145ShingekiNoKyojin, Guiltygear, Kirby, fromsoftware, DragonballLegends, Behzinga, bleach, Deltarune, Persona5, PERSoNA, Dragonballsuper, FrankOcean, blackdesertonline, castlevania, MortalKombat, yakuzagames, …gaming and music
131yesyesyesyesno, IdiotsNearlyDying, nonononoyes, combinedgifs, perfectlycutscreams, ATBGE, GTBAE, whatcouldgoright, BrutalBeatdowns, bullybackfire, NotTimAndEric, tooktoomuch, unexpectedjihad, TikTokCringe, …comedy
90falcons, ravens, washingtonwizards, AtlantaHawks, torontoraptors, MLBTheShow, MaddenUltimateTeam, cowboys, CHIBears, steelers, DenverBroncos, minnesotavikings, miamidolphins, pacers, eagles, …sports
55VeganFoodPorn, easyrecipes, cookingvideos, budgetfood, recipegifs, GifRecipes, slowcooking, instantpot, spicy, Breadit, HealthyFood, Paleo, EatCheapAndHealthy, Chefit, PlantBasedDiet, vegetarian, grilling, Keto_Food, …food and cooking
42GalGadot, milanavayntrub, alexandradaddario, victoriajustice, SydneySweeney, MeganFox, kateupton, AlisonBrie, RachelCook, MargotRobbie, ScarlettJohansson, …celebrities
40Mustang, Porsche, Audi, mercedes_benz, Volkswagen, ForzaHorizon, tacticalgear, subaru, Honda, BMW, Cartalk, projectcar, vandwellers, 4 × 4, Jeep, Miata, motogp, INDYCAR, mountainbiking, …cars and bicycles
30AnythingGoesNews, inthenews, nottheonion, Liberal, NewsOfTheStupid, moderatepolitics, politics, uspolitics, progressive, neutralnews, skeptic, usanews, news, hillaryclinton, Libertarian, …politics and news
15Chennai, Nepal, indianews, IndiaSpeaks, Kanye, bollywood, pakistan, mumbai, delhi, bangalore, india, Keralalocation-based
15PrequelMemes, OTMemes, Animemes, lotrmemes, wowthanksimcured, funnysigns, simpsonsshitposting, TargetedShirts, raimimemescomedy (memes)
14ImaginaryLeviathans, ImaginaryBehemoths, ImaginaryMonsters, ImaginaryLandscapes, ReasonableFantasy, ImaginaryCharacters, armoredwomen, ImaginaryTechnology, futureporn, ImaginaryArchitecture, ImaginaryCityscapes, ImaginaryMindscapes, ImaginaryHorrors, …imaginary pictures
14poppunkers, Metalcore, futurebeats, trap, trapmuzik, hiphop, electronicmusic, hiphopheads, indieheads, gamemusic, amv, TaylorSwift, popheads, Musicmusic
13gamernews, Games, PS5, nintendo, pcgaming, Gaming4Gamers, macgaming, Vive, iosgaming, JRPG, PSVR, PlayStationPlus, GamingLeaksAndRumoursgaming (general gaming)
13musicproduction, WeAreTheMusicMakers, makinghiphop, misfits, FL_Studio, trapproduction, edmproduction, Bass, drums, guitarlessons, piano, singingmusic
10AskNYC, askTO, TooAfraidToAsk, NoStupidQuestions, AskUK, ask, answers, AskAnAmerican, AskEurope, legaladviceofftopicquestions and answers
8UFOs, aliens, ufo, HighStrangeness, AlternativeHistory, UnexplainedPhotos, Missing411, truecreepyaliens
8hbo, netflix, Fantasy, bestofnetflix, horror, Hulu, DisneyPlus, startrekvideo streaming
6UnearthedArcana, DnDHomebrew, DnDBehindTheScreen, Roll20, dndmaps, magicTCGgames
5PathOfExileBuilds, raidsecrets, woweconomy, CompetitiveTFT, CompetitiveWoWcompetetive gaming
5holdmycosmo, holdmyfries, holdmyjuicebox, holdmybeer, holdmyredbull“hold my” subreddits      
4HiTMAN, JustBootThings, MilitaryStories, armymilitary
3Design, DesignPorn, architecturedesign
Table 2. Top highways with their statistics selected based on the highest fraction of all shortest paths of at least the highway length.
Table 2. Top highways with their statistics selected based on the highest fraction of all shortest paths of at least the highway length.
HighwayLength% the Shortest Paths of at Least Highway Length Containing Highway% Community Pairs with the Shortest Path of at Least Highway Length
(DigitalArt, SpecArt, ImaginaryMindscapes)319.0%17.0%
(Currentlytripping, DigitalArt, SpecArt, ImaginaryMindscapes)416.0%13.0%
(ActualPublicFreakouts, iamatotalpieceofshit, instantkarma)314.0%17.0%
(introvert, introverts, AskMenOver30)313.0%17.0%
(Currentlytripping, DigitalArt, SpecArt)312.0%17.0%
(PublicFreakout, BadChoicesGoodStories, JoeRogan)310.0%17.0%
(Roll20, OnePiece, Megaten)39.0%17.0%
(introvert, introverts, AskMenOver30, ask)49.0%13.0%
(introvert, introverts, AskMenOver30, AskUK)49.0%13.0%
(HighQualityGifs, RedditForGrownups, findapath, AskMenOver30)49.0%13.0%
(nonmonogamy, AskWomenOver30, AskMenOver30)38.0%17.0%
(electronicmusic, dreampop, LofiHipHop, EDM)48.0%13.0%
(RedditForGrownups, findapath, AskMenOver30)38.0%17.0%
(PublicFreakout, BadChoicesGoodStories, JoeRogan, seculartalk)47.0%13.0%
(AskWomenOver30, AskMenOver30, ask)37.0%17.0%
(AskWomenOver30, AskMenOver30, AskUK)37.0%17.0%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sawicki, J.; Ganzha, M. Exploring Reddit Community Structure: Bridges, Gateways and Highways. Electronics 2024, 13, 1935. https://doi.org/10.3390/electronics13101935

AMA Style

Sawicki J, Ganzha M. Exploring Reddit Community Structure: Bridges, Gateways and Highways. Electronics. 2024; 13(10):1935. https://doi.org/10.3390/electronics13101935

Chicago/Turabian Style

Sawicki, Jan, and Maria Ganzha. 2024. "Exploring Reddit Community Structure: Bridges, Gateways and Highways" Electronics 13, no. 10: 1935. https://doi.org/10.3390/electronics13101935

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop