1. Introduction
There are more than 11 million lesbian, gay, bisexual, and transgender (LGBT) adults in the United States (U.S.) [
1]. LGBT populations face social stigma and additional discrimination-imposed challenges such as a higher rates of HIV and depression than their heterosexual and cisgender peers [
2]. To address LGBT issues and provide better service to this community, the first step is identifying those issues. However, traditional surveys and other research approaches such as focus groups are expensive and time-consuming, address limited issues, and obtain small-scale data.
Social media has become a mainstream channel of communication and has grown in popularity. Social media facilitates people’s belonging to and exchanging information within LGBT communities by allowing users to transcend geographic barriers in online spaces with the limited risk of being “outed” [
3]. Compared to heterosexual respondents, LGBT users are more likely to have accounts on social media websites, access social media daily, and make frequent use of the internet [
4].
According to a survey, 80% of LGBT Americans use social networking websites, and about four in ten LGBT adults have revealed their sexual orientation or gender identity on social networking sites [
5]. These statistics show that identifying and exploring millions of LGBT users and their discussions on social media could lead to revolutionary new ways of collecting and analyzing data about LGBT people, impacting a range of disciplines, including health, informatics, sociology, and psychology.
Privacy and stigma pose significant barriers to LGBT people sharing information related to these identities [
6]. Therefore, social media data may provide unique perspectives on LGBT issues that are not shared in these other settings. Among different social media platforms, Twitter offers Application Programming Interfaces (API) to collect large-scale datasets [
7,
8]. Due to its APIs and millions of users, many studies have used Twitter data to examine phenomena of interests in different applications such as health [
9,
10,
11,
12,
13,
14], politics [
15,
16,
17], social issues such as sexual harassment [
18], and disaster analysis [
19,
20,
21].
Valuable social media research has been implemented about the LGBT community on social media, such as defining online identity [
22,
23], exploring the societal perception about the LGBT community [
24], investigating transgender adolescents’ uses of social media for social support [
25], analyzing how LGBT parents navigate their online environments for advocacy, privacy, and disclosure [
26], comparing characteristics of research participants recruited via in-person intercept interviews in LGBT social venues and targeted social media ads [
27], and identifying health issues of LGBT users on social media [
28,
29,
30]. Studies have also been conducted exploring the visibility and participation of LGBT users online [
31], the correlation between psychological wellbeing and social media dependency [
32,
33], LGBT hospital care evaluation based on social media posts [
34], gender transition sentiment (mental health) [
35], the health and social needs of transgender people [
36], sexual health promotion [
37,
38,
39,
40,
41,
42,
43,
44], and the intervention and recruitment of research participants [
45,
46,
47,
48,
49,
50,
51,
52,
53,
54].
The above studies have utilized three approaches to identify LGBT users on social media. The first one places calls for LGBT participants, such as [
45]. The second approach manually finds profiles related to self-identified individuals, such as [
30]. The last approach is to identify profiles containing LGBT-related words, such as [
35]. These approaches have limitations. The first and second approaches are time-consuming and labor-intensive. The limitation of the third approach is that users who utilize LGBT-related words in their profile are not necessarily LGBT individual users and can belong to other types of users, such as organizations. Similar to our research, some studies have developed classifiers to identify the gender or age of users in text data [
55] and in social media such as Twitter [
56,
57], Sina Weibo [
58], Facebook [
59], and Netlog [
60]. These studies have utilized different features such as the number of pronouns in social media profile and bio to develop binary classifiers, but there is no research on identifying the category of LGBT users.
In sum, social media has provided a great opportunity for both LGBT users to overcome the relevant privacy and stigma issues and researchers to study LGBT population. However, there is a need to develop efficient and effective automated methods to categorize LGBT users using machine learning. To addresses the limitations, first, this research develops a robust codebook to characterize users in community-informed ways beyond just searching by keywords, to provide the most accurate data to train a machine learning model. Then, this study offers a classifier to automatically categorize LGBT users to facilitate future relevant studies. This paper has multiple contributions and implications, as follows:
This paper offers a codebook to manually categorize LGBT users.
The prediction approach is an important step toward categorizing LGBT users by developing a machine learning classifier.
Methodologically, our approach can be reused in predicting not only LGBT users but also other minorities.
While this research uses Twitter data, the proposed approach and features can be adopted for other possible social media platforms.
The approach of this paper can be used to identify and filter out adult content.
This research can be used by researchers to understand social media activities and concerns (e.g., health issues) of LGBT individuals.
This study can also be utilized by researchers to explore the social media strategies of LGBT organizations and identify best practices to promote social good for the LGBT population.
2. Materials and Methods
The methodology of this paper has five components, including data acquisition, data annotation, classification, evaluation, and statistical analysis (
Figure 1).
2.1. Data Acquisition
Twitter data were chosen for this project due to the hesitations many LGBT community members have about reporting their identities in official studies or in medical settings. Choosing Twitter data also allows for a broader reach within the community than is possible in a survey approach. Survey or focus-group based research on queer issues may also be heavily siloed, while Twitter data offer a broader view available at scale.
Twitter, a massively popular American microblogging and social networking platform launched in 2006, allows users to post short messages or “tweets” and interact with other users’ tweets by liking or retweeting. Users choose to “follow” other users whose content they wish to view and can choose to only allow certain other users to follow their account. Twitter is a social media platform that provides us with a large-scale dataset to classify LGBT users.
This paper categorizes Twitter users utilizing LGBT-related words in their profiles. Profiles were identified using the followerwonk platform (
https://followerwonk.com/bio (accessed on 15 June 2019)) to obtain Twitter profiles containing “lesbian”, “gay”, “bisexual”, “bi”, “transgender”, “trans man”, and “trans woman” users in the U.S. and in each state, and only profiles that had at least 50 followers and 50 tweets to focus on active users. This process offered 42,644 profiles. After removing duplicate profiles, we found 38,978 unique profiles.
We recognize that the topic of this paper is a sensitive area presenting ethical challenges. To address these challenges, we include a self-reflexivity statement. First, we use publicly accessible Twitter data without any interaction with the users, our work is exempt from the institutional review board (IRB) review. However, we took great care in data collection and analysis and presenting results by not disclosing personally identifiable information. Second, to incorporate sensitivity in this paper, some of the coauthors belong to the LGBT community.
2.2. Data Annotation
In order to accurately categorize LGBT users, high quality data from users who self-identify as LGBT in the United States are needed. Where previous work in the field has taken more simplistic approaches to gather profiles belonging to the community by simply including all profiles with mentions of LGBT terms, this results in low quality data due to the inclusion of accounts professing support as allies and automated accounts that post primarily pornographic material. This research could be used to automate the process of future classification and could serve as a repository for a number of future academic studies into many other aspects of LGBT social media activities.
The annotation approach and codebook were developed iteratively, and responsively to both community rhetoric and the intricacies, twists, and unexpected challenges of mining social media data. Using a human-centered approach, a codebook was developed to reflect the most complexity possible when labeling the accounts of users, while still creating disjoint sets. The final codebook was then applied to all collected user accounts by two coders independently for intercoder reliability.
The two authors independently coded and discussed 500 randomly selected profiles from the 38,978 unique profiles. Due to the nature of the internet and social media at large, searching for profiles with LGBT-related words in Twitter bios returns a fairly high percentage of results with primarily pornographic material, which may or may not be posted by “bots”. Organizations were also classified separately, as they do not reflect individual experiences. Discrepancies were addressed by a third coder. The initial coding process offered three categories, including individual, porn/sex worker, and organization accounts. Coders needed to answer the following two questions for each account:
Q1: Is the account useable for this research? This yes/no question excluded the following accounts:
Non-U.S. accounts where their bio information does not show a location in the U.S.;
Non-English accounts that posted mostly non-English tweets;
Inactive accounts that have not been active since 2017;
Private and suspended accounts;
Automated accounts that posted an unusual number of tweets, retweets, and likes, had a very low rate of followers to followings, and did not have an image. We also used Botometer (
https://botometer.osome.iu.edu/ (accessed on 15 June 2019)) to identify automated accounts [
61].
Q2: What is the category of the account? To address this question, coders used the following definition to assign one of the categories:
Individual accounts are controlled by a single person.
Sex Worker/Porn accounts are involved in the production of professional pornography both on and off screen, those engaged in prostitution and escort services, erotic dancers, fetish models, and amateur individuals using webcam sites, amateur porn sites, or pay-gated platforms to profit off of self-made content, and accounts that retweet primarily pornographic material and/or post their own nude photographs or moving images.
Organization accounts are managed by a group or an organization representing more than one person.
After completing the coding, we applied Cohen’s κ to determine the agreement between the two coders. There were substantial agreements for Q1 (κ = 0.7862) and Q2 (κ = 0.7544).
2.3. Classification
Our next goal centers around inferring the category of the collected Twitter users automatically. We draw on Twitter account information to build a machine learning classifier. This paper follows the automated framework in
Figure 2 to categorize LGBT users on Twitter.
This step includes developing algorithms to assign a set of users U =
to known classes. The classification can be described as the prediction of the category of each user (
). The following classifier algorithm (a) assigns a class (c) to each user in Equation (1):
In this research, there are three classes (m = 3), including individual, sex worker/porn, and organization. To classify each user, the input of each classifier is a set of n features,
. This research examines the following two types of features: bio and profile features. To predict the category of each user, we use the following two main approaches [
62]: traditional methods including NaiveBayes, BayesNet, Random Forest, J48, and Support Vector Machines (SVM) and deep learning using Convolutional Neural Network (CNN). These methods are among high-performance classification algorithms [
63,
64,
65,
66,
67,
68]. CNN is of the popular deep learning methods and has been used for different classification tasks [
62,
69,
70,
71]. The rest of the classifiers are traditional machine learning methods using for a wide range of applications such as spam detection [
72,
73] and document classification [
74]. We transform the information of Twitter accounts into a set of features. The focus of this study is on the features displayed on Twitter accounts. These features illustrate information about users and their activities.
Table 1 shows the definition of Twitter terms.
This paper uses features in the LGBT Twitter accounts and builds a feature vector for each account, which are briefly described below.
1288 bio features
- ○
Frequency of each word in bio
- ○
The number of words in bio
81 profile features
- ○
The age of each Twitter account (account’s age)
- ○
Total number of tweets (#tweets)
- ○
The number of tweets per year (#tweets/year)
- ○
The number of likes per year (#likes/year)
- ○
The number of followers (#followers)
- ○
The number of followings (#followings)
- ○
The rate of followers to following (#followers/#followers)
- ○
Frequency of letters (A–Z) and numbers (0–9) in the username
- ○
Frequency of letters (A–Z) and numbers (0–9) in the screen name
- ○
The username’s length
- ○
The screen name’s length
This study uses the
value, which is one of the effective feature selection methods [
75], to measure the discriminative power of features for ranking the impact of the different number of features on the performance of classification methods. The
value assists in identifying the best number of features showing the best classification performance.
2.4. Evaluation
We examine the performance of the six algorithms to find which classifier performs better with the bio and profile features. To evaluate the performance of classifiers, we use some measures based on the confusion matrix. The following confusion matrix represents a binary classification example that can be extended to more than two categories:
| | Predicted |
| | Category 1 | Category 2 |
Actual | Category 1 | True Positive (TP) | False Positive (FP) |
Category 2 | False Negative (FN) | True Negative (TN) |
While TP and TN are correctly identified and misidentified reports, respectively, FP and FN are incorrectly identified and misidentified reports, respectively. We utilized precision (P), recall (R), the area under the ROC curve (AUC), and accuracy (ACC) based on the following definitions:
ROC finds the tradeoff FP and TP by plotting FP on the X-axis and TP on the Y-axis; the closer to the upper left indicates better performance. Then, we computed the chi-square to rank and find the top features. In order to determine the category of each user, we adopted the six classification algorithms using 5-fold cross-validation, in which the data are broken into five subsets, and the holdout method is repeated five times. Each time, one of the three subsets is used as the test set, and the other four subsets are used as the training set.
2.5. Statistical Analysis
To compare individual, porn/sexual worker, and organization accounts based on the mean value of the top features identified in the previous step, we utilized an analysis of variance (ANOVA), which tests whether the weight of features is different for the three account’s types. We used the value of the top features as the dependent variable. After we found a significant difference (
p-value ≤ 0.05), we used Tukey’s multiple comparison test [
76] to find which of the means differ significantly from others. To control familywise errors, we used the false discovery rate (FDR) method [
77] that reduces not only false positives but also false negatives [
78]. We also utilized the absolute effect size using Cohen’s d to identify the magnitude of the differences. We used the following classification index to interpret effect sizes: very small (d = 0.01), small (d = 0.2), medium (d = 0.5), large (d = 0.8), very large (d = 1.2), and huge (d = 2.0) [
79].
3. Results
The manual coding process offered 16,241 users, including 12,488 (76.89%) individual, 2282 (14.05%) porn/sexual work, and 1471 (9.06%) organization accounts. In total, we obtained 1369 features. We tested the performance of the six classifiers developed in Weka (
https://www.cs.waikato.ac.nz/ml/weka/ (accessed on 15 April 2021) with the five-cross validation methods. To ensure the comparability between the classifiers, we used the standard parameters. Out of the six classifiers, we found BayesNet produced higher accuracy and AUC than the rest of the algorithms (
Figure 3). The BayesNet algorithm performed significantly better than the baseline accuracy of 0.7689, which was based on using the algorithm ZeroR relying on the target and ignores all predictors.
We found that finding the optimum number of features can improve the classification performance, which offers a time-saving and cost-efficient system. Therefore, we have examined a different number of features. The optimum number of features was 399 (
Figure 4).
Table 2 shows the accuracy performance of BayesNet algorithms using the profile, bio, profile and bio, and top 399 profile and bio features. This table had three outcomes. First, the profile or bio features could identify the three classes with more than 80% accuracy. Second, the combination of profile and bio improved the performance of the classifier. Third, reducing the number of features enhanced the accuracy of BayesNet.
Table 3 summarizes the performance metrics of NaiveBayes with 399 features, where we found that the classifier was reasonably stable (SD ≤ 0.006 and CV ≤ 0.01). CV represents the coefficient of variation measured using
.
Among the top 399 features, the number of bio features is more than the number of profile features (
Figure 5). We found that 321 (24.9%) out of the total 1288 bio features and 78 (96.3%) out of the total 81 profile features are among the top 399 features. This means that while the profile features represent 20% of the top features, most profile features are among the top features. Among the top 50 features, the number of profile features is more than the number of bio features.
Table 4 shows the top 20 features that assisted in classifying users. Out of the top 20 features, 9 (45%) and 11 (55%) features were related to profile and bio categories, respectively. The bio features include the #words in the bio and the frequency of words in the bio of Twitter accounts, including bisexual, transgender, community, nsfw, porn, organization, LGBT, allies, event, and men. Among these words, nsfw and porn are used more by SWP accounts, and the rest of words are utilized more by organization accounts. The rest of the top 20 features are related to the profile category, including the #likes/year, the #followers/#followings, the account’s age, the #tweets/year, the #tweets, the letter g in the screen name, the #followers, the username’s length, and the #followings.
Our statistical analysis shows that there were 46 (out of 60) significant differences. For example, the number of tweets is higher for individual (Ind) accounts than sexual worker/porn (SWP) and organization (Org) accounts (
Table 4). We found the following findings:
There was no significant difference between SWP and Org accounts across three features, including #followers/followings, #followers, and #followings. Compared to Org accounts, SWP accounts had a higher #likes/year, #tweets/year, and #tweets and used nsfw (not safe for work) and pornographic words in their bio more. The value of the rest of the features was higher for Org accounts than SWP ones. In sum, we found three NS, five SWP > Org, and twelve SWP < Org comparisons.
There was no significant difference between Ind and Org accounts across two features, including #followings and nsfw. Compared to Org accounts, Ind accounts had a higher #likes/year, #tweets/year, and #tweets. The value of the rest of the features was higher for Org accounts than Ind ones. In total, this research identified two NS, three Ind > Org, and fifteen Ind < Org comparisons.
There was no significant difference between Ind and SPW accounts across nine features, including #likes/year, #tweets/year, and the length of the username, and using the words bisexual, transgender, community, organization, and allies in their bio. Compared to SPW accounts, Ind accounts had a higher account age, #tweets, and used the acronym LGBT more. The value of the rest of features was higher for SPW accounts than Ind ones. In total, this research identified nine NS, three Ind > SWP, and eight Ind < SWP comparisons.
There is a significant difference between the three categories based on the following features: the account’s age; the number of tweets; using porn, LGBT, and men words in the bio; using G in screen name; and the number of words in the bio.
The effect size analysis illustrates that the 46 significant differences were not trivial, including 6 very small, 18 small, 14 medium, 6 large, and 2 very large effect sizes (
Table 5). The maximum difference was between individual and organization accounts with 18 (90%) significant differences, and the minimum difference was between the individual and sexual worker/porn accounts with 11 (55%) significant differences out of 20 comparisons. The effect size analysis also confirmed that the magnitude of significant differences is considerable.
4. Discussion
This research is unique in that it provides a prediction framework including an automatic classifier, a feature selection approach, and evaluation measures. Our experiments were designed to categorize LGBT users based on different sets of features and categories and identify features that may contribute to improving the efficiency and effectiveness of the prediction. Our proposed model uses BayesNet to learn feature vectors and the value to identify the optimal subset of features. Our proposed model outperformed the baseline on classifying LGBT accounts. We are now able to identify individual, sex worker/porn, and organization accounts with around 88% accuracy. The evaluation shows that the performance of our classifier is better than the baseline accuracy (76.89%) using ZeroR, which classifies each user to the largest class, which is individual users in this study. While even a little higher than the baseline could be significant, our classifier shows more than 10% improvement over the baseline.
While using profile and bio features independently can provide a significant change over the baseline performance, the combination of profile and bio features and reducing the number of features can be more helpful in classifying LGBT accounts. The accuracy of our classifier is improved when both profile and bio features are used. While the number of profile features (81) is less than the number of bio features (1288), most of the top 50 features are related to profile information, indicating that profile information containing structured features plays an important role in classifying LGBT accounts. In addition, words in the bio of Ind, SWP, and Org accounts can be a good indicator to categorize LGBT users.
Our results suggest that profile information, words of bio, and characters of username and screen name can help to predict the category of LGBT users. For example, it is not surprising to see that the number of followers of Org and SWP accounts is more than Ind accounts because they have more fans than Ind ones. However, it is interesting to find that Org accounts used the like icon and tweeted less than Ind and SWP accounts, which means that Org accounts are cautious in posting social comments and showing their interests. The reason behind this strategy could be that a single unfortunate post can have a significant negative impact on organizations [
80]. However, Ind and SWP accounts do not have this limitation and can be more active than Org accounts.
Compared to SWP and Ind accounts, Org ones use community in the bio more than Ind and SWP accounts, which means Org accounts are more interested in emphasizing their role for the community. The age of Org accounts is higher than the other two accounts, which indicates that organizations have been active on social media for more years than the other two types. The characteristics of SWP accounts are similar to Org accounts based on some features. For example, the number of followers and followings of Org and SWP accounts is more than Ind accounts. Org and SWP accounts use more words in their bio than Ind accounts to introduce their services and provide more information for customers.
The comparisons of Ind vs. SWP, Ind vs. Org, and SWP vs. Org illustrate that the minimum difference is between Ind and SWP accounts, indicating that SWP accounts are more similar to Ind accounts than Org ones. It seems that the strategy of SWP accounts is to behave similarly to Ind accounts. Therefore, it is a complicated task to distinguish between Ind and SWP accounts. However, identifying Org accounts is less complicated than SWP and Ind accounts. While it is a difficult task to identify Ind and SWP accounts, there are features (e.g., using nsfw in bio) that assist in finding SWP accounts.
This research provides significant contributions. First, while other research developed binary classifiers, this paper offers a multi-label classifier to categorize users. For example, one study identifies individual and organization users [
81]. Second, this study illustrates that the used traditional machine learning methods in this research offer better performance than deep learning using CNN for categorizing LGBT users using bio and profile features. Our data size is not very large. Therefore, this finding is in line with the current literature that indicates that deep learning methods do not provide a significant performance over traditional methods if the size of a dataset is small or medium [
82]. Third, the proposed approach is effective in utilizing bio and profile features to identify Ind, SWP, and Org accounts. Fourth, this paper identifies and uses features that can be used for similar purposes. Fifth, the proposed approach is flexible to incorporate not only bio and profile features but also other features (e.g., semantics of tweets), use other machine learning methods, and be applied on other social media platforms. Sixth, this research is beneficial for researchers who are interested in categorized LGBT users for social media analysis purposes. For instance, our work can be used by public health experts to identify LGBT individuals to study their information behavior on social media, by social media and marketing companies and application developers to filter out adult content, and by social science and business experts to study LGBT organizations. We believe our work bears the potential to help understand the needs of LGBT individuals on social media and develop interventions to address the needs of LGBT people.
While our study contributes to LGBT studies in social media and opens a new direction for future research, this study bears certain limitations. First, we limit our features to bio and profile features. Second, this study is limited to LGBT users who live in the U.S. and post tweets in English. Third, our data collection was limited to lesbian, gay, bisexual, and transgender users, indicating that we might miss other possible relevant data.
Despite the limitations, our findings can provide new insights into types of LGBT users and their social media activities. Future research will need to consider n-grams (e.g., bigrams), linguistics features (e.g., verbs), the semantic meanings of words (e.g., themes), and global or local weighting methods. We aim to go beyond unigrams and incorporate n-grams, linguistic analysis, and semantic features in our prediction framework. That way, we hope to achieve a higher prediction level.