Next Article in Journal
A Novel Procedure to Measure Membrane Penetration of Coarse Granular Materials
Previous Article in Journal
Daytime Lipid Metabolism Modulated by CLOCK Gene Is Linked to Retinal Ganglion Cells Damage in Glaucoma
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Using Dynamic Pruned N-Gram Model for Identifying the Gender of the User

1
Department of Information Systems and Technology, Port Said University, Port Fouad, Port Said 42526, Egypt
2
Department of Computer Science and Artificial Intelligence, College of Computer Science and Engineering, University of Jeddah, Jeddah 21493, Saudi Arabia
3
Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 21493, Saudi Arabia
4
Department of Informatics, National Research University Higher School of Economics, 3A Kantemirovskaya St., 194100 Saint Petersburg, Russia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(13), 6378; https://doi.org/10.3390/app12136378
Submission received: 20 May 2022 / Revised: 11 June 2022 / Accepted: 21 June 2022 / Published: 23 June 2022
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Organizations analyze customers’ personal data to understand and model their behavior. Identifying customers’ gender is a significant factor in analyzing markets that help plan the promotional campaigns, determine target customers and provide relevant offers. Several techniques were developed to analyze different types of data, including text, image, speech, and biometrics, to identify the gender of the user. The method of synthesis of the profile name differs from one customer to another. Using numerical substitutions of specific letters, known as Leet language, impedes the gender identification task. Moreover, using acronyms, misspellings, and adjacent names impose additional challenges. Towards this goal, this work uses the customers’ profile names associated with submitted reviews to recognize the customers’ gender. First, we create datasets of profile names extracted from the customers’ reviews. Secondly, we introduce a dynamic pruned n-gram model for identifying the gender of the user. It starts with data segmentation to handle adjacent parts, followed by data conversion and cleaning to fix the use of Leet language. Feature selection through a dynamic pruned n-gram model is the next step with the recurrent misspelling correction using fuzzy matching. We evaluate the proposed approach on the real data collected from active web resources. The obtained results demonstrate its validity and reliability.

1. Introduction

Recently, many changes have occurred in the world due to the difficult times we are going through. The pandemic has significantly affected many areas; e-commerce and e-marketing have acquired the largest share. Both consumers’ and sellers’ behavior has been affected and differed significantly from previously, which prompted many companies to handle this matter by making more efforts to analyze customer interests and desires [1,2]. The primary motivation is to enhance their marketing capabilities and increase the commercial share, taking into account the increased degree of competition between companies to realize customer satisfaction and increase their loyalty [3,4]. To this end, companies collect and analyze a lot of essential data to provide marketing and promotional offers that match customers’ requirements [5,6].
Generally, the term personalization refers to allocating the system’s output based on its users’ personal information. Web personalization is one of the most significant applications based on analyzing the customers’ behavior. Personalization’s primary goal is represented in two main objects, realizing a satisfying interaction for customers and vendors and predicting the consumer’s needs [1,6,7]. Personalization techniques are widely utilized by many websites, including search engines and online social networks, to enhance their capabilities of providing users with the most relevant information to search queries and preferences [8].
Most online shopping websites allow customers to submit their feedback about offered products and services. The analysis of customers’ data, besides their feedback, could help companies greatly while developing their marketing strategies, product redesign, etc., [9,10]. Customer gender is considered a significant aspect of personal data that can help understand and predict customer behavior and enhance organizations’ capabilities to provide users with the most relevant information to search queries and preferences [11,12].
Commonly, gender classification is considered a binary task, “Male and Female”. However, when dealing with the username or profile name, it may be extended to be a ternary task where some people, for several reasons, prefer to be registered anonymously; up to the end of this paper, the term username and profile name will be used interchangeably. Furthermore, digital platforms may allow users to be registered using virtual genders that enhance physical anonymity [13].
Gender classification represents a challenging task attracting researchers from various fields, including computer vision, data science, pattern recognition, etc. Automatic gender recognition has a wide range of applications in many fields that justify its importance, including commercial development (i.e., advertising), law enforcement (i.e., legal investigations), human–computer interaction (i.e., intelligent systems), and demographic research (i.e., social classification).
Several techniques were proposed for gender classification that depend on the type of data in which the technique was applied (e.g., image, speech, text, biometrics) [14,15,16,17,18,19]. Our goal is to recognize the gender from text, more specifically, from the username. Most customers prefer to write short feedback, making it insufficient to use text-based approaches for the Gender Identification (GI) task.
The proposed approach encounters many challenges that make the GI a problematic task. The first challenge is that the username may not contain only alphabetical letters; also, it is possible to accept numbers, symbols, and special characters, making the classification task not straightforward. Second, using the Leet language that users widely employ implies using some predefined numbers instead of letters. Such a situation makes removing numbers a risky task that requires a backward process of substitutions to retrieve the original name. The third comprises using a short name form by ignoring writing one or more middle letters, typically vowels.

1.1. Research Objectives

In this work, we strive to accomplish two primary objectives. First, we aim to collect review data from various online websites and extract profile data to build a diversified dataset suitable for validating the automatic gender identification model. Second, we investigate the potential of the dynamic pruned n-gram model for handling gender identification problems from profile data.

1.2. Contributions

In order to fulfill these objectives, we build four datasets that contain the profile data of customers. The created datasets are collected from real and active websites.
We have developed a gender classification model capable of addressing the shortcomings of existing models. The proposed multi-layered preprocessing framework handles the problems of using numerical substitutions of letters known as Leet speak; reverse substitutions are performed, considering all possible cases. Further, the use of short forms of names and acronyms is considered. We introduce the dynamic pruned n-gram model for feature extraction. Misspellings corrections were performed using the fuzzy-matching technique. We evaluate the performance of the proposed model; the obtained results demonstrate its validity and reliability.
The rest of this paper is organized as follows: In Section 2, we briefly present some related concepts with a discussion of the related studies. Section 3 introduces the proposed model with a description of the gender identification framework. Further, the basic components of our model have been described in detail. Section 4 provides details of the conducted experiments, including a description of the dataset used in this study. The implementation details and an exploration of the obtained results are provided. The testing mechanism, evaluation metrics, and criteria are described in this context. In Section 5, a detailed discussion is presented to highlight our work’s theoretical and practical implications. Finally, the conclusion that summarizes the results and future work.

2. Background

This section focuses on giving an overview of the related concepts, including web personalization and tokenization. Next, a comprehensive discussion of the gender classification techniques, We have considered the text-based approaches only due to their relevance to our work.

2.1. Web Personalization

Web mining refers to applying data mining techniques to web resources to extract hidden knowledge. One of the most attractive web mining areas is web personalization, which aims to optimize service and data provided to users by monitoring their interaction history, users’ profiles, preferences, geographic location, interests, etc. Therefore, the general definition of the personalization technique is the customization of system outputs by the gathered information about system users [20,21].
In other words, the term customization refers to improvements in the outputs presented to the user by increasing the relevance degree of the results. Recommendations and suggestions are different output forms of applying the personalization techniques to collected data (e.g., Facebook’s suggestions, Google’s search results, etc.) [22,23].
Traditionally, personalization systems have relied initially on explicit user feedback and ratings. Numerous studies have been introduced to improve the quality and relevance of results that employ additional information such as dwell time, user clicks, and others through a rating scheme. Further, other techniques are developed based on collaborative filtering (CF) for building personalization and recommendation systems. Other techniques incorporate soft computing techniques (e.g., fuzzy logic, neural networks, and genetic algorithms) as a helpful tool to handle this issue [7,24].

2.2. Tokenization

Undoubtedly, the web represents the most important source of texts. The raw content involves many useless and uninteresting details that may be optionally thrown away, such as line breaks, whitespace, blank lines, punctuation, and others. Tokenization is one of the most significant key concepts in Natural Language Processing. It could be defined as the process of chopping a sequence of characters up into pieces, and each one is called a token. Tokens represent identifiable linguistic units that constitute a piece of language data.
The primary purpose of tokenization is to convert text into features for further analysis tasks, and it is a significant task to find the correct tokens for use. No standard method appears to be valid and works well for all applications; it is a domain-specific task. Several techniques were introduced in this context; the most straightforward way for tokenization is to split text on whitespaces [25].
Another approach is based on splitting text on all non-alphanumeric characters, which implies throwing the punctuation characters away. However, this approach comprises problems that, in some situations, punctuation marks represent a part of the word [26]. Additionally, hyphens and single apostrophes may reflect the meaning of the sentence in the case of the presence of contractions such as “couldn’t”; tokenization using this approach will create “couldn t”. Therefore, tokenization turns out to be a far more difficult task and still needs more effort to refine the performance significantly [27].

2.3. Gender Identification Techniques

Recently, the development of gender classification techniques has become more interactive, draws increasing interest and attracts researchers from various scientific disciplines. The application of classification techniques strives to extract characteristics that differentiate between masculinity and femininity. The primary goal is to automatically identify a person’s gender according to the available data. Based on the adopted data type, we can categorize the proposed approaches into four classes: (1) Text Analysis-Based Approaches, (2) Face Recognition-Based Approaches, (3) Speech Recognition-Based Approaches, and (4) Biometrics-Based Approaches. A brief review of gender identification approaches that are based on text analysis follows:
Since most information on the web is in textual form, it is reasonable for it to receive the highest interest and carry the largest share of research proposals. Textual data could be found in several forms: authors’ names, tweets, Facebook posts, messages, blogs, etc. Many researchers have proposed solutions to recognize the gender of the text authors’; most of them focus on the text available on online social networks. In 2015, [28] introduced combination approaches for gender classification with text mining techniques. It employs sociolinguistic-inspired text features to enhance the performance of text mining methods. Moreover, the authors of [29] have proposed integrating the text mining approach to examine author preferences for gender classification tasks. Searching for linguistic information in the author’s preferences could improve the performance of the gender classification. Therefore, integrating the self-assigned gender into the binary GI classifier could improve the accuracy of the results. The authors of [30] have introduced a supervised learning approach called the “Cascading Transformer” to identify the gender of named entities in text. Sequence labeling considering the context where the name appears outlines their solution for inferencing the gender; “Female”, “Male”, “Ambiguous”, and “Other” represent the possible labels. The proposed model has been tested and evaluated on four open-source datasets created by the authors.
Although most researchers tend to develop solutions to the English text, several research works have been proposed to handle other languages such as Arabic [31,32,33], Russian [34,35], and Portuguese [36]. The authors of [32] have introduced a language-specific algorithm based on N-Gram Feature Vector that deals with Egyptian dialect. They achieved the text classification task using a Mixed Feature Vector, which was applied to the authors’ annotated dataset of Egyptian dialects. Feature weights were applied using the Random Forest with Mixed Feature Vector and Logistic Regression with N-Gram Feature Vector. Further, in 2020 [33], the authors employed deep learning techniques for handling the gender identification task. They have investigated several Neural Network varieties (e.g., Convolutional Neural Networks, Convolutional Bidirectional Long-Short Term Memory, Long-Short Term Memory) on Arabic text on Twitter, more specifically, Egyptian dialect. The best-suited model was Convolutional Bidirectional Gated Recurrent Units, which achieved the highest accuracy. Turning to the Russian language, the authors of [34] have investigated the application of conventional machine learning models (e.g., Support Vector Machine, Gradient Boosting, Decision Tree) and varieties of Neural Networks (e.g., Convolutional Neural Network, Long-Short Term Memory) on the Russian text with gender deception to handle the task of GI. They examined several combinations and proposed the best four models that achieved high accuracy. Concerning the Portuguese language, the authors of [36] have proposed applying machine learning techniques to Portuguese on Twitter. They incorporate the extracted Meta-Attributes from the text with the ML algorithms for recognizing the linguistic expression related to gender. Several factors were considered (e.g., multi-genre, characters, content-free texts, structure and morphology, syntax) through the classification process. Furthermore, the gender identification task’s significance encouraged the authors of [37] to develop and build a set of software tools built using the R programming language. The source of the package is called genderizeR. Further, there exist several tools that have been built to be compatible with other programming languages such as python. Python packages for gender classification include chicksexer, gender-guesser, etc.

3. The Proposed Model

This work’s primary goal is to recognize the gender of the customer through the username. The proposed model aims to collect, process, and analyze the consumers’ data from the online markets. In this section, we present the dynamic pruned n-gram model for identifying the gender of the customer using the registered username. First, we will discuss the workflow of our model and how it operates. Afterward, we will discuss the primary sub-procedures with different variants of specific tasks for performance tuning.

3.1. The Framework Description

Generally, the proposed model incorporates several tasks starting from collecting data, followed by a series of analysis processes ending with assigning the proper labels to users, as shown in Figure 1. Collecting review data has been performed by scraping customers’ feedback available on the online markets (e.g., Amazon, BestBuy, Lenovo, Dell). The gathered data include the review text, date and time of writing the review, ratings, and information of the person who wrote this comment. Therefore, the next step is to extract the username field from the collected data. Commonly, websites do not allow spaces to be used when creating a username during the registration or submission of the review. Therefore, customers may find no alternative to writing concatenated names, making gender identification tasks more difficult. Therefore, word segmentation appears to be a suitable solution to handle this task.
Typically, a username may contain letters, numbers, and special characters. Additionally, many people prefer to use common substitutions of specific letters in the word with specific numbers, known as the Leet language. Recognizing these numbers and substituting them with the correct letters is the next task to be achieved. Afterward, the remaining numbers and/or symbols need to be removed so that after finishing this step, only alphabetical letters are being maintained.
People’s names of the same gender converge in groups, and each group shares morphologies. The forms of affinity vary from the convergence in word structure to parts of words such as prefixes, suffixes, root words, and stems. Searching for matching names in the dataset starts with the input pattern and creates the n-gram combination, where n represents the number of characters in the input word. In the continued absence of matching words, n is decreased by one letter and becomes ( n 1 ) until finding matching or reaching a certain threshold; Dynamic Pruned N-Gram refers to this process.
Commonly, websites limit the number of characters and symbols that can synthesize the username, and the practice of some users ignore some letters, usually the vowels from the middle of the word. From Figure 1, we can observe that the process of searching for matching names is incorporated with another task. Handling the misspelling word problems is the current task. Therefore, while searching for matchings, we consider the probability of missing characters, with or without intention.
Therefore, the previous tasks’ results are a set of nominations for the gender of the extracted parts of the names associated with specific probabilities. By considering these results and frequencies of matchings, a final decision on the gender of the user will be made. Algorithm 1 presents the proposed model’s mechanism with more details and clarifies how the different sub-procedures can interact and cooperate to output the final results.

3.2. The Basic Procedures

Generally, the proposed model incorporates several sub-tasks, as mentioned in Figure 1 and Algorithm 1. The primary procedures include data segmentation, data conversion, data cleaning, dynamic pruned n-gram feature, and misspelling correction, illustrated as follows.
Algorithm 1 Dynamic Pruned N-Gram Feature Selection for User Gender Identification
Input: UserName
Output: Identified Gender of the User
     Initialization:
1:  name _ segments Segment ( UserName : string )
2: for each nsegment in name _ segments  do
3:     converted _ name LeetTranslate ( nsegment )
4:     cleaned _ name Clean ( converted _ name )
5:     name _ parts Tokenizer ( cleaned _ name )
6:    for each name _ part in name _ parts  do
7:       stringlen length of name _ part
8:       n g r a m = s t r i n g l e n
9:      while  n g r a m > = 3  do
10:          char _ groups Ngram ( n a m e _ p a r t , n g r a m )
11:         for each group in char _ groups  do
12:            cand _ name Concatenate ( g r o u p )
13:            male _ sim FuzzyMatch ( c a n d _ n a m e , m a l e s )
14:            female _ sim FuzzyMatch ( c a n d _ n a m e , f e m a l e )
15:         end for
16:          m a s c u l i n i t y = m a l e _ s i m
17:          f e m i n i n i t y = f e m a l e _ s i m
18:          cand _ gender GenderRelevance ( m a s c u l i n i t y , f e m i n i n i t y )
19:         if ( c a n d _ g e n d e r [ M , F ] ) then
20:           break
21:         end if
22:          n g r a m = n g r a m 1
23:      end while
24:       m a l e _ p a r t = P ( c a n d _ g e n d e r ) , c a n d _ g e n d e r [ M ]
25:       f e m a l e _ p a r t = P ( c a n d _ g e n d e r ) , c a n d _ g e n d e r [ F ]
26:      if ( m a l e _ p a r t > f e m a l e _ p a r t ) then
27:          p a r t _ g e n d e r = [ M ]
28:      else
29:          p a r t _ g e n d e r = [ F ]
30:      end if
31:    end for
32:     gender Count ( p a r t _ g e n d e r )
33: end for
34: if ( g e n d e r [ M , F ] ) then
35:     g e n d e r = [ A ]
36: end if
37: return  g e n d e r

3.2.1. Data Segmentation

Data segmentation in NLP refers to the process of distinguishing the distinct pieces of text that have been accidentally merged or concatenated by determining where the word boundaries are. In other words, it represents the root task of tokenization. Generally, segmentation could be performed on two levels; sentence level and word level. Accordingly, sentence segmentation applies to the text by dividing a sequence of words into sentences. On the other hand, word segmentation applies to the text by dividing a sequence of letters into words.
Concerning our work objectives, our dataset comprises a list of usernames, each item consisting of a series of bits that may include letters, symbols, or special characters, which synthesize a unique unit. Therefore, in our implementation, we have employed word-level segmentation. Many approaches were recently introduced for handling this task, varying from using linguistic-based techniques or applying machine and/or deep learning techniques. Generally, we can categorize the word segmentation techniques into three categories [38]: (1) Rule-based approaches, (2) Machine learning approaches, and (3) Dictionary-based approaches. In this work, we have employed word segmentation using a dictionary-based algorithm built on the released version of Google’s web trillion-word corpus developed by Peter Norvig [39]. The process starts by scanning characters from the left, adding one at a time to the search pattern, and looking up the dictionary for matching words. Maximal matching algorithms were utilized to avoid the nearest matching case and continue searching for the longest possible pattern. Several lookup dictionaries were built from the web’s text, usually containing trillions of words. Usually, such type of dictionaries contains the words and their frequencies. Further, the number of words that appear together was considered using the n-gram technique; i.e., unigram represents one word where n = 1 , bigram consists of two words, n = 2 , etc.

3.2.2. Data Conversion and Cleaning

Typically, most websites share the same rules for creating usernames. Website developers allow the customers to synthesize their usernames using alphabetical letters and numbers. Further, it is possible to allow the use of underscores or any other range of special characters and symbols. Human names contain only letters, so for gender identification purposes, it is reasonable to maintain only letters and remove all other symbols and numbers. Unfortunately, users prefer to use numbers beside letters to create their username (e.g., Birthdate, marriage data) as an additional factor that distinguishes them from other users who share the same name. Moreover, there is an increase in using the numerical substitutions of letters or what is known as “Leet language”. Therefore, removing numbers becomes unsafe and may lead to erroneous results; thus, we must be careful when deciding which numbers need to be removed and which do not.
Leet or “leet speak” is an informal language that stands for substituting one or more English letters in the word with special characters or numbers that resemble the letters in appearance. In this regard, we make a backward substitution by replacing these numbers with the original letters. The replacement relation is not one-to-one but one-to-many, which means each letter has several possible substitutions [40,41]. Originally, Leet was designated for an obscure communication between the users of the computer bulletin board, but it has become a phenomenon that dominates social media, chatting, and comix [42]. Table 1 shows the common substitutions that we have considered in our work.

3.2.3. Dynamic Pruned N-Gram Features Selection

Selecting features dynamically by applying a dynamic pruned n-gram model represents the main purpose of this task. In this context, we refer to the term features as the name or name part potentially extracted from the input pattern. Searching for matchings uses the help of the fuzzy matching technique to handle misspelling characters.
Consequently, after completing the possible replacements of predefined numbers with corresponding letters and cleaning the data by removing the remaining numbers or any other symbols, the selection of features starts. Inputs for this stage comprise a sequence of letters only. Subsequent to creating the search pattern by n-gram model, the algorithm starts looking at the labeled dataset for matchings using the entire input pattern where " n " refers to the number of letters in the input pattern. The labeled dataset consists of pre-classified names into two groups according to the name’s gender, male and female. Next, it computes the similarity between the search pattern and results from the two groups. The next step is selecting the top results that exceed a certain threshold and computing the aggregated similarity separately for each group. After that, we measure the relevance of the input pattern and the search results. If we find considerable matchings, we can directly decide the current pattern belongs to which group.
On the other hand, if there are no matchings or the results did not help recognize the gender of the current pattern, pruning n-gram will be employed dynamically in this context. Pruning the search pattern means that the search pattern’s length will be reduced by one, where “n” becomes “ n 1 ”. Figure 2 shows an illustration of using this mechanism to improve the performance of the search process. Therefore, creating the search pattern always starts with the highest possible order of n-gram that equals the input pattern length. If there is no matching, it starts pruning the pattern by decreasing the n-gram order by one at a time until finding matchings or reaching a certain threshold, “the minimum order”, which we have set to n = 3 . Since some users prefer to register anonymously, reaching the end of the search process without any matches means that the username gender will be classified as Anonymous.

3.2.4. Misspelling Correction

Clients are not obliged to write their names correctly when creating the username synthesizing string. For example, users may prefer to write names in short forms, like “Jhn” instead of “John”. It is also possible to accidentally miss one or more letters or write wrong letters within the string. Encountering such a problem, regardless of its reason, will lead to mismatching during the gender identification processes. Therefore, the searching for matchings processes greatly depends on handling misspelling problems.
Several techniques have been introduced in this context; fuzzy matching is used widely to address such a problem, also known as approximate string matching and fuzzy string searching. It allows the recognition of words that did not typically match the input pattern with high accuracy. Moreover, many applications such as search engines were built based on fuzzy matching to provide relevant results if a user query has a typo. Fuzzy searching on text could be implemented using many algorithms such as Damerau–Levenstein distance, Hamming distance, BK tree or Burkhard Keller tree, Bitap algorithm, and Levenstein distance, which we have used as a base algorithm for our implementation.
The Levenshtein distance, also known as edit distance, is a string metric that calculates the number of edits needed for converting the input word “Str1” to the target word “Str2” and suggests the minimum number of edits that represents the difference between two words. Here, the term edit refers to performing actions that exclusively include the substitution, insertion, or deletion of a single character one at a time. For example, “Fat and Hat” has an edit distance of one where “F” substitutes with “H”, while “Fate and Hat” would be two where “F” substitutes with “H” and delete “e”.

4. Experiments and Results

In order to examine the efficiency of our work, we have tested and evaluated our model on a real dataset collected from online markets’ websites that allow customers to register and write feedback about products or services. The customer review data and users’ information are available and publicly accessed. The following is a brief description of the developed framework, its components, and used implementation tools—next is a discussion of the dataset used in the experiment and an exploration of the achieved results.

4.1. Implementation Details

In this part, the proposed model’s implementation details are presented because it is necessary to verify our work’s viability for implementation and assure its quality and efficiency. To this end, the proposed method has been implemented on a laptop with Intel(R) Core(TM) i7-8750H CPU @2.2 GHz and 8 GB of RAM on Windows 10 x64.We have developed a spider program for crawling and scraping the required data using Scrapy (version 2.2.0) and Selenium (version 3.141.0). Scrapy was used for collecting data from websites that use static URLs to present reviews for readers. On the other hand, Selenium was used to collect data from websites employing the dynamically loaded contents strategy and clicking buttons for exploring the content data. Concerning the application development, we have chosen the python programming language (version 3.7.3) due to its capabilities and richness of libraries. To conduct the experiments, writing, compiling, and building the code were performed through Geany IDE (version 1.35). The python development environment offers many helpful packages significant for any data science application. We have performed processing tasks such as data conversion and cleaning using python tools for text processing such as regular expressions. Further, we have used wordsegment (version 1.3.1) for data segmentation and fuzzywuzzy (version 0.18.0) for misspelling correction. The n-gram features extraction and other text analysis tasks were implemented using Natural Language Toolkit “NLTK” (version 3.5). Libraries of Matplotlib (version 3.3.2) and NumPy (version 1.19.3) have been used for data analysis and visualization purposes.

4.2. Data

Notwithstanding the availability of multiple datasets used for gender identification purposes, there is no dataset containing actual user names with real practice to the best of our knowledge. Unfortunately, there are no benchmark datasets that can serve our purpose. Therefore, we have resolved to gather real data from active websites. The proposed work for gender identification targets the customers’ data registered on the online markets’ websites that allow customers to write feedback about specific products or services. We have developed a web scraper to collect data from websites to achieve this goal.
Despite the large volume of data available via the Internet and the ability of the developed program to collect data easily, experimenting with this amount becomes impossible; this relies on the restrictions imposed by the testing process’ nature, where the classification process must be performed twice to ensure the validity of the proposed model results. First, raw data must be classified manually and then input to the model to apply the automatic classification and compare the results with the manual classification. Therefore, we have set specified criteria for selecting data for participation in the experiment that must be met. The primary criterion is that the proposed data should involve all cases that the proposed model can handle. We have created four datasets derived from selected websites; the list includes Amazon, Bestbuy, Lenovo, and Dell exclusively. Table 2 shows the main characteristics of the collected datasets.
The original data contains customers’ reviews and metadata as well. For example, Figure 3 shows a sample of data from BestBuy from June 2020, which was parsed and stored in JSON file format. The review data includes the following fields: review ID, reviewer profile name, customer ratings for the mentioned product, date and time of submitting the review, title and text of the review, ratings for selected characteristics, readers’ votes, the manufacturer’s response data, and a list of mentions that represent features that the customer likes in this product.

4.3. Experiment

The first step toward achieving the work objectives is to extract the username list from the created datasets; hence, the output comprises four groups of usernames. Next, performing the necessary preliminary cleaning processes; involves eliminating list items that contain numbers only, items that have explicitly stated that the username is “Anonymous”, or items that contain an aggregation of alphabetical characters less than a specific threshold. Data segmentation is the next step; for each username, the task is to identify distinct name parts for customers who registered with a full name (e.g., name and surname). Consequent processes for gender identification will be applied for each name segment; after that, an evaluation will be performed on all name parts together to make the final decision on the username’s gender. Table 3 shows real illustrative examples from our dataset that show the word segmentation process’s impact on the generated output.
The gender identification process begins with receiving the name segment as an input pattern. It starts with searches for numbers to check if there is a suitable substitution or not, replacing it with the corresponding letter. Table 4 presents examples of data conversion operations that involve replacing recognized numbers used in the Leet language with the correct letters considering possible variations. After completing all possible replacements, it is required to remove all non-alphabetic characters. Therefore, the cleaned segment contains only letters; this is the time for applying the n-gram model for generating possible sequences. Starting with n letters that equal the segment length, it searches for matchings by incorporating the fuzzy matching technique for handling misspelling issues. Query results contain candidate names with their similarities, and the model selects only the most similar to input patterns.
Subsequently, computing the aggregated similarity according to gender and assigning it to the gender with the highest relevance regarding the possible sequences list. If no matchings were found, the searching process would be repeated with a lower order of n-gram, where n = n 1 until the input length reaches three characters or finds matches—then computing the probability of part gender and assigning the choice with the highest probability. Finally, counting the assigned gender for each part and choosing a label with the highest frequency; finding equal frequencies imposes selecting the highest relevance if no matches are found, so we are obliged to assign the username gender as Anonymous.

4.4. Testing and Evaluation

In order to verify the proposed model, the following procedure has been conducted. Once the preliminary cleaning task had concluded, we randomly selected a sample of data from each dataset separately. Subsequently, we have classified the data manually and assigned the corresponding label for each username of the dataset. After that, we applied our model to these samples and got the results. The manually assigned labels were hidden while experimenting. Next, we have compared the previously assigned labels manually and those assigned by the model to compute the accuracy. On the other hand, we have employed the corpus of names as a baseline dataset used for matching names. The names corpus is publically available and can be downloaded with the Natural Language Toolkit package “NLTK”, containing 2943 male names and 5001 female names.
Figure 4 shows the gender distribution of our datasets’ selected samples, on which we have applied our model blindly. The experimental results we have obtained have been presented in Table 5, which shows the confusion matrix of the proposed algorithm for gender identification; it shows the correct classification results for each label in the diagonal entries (i.e., cells [ii]). On the other hand, the off-diagonal entries indicate wrong classification results.
We can observe that the number of correct female name classifications is more than the correct classification of male names; we recommend that the reason relies on the higher number of female names in the baseline dataset. Further, it is remarkable that most false predictions tend to be anonymous, which is logically reasonable. Accordingly, we can say that the proposed model’s results prove its capabilities and validity, as evidenced by the classification results’ accuracy.

5. Discussion

Various methods have been proposed in the web personalization literature to analyze the customers’ data, preferences, and sentiments to understand their behavior properly. Personalization’s primary goal is represented in two main objects: realizing satisfying interaction for customers and vendors and predicting the consumer’s needs. Its techniques are widely utilized by many websites, including search engines and online social networks, to enhance their capabilities of providing users with the most relevant information to search queries and preferences.
Companies seek to collect and analyze customers’ data for several commercial purposes. One of the most critical pieces of personal data is the gender of the customer; it helps in modeling the customers’ behavior. Such information can support companies in many areas, including product redesign, planning promotional campaigns, advertising, providing customized offers, etc. Although online feedback has been recognized as a significant source for analyzing the market, studies that endeavor to develop techniques to identify the gender of customers from online reviews automatically are rarely seen in the literature.
Several approaches have been introduced to infer the gender of the author; most of them tend to examine written text, such as tweets, comments, blogs, etc., which mainly rely on extracting features through the use of machine learning techniques and deep learning models. However, these techniques are not suitable to infer the gender of the author through their profile data, more specifically, the username. Users have been greatly influenced by the linguistic developments related to the expanding use of online social media, chat rooms, etc. Several forms of this influence appear in using Leet language (Leet speak), name acronyms, and other aspects, strengthening the hypothesis of the inappropriateness of the mentioned techniques for gender identification tasks. Moreover, the volume of text is not enough to use these techniques.
The study has double implications, theoretical and practical. Theoretically, it is the first study identifying the gender of the author through the username associated with online customer feedback to the best of our knowledge. Our study contributes to the literature on pattern recognition, personalization, and information extraction by introducing an unconventional approach that can automatically infer the gender of the author from their username. Further, the concept of Leet language is still new and is not widely used, and incorporating reverse substitutions represents a novel approach; it could be incorporated into many other text analytical domains.
We built diverse datasets to serve as an experimental environment for the proposed model. The created datasets were collected from real and active websites, which comprise customers’ reviews of products and services. Then, we extracted the profile data of customers who wrote the reviews. We also proposed a multi-layered preprocessing framework to handle the gender identification problems. The first layer involves the removal of irrelevant symbols and special characters from the input pattern. Regarding the use of Leet language, reverse substitutions were performed, taking into account all possible cases.
On the other hand, the second layer concerns removing the remaining numbers after completing all possible numeric alphabet replacements. Additionally, misspellings and short-form cases were handled as well using a dynamic pruned n-gram model and fuzzy-matching. We conducted experiments to evaluate the performance of the proposed model and examined its predictive power to validate our approach. The presented results demonstrate its validity and reliability.
In addition to identifying the gender of the customer from online review data, the study has practical implications as well. The proposed model has several potential future business applications; it can be applied to other online networks, such as Twitter, Instagram, etc. Further, it could be incorporated with other systems such as recommender systems and user profiling.

6. Conclusions

Gender classification represents a significant task in modeling user behavior. In this work, we presented a dynamic pruned n-gram model for recognizing the gender of the customers from their usernames. We have been utilizing the availability of review data on online websites and extracting the username dataset. Identifying the gender of the customer through their profile name encounters several challenges that we have stated clearly. Therefore, our model constitutes several subtasks that cooperate to operate efficiently, including segmentation, numerical substitution, fuzzy matching, etc. The results show our work’s validity and efficiency for predicting the correct gender using short data values. As a future direction, we are considering applying our model to datasets from other languages, specifically the Arabic language, and measuring its validity and effectiveness.

Author Contributions

Conceptualization, N.M.A.; methodology, N.M.A., A.A., A.M.A. and B.N.; software, N.M.A., A.A., A.M.A. and B.N.; validation, N.M.A., A.A., A.M.A. and B.N.; formal analysis, N.M.A., A.A. and A.M.A.; investigation, N.M.A., A.A., A.M.A. and B.N.; resources, N.M.A., A.A., A.M.A. and B.N.; data curation, N.M.A., A.A., A.M.A. and B.N.; writing—original draft preparation, N.M.A. and A.A.; writing—review and editing, N.M.A., A.M.A. and B.N.; visualization, N.M.A., A.A. and B.N.; supervision, B.N.; project administration, N.M.A., A.A., A.M.A. and B.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The researcher (N.M.A.) is funded by a scholarship (EGY-6428/17) under the Joint Executive Program between the Arab Republic of Egypt and the Russian Federation.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kauffmanna, E.; Peralb, J.; Gilc, D.; Ferrándezb, A.; Sellersd, R.; Mora, H. A Framework for Big Data Analytics in Commercial Social Networks: A Case Study on Sentiment Analysis and Fake Review Detection for Marketing Decision-Making. Ind. Mark. Manag. 2020, 90, 523–537. [Google Scholar] [CrossRef]
  2. Wang, H.; Xu, Z.; Fujita, H.; Liu, S. Towards Felicitous Decision Making: An Overview on Challenges and Trends of Big Data. Inf. Sci. 2016, 367–368, 747–765. [Google Scholar] [CrossRef]
  3. Ali, N.M. Aspect-Oriented Analytics of Big Data. In Proceedings of the 14th International Baltic Conference on Databases and Information Systems (Baltic DB&IS 2020), Tallinn, Estonia, 16–19 June 2020; Volume 2620, pp. 41–48. [Google Scholar]
  4. Amplayo, R.K.; Lee, S.; Song, M. Incorporating Product Description to Sentiment Topic Models for Improved Aspect-based Sentiment Analysis. Inf. Sci. 2018, 454–455, 200–215. [Google Scholar] [CrossRef]
  5. Thelwall, M.; Stuart, E. She’s Reddit: A Source of Statistically Significant Gendered Interest Information? Inf. Process. Manag. 2019, 56, 1543–1558. [Google Scholar] [CrossRef] [Green Version]
  6. Ali, N.M.; Gadallah, A.M.; Hefny, H.A.; Novikov, B. An Integrated Framework for Web Data Preprocessing Towards Modeling User Behavior. In Proceedings of the 2020 International Multi-Conference on Industrial Engineering and Modern Technologies (FarEastCon), Vladivostok, Russia, 6–9 October 2020; pp. 1–8. [Google Scholar] [CrossRef]
  7. Al-Yazeed, N.M.A.; Gadallah, A.M.; Hefny, H.A. A Hybrid Recommendation Model for Web Navigation. In Proceedings of the The Seventh IEEE International Conference on Intelligent Computing and Information Systems (ICICIS), Cairo, Egypt, 12–14 December 2015; pp. 552–560. [Google Scholar] [CrossRef]
  8. Lopes, C.; Cabral, B.; Bernardino, J. Personalization Using Big Data Analytics Platforms. In Proceedings of the Ninth International C* Conference on Computer Science & Software Engineering (C3S2E’16), Porto, Portugal, 20–22 July 2016; pp. 131–132. [Google Scholar] [CrossRef]
  9. Chen, M.J.; Farn, C.K. Examining the Influence of Emotional Expressions in Online Consumer Reviews on Perceived Helpfulness. Inf. Process. Manag. 2020, 57, 102266. [Google Scholar] [CrossRef]
  10. Ali, N.M.; Novikov, B. A Multi-Source Big Data Framework for Capturing and Analyzing Customer Feedback. In Proceedings of the 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus), St. Petersburg and Moscow, Russia, 26–29 January 2021; pp. 185–190. [Google Scholar] [CrossRef]
  11. Goel, S.; Kumar, R. Collaboratively Augmented UIP—Filtered RIP with Relevancy Mapping for Personalization of Web Search. Inf. Sci. 2021, 547, 163–186. [Google Scholar] [CrossRef]
  12. Chen, Y.; Dai, Y.; Han, X.; Ge, Y.; Yin, H.; Li, P. Dig Users’ Intentions via Attention Flow Network for Personalized Recommendation. Inf. Sci. 2021, 547, 1122–1135. [Google Scholar] [CrossRef]
  13. Fosch-Villaronga, E.; Poulsen, A.; Søraa, R.A.; Custers, B.H.M. A Little Bird Told Me Your Gender: Gender Inferences in Social Media. Inf. Process. Manag. 2021, 58, 102541. [Google Scholar] [CrossRef]
  14. Kim, Y.; Kim, J.H. Using Computer Vision Techniques on Instagram to Link Users’ Personalities and Genders to the Features of their Photos: An Exploratory Study. Inf. Process. Manag. 2018, 54, 1101–1114. [Google Scholar] [CrossRef]
  15. Livieris, I.E.; Pintelas, E.; Pintelas, P. Gender Recognition by Voice Using an Improved Self-Labeled Algorithm. Mach. Learn. Knowl. Extr. 2019, 1, 492–503. [Google Scholar] [CrossRef] [Green Version]
  16. Cascone, L.; Medaglia, C.; Nappi, M.; Narducci, F. Pupil Size as A Soft Biometrics for Age and Gender Classification. Pattern Recognit. Lett. 2020, 140, 238–244. [Google Scholar] [CrossRef]
  17. Rim, B.; Kim, J.; Hong, M. Gender Classification from Fingerprint-images using Deep Learning Approach. In Proceedings of the International Conference on Research in Adaptive and Convergent Systems, Gwangju, Korea, 13–16 October 2020; pp. 7–12. [Google Scholar] [CrossRef]
  18. Nayak, J.S.; Indiramma, M. An Approach to Enhance Age Invariant Face Recognition Performance Based on Gender Classification. J. King Saud Univ.-Comput. Inf. Sci. 2021. [Google Scholar] [CrossRef]
  19. Rwigema, J.; Mfitumukiza, J.; Tae-Yong, K. A Hybrid Approach of Neural Networks for Age and Gender Classification through Decision Fusion. Biomed. Signal Process. Control 2021, 66, 102459. [Google Scholar] [CrossRef]
  20. Ali, N.M.; Gadallah, A.M.; Hefny, H.A.; Novikov, B. Online Web Navigation Assistant. Vestn. Udmurt. Univ. Mat. Mekhanika. Komp’Yuternye Nauk. 2021, 31, 116–131. [Google Scholar] [CrossRef]
  21. Díez, J.; Pérez-Núñez, P.; Luaces, O.; Remeseiro, B.; Bahamonde, A. Towards Explainable Personalized Recommendations by Learning from Users’ Photos. Inf. Sci. 2020, 520, 416–430. [Google Scholar] [CrossRef] [Green Version]
  22. Lyu, Y.; Chow, C.Y.; Wang, R.; Lee, V.C.S. iMCRec: A Multi-Criteria Framework for Personalized Point-of-Interest Recommendations. Inf. Sci. 2019, 483, 294–312. [Google Scholar] [CrossRef]
  23. Da’u, A.; Salim, N.; Rabiu, I.; Osman, A. Recommendation System Exploiting Aspect-based Opinion Mining With Deep Learning Method. Inf. Sci. 2020, 512, 1279–1292. [Google Scholar] [CrossRef]
  24. Renjith, S.; Sreekumar, A.; Jathavedan, M. An Extensive Study on the Evolution of Context-Aware Personalized Travel Recommender Systems. Inf. Process. Manag. 2020, 57, 102078. [Google Scholar] [CrossRef]
  25. Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python, 1st ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
  26. Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
  27. Sun, S.; Luo, C.; Chen, J. A Review of Natural Language Processing Techniques for Opinion Mining Systems. Inf. Fusion 2017, 36, 10–25. [Google Scholar] [CrossRef]
  28. Simaki, V.; Aravantinou, C.; Mporas, I.; Megalooikonomou, V. Using Sociolinguistic Inspired Features for Gender Classification of Web Authors. In Proceedings of the International Conference on Text, Speech, and Dialogue TSD 2015: Text, Speech, and Dialogue, Pilsen, Czech Republic, 14–17 September 2015; Springer: Cham, Switzerland, 2015; Volume 9302, pp. 587–594. [Google Scholar] [CrossRef]
  29. Kucukyilmaz, T.; Deniz, A.; Kiziloz, H.E. Boosting Gender Identification Using Author Preference. Pattern Recognit. Lett. 2020, 140, 245–251. [Google Scholar] [CrossRef]
  30. Das, S.; Paik, J.H. Context-Sensitive Gender Inference of Named Entities in Text. Inf. Process. Manag. 2021, 58, 102423. [Google Scholar] [CrossRef]
  31. Alsmearat, K.; Al-Ayyoub, M.; Al-Shalabi, R.; Kanaan, G. Author Gender Identification from Arabic Text. J. Inf. Secur. Appl. 2017, 35, 85–95. [Google Scholar] [CrossRef]
  32. Hussein, S.; Farouk, M.; Hemayed, E. Gender Identification of Egyptian Dialect in Twitter. Egypt. Inform. J. 2019, 20, 109–116. [Google Scholar] [CrossRef]
  33. ElSayed, S.; Farouk, M. Gender Identification for Egyptian Arabic Dialect in Twitter Using Deep Learning Models. Egypt. Inform. J. 2020, 21, 159–167. [Google Scholar] [CrossRef]
  34. Sboev, A.; Moloshnikov, I.; Gudovskikh, D.; Selivanov, A.; Rybka, R.; Litvinova, T. Automatic Gender Identification of Author of Russian Text by Machine Learning and Neural Net Algorithms in Case of Gender Deception. Procedia Comput. Sci. 2018, 123, 417–423. [Google Scholar] [CrossRef]
  35. Sboev, A.; Moloshnikov, I.; Gudovskikh, D.; Selivanov, A.; Rybka, R.; Litvinova, T. Deep Learning Neural Nets Versus Traditional Machine Learning in Gender Identification of Authors of RusProfiling Texts. Procedia Comput. Sci. 2018, 123, 424–431. [Google Scholar] [CrossRef]
  36. Filho, J.A.B.L.; Pasti, R.; Castro, L.N.D. Gender Classification of Twitter Data Based on Textual Meta-Attributes Extraction. In New Advances in Information Systems and Technologies; Rocha, Á., Correia, A.M., Adeli, H., Reis, L.P., Teixeira, M.M., Eds.; Springer: Cham, Switzerland, 2016; Volume 444, pp. 1025–1034. [Google Scholar] [CrossRef]
  37. Wais, K. Gender Prediction Methods Based on First Names with genderizeR. R J. 2016, 8, 17–37. [Google Scholar] [CrossRef] [Green Version]
  38. Venkataraman, A. Word Segmentation for Classification of Text. Master’s Thesis, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Uppsala University, Department of Information Technology, Uppsala, Sweden, 2019. [Google Scholar]
  39. Norvig, P. Natural Language Corpus Data. In Beautiful Data: The Stories Behind Elegant Data Solutions, 1st ed.; Book Section 14; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009; pp. 219–242. [Google Scholar]
  40. Sharpened Productions. The Slangit Leet Sheet; Sharpened Productions: Minneapolis, MN, USA, 2021; Available online: https://slangit.com/leet_sheet (accessed on 20 March 2021).
  41. Christensson, P. Leet Definition. 2019. Available online: https://techterms.com/definition/leet (accessed on 20 March 2021).
  42. Mitchell, A. A Leet Primer. 2005. Available online: https://www.technewsworld.com/story/47607.html (accessed on 20 March 2021).
Figure 1. The Experimental Framework of the Gender Identification Model.
Figure 1. The Experimental Framework of the Gender Identification Model.
Applsci 12 06378 g001
Figure 2. An illustration of using the n-gram model for synthesizing the search patterns.
Figure 2. An illustration of using the n-gram model for synthesizing the search patterns.
Applsci 12 06378 g002
Figure 3. Sample of the original review data.
Figure 3. Sample of the original review data.
Applsci 12 06378 g003
Figure 4. The gender distribution for each dataset in terms of percentage.
Figure 4. The gender distribution for each dataset in terms of percentage.
Applsci 12 06378 g004
Table 1. The most common letters that have numerical substitutions.
Table 1. The most common letters that have numerical substitutions.
Alphabetical CharacterNumerical Substitutions
A4
B6, 8, 13
E3
G6, 9
I1
L1, 7
O0
P9
Q9
R2, 12
S5
T1, 7
Z2
Table 2. Characteristics of the collected datasets.
Table 2. Characteristics of the collected datasets.
DatasetWebsiteProduct CategoryCountStart DateEnd DateReview Count
DS1BestbuyLaptop3Apr.-2019Sept.-20204355
DS2AmazonLaptop4May-2018Sept.-20202500
DS3DellLaptop1Nov.-2019Sept.-20202325
DS4LenovoLaptop2Oct.-2018Sept.-20208163
Table 3. A sample of usernames and the produced list of names after segmentation.
Table 3. A sample of usernames and the produced list of names after segmentation.
UsernameName Segments
(Raw Data)(after Segmentation)
AppleGadgetGuy[Apple, Gadget, Guy]
Jangthang[Jang, thang]
Floridamarlin63[Florida, marlin, 63]
KingHenry[King, Henry]
Jimmyk[Jimmy, k]
BruceLee[Bruce, Lee]
Rubysss[Ruby, sss]
packerboy82[packer, boy, 82]
Sam2019[Sam, 2019]
Table 4. A sample of usernames that have numerical substitution and the equivalent correct names.
Table 4. A sample of usernames that have numerical substitution and the equivalent correct names.
UsernameSubstituted LetterCorrect Name
J0hn O Z e r o John
4ndrew A 4 Andrew
Fr4nk A 4 Frank
9eorge G 9 George
3llie E 3 Ellie
5ilvia S 5 Silvia
7ommy T 7 Tommy
7ony T 7 Tony
Enri9ue Q 9 Enrique
Table 5. Confusion matrix of the proposed gender classification method on the dataset in terms of percentage.
Table 5. Confusion matrix of the proposed gender classification method on the dataset in terms of percentage.
DatasetGenderMaleFemaleAnonymousOverall Accuracy
DS1Male80.05.514.588.6
Female2.787.310.0
Anonymous0.51.198.4
DS2Male78.57.913.683.8
Female4.889.16.1
Anonymous2.95.791.4
DS3Male85.64.69.789.2
Female2.490.76.8
Anonymous3.04.093.0
DS4Male84.27.38.584.2
Female4.883.411.7
Anonymous3.211.685.3
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Ali, N.M.; Alshahrani, A.; Alghamdi, A.M.; Novikov, B. Using Dynamic Pruned N-Gram Model for Identifying the Gender of the User. Appl. Sci. 2022, 12, 6378. https://doi.org/10.3390/app12136378

AMA Style

Ali NM, Alshahrani A, Alghamdi AM, Novikov B. Using Dynamic Pruned N-Gram Model for Identifying the Gender of the User. Applied Sciences. 2022; 12(13):6378. https://doi.org/10.3390/app12136378

Chicago/Turabian Style

Ali, Noaman M., Abdullah Alshahrani, Ahmed M. Alghamdi, and Boris Novikov. 2022. "Using Dynamic Pruned N-Gram Model for Identifying the Gender of the User" Applied Sciences 12, no. 13: 6378. https://doi.org/10.3390/app12136378

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop