1. Introduction
Nowadays, a multitude of short texts are generated through various communication channels, such as short message services (SMS), microblogging, instant messaging services, e-commercial services. For instance, Twitter receives approximately 6000 posts per second [
1]. Short texts serve as a convenient and cost-effective means to connect with individuals. Research indicates the high reliability of SMS, with 99% of all SMS messages being read by their recipients [
2]. For this reason, spammers take advantage of short texts to spread unwanted advertisements and malicious messages.
The detection for short texts presents unique difficulties because of their characteristics [
3]. First, the limited length may not contain sufficient semantic information. Second, a wide range of topics results in a high degree of sparsity within the short text representation matrix [
4]. Third, short texts are rapidly and constantly generated, necessitating real-time, high-throughput spam filtering. Lastly, informal writing is prevalent in short texts. In other words, short texts are frequently composed in a casual, informal, idiosyncratic, and occasionally misspelled manner [
5]. For example, people commonly substitute “thx” for “thanks”, and “im here” for “I am here”. Some short texts are even intentionally and maliciously crafted to evade spam filters [
6]. For instance, spammers may employ “kredit kard” for “credit card”, and “banc acct” for “bank account”. This informal writing style introduces numerous new words to short texts, thereby complicating the identification of spam messages.
Researchers have invested significant efforts in short text spam filtering. Traditional learning methods, including statistical techniques such as naïve Bayes (NB) [
7], the vector space model (VSM) [
8], the support vector machine (SVM) [
9], and k-nearest neighbor (KNN) [
10], often treat text as a collection of independent words and disregard word order. These approaches rely on statistical feature extraction methods like TF-IDF (term frequency-inverse document frequency) [
11]. To further improve accuracy, deep learning models have recently been deployed to address these issues. In addressing the challenge of limited text length, often referred to as the “shortness issue,” several approaches have been implemented. Gao et al. [
4] addressed this issue by implementing a convolutional neural network (CNN) and a bidirectional gated recurrent units model (Bi-GRU), seamlessly integrated with the TF-IDF algorithm. Zhu et al. [
12] harnessed the power of bidirectional encoder representation from transformers (BERT) to extract more relevant features from the user’s sentiment context. Machicao et al. [
13] proposed a novel approach that combines network structure and dynamics based on cellular automata theory to capture patterns in text networks and thus enhancing the text representation. The latest research, building upon the advancements of the generative pre-trained transformer 3 (GPT-3), achieves even higher accuracy [
14]. To address the issue of sparsity, researchers have adopted strategies to expand and enrich the feature space. Liao et al. [
15] treated each category as a subtask and utilized the robustly optimized BERT pre-training method, based on the deep bidirectional Transformer, to extract features from both the text and category tokens. Wang et al. [
16] addressed sparsity by semantically expanding short texts. They incorporated an attention mechanism into their neural network model to identify and include related words from the short text. Cai et al. [
17] used the attention mechanism to further strengthen the extraction of sentiment features. However, the deep learning models can achieve high accuracy by complex architectures, which demands heavy computation. Nevertheless, it is important to note that the deep learning approaches prioritize achieving higher accuracy and may overlook considerations related to training and filtering speed. Most of these algorithms are still in their early stages of development [
18] because balancing the objectives of achieving high accuracy and maintaining high-throughput spam filtering is a formidable challenge and is particularly critical in the context of the filtering industry. This study treats the limited words in a short text as a sequence with dependent features. Making use of word order, we applied a hidden Markov model (HMM) for short text filtering, which achieves high accuracy and high throughput.
Furthermore, it is crucial to identify the frequently occurring unknown strings because of the shortness issue. However, the research in new word weighting, particularly mitigating the informal writing issue under the specific challenges of short text filtering, is limited. Several methods have been devised to identify new words in long texts. These methods involve extracting independent features from the remaining portions of the text. Qian et al. [
19] used word embedding and frequent n-grams string mining to identify new word for Chinese word segmentation. For a similar purpose, Duan et al. [
20] used the bidirectional long short-term memory (LSTM) model and conditional random fields to process the manually chosen features, including word length, part of speech, contextual entropy, and degree of word coagulation.
More importantly, scaling the extent to which a new word signifies divergence holds particular significance in the field of short text filtering given the constraints of text length limitation, feature sparsity, computational resources, and performance considerations [
21]. The weight assigned to a new word is linked to the presence of other known words in all short texts where it occurs. This study introduces a novel approach to calculate a new word weight based on both the weights of known words in short texts and the artificial neural network (ANN) predicted probabilities of the texts being ham or spam. The novel weighting method further improves the accuracy further without compromising processing speed, thus achieving superior performance and filtering speed concurrently.
In summary, this study proposes a hybrid model to tackle the challenges posed by informal writing in fast filtering short texts. It combines an ANN for new word weighting and an HMM for filtering at a high processing speed without imposing a heavy computational burden. Extensive experiments are conducted on the SMS Spam Collection hosted at the University of California, Irvine (UCI) and other four datasets to illustrate the effectiveness of the proposed hybrid model. The performances are evaluated in key criteria such as accuracy, training time, training speed, and filtering throughput, enabling a thorough comparison of the proposed hybrid model’s capabilities.
The contributions of the paper are summarized as follows:
A novel new word weighting method based on the ANN model is developed. The weight of a word measures its likelihood of being densely distributed in one category. The weight of a new word in a short text is weighted based on the weights of its neighbor words and the probabilities yielded by the ANN.
When all words are properly weighted, a hybrid model that combines the ANN and an HMM is proposed for accurate and fast short text filtering. The HMM is used to predict the likelihood of a short text being spam.
The hybrid model represents pioneering research in the specialized domain of short text filtering, addressing unique challenges like limited length and feature sparsity with novel approaches.
The rest of the paper is organized as follows. The proposed hybrid model is presented in
Section 2. The evaluation metrics, experimental results, and discussion are given in
Section 3. The conclusions and future work are drawn in
Section 4.