*3.4. Methods*

The solutions discussed in this paper are based on supervised learning, since they apply training sets with the target labels annotated. The generated dictionary is a mixture of labeled words that are assigned to one of the two target categories: spam or ham. The models make their predictions based on the dictionary's content. One can imagine that a question is posed to a program: if this e-mail consists of these words, is it spam or ham? The model responds to this unknown question by comparing it to the similar questions and answers (labels) it was given at the starting point.

The process of labeling (generating a dictionary in the case of the described application) is carried out with the use of a training set. A test set is used to measure the program's performance during the last step of the experiment.

Classification, interesting in the context of this paper, is one of the prevailing supervised machine-learning tasks. Its goal is to predict discrete values (might be categories, classes, or labels) for new examples (that had not been seen by the program before) from

one or more features. The set of classes is finite and there are several types of learning. Spam filtering is a two-class learning (also referred to as binary classification) [41]. The program (or its part) performing a classification task is called a classifier. In this paper, the classifiers were implemented with scikit-learn (sklearn), which is a free machine-learning library for the Python programming language.

The training phase is aimed at minimizing the errors, but it is important to remember that no model is perfect. Here, we use a set of typical measures defined in the context of a confusion matrix: true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP). Out of the four, the most undesirable outcome in the case of spam filtering is a false positive as it may result in losing a portion of critical information. Several parameters which allow evaluation of the classifiers are built based on the values that make up the confusion matrix: accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Accuracy was the main indicator of the classifier performance in the tests carried out in this research. In the most interesting cases, all five parameters were calculated for each tested classifier.

Below, we present three machine-learning algorithms we are comparing on the task of spam detection.

Despite its simplicity, *k*-nearest neighbours (*k*-NN) proved to be successful in a grea<sup>t</sup> number of supervised machine-learning tasks [42]. *k*-NN perform the classification of the new point (in the multidimensional space, where each point is a vector representing a sample being a single e-mail), based on *k* elements in its nearest distance. *k*-NN is sometimes called a "lazy learner", which means that it does not need to learn, but waits for classification until the very last moment. Gathering and labeling data could be referred to as a training phase. Once it is ready, the training stage is also completed. However, this fact leads to a time-consuming testing phase, during which the pairwise distances are calculated and compared.

Supervised neighbour-based learning methods are provided by the sklearn.neigbours library. *k*-NN may be implemented with the use of KNeighboursClassifier and the specific line of code responsible for the model definition is (when *k* = 5):

model = KNeighboursClassifier (nneighbours = 5)

When a new query point is given, KNeighboursClassifier carries out learning based on its *k* nearest points (*<sup>n</sup>*\_*neighbours*). The distance function applied by us is simply the standard Euclidean distance.

When the corpus we are working with is large, there may be hundreds of thousands features in the dictionary. If we convert the text documents (for instance e-mails) into feature vectors, each of them will then have hundreds of thousands of components and most of them will be zero. Such vectors are referred to as sparse. High-dimensional data are problematic for all machine-learning tasks due to the well-known curse of dimensionality [43].This is due to the higher demand for memory and computation compared to low-dimensional vectors. This difficulty may be overcome with scipy Python library using data types that can pull nonzero elements out of the sparse vectors. The second aspect is related to the fact that with the high dimensionality comes a threat of the insufficient number of documents in the training set. It is necessary to make sure that there are enough training instances to cover all features. Otherwise, the algorithm operation may result in overfitting, where the quality results are satisfactory for the training set of samples, but not for the testing set (and the following usage cases).

Support vector machines (SVM) [44] are most typically used in classification applications, although their usefulness is broader (e.g., outlier detection). If given a labeled dataset, SVM finds a classification (separation) hyperplane by searching for the maximum distance between data points (vectors representing samples) belonging to different classes. There exist two types of SVM models: hard-margin (each point needs to be classified accurately) and soft-margin (incorrect classification is also acceptable). Contrary to *k*-NN classifier, it is beneficial for the SVM to operate in high dimensions [45]. By increasing the number of features, data points tend to be more efficiently separated. The points that are closest to

the classification hyperplane are called support vectors. A hyperplane is also referred to as a decision boundary and separates elements belonging to different categories. The gap between the two hyperplanes drawn on support vectors is called a margin. The bigger the margin, the better.

In the application built for the purposes of this paper, two support vector classification (SVC()) based models, NuSVC() and LinearSVC(), were implemented with sklearn; with all parameters taking default values.

The family of naïve Bayes (NB) classifiers is based on the Bayes theorem that bounds absolute and conditional probabilities. In the case of machine learning and spam recognition, the probabilities can be associated with the relative frequencies of word appearance in messages (i.e., relative frequency counting of words). The second concept is the so-called naïve assumption that all features are independent of each other given the output (a class to which they belong). Although this assumption of independence rarely holds true, naïve Bayes classifier can perform a very successful classification, even if the training data does not provide many examples. Moreover, the classifiers that belong to NB family are known to be fast and simple.

The variant tested for the purpose of this paper is provided by sklearn. Multinomial naïve Bayes classifier MultinomialNB() applies the NB algorithm to multinomially distributed data [46]. It is also the most common option used in text classification. The data are represented in the form of word vector counts.

#### **4. Numerical Results with Validation**

The results were obtained based on our proprietary-software solution developed in Python 3.7.3.

#### *4.1. Datasets Structure*

The classifiers were tested on four datasets of various sizes. Three of them (composed of four datasets: enron1, enron2, enron4, enron5) are the extracts of the Enron Corpus [35]. In this phase, we propose to introduce cross-validation between different datasets (enron 1 and 4 as well as enron 2 and 5) in the training and test phases. These datasets' structure is described in detail in Tables 4–6. The fourth dataset (Table 7) is the exact copy of the part of the Lingspam corpus, used by Gregory Piatetsky-Shapiro and Matthew Mayo as a foundation for the paper described in [29]. The variety of the datasets provides the opportunity to carry out broad research.

**Table 4.** Dataset 1 structure.


#### **Table 5.** Dataset 2 structure.



**Table 6.** Dataset 3 structure.

#### **Table 7.** Dataset 4 structure.


#### *4.2. Text-Preprocessing Impact on the Dictionary*

Although the purpose of using the basic tex preprocessing methods (tokenization, etc. see Section 3.3) is straightforward and easy to explain, things become complicated regarding stemming and lemmatization. This chapter shows the differences in the ten most common words in the dictionary when none of the two methods is applied and when stemming or lemmatization is implemented. The test was repeated for each dataset and the results are shown in Tables 8–11.


**Table 8.** Ten most common features for dataset 1.

For dataset 1, both stemming and lemmatization caused the number of occurrences of the word **deal** increased by almost 70. Moreover, the words **need** and **volum(e)** appeared in the table, pushing the words **height** and **width** out (Table 8). For dataset 2, implementing either stemming or lemmatization resulted in the increase of the number of occurrences of the word **deal** by almost 700. Furthermore, the word **volum(e)** appeared in top 10, pushing the word **forwarded** out (Table 9). Table 10 presents the results for dataset 3, which is the biggest one (includes 16,675 e-mails). Adding the function responsible for stemming or lemmatization contributed to the change in the number of occurrences of the **deal** word. The number increased by approximately 800. When none of the method was present in the program, word **statements** was the last one in the top 10 list. Once the method (either stemming or lemmatization) was defined, the word **schedul(e)** emerged with the significant number of occurrences (3852 for stemming and 2591 for lemmatization). Taking dataset 4 into account, the differences were less visible (Table 11). What stands out is the increased number of occurrences of the word **order**, which changed by almost 100 after implementing each of the two methods. With stemming, **linguist** appears in the top 10, pushing out the word **free**.


**Table 9.** Ten most common features for dataset 2.

**Table 10.** Ten most common features for dataset 3.


Above, significant differences were shown for only the ten most common words. Therefore, if we refer to all 200, 1500 and 3000 words, there will be even more dissimilarity in the number of word occurrences which sometimes leads to either including the word in the dictionary or not. All designed models (*k*-NN, SVM, and NB) take e-mails as input. The e-mails are represented as vectors, with the elements being the word counts, based on the content of the dictionary. Let us assume that **schedul(e)** is a word strongly indicating that the e-mail is not spam. For dataset 3, when the function responsible for building the dictionary does not apply stemming or lemmatization, **schedul(e)** is not included in the small dictionary of ten features (Table 10) and because of that it would not be taken as a

valid portion of the information by the model. This could arise from the fact that the word takes many forms, such as "schedule", "schedules", "scheduling", "scheduled"—which are all counted as separate words. Using stemming or lemmatization may prevent such situations.


**Table 11.** Ten most common features for dataset 4.

#### *4.3. Spam Detection*

Here, we discuss the results related to the five substages of our meta-algorithm. The substages are introduced in Table 12.



The exact results related to various substages are summarized in Appendix A given at the end of the paper. Here, we give only the main findings. Based on the Substage 1 results, the following facts may be observed:

• For all tests, the maximum accuracies were achieved by the MNB classifier.

• For each classifier, its maximum accuracy was obtained when stemming was implemented.



The results based on confusion matrices are presented in Table 13. MNB classifier provides the highest probability that the e-mail classified as ham is actually a desired message (PPV = 0.887), while NuSVC performs best when predicting if the spam e-mail is in fact a spam (NPV = 0.986).


**Table 13.** Evaluation of the chosen classifier performance in Substage 1.

Substage 2 aimed to find the parameters (*k* and number of features in the dictionary) for which *k*-NN classifies the e-mails most efficiently. Because of the *k*-NN's computational complexity, dataset 1 (the smallest one) was chosen to conduct the experiment. The three highest accuracy values were obtained for the following parameters:


The results of Substage 2 are interpreted with the help of graphs. Figure 3 shows the maximum accuracy obtained for *k* across all Substage 2 results. The maximum is obtained for *k* = 11. For values of *k* that are bigger than 11, the accuracy rapidly declines. This is because the greater *k*, the simpler the classifier. Finally, if *k* is too big, most of the test points will belong to the same (prevailing) class.

**Figure 3.** *k*-NN accuracy vs. *k*.

Figure 4 presents the accuracy of the average tests for each dictionary size. The higher the data dimensionality, the worse the *k*-NN's accuracy. The difference between *k*-NN when dict = 200 and dict = 1500 or dict = 3000 is significant (≈0.2). To show the tendency, the power trend line was added to the graph. As we can see, the accuracy tends to change in a similar way. What is interesting, the power trend line and the exponential curve are alike. The only difference is that the arc of the first one is more symmetrical [47]. Hence, it may be concluded that in this case the accuracy experiences an exponential change.

**Figure 4.** Accuracy of the average tests vs. dictionary length.

The three *k*-NN models with the highest accuracy were chosen to be tested in Substage 3. Table 14 presents the five indicator values. This allows performance of a more thorough evaluation.

**Table 14.** Evaluation of the chosen classifiers performance in Substage 2.


Substage 3 consisted of ten tests. The first six of them were chosen as the top results of Substage 1 and Substage 2. The other four were conducted because of their promising performance in the previous experiments. The top accuracy values were obtained for the following parameters (these four models were designated for testing in Substages 4 and 5):


•*k*-NN *k* = 11—accuracy = 0.828, dict = 200, lemmatization.

Table 15 includes the quality metrics related to the four models that will be tested in Substages 4 and 5. When compared to MNB, NuSVC and *k*-NN have lower accuracy, sensitivity, and NPV. However, both obtained better specificity and PPV parameters. On the other hand, MNB was better at predicting the negative class.



In Substage 4, once again, MNB models achieved the highest values of accuracy: 0.919 and 0.909. Surprisingly, *k*-NN with *k* = 11 performed slightly better than NuSVC. Except for NuSVC, all models obtained higher accuracy than in Substage 3.

A collection of values that facilitate assessing the performance of the classifiers in Substage 4 is presented in Table 16. A very low specificity was noted for NuSVC. The model made a considerable mistake by classifying 933 ham e-mails as spam. The number was approximately two times higher than in the case of the other classifiers.


**Table 16.** Evaluation of the classifiers performance in Substage 4.

Substage 5 aimed at testing the classifiers that had performed best in Substage 3, but on the dataset that was not related to Enron. The sizes of dataset 1 and the one used in this substage (dataset 4, extracted from Lingspam corpus) were the same and that is why the accuracies will be compared to those obtained in Substage 1. The training set consisted of 702 e-mails. In the test set, there were 260 messages. In both cases, when MNB classified the messages, it achieved the highest accuracy. For MNB (1), there were 3000 features in the dictionary, for MNB (2)—1500. In each case, the lemmatization was added to the program. *k*-NN fared the worst—much less than it achieved in Substage 1, when its accuracy was 0.915 for the same parameters. This may be a result of the source dataset content (Enron vs. Lingspam). NuSVC improved its accuracy by 0.157.

Table 17 summarizes metrics for the 4 models tested in Substage 5 and for the results obtained by G. Piatetsky-Shapiro and M. Mayo in a similar experiment on the same dataset [29]. The probability that *k*-NN classified a harmful message as spam is only 0.608—this is the bottom value among all results. This fact has a direct impact on the accuracy of *k*-NNs, which was the lowest one in this substage. Both MNB models obtained specificity and PPV equal to 1. It means there was not a single non-spam e-mail that would be misclassified as spam. Moreover, the total number of misclassified e-mails was only 10 (spam classified as ham). In Substage 5, for dataset 4, MNB classifier turned out to be nearly perfect. The results are a little better than those achieved by G. Piatetsky-Shapiro and M. Mayo [29]. This is possibly because of the more complex text-preprocessing methods that were implemented.


**Table 17.** Evaluation of the classifiers performance in Substage 5.

#### *4.4. Method Validation and Discussion of Results*

First, we note that text preprocessing has a significant impact on the behavior of the classifiers. There is no doubt it is always beneficial to apply the basic methods, such as conversion to lowercase (or uppercase as the effect is the same), removing stop words, digits or punctuation marks and other techniques, as described in Figure 2. Implementing advanced text-preprocessing methods (stemming or lemmatization) allows the acquisition of higher accuracy of the classification.

Second, the selected size of the dictionary (the number of features) matters. For the support vector machines and naïve Bayes classification, the results were better if the number of features was larger. On the contrary, *k*-NNs' accuracy tends to decrease rapidly for higher data dimensionality. *k*-NN obtained the highest accuracy for the smallest dictionary size. *k*-NN performs well when that data dimensionality is low. Its efficiency is also highly dependent on the *k* parameter. It might be assumed that if *k*-NN achieves the maximum accuracy for the given *k*max, the performance will experience a sharp drop for *k* > *k*max. Testing the support vector classification methods proved that *LinearSVC* is relatively efficient when the dataset is small. For large datasets *NuSVC* classification is more accurate.

Third, among all designed classifiers, MNB turned out to be a leader. In the relevant stages, the maximum accuracy across all results was obtained by MNB. Naïve Bayes classification is efficient in all cases but eventually returns the best outcomes when the dictionary consists of many features and the lemmatization technique is included in the application.

Fourth, the classifiers that achieved the best results when tested on the extract from the Enron Corpus, classified the e-mails even more accurately for the dataset extracted from the Lingspam corpus. This indicates that the content (words) and the structure of the data impact the model performance directly.

Fifth, the most important aspect related to validation of our work is related to the quality of the obtained results. Here, one of the most important aspect of our proposal is summarized with Table 18. It shows a signification progress in comparison with the results reported in the referenced literature (the highest values are marked in red). One can see that especially the specificity provided by our approach is attractive. It is important in the case of unbalanced datasets and applications related to anomaly detection (where spam detection is also assigned).


**Table 18.** Comparison of the validation results with various performance metrics.
