To assess the effectiveness of the proposed approach in detecting depression, we have designed an experimental study. This experiment is also intended to provide information on which techniques perform better in detecting the intensity of depressive signs in different textual corpora. To achieve this, we frame the problem as a multiclass classification task, where the goal is to detect the intensity of depressive signs in the text. The depression detection process operates at the post level, where posts are classified based on the severity of depressive signs.
4.2. Results
To begin with, we evaluate the
model to explore which specific features lead to better performance. An initial experiment was conducted using all the features of each set. Features exhibiting a strong correlation (greater than 0.95) with other features were removed, and all values were scaled using Scikit-learn MinMaxScaler. The results of this experiment are shown in
Table 2.
We can see that the best performance for both datasets was achieved using the combination of all feature sets, resulting in an F1-score of 71.12% and 60.80% for the DsD and DepSign datasets. This indicates that incorporating diverse lexicon-based features allows the model to capture various aspects of depression-related linguistic patterns, leading to more informed and accurate depression detection.
However, it is reasonable to consider that not all extracted features are equally valuable for depression detection and that some may even negatively impact the accuracy of the classification. A feature selection process was conducted to investigate and improve our model. Thus, we have used recursive feature elimination (RFE) with cross-validation to identify the most relevant features among all the feature sets. The results improved slightly using this technique, reaching a 71.89% and 61.29% F1-score for each dataset. A significant reduction in the dimensionality of the features accompanies this performance improvement. Of the 376 features in the proposed framework, only 61 and 114 were selected for the DsD and DepSign datasets, respectively. The selection of a more significant number of features in the DepSign dataset is consistent with the conclusions drawn from the information gain analysis described in
Section 3.2. That analysis showed a more significant number of features with a high information gain in this dataset.
Upon examining the performance of different types of characteristics, we can observe that affective and syntactic features yield the best results (
RQ2). On the contrary, social characteristics exhibited the lowest performance in both datasets. Further information on this matter is depicted in
Figure 4. These figures illustrate the percentage of features that were selected from a specific category (
Figure 4a) and the distribution of the feature sets among the specified features (
Figure 4b) for each dataset.
We can see that for both datasets, affective features are consistently among the top-selected categories.
Figure 4a shows that approximately 22% to 24% of the total set are selected for each dataset, while
Figure 4b shows that this set is the one with more representation in the DsD dataset and the second one in the DepSign dataset. This suggests that emotional expressions and sentiment-related content are crucial in detecting depression. Furthermore, the syntactic features are consistently represented in both datasets. In contrast, social characteristics are less represented among the selected features. It is worth mentioning, however, that this is the set with a smaller size.
After obtaining the results for the
model, our focus shifts to evaluating the distributional representation model (
). To conduct this evaluation, we experimented using various word embedding techniques, including FastText and transformer-based embeddings. For the latter, we specifically used two different pre-trained models using the SentenceTransformers Python framework [
68]:
all-mpnet-base-v2 and
all-distilroberta-v1 [
71]. This library exploits an approach using siamese and triplet networks to generate sentence embeddings for each sentence.
According to the results obtained, shown in
Table 3, the
model performs slightly better than the
model. Regarding comparing word embedding techniques, we can observe that transformer-based embeddings yield better performance. Specifically, the best results are obtained with
all-mpnet-base-v2 embeddings in the DsD dataset and
all-distilroberta-v1 in the DepSign dataset. As described in
Section 2, this can be explained by the fact that transformer-based embeddings can capture intricate relationships between words, contextual information, and semantic meanings, resulting in more comprehensive and contextually rich representations of textual data.
After separately analyzing the results obtained from the distributional representation and the models based on lexicons, our focus shifted to examining their combination, represented by the model . To conduct this experiment, we selected the features that demonstrated the most promising performance for each dataset and integrated them with word embeddings. The better results obtained with context-aware embeddings motivated us to select it as a word embedding technique in the model. Specifically, we selected all-distilroberta-v1, as it obtained a higher result than all-mpnet-base-v2 in the DepSign dataset, is very similar in DsD, and also has a smaller size, which reduces its computational expense. Thus, we constructed the model combining the best-performing characteristics selected from with all-distilroberta-v1 embeddings.
In addition, we aim to compare the results obtained with the methods proposed in this paper with those obtained using transformer-based models. For this purpose, we utilized two widely used pre-trained transformer models: BERT and MentalBERT. BERT is a powerful language representation model that has shown effectiveness in numerous NLP applications. MentalBERT is a specialized adaptation of BERT fine-tuned specifically for mental health-related tasks. It incorporates domain-specific knowledge and context, making it particularly suitable for detecting depression in the textual data.
Table 4 presents the best results achieved with three different models:
,
, and
, along with the results obtained with BERT and MentalBERT [
20,
32].
The results show that the model achieved the highest F1-score in both datasets: 73.35% for DsD and 63.81% for DepSign. These results suggest that combining linguistic cues and emotional expressions from lexicon-based features with contextual information and semantic relationships captured from word embeddings leads to improved performance and more informative depression detection models. Furthermore, such results affirm that the proposed approach for detecting depression severity signs in the textual data exhibits substantial performance across different datasets (RQ1).
To delve deeper into the effectiveness and influence of the proposed models, we performed a Friedman statistical test [
70]. This test was employed to determine whether there are significant differences between classification methods based on the sample results. The Friedman test ranks the methods according to their performance on different datasets. A lower ranking indicates that a specific method outperforms the others, demonstrating its superior effectiveness in the classification task. The Friedman test evaluates the average ranks of the methods, denoted as follows:
where
represents the rank of the algorithm
jth in the dataset
ith,
k is the number of methods, and
n is the number of datasets. It tests the null hypothesis, which assumes that given all algorithms are equal, they have equal ranks. The Friedman statistic under this null hypothesis, with
degrees of freedom, is given by
Additionally, Iman and Davenport [
72] proposed a more robust statistic based on the F-distribution. This statistic, denoted as
, is calculated with
and
degrees of freedom:
The Friedman statistical test was performed with specific parameters: an
value of 0.1,
(representing the number of methods analyzed) and
(indicating the number of datasets). The computed values resulted in
and
, while the critical value
. Given that
, the null hypothesis of the Friedman test was rejected, indicating the statistical significance of the results. To simplify the presentation of the best-performing methods based on their ranks from the Friedman test,
Table 5 displays the best five approaches.
The results of Friedman’s test support the effectiveness of combining lexicon-based features with distributional representations as the most successful approach (RQ3). Additionally, the results show that the proposed method outperforms more complex transformer-based models, often requiring a higher computational overhead.
After identifying the best-performing model, we conducted additional experiments to gain deeper insights into the performance of the approach. Specifically, our goal was to understand how the class imbalance within the dataset influences the effectiveness of our system. To achieve this, we employed random undersampling techniques [
73] on both datasets, creating a more balanced representation of classes by randomly removing instances from the majority classes. This strategic sampling approach helps address the issue of imbalanced data distribution, ensuring that the model does not become overly biased toward predicting the majority class.
The confusion matrices obtained using the
model for each dataset, both with and without undersampling, are presented in
Figure 5. On close examination of these matrices, some notable insights emerge. When we do not employ undersampling, the class imbalance significantly affects the model’s predictions. It tends to be more inclined to predict the majority class, which results in higher accuracy for that class but considerably lower accuracy for minority classes. However, when random undersampling is applied, we observe a shift in the behavior of the model. It becomes less biased toward the majority class and demonstrates improved performance in recognizing minority classes.
Additionally,
Table 6 presents a comprehensive analysis of model performances regarding the two undersampling strategies. It also showcases F-scores, precision, and recall, along with their standard deviations. It can be appreciated that undersampling leads to a decrease in average performances that can be attributed to the loss of valuable information derived from the reduction in dataset size. This decrease is higher in the DsD dataset, as the information loss is higher in this dataset given the limited number of samples in the minority class. However, the standard deviation of the performances for each class also decreases, suggesting a more consistent performance. These findings emphasize the importance of addressing class imbalance when working with the considered datasets.
To conclude, we compare the results of our proposed approach with the best-performing methods found in the literature for each dataset.
Table 7 shows the outcomes of this comparison. Regarding the DsD dataset, one of the most prominent approaches is the one proposed by Ilias et al. [
20]. The authors in this work inject extra-linguistic information into transformer-based models and employ a multimodal adaptation gate for creating the combined embeddings, which are then fed to a MentalBERT model. Applying label smoothing, they yield a maximum weighted F1-score of 73.16%, which is slightly lower than the outcome obtained by our approach. For the DepSign dataset, Poswiata et al. [
32] proposed an ensemble of various fine-tuned transformer-based models, achieving a macro F1-score of 63.70%. To compare, we have computed the macro average F1-score for our approach using the splits described in their article. The results reveal that our method does not surpass the performance of the transformers ensemble model. A closer examination of these results, in conjunction with the class distribution of the data and
Figure 5, suggests that our approach exhibits higher sensitivity to the class imbalance of the dataset. However, a compelling advantage of our method lies in its simplicity, making it less susceptible to reduced data availability. Thus, there is potential to enhance the performance of our approach by employing specialized techniques to address class imbalance, even if they result in data reduction.
Furthermore, in
Table 7, we have incorporated two additional metrics to provide a more comprehensive evaluation: precision and recall. These metrics allow us to assess the correctness and sensitivity of our approach. A key observation is that the proposed method exhibits substantial results for both metrics. This implies that our model is competent at correctly identifying the intensity of depression signs and still maintains good sensitivity. In addition, the proposed method surpasses or remains competitive with the existing state-of-the-art methods. For the DepSign dataset, the transformer-based ensemble model achieves a higher precision than ours, although the recall values are comparable between the two methods. This suggests that our approach has a slightly less accurate identification of depression signs when they are present but maintains a similar level of sensitivity in detecting depression signs. However, our approach presents some advantages regarding simplicity, interpretability, and reduced computational overhead. These results reinforce the potential of traditional machine learning techniques in effectively detecting the intensity of depression signs in the textual data.