1. Introduction
Nowadays, the detection of depressive states is made through psychiatric examinations requiring the administration of psychiatric tests (such as the well-known Beck Depression Inventory (BDI) [
1] test) and semi-structured interviews (such as Structured Clinical Interview-II (SCID-II) (
https://www.appi.org/products/structured-clinical-interview-for-dsm-5-scid-5, last accessed on 31 August 2021) [
2]). These procedures are extremely time-consuming, require a high level of expertise to reliably interpret the interviews’ outputs, and can be affected by clinicians’ theoretical orientations and an overestimation of patient’s progresses. In the last years, however, an ever-growing interest has been turned toward the possibility of detecting depression through automatic methods by making use of the interaction between human beings and computer-aided devices via a suitable definition of measurable behavioural depression features [
3,
4,
5,
6,
7,
8,
9,
10]. Some authors analyse speaking rates and silences in read and spontaneous speech (see [
3,
5,
8,
9,
10] for further details), whereas others consider handwriting, drawing, and other behavioral features (see [
4,
6,
8], also for more complete references). Nevertheless, speech remains the easiest source of behavioral data to be collected. Thus, identifying additional speech features that are able to more accurately discriminate between typical and depressed subjects and require low computational costs is a goal worthy to be pursued. Notwithstanding the promises of Artificial Intelligence (AI) and Machine Learning (ML), technologies exploiting behavioural features can be used unethically, enabling possible discriminating usages of them and compromising the privacy of citizens. To protect citizens from these threats, the EU law on data protection and privacy in the European Union and European Economic Area (General Data Protection Regulation (GDPR) (
https://ec.europa.eu/info/law/law-topic/data-protection_en (accessed on 31 August 2021) requires that the AI and ML methods and architectures developed by the scientific and industrial communities be endowed with advanced smart privacy-preserving procedures.
Hence, the main objective of this research is to provide a theoretical and practical framework for the detection of depressive signs based on two main pillars: privacy protection and accurate performance, to be executed on mobile devices. Accordingly, this paper proposes three more markers (described in
Section 5), whose effectiveness will be demonstrated here.
The main differences between this paper and other approaches are: (1) the usage of a lightweight approach, based on the accurate choice of speech features rather than ML approaches; (2) the definition of an architecture for the continuous improvement of the classification method. Both the features and the architecture are privacy-oriented due to their capability of providing new data and computing on them without exposing personal information.
This research work is framed into the AutoNomous DiscoveRy Of depressiIve Disorder Signs (ANDROIDS) project, whose objective is to investigate on AI-based tools and methodologies for the early multimodal detection of depressive signs.
The paper is structured as follows:
Section 2 frames the paper in the technological and societal concern of privacy preservation in AI.
Section 3 reports background information on the ANDROIDS project, as well as reviewing a brief state-of-art on the topic of the paper.
Section 4 presents the approach followed in this paper, as well as a hw/sw architecture supporting it.
Section 5 gives details on the data sampling and the detection of the used features.
Section 6 describes the results of the data analysis task and describes the classification model.
Section 7 discusses the results, and
Section 8 ends the paper and draws future research tasks.
2. Motivation
Cloud-centric architectures and Deep Neural Network (DNN) applications have raised the necessity to move from the massive exploitation of acquired data to more sustainable computational paradigms. Such paradigms would be able to improve both performance and privacy, according to legislative regulations, such as the European Commission’s GDPR [
11] and Consumer Privacy Bill of Rights in the US [
12].
Some privacy issues are solved by the introduction of blockchain technologies, since they can regulate data exchange during transactions among multiple stakeholders. However, blockchain introduces a computational overload that is not negligible for its actual application. As an example, in [
13,
14], a blockchain-based service infrastructure is proposed that matches European GDPR and privacy-related issues in the Internet of Vehicles (IoV) context.
On the other hand, DNNs perform better if they are fed with a great amount of data, opening privacy issues. In this way, a trade-off between performance and privacy does not allow for DNNs to scale. However, the problem of privacy for users and the lack of enthusiasm for sharing their personal data (e.g., the recordings of driving behaviours) represent obstacles for using this valuable data.
Mobile Edge Computing (MEC) [
15] has provided an effective mitigation to scalability and performance levels degradation, implementing collaborative schemas for cooperative distributed ML models training. Nonetheless, data privacy is still an issue in MEC systems, since both processing at edge servers involves the transmission of potentially sensitive personal data. To date, privacy-sensitive users are still discouraged from exposing their data and contributing to model training, since they fear being controlled by servers or the risk of violating increasingly stringent privacy laws.
3. Background and Related Works
This section is devoted to framing the present work into the ANDROIDS project (in
Section 3.1) and then in the international scientific panorama (in
Section 3.2).
3.1. The ANDROIDS Project
ANDROIDS investigates the features of human interactions in order to model cognitive and emotional processes. The aim is to study the fundamentals and the means to develop a Clinical Decision Support System (CDSS) in the field of depression detection and care. The pillars of the project are: (1) combining diverse sources of information (e.g., audio, handwriting, video); (2) defining detection methods and supporting software architectures able to guarantee the privacy of the patients in sharing their data with computer-based tools; (3) implementing multi-sensor data fusions [
16]; (4) pursuing performances of detectors to realize scalable tools. The final aim is to train ML algorithms to generate a tool that can support the physicians and psychologists in the detection of signs of depressive disorders and to deepen further investigations.
The presented work focuses on the second and the third pillars. Starting from the non-spontaneous tales of both clinically diagnosed depressive patients and non-depressed people while reading a pre-defined tale, this research concentrates on analysing non-verbal features.
3.2. Related Works
There are several papers in the scientific literature focused on the prediction of depressive signs. The main approaches are summarized according to the nature of used data:
Speech: para-verbals features—e.g., speed, silences, pauses [
3,
5,
8,
9,
10]—as well as non-verbal features [
17,
18,
19,
20] in read and spontaneous speeches;
Handwriting and drawing [
4,
6,
8] mainly focusing on the shape of the drawn lines;
Video analysis: face expressions [
21], eye movements [
22];
Content of written and spoken words [
23,
24,
25,
26];
Multimodality: more than a source of data is used and combined to improve detection performance [
32,
33].
Focusing on speech, it has been traditionally used to detect physical injuries and diseases. As an example, the project Vox4Health investigated the definition and design of an m-health system able to detect physical throat issues [
34,
35]. In [
36], a smartphone records a voice signal of a client and sends it to a cloud server, where the signal is analyzed with DNNs. In [
37], a feature selection method in terms of the wrapper approach is used to combine the single features ranked by using the Fisher discrimination ratio, where voice pathology detection experiments were carried out using Random Forest.
While the detection of some of these physical injuries is based on protocols accepted by the medical scientific community [
38], the level of study of psychological pathological and pre-pathological states has not produced such protocols yet. Hence, a final selection of which features should be included in automatic classification tasks is still an open research topic.
Some research papers frame the privacy concern when speech is analyzed for clinical diagnosis [
39].
Other highly cited papers dealing with the problem of privacy in smart healthcare are [
40,
41,
42]. In particular, ref. [
40] focuses on the usage of mobile phones in participating in distributed collection and analysis healthcare systems. Another important work in scientific literature is [
43], where the authors also surveyed the main approaches and architecture to cope with this problem. The approach followed in [
43] is based on data anonymization (i.e., the subjects’ features do not allow for their identification). GDPR described two different ways to identify individuals:
Directly, from their name, address, postcode, telephone number, photograph or image, or some other unique personal characteristic;
Indirectly, from certain information linked together with other sources of information, including their place of work, job title, salary, their postcode, or even the fact that they have a particular diagnosis or condition.
The combination of Fully Homomorphic Encryption (FHE) and Neural Networks (NNs) is becoming a research trend that is attracting a growing interest in the scientific and industrial community. Some research papers dealing with this promising and not yet much researched topic are: [
44], which is focused on the application of FHE to speech analysis [
45], which proposes a test bed for evaluating the effectiveness of FHE on specific algorithms and computing functions; Ref. [
46], where the authors assess different de-identification methods in speech emotional analysis.
All things considered, many of the approaches presented suffer from performance issues. To implement a system able to run on users’ mobile terminals, some authors propose the usage of typing pattern analysis [
47]. Other speech-based analysis methods, which involve the automatic transcription of speech using cloud-based resources, present a possible loss of privacy [
48].
4. Methodology and Supporting Architecture
This section is devoted to describing the approach and presenting a supporting hardware and software architecture.
Figure 1 depicts the overall architecture supporting the proposed approach.
Under the privacy aspect, there are three actors: the final user, the privacy authority and the medical structure. In the first domain, patients use the detection tool (i.e., the early detector) that runs on their computing device. The application can record their voice and apply the criteria reported in
Section 6 on the device itself. Patients can receive feedback from the application and, in case of the detection of depressive signs, they may contact their personal doctor in order to have a more profound analysis. In our vision, the feedback should be given by a message appearing on the screen or by sending a secure link via email.
After being given informed consent, the patient can contact a privacy authority that collects voice samples, anonymizes, and stores them in a repository (patients participate in this study on a voluntary basis on the base of the GDPR regulation).
This repository can be queried by different medical structures in order to create—using an Audio Processing Tool (APT) and the learning tool—the detection model. The detection model contains the ranges of the H, D, and M areas, which, respectively, represent healthy subjects, depressed subjects, and an overlapping zone of uncertainty.
The updated detection model can then be sent to the final user, who uses the detection model to classify himself/herself by applying the method described in
Section 5. The final user can also considerably improve the detection mechanism by sharing speech data. This cyclic mechanism guarantees the capability of improving the detection model on the basis of fresh data continuously added in the repository.
Figure 2 provides an overview of the method on which the detection model is based.
The speech from a set of patients—both healthy and affected by depression—are acquired and split to improve the number of considered samples. Marker values are extracted from these samples to define those regions of values related to the different classifications. Finally, a posteriori analysis is used to quantify the uncertainty of classification for each region in order to estimate their accuracy.
This method is also applicable to data acquired to refine the detection model, as described in this section. In fact, when new patients are added to the dataset, they are examined by medic personnel who certify their clinical state.
In order to guarantee reaching the two objectives of this work, privacy and performance, the feature definition phase is critical. In fact, the markers used for the classification should be as few and easy-to-compute as possible in order to improve the performance (especially mobile devices). On the other hand, a preprocessing stage that hinders the indirect identification of users becomes necessary.
Section 5 presents the details on which the detection model is computed.
It is important to underline that privacy is preserved in this architecture, since sensitive data are not sent to medical structures, which are ruled by private people and can also be regulated by different data privacy acts. Otherwise, data are simply sent to the privacy authority, which could be a public organization, and could expose its data handling services according to a trust chain supported by the adherence to international privacy regulations and standards.
Furthermore, the markers on which the algorithm is based are required to prevent revealing the identity of patients. As a last observation, it is important to highlight that the reports that the automated recognition software (based on the detection model) performs are presented to the final users, suggesting them to consult a medical professional in the case of depression suspects. Furthermore, in this case, privacy is preserved.
5. Feature Selection
To identify some features that could be able to characterize the speech of depressed and typical subjects (at least in sufficiently standardized conditions) a database consisting of 11 depressed subjects and 11 healthy subjects was used. The 11 Italian depressed patients were recruited with the help of psychiatrists at the Department of Mental Health and the General Hospital in Caserta, the Institute for Mental Health and the General Hospital in Santa Maria Capua Vetere, the Centre for psychological Listening in Aversa (Italy), and in a private psychiatric office. They were diagnosed as depressed and were under treatment (treatments are different and personalized on the basis of specific medical and other needs).
Both the control and the depressed participants were administered the Beck Depressive Inventory Second Edition (BDI-II). The data collection was conducted under the approval of the ethical committee of the Department of Psychology, Università della Campania “Luigi Vanvitelli”, code n. 25/2017.
Although the size of the dataset appears too small to lead to sufficiently general conclusions, it has been chosen for two main reasons: from a purely methodological viewpoint, before widening the scope of our study, just a pilot field of investigation was needed as a guide toward a plausible hypothesis about the features that could be considered relevant to discriminating the speech of depressed people from that of healthy people; from a practical viewpoint, the selected items were the ones about whom the information (with special regard to their BDI classes) was more complete. On the other hand, these samples are equally distributed with respect to gender (three men and eight women), and at least similarly distributed with respect to age.
Both healthy people and depressed patients were subjected to two experiments. The first experiment, called “Diary” was essentially an interview about their daily life in the last week; the other, called “Tale”, was the reading of a short tale by Aesop, The North Wind and the Sun. However, only the sound files of these readings (i.e., the “Tale” experiment) were taken into consideration, as they allowed for standard speech conditions, not affected by momentary moods depending on some possible experiences lived by the subjects.
Data have been extracted from the files by using the well-known Praat tool [
49]. Each entire file was split in time intervals of 2 s and, for each interval, the average pitch and the average intensity of the voice were computed.
The average pitch was preferred to intensity, since the latter can strongly (much more than pitch, however) depend on the device used and on the conditions of experiment. It is worth noting that the analysis of pitch, like that of intensity, also captures at least partly the information delivered by the analysis of empty pauses [
3], as, in the time intervals where pauses occur, average pitches present sudden cut-off of values.
Then, denoting by
S the combination of the two samples and labelling each element of
S with its ID number
i, two classes of data were obtained:
the number of intervals associated with
i, defined by Equation (
1), where
is the length of its speech (in seconds) for each
, and
the sequence
of the average pitches on the different intervals, where
j is the index of the interval.
where
denotes, as usual, the integer part.
Next, as the first good marker candidate, the
total variation of pitch over the speech of
i-th subject (for any
) was chosen (see Equation (
2) below). This marker can give some information about the uniformity and the “smoothness” of speech. The greater
is, the less smooth and uniform the speech.
The measure above is affected by the speed of the speech of each subject; in fact, although it is quite plausible that a subject who is speaking quickly also has a less smooth and uniform voice, two subjects with the same smoothness sound different if one of them speaks faster than the other. Thus, the second marker to be used in order to correct and sharpen the results given by
is the
average variation of pitch over the whole reading (see Equation (
3)).
On the other hand, a pitch variation between two subsequent intervals can be due to two different causes:
the reading expression, dictated by the meaning of what one is reading, and
the psychologic stability (or instability) of the reader. This remark has suggested considering the ratio as the third marker (see Equation (
4)).
This is the percentage of inversions (also called oscillations) of the sign of two subsequent differences and over the whole reading for any . The percentage was preferred to the number of inversions, since the latter also depends on the length (hence, on the speed) of reading. This last marker can also give some information on the stability and uniformity of speech.
Therefore, we transformed our speech dataset into a set of 22 values , 22 values , and 22 values recorded in suitable tables. Next, we computed the position and dispersion indexes for each variable and assigned a suitable confidence interval to each of them. For each subject i, the position of the triplet , inside or outside the Cartesian product of these confidence intervals, is the main tool to discriminate whether the subject is normal or depressed.
6. Results
6.1. The Total Variation
The first step of the analysis was to evaluate, for any
, the absolute values of the
pitch differences
and their sum. The results of this first step are listed in
Table 1 for healthy subjects and in
Table 2 for depressed subjects.
The mean value of the data
and related standard deviation, as computed from
Table 1, are reported in Equation (
5) as a reference interval for the classification of data. One finds that the values of
for seven items in
Table 1, namely, items ID_39, ID_32, ID_10, ID_33, ID_36, ID_40, and ID_30, belong to
, so that
turns out to be the confidence interval for
T at the confidence level
.
As observed in
Table 2, the average value of depressed subjects is
. Furthermore, nine items are
outside . More precisely, five items have a value
and four items have a value
. Accordingly, 82% of the sample of depressed subjects are correctly classified
outside .
6.2. The Average Variation
The results of the average variations of pitch for the subjects in the sample
S are now collected in
Table 3 and
Table 4, the former for healthy subjects and the latter for depressed ones.
The mean value and standard deviation of
, as computed from
Table 3, are in Equation (
6), as a reference interval for the classification of data. One finds that the values of
for eight items in
Table 3 belong to
, so that
turns out to be the confidence interval for
A at the confidence level
.
As observed in
Table 4, we find that their average value is
and their standard deviation is
. The most relevant factor to note is that nine items in
Table 4 turn out to be
outside interval
. More precisely, five items have a value
and four items have a value
. Again, 82% of the sample of depressed subjects are correctly classified
outside .
6.3. The Inversion Percentage
The last marker is the inversion percentage. The values
of this marker are listed in
Table 5 and
Table 6 for healthy and depressed subjects, respectively.
The average value and standard deviation of
, as computed from
Table 5, are reported in Equation (
7) as a reference interval for the classification of data. One finds that seven items in
Table 5 belong to
, so that
is, in turn, the confidence interval for
O at the confidence level
.
Finally, observing the values
in
Table 6, we find that their average value is
and that their standard deviation is
. Only three items in
Table 6 are
outside interval
(1
and 2
). With this marker, only 27% of depressed subjects are correctly classified. However, the important contribution of the inversion percentage is that it
correctly places subject ID_9, who had instead escaped the classification provided by total and average variations.
7. Discussion
The main contribution of the work is finding a quick and easy-to-compute method for classifying depressed and non-depressed subjects from speech non-verbal analysis. Three markers have been found to accomplish this goal.
The choice of the markers was mainly due to perception. As a matter of fact, human ears seem to be able to perceive the psychological conditions of a subject who reads a tale: at least some suggestions about these conditions can be obtained from the empty pauses [
3], from the reading speed [
50], from the voice tones, and from their variations [
51]. This suggests that further attention should be placed on pitch and, above all, on its variations during the reading; hence, the chosen marker, which is why they were chosen as the main markers in this study.
On the other hand, all of this information is mainly qualitative, and depends on numerous interfering psychological states, so that one must take into account several “modes” of depression, which could produce (in principle with the same probability) a too fast or a too slow reading, too high or too low tones, some “flatness” of the pitch, or some high-frequency oscillation of pitch (on a purely perceptive ground, all of these features have actually been detected in all of the depressed subjects of the sample) [
50,
51]. None of the above described markers could have been able, by themselves, to give a sharp discrimination between typical and depressed subjects. They were to be used as “filters” selecting, one after another, the classes of subjects. Furthermore, as a matter of fact, the joint use of these three markers has led to a correct discrimination of
of depressed subjects (and, of course,
of healthy subjects).
In fact, the main conclusion that seems to be drawn from the results reported in the previous Section can be expressed in terms of probability as follows. Let us first consider the four (random) variables
(where
is the BDI score assigned to
. The BDI score spans from 0 to 63 according to its definition [
1]),
and
We may now envisage S as the space of probability .
We denote by
N the set
and by
the set
, i.e., the events
“healthy subject according to BDI test” and
“depressed subject according to BDI test”, respectively (with no information on the values of
T,
A and
O). Thus, the results described in the previous Section can be epitomized by the relations
and
or, equivalently (and perhaps more expressively),
and
where the symbol “Prob” means “probability” (
Figure 3 gives a visual synthetic picture of the situation described by Relations (
10) and (
11)).
Two remarks now spontaneously arise in connection with these results. First, the limits of the confidence intervals
,
, and
depend on the subset
N of the sample
S. This means that, if we change our sample, we find different limiting values to discriminate between healthy and depressed subjects. In addition, the sets excluded by Relations (
10) and (
11) are not complementary, i.e., their combination is not the whole
S, so that there is an overlapping zone
M, namely, the set
, where, in principle, both depressed and typical subjects are somehow mixed. In connection with the first remark, it is to be expected that, taking ever larger samples, the averages of the variables
T,
A, and
O will tend to fluctuate each in a sufficiently small neighborhood of a stable value, and, also, that their standard deviations will describe sets of values whose upper bounds will cluster around a stable value.
One can observe, on the one hand, that no medical diagnosis is free of a greater or smaller degree of uncertainty (i.e., in most cases, a diagnosis requires further examinations to be validated or rejected), and, on the other hand, the proposed markers seeming to work as filters likely suggests that it could be possible to find a suitable function leading to a sharp distinction between a properly defined healthy zone and a depressed zone, with no overlapping.
From a diagnostic viewpoint, however, it will be of particular interest to give the explicit expressions of the “inverse” probabilities:
—the probability that a subject whose three measured markers are all in , , and , respectively, is a typical subject;
—the probability that a subject whose three measured markers are all outside , , and , respectively, is depressed;
—the probability that a subject for whom one or two measured markers, but not all three, are outside , , and , respectively, is depressed.
In fact, these probabilities express the plausibility of a diagnosis given the values of the markers. Without these probabilities, our analysis would have been severely incomplete.
Now, since
, we have
and, applying the Bayes formula, we obtain
where
e
, so that
and
Next,
where
, so that
and
Now, if we check the measures listed in the Tables reported in the previous Sections, we find that
and
so that
and, as a consequence,
In conclusion, when the triplet of the values of the markers is in the
H region, then we can be sure that the subject examined is healthy, and when it belongs to the
D region, we have the same certainty that the subject is depressed. When the triplet of the values of markers is in the
M region, as often happens in medical investigations, the results are not conclusive, but the probability that the subject examined is depressed is perceptibly higher than that he is healthy. As a matter of fact, according to Relations (
12) and (
13), the probability that a subject is healthy, given that the triplet of the values of markers is in the region
M, is 44%, whereas the probability that a subject is depressed, given that the triplet of the values of markers is in the region
M, is 56%.
These mathematical considerations provide a method to check about the depressive and healthy status of new acquired speech samples. Specifically, by locating the speech samples of users in the proper space depicted in
Figure 3, the proposed tool can determine their classification in the healthy/depressed category.
These results are limited to the rather small sample we have considered. Nevertheless, it still seems plausible that a significantly larger set of samples could lead not only to a better definition of the H and D regions, but also to more significant probability differences in the M region.
8. Conclusions and Future Works
According to the above discussion, it is clear that the present work is just a first step in a promising direction, and is a preliminary study toward an automatic procedure to discriminate between typical and depressed subjects through speech analysis while preserving their anonymity and privacy.
The preliminary nature of the study is due to the limited number of samples analyzed in this paper. In these samples, the results show the potential of the proposed algorithm; hence, the authors aim to run this approach on larger datasets (e.g., the DAIC-WOZ database [
52]) in the near future. In addition to this, ML approaches tend to over-perform in laboratory settings as compared to real-life application. To overcome this, the architecture presented in this paper incorporates a continuous improvement of classification accuracy.
Regardless, the results obtained and presented here seem to point out an interesting and promising pathway to deepen and extend the research on the markers proposed in the present work. They deserve to be examined more deeply, as they seem to work effectively for the desired discrimination because they actually allow for identifying a “healthy region” and a “depressed region”, though, at this preliminary stage, it is dependent on the selected sample.
Needless to say, a crucial step in the future development of the research about markers is a Bayesian analysis of inverse conditional probabilities to the ones considered in Relations (
10) and (
11). Such an analysis will be the first step of the future research on the effectiveness of the variables
T,
A, and
O as markers of depression.
Another important future study will be conducted on the impact of the quality of speech on recognition performance. Since mobile devices are not likely to assure a very high quality of recordings, and since background noise can be frequent in such recordings, this study will give a firm contribution toward defining the means of adopting the approach in mobile settings. To this aim, the samples considered in this paper will be distorted with growing levels of noise and with typical background noises, such as cafeterias, streets, etc.
Additionally, other ML approaches able to generate new detectors from the dataset generated by the APT will be explored. The performance obtained using ML techniques, such as Random Forests, Decision Trees, or Support Vector Machines, will be evaluated and compared with the results achieved by the proposed approach. Finally, to preserve the privacy of subjects and to improve the protection of their sensitive data, anonymization methods will be explored.