PHQ-V/GAD-V: Assessments to Identify Signals of Depression and Anxiety from Patient Video Responses
Round 1
Reviewer 1 Report
- Recently many works have been carried out on lDepression. What are the limitations of those works that motivated the current research? - What are the key contributions of this work? - Summarize the findings from recent works in a table. - The authors can discuss some of the recent works such as the following :
https://www.mdpi.com/2079-9292/11/5/676
https://ieeexplore.ieee.org/abstract/document/9743908/ - HOw are the hyper parameters chosen in this study? - WHat are the threats to the validity of the proposed approach? - Analyze the computational cost of the proposed model. - Discuss the limitations and future scope of this study.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
The paper is interesting to read and well-written. Another stong point of the paper are nicely prepared charts and visualizations. The dataset is relatively large, 1149 participants/observations is considerable and very hard to achieve in pure clinical scenarios.
As a downside, the dataset is just a noisy self-reported data. It is not clinical data but just "signals of" some potential issues. This makes the data and experiments less relevant.
Yet, the biggest problem is that it is not possible to replicate this study as neither the dataset nor the source codes are available. Moreover, the description of the methodology is severely lacking. Please see the examples below.
Section 2.3 - "the current study postulates that combining modalities will provide extra value over using a single modality by itself". This is a direct contradiction to results presented in Table 1., where Language is the single, superior modality that outperforms All Combined modalities. And also in direct contradiction to this fragment in section 4.1: "Combining multiple modalities did not improve the performance of the model". Why does the paper contain such self-contradictory statements?
Section 3.2.1 - what exactly is the "multi-layer perceptron (MLP)"? Is it a Dense layer? Or many Dense layers, as the name implies? How many? What is the size of each layer and what's the activation function? Is there a dropout layer in between? In case of the "joint loss technique", what's the contribution of each individual loss type?
Section 3.2.2 - I do not understand the role of combining features using bi-modal outer product approach. It is not clear what is the gain when compared to embedding concatenation. The last sentence is also not clear, what scores are unchanged? If this is about model scores (results) then shouldn't the goal be to improve? It would be advisable to report the results of concatenation approach.
Section 3.2.3 - the first sentence indicates that multi response method is used for video. Why did not you use it also for text and audio?
In this section, there is no description of the transformer block that is present in Figure 3. (c). Text only says "multi-response block uses the same classification head as the both of above blocks.". What is the exact network that you use?
Basically, sections 3.2.1 - 3.2.3 do not contain enough description to replicate the described approach. We know nothing about the exact transformer types that were used. Huggingface links and paper references should be provided and some details about finetuning such as learning rates, number of epochs, etc. This is because you mention that transformers are trainable. The only hint appears on Figure 3. (b) eg. Affectiva, Hubert, but this is by far not enough. What transformer is used for multimodal processing as the module that combines (outer products of) Transcript, Audio and Facial Features? In section 3.2.3 you only claim that "It takes the embeddings produced by each of the upstream blocks and combines them using a transformer". Was it pre-trained? Was it jointly pre-trained using speech, text and video data? This is the input type that you use.
"Previous research attempted to use BERT to detect depression, but did not show promising results [18]." <- This is not true. In Arabic, El-Ramly et al. (2021) reported the accuracy of 96.92 using a BERT variant on a dataset of 7k posts
https://ieeexplore.ieee.org/document/9694178
I would certainly name it "promising".
Caption for Figure 3. (b) - Audio/Hubert path is not an input / not connected to the MLP. Is it really the case?
As an appendix to Figure 3, it would be useful to print the size of tensors at main points of the diagram. This would allow to understand the effect of outer product, for example.
"negatively affecting scores slightly." - slightly negatively affecting scores?
The final remark.
According to a similar study, it seems possible to benefit from multiple modalities when detecting depression. Makiuchi et al. (2019) described a model based on feature fusion / fully-connected layer to combine speech and linguistic features.
https://www.researchgate.net/publication/336782624_Multimodal_Fusion_of_BERT-CNN_and_Gated_CNN_Representations_for_Depression_Detection
Unfortunately, you did not re-implement Makiuchi et al. method for a comparison, the paper was just mentioned in the context of datasets. I suggest to apply this method to your dataset and report the results. In general, you should investigate the issue of this discrepancy of your results vs the results of Makiuchi et al. (2019). The only remark I found in your paper is that "Combining multiple modalities did not improve the performance of the model [..] This is counter intuitive to expectations and further research and analysis is needed to determine the exact cause of this discrepancy.". I agree with this assessment, but this is by far not enough. It is an indication that there is something very wrong in your methodology or bugs in your code.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
1. Related works (research line) involved in similar / related application under the same area are not comprehensively discussed in the paper. Are there any related applications have been explored by other researchers? What are the gaps / issues / weaknesses have been encountered so far? Please discuss in the literature review section.
2. Figure 3. The detail of the process for each step in (a), (b) and (c) are missing. Please explain intensively under methodology section.
3. Why transformer is used in this project? What is the contribution of the transformer and BERT to the system ? Explain the function and how it works in the proposed method.
4. How do you interpret your results / observation using AUC and Pearson Correlation ? Data interpretation is not clear in this paper.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Thank you for addressing the suggestions. The article is now significantly improved over the original submission.