VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection
Abstract
:1. Introduction
- Can VLMs, particularly CLIP, enhance VAD by incorporating textual descriptions generated by LMMs, particularly LLaVA, alongside visual features?
- Can a model trained on visual data, together with text obtained through prompt engineering, achieve performance comparable to SOTA methods?
- We introduce a novel VAD method that effectively utilizes visual-language pretraining techniques. To the best of our knowledge, this is the first work to adopt CLIP [32] for VAD. Consequently, we demonstrate that our model, VAD-CLVA, surpasses SOTA visual VAD methods, affirming the utility of joint learning of text descriptions and visual features.
- This is also the first attempt to employ LMMs (i.e., LLaVA [30,31]) with prompt engineering to perform VAD and to generate text descriptions corresponding to individuals’ speaking activity when their upper-body images are the inputs. While the standalone LMM model may not match the effectiveness of our VAD-CLVA for the VAD task, its text descriptions enhance the utilization of spatio-temporal upper-body features, thereby improving VAD-CLVA’s performance.
- Through extensive experimentation, we demonstrate that our approach outperforms all visual methods as well as a standalone LMM. Moreover, our VAD-CLVA consistently achieves results comparable to or surpassing the audio-visual SOTA, even if it employs a simpler pipeline compared to them and without the necessity of pretraining on audio-visual data.
2. Related Work
2.1. Voice Activity Detection
2.2. CLIP and LLaVA
3. Proposed Method: VAD-CLVA
3.1. Preliminaries
3.2. Formal Description
3.3. Implementation Details
4. Experimental Analysis and Results
4.1. Ablation Study
4.2. Comparisons with the SOTA
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Skantze, G. Turn-taking in conversational systems and human–robot interaction: A review. Comput. Speech Lang. 2021, 67, 101178. [Google Scholar] [CrossRef]
- Xu, E.Z.; Song, Z.; Tsutsui, S.; Feng, C.; Ye, M.; Shou, M.Z. Ava-avd: Audio-visual speaker diarization in the wild. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3838–3847. [Google Scholar]
- Wang, Q.; Downey, C.; Wan, L.; Mansfield, P.A.; Moreno, I.L. Speaker diarization with LSTM. In Proceedings of the 2018 IEEE ICASSP, Calgary, AB, Canada, 15–20 April 2018; pp. 5239–5243. [Google Scholar]
- Chung, J.S.; Huh, J.; Nagrani, A.; Afouras, T.; Zisserman, A. Spot the conversation: Speaker diarisation in the wild. arXiv 2020, arXiv:2007.01216. [Google Scholar]
- Hung, H.; Ba, S.O. Speech/Non-Speech Detection in Meetings from Automatically Extracted Low Resolution Visual Features. 2009. Available online: https://infoscience.epfl.ch/entities/publication/0659b34f-3f4d-44e6-86a8-898c01b6b857 (accessed on 11 January 2025).
- Beyan, C.; Katsageorgiou, V.M.; Murino, V. A sequential data analysis approach to detect emergent leaders in small groups. IEEE Trans. Multimed. 2019, 21, 2107–2116. [Google Scholar] [CrossRef]
- Górriz, J.M.; Ramírez, J.; Lang, E.W.; Puntonet, C.G.; Turias, I. Improved likelihood ratio test based voice activity detector applied to speech recognition. Speech Commun. 2010, 52, 664–677. [Google Scholar] [CrossRef]
- Michelsanti, D.; Tan, Z.H.; Zhang, S.X.; Xu, Y.; Yu, M.; Yu, D.; Jensen, J. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1368–1396. [Google Scholar] [CrossRef]
- Moine, C.L.; Obin, N.; Roebel, A. Speaker attentive speech emotion recognition. arXiv 2021, arXiv:2104.07288. [Google Scholar]
- Moattar, M.H.; Homayounpour, M.M. A simple but efficient real-time voice activity detection algorithm. In Proceedings of the 2009 17th European Signal Processing Conference, Glasgow, Scotland, UK, 24–28 August 2009; IEEE: New York, NY, USA, 2009; pp. 2549–2553. [Google Scholar]
- Minotto, V.P.; Jung, C.R.; Lee, B. Simultaneous-speaker voice activity detection and localization using mid-fusion of SVM and HMMs. IEEE Trans. Multimed. 2014, 16, 1032–1044. [Google Scholar] [CrossRef]
- Patrona, F.; Iosifidis, A.; Tefas, A.; Nikolaidis, N.; Pitas, I. Visual voice activity detection in the wild. IEEE Trans. Multimed. 2016, 18, 967–977. [Google Scholar] [CrossRef]
- Tao, R.; Pan, Z.; Das, R.K.; Qian, X.; Shou, M.Z.; Li, H. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 3927–3935. [Google Scholar]
- Köpüklü, O.; Taseska, M.; Rigoll, G. How to design a three-stage architecture for audio-visual active speaker detection in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1193–1203. [Google Scholar]
- Xiong, J.; Zhou, Y.; Zhang, P.; Xie, L.; Huang, W.; Zha, Y. Look&listen: Multi-modal correlation learning for active speaker detection and speech enhancement. IEEE Trans. Multimed. 2022, 25, 5800–5812. [Google Scholar]
- Shahid, M.; Beyan, C.; Murino, V. S-vvad: Visual voice activity detection by motion segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2332–2341. [Google Scholar]
- Beyan, C.; Shahid, M.; Murino, V. RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis. IEEE Trans. Multimed. 2020, 23, 2071–2085. [Google Scholar] [CrossRef]
- Shahid, M.; Beyan, C.; Murino, V. Voice activity detection by upper body motion analysis and unsupervised domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–29 October 2019. [Google Scholar]
- Zhang, Y.; Liang, S.; Yang, S.; Liu, X.; Wu, Z.; Shan, S.; Chen, X. Unicon: Unified context network for robust active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 3964–3972. [Google Scholar]
- Chung, J.S.; Zisserman, A. Learning to lip read words by watching videos. Comput. Vis. Image Underst. 2018, 173, 76–85. [Google Scholar] [CrossRef]
- Liu, Q.; Wang, W.; Jackson, P. A visual voice activity detection method with adaboosting. In Proceedings of the Sensor Signal Processing for Defence (SSPD 2011), London, UK, 27–29 September 2011. [Google Scholar]
- Sodoyer, D.; Rivet, B.; Girin, L.; Savariaux, C.; Schwartz, J.L.; Jutten, C. A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 2009, 125, 1184–1196. [Google Scholar] [CrossRef] [PubMed]
- Chung, J.S.; Zisserman, A. Out of time: Automated lip sync in the wild. In Proceedings of the ACCV 2016 Workshops, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2017; pp. 251–263. [Google Scholar]
- Geeroms, W.; Allebosch, G.; Kindt, S.; Kadri, L.; Veelaert, P.; Madhu, N. Audio-Visual Active Speaker Identification: A comparison of dense image-based features and sparse facial landmark-based features. In Proceedings of the 2022 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Bonn, Germany, 12-14 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Huang, C.; Koishida, K. Improved active speaker detection based on optical flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 17–18 June 2020; pp. 950–951. [Google Scholar]
- Cristani, M.; Pesarin, A.; Vinciarelli, A.; Crocco, M.; Murino, V. Look at who’s talking: Voice activity detection by automated gesture analysis. In Proceedings of the Constructing Ambient Intelligence: AmI 2011 Workshops, Amsterdam, The Netherlands, 16–18 November 2011; Springer: Berlin/Heidelberg, Germany, 2012; pp. 72–80. [Google Scholar]
- Gebre, B.G.; Wittenburg, P.; Heskes, T. The gesturer is the speaker. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; IEEE: New York, NY, USA, 2013; pp. 3751–3755. [Google Scholar]
- Shahid, M.; Beyan, C.; Murino, V. Comparisons of Visual Activity Primitives for Voice Activity Detection. In Proceedings of the Image Analysis and Processing—ICIAP 2019: 20th International Conference, Trento, Italy, 9–13 September 2019; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2019; pp. 48–59. [Google Scholar] [CrossRef]
- Xenos, A.; Foteinopoulou, N.M.; Ntinou, I.; Patras, I.e.a. VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning. arXiv 2024, arXiv:2404.07078. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. arXiv 2023, arXiv:2310.03744. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Birmingham, UK, 2021; pp. 8748–8763. [Google Scholar]
- Auty, D.; Mikolajczyk, K. Learning to Prompt CLIP for Monocular Depth Estimation: Exploring the Limits of Human Language. In Proceedings of the IEEE/CVF ICCV, Paris, France, 1–6 October 2023; pp. 2039–2047. [Google Scholar]
- Bondielli, A.; Passaro, L.C. Leveraging CLIP for Image Emotion Recognition. In Proceedings of the CEUR WORKSHOP PROCEEDINGS, Virtual, 4–5 October 2021; Volume 3015. [Google Scholar]
- Chen, D.; Gou, G. Unleash the Capabilities of the Vision-Language Pre-training Model in Gaze Object Prediction. In Proceedings of the Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 453–466. [Google Scholar]
- Tao, F.; Busso, C. Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection. In Proceedings of the INTERSPEECH, Stockholm, Sweden 20–24 August 2017; pp. 1938–1942. [Google Scholar]
- Tao, F.; Busso, C. End-to-end audiovisual speech activity detection with bimodal recurrent neural models. Speech Commun. 2019, 113, 25–35. [Google Scholar] [CrossRef]
- Roth, J.; Chaudhuri, S.; Klejch, O.; Marvin, R.; Gallagher, A.; Kaver, L.; Ramaswamy, S.; Stopczynski, A.; Schmid, C.; Xi, Z.; et al. Ava active speaker: An audio-visual dataset for active speaker detection. In Proceedings of the IEEE ICASSP, Barcelona, Spain, 4–8 May 2020; pp. 4492–4496. [Google Scholar]
- Sharma, R.; Somandepalli, K.; Narayanan, S. Crossmodal learning for audio-visual speech event localization. arXiv 2020, arXiv:2003.04358. [Google Scholar]
- Shvets, M.; Liu, W.; Berg, A.C. Leveraging long-range temporal relationships between proposals for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9756–9764. [Google Scholar]
- Gebru, I.D.; Ba, S.; Li, X.; Horaud, R. Audio-visual speaker diarization based on spatiotemporal bayesian fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1086–1099. [Google Scholar] [CrossRef]
- Chakravarty, P.; Tuytelaars, T. Cross-modal supervision for learning active speaker detection in video. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 285–301. [Google Scholar]
- Joosten, B.; Postma, E.; Krahmer, E. Voice activity detection based on facial movement. J. Multimodal User Interfaces 2015, 9, 183–193. [Google Scholar] [CrossRef]
- Haider, F.; Campbell, N.; Luz, S. Active speaker detection in human machine multiparty dialogue using visual prosody information. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Washington, DC, USA, 7–9 December 2016; IEEE: New York, NY, USA, 2016; pp. 1207–1211. [Google Scholar]
- Stefanov, K.; Beskow, J.; Salvi, G. Vision-based active speaker detection in multiparty interaction. In Proceedings of the Grounding Language Understanding (GLU2017), Stockholm, Sweden, 25 August 2017. [Google Scholar]
- Stefanov, K.; Beskow, J.; Salvi, G. Self-supervised vision-based detection of the active speaker as support for socially aware language acquisition. IEEE Trans. Cogn. Dev. Syst. 2019, 12, 250–259. [Google Scholar] [CrossRef]
- Wortsman, M.; Ilharco, G.; Kim, J.W.; Li, M.; Kornblith, S.; Roelofs, R.; Lopes, R.G.; Hajishirzi, H.; Farhadi, A.; Namkoong, H.; et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 7959–7971. [Google Scholar]
- Shen, S.; Li, L.H.; Tan, H.; Bansal, M.; Rohrbach, A.; Chang, K.W.; Yao, Z.; Keutzer, K. How much can clip benefit vision-and-language tasks? arXiv 2021, arXiv:2107.06383. [Google Scholar]
- Yuan, M.; Lv, N.; Xie, Y.; Lu, F.; Zhan, K. CLIP-FG:Selecting Discriminative Image Patches by Contrastive Language-Image Pre-Training for Fine-Grained Image Classification. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 560–564. [Google Scholar] [CrossRef]
- Srivastava, M.M. RetailKLIP: Finetuning OpenCLIP backbone using metric learning on a single GPU for Zero-shot retail product image classification. arXiv 2023, arXiv:2312.10282. [Google Scholar]
- Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11686–11695. [Google Scholar]
- Liu, J.; Zhang, Y.; Chen, J.N.; Xiao, J.; Lu, Y.; A Landman, B.; Yuan, Y.; Yuille, A.; Tang, Y.; Zhou, Z. Clip-driven universal model for organ segmentation and tumor detection. In Proceedings of the IEEE/CVF ICCV, Paris, France, 1–6 October 2023; pp. 21152–21164. [Google Scholar]
- Liang, Z.; Li, C.; Zhou, S.; Feng, R.; Loy, C.C. Iterative prompt learning for unsupervised backlit image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 8094–8103. [Google Scholar]
- Sanghi, A.; Chu, H.; Lambourne, J.G.; Wang, Y.; Cheng, C.Y.; Fumero, M.; Malekshan, K.R. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18603–18613. [Google Scholar]
- Kim, G.; Kwon, T.; Ye, J.C. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2426–2435. [Google Scholar]
- Wang, H.; Li, Y.; Yao, H.; Li, X. Clipn for zero-shot ood detection: Teaching clip to say no. In Proceedings of the IEEE/CVF International Conference on Computer Vision, New Orleans, LA, USA, 18–24 June 2023; pp. 1802–1812. [Google Scholar]
- Wang, M.; Yang, N. EmoAsst: Emotion recognition assistant via text-guided transfer learning on pre-trained visual and acoustic models. Front. Comput. Sci. 2024, 6, 1304687. [Google Scholar] [CrossRef]
- Garg, B.; Kim, K.; Ranjan, S. From Video to Images: Contrastive Pretraining for Emotion Recognition from Single Image. In Proceedings of the AAAI Conference on Artificial Intelligence, Pomona, CA, USA, 24–28 October 2022; Volume 36, pp. 12951–12952. [Google Scholar]
- Afouras, T.; Owens, A.; Chung, J.S.; Zisserman, A. Self-supervised learning of audio-visual objects from video. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 208–224. [Google Scholar]
- Truong, T.D.; Duong, C.N.; Pham, H.A.; Raj, B.; Le, N.; Luu, K. The right to talk: An audio-visual transformer approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1105–1114. [Google Scholar]
- Sharma, R.; Narayanan, S. Audio-visual activity guided cross-modal identity association for active speaker detection. IEEE Open J. Signal Process. 2023, 4, 225–232. [Google Scholar] [CrossRef]
Exp. | Model | Modality | Text Prompt | Bell | Sick | Long | Boll. | Lie. | Avg. | Std. | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | LLaVA-13B [30,31] | Vis. & Text | X | prompt 1 | 82.06 | 54.89 | 41.41 | 76.66 | 76.97 | 66.40 | 17.46 |
2 | ViT-B/16 + MLP | Vis. | pretrained | X | 66.53 | 73.88 | 44.94 | 83.6 | 94.96 | 72.78 | 18.87 |
3 | ResNet101 + MLP | Vis. | pretrained | X | 66.77 | 84.99 | 78.76 | 85.25 | 90.77 | 81.31 | 9.17 |
4 | ResNet101 + MLP | Vis. | fine-tuned | X | 71.29 | 89.98 | 81.42 | 83.08 | 88.88 | 82.93 | 7.46 |
5 | Vis. & Text | pretrained | fixed | 75.31 | 91.59 | 81.61 | 86.73 | 83.95 | 83.84 | 6.04 | |
6 | Vis. & Text | pretrained | fixed | 88.00 | 89.04 | 81.42 | 71.23 | 94.19 | 84.78 | 8.83 | |
7 | Vis. & Text | pretrained | variable | 83.21 | 92.29 | 84.03 | 83.69 | 83.56 | 85.36 | 3.89 | |
8 | Vis. & Text | pretrained | variable | 76.38 | 96.1 | 84.03 | 76.87 | 94.85 | 85.65 | 9.48 | |
9 | Vis. & Text | fine-tuned | fixed | 79.12 | 85.04 | 85.25 | 79.88 | 96.84 | 85.23 | 7.08 | |
10 | Vis. & Text | fine-tuned | fixed | 90.75 | 96.64 | 78.76 | 73.97 | 97.13 | 87.45 | 10.56 | |
11 | Vis. & Text | fine-tuned | variable | 85.52 | 92.09 | 86.50 | 83.27 | 96.52 | 88.78 | 5.41 | |
12 | Vis. & Text | fine-tuned | variable | 96.37 | 98.60 | 85.25 | 78.52 | 94.02 | 90.55 | 8.42 |
Dataset | Exp. | Avg. | Std. |
---|---|---|---|
Columbia [42] | 12 | 93.8 | 3.7 |
Columbia [42] | 11 | 95.2 | 4.9 |
RealVAD [17] | 12 | 86.4 | 6.3 |
RealVAD [17] | 11 | 88.2 | 5.3 |
Method | Venue | Modality | Bell | Boll | Lieb | Long | Sick | AVG | STD |
---|---|---|---|---|---|---|---|---|---|
[42] | ECCV 2016 | V | 82.9 | 65.8 | 73.6 | 86.9 | 81.8 | 78.2 | 8.5 |
SyncNet [23] | ACCV 2017 | AV | 93.7 | 83.4 | 86.8 | 97.7 | 86.1 | 89.5 | 5.9 |
[18] | ICCV 2019 | V | 89.2 | 88.8 | 85.8 | 81.4 | 86.0 | 86.2 | 3.1 |
RGB-DI [18] | ICCV 2019 | V | 86.3 | 93.8 | 92.3 | 76.1 | 86.3 | 86.9 | 7.0 |
LWTNet [59] | ECCV 2020 | AV | 92.6 | 82.4 | 88.7 | 94.4 | 95.9 | 90.8 | 5.4 |
RealVAD [17] | IEEE TMM 2020 | V | 92.0 | 98.9 | 94.1 | 89.1 | 92.8 | 93.4 | 3.6 |
S-VVAD [16] | WACV 2021 | V | 92.4 | 97.2 | 92.3 | 95.5 | 92.5 | 94.0 | 2.2 |
[60] | CVPR 2021 | AV | 95.8 | 88.5 | 91.6 | 96.4 | 97.2 | 93.9 | 3.7 |
TalkNet [13] | ACM MM 2021 | AV | 97.1 | 90.0 | 99.1 | 96.6 | 98.1 | 96.2 | 3.6 |
UNICON [19] | ACM MM 2021 | AV | 93.6 | 81.3 | 93.8 | 93.5 | 92.1 | 90.9 | 5.4 |
ACLNet [15] | IEEE TMM 2022 | AV | 97.4 | 88.1 | 97.5 | 98.5 | 98.0 | 95.9 | 4.4 |
GSCMIA [61] | IEEE JSTSP 2023 | AV | 96.3 | 89.4 | 98.7 | 98.7 | 96.8 | 96.0 | 3.8 |
VAD-CLVA (Ours) | VT | 96.9 | 86.7 | 96.0 | 97.8 | 98.8 | 95.2 | 4.9 | |
max-AV | 98.1 | 89.4 | 99.1 | 99.3 | 98.1 | ||||
max-V | 92.4 | 98.9 | 94.1 | 95.5 | 92.8 |
Method | Venue | Modality | Bell | Boll | Lieb | Long | Sick | AVG | STD | MED |
---|---|---|---|---|---|---|---|---|---|---|
[42] | ECCV 2016 | V | 82.9 | 65.8 | 73.6 | 86.9 | 81.8 | 78.2 | 8.5 | 81.8 |
SyncNet [23] | ACCV 2017 | AV | 93.7 | 83.4 | 86.8 | 97.7 | 86.1 | 89.5 | 5.9 | 86.8 |
[18] | ICCV 2019 | V | 89.2 | 88.8 | 85.8 | 81.4 | 86.0 | 86.2 | 3.1 | 86.0 |
RGB-DI [18] | ICCV 2019 | V | 86.3 | 93.8 | 92.3 | 76.1 | 86.3 | 86.9 | 7.0 | 86.3 |
LWTNet [59] | ECCV 2020 | AV | 92.6 | 82.4 | 88.7 | 94.4 | 95.9 | 90.8 | 5.4 | 92.6 |
RealVAD [17] | IEEE TMM 2020 | V | 92.0 | 98.9 | 94.1 | 89.1 | 92.8 | 93.4 | 3.6 | 92.8 |
S-VVAD [16] | WACV 2021 | V | 92.4 | 97.2 | 92.3 | 95.5 | 92.5 | 94.0 | 2.2 | 92.5 |
[60] | CVPR 2021 | AV | 95.8 | 88.5 | 91.6 | 96.4 | 97.2 | 93.9 | 3.7 | 95.8 |
TalkNet [13] | ACM MM 2021 | AV | 97.1 | 90.0 | 99.1 | 96.6 | 98.1 | 96.2 | 3.6 | 97.1 |
UNICON [19] | ACM MM 2021 | AV | 93.6 | 81.3 | 93.8 | 93.5 | 92.1 | 90.9 | 5.4 | 93.5 |
ACLNet [15] | IEEE TMM 2022 | AV | 97.4 | 88.1 | 97.5 | 98.5 | 98.0 | 95.9 | 4.4 | 97.5 |
GSCMIA [61] | IEEE JSTSP 2023 | AV | 96.3 | 89.4 | 98.7 | 98.7 | 96.8 | 96.0 | 3.8 | 96.8 |
VAD-CLVA (Ours) | VT | 96.9 | 86.7 | 96.0 | 97.8 | 98.8 | 95.2 | 4.9 | 96.9 | |
max-AV | 98.1 | 89.4 | 99.1 | 99.3 | 98.1 | |||||
max-V | 92.4 | 98.9 | 94.1 | 95.5 | 92.8 |
Bell | Boll | Lieb | Long | Sick | AVG | STD | |
---|---|---|---|---|---|---|---|
S-VVAD [16] | 86.1 | 87.7 | 96.7 | 84.0 | 75.1 | 85.9 | 7.8 |
VAD-CLVA (Ours) | 96.4 | 78.5 | 94.0 | 85.2 | 98.6 | 90.6 | 8.4 |
Method | Modality | Pretraining Data | Training Data | Testing Data | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | AVG | STD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[17] | V | Columbia [42] | - | RealVAD [17] | 53.6 | 51.1 | 41.1 | 50.2 | 37.3 | 50.3 | 56.8 | 53.6 | 69.8 | 51.5 | 9.3 |
UNICON [19] | V | AVA-ActiveSpeaker [38] | - | RealVAD [17] | 86.7 | 78.1 | 70.5 | 73.1 | 68.9 | 84.9 | 93.0 | 80.4 | 87.0 | 80.3 | 7.8 |
UNICON [19] | AV | AVA-ActiveSpeaker [38] | - | RealVAD [17] | 94.3 | 74.0 | 89.9 | 76.7 | 80.6 | 93.6 | 98.8 | 83.5 | 93.5 | 87.2 | 8.3 |
VAD-CLVA (Ours) | VT | Columbia [42] | - | RealVAD [17] | 89.0 | 81.6 | 81.4 | 83.4 | 79.3 | 93.6 | 97.2 | 85.8 | 93.4 | 87.2 | 6.4 |
[17] | V | - | RealVAD [17] | RealVAD [17] | 51.6 | 53.5 | 42.9 | 51.7 | 44.4 | 50.5 | 58.7 | 67.9 | 55.8 | 53.0 | 7.1 |
UNICON [19] | V | AVA-ActiveSpeaker [38] | RealVAD [17] | RealVAD [17] | 86.9 | 76.5 | 81.6 | 87.0 | 79.6 | 88.9 | 97.0 | 84.5 | 88.9 | 85.6 | 5.7 |
UNICON [19] | AV | AVA-ActiveSpeaker [38] | RealVAD [17] | RealVAD [17] | 96.5 | 81.1 | 86.9 | 84.4 | 89.9 | 85.6 | 94.9 | 88.1 | 90.9 | 88.7 | 4.7 |
VAD-CLVA (Ours) | VT | - | RealVAD [17] | RealVAD [17] | 91.7 | 78.8 | 86.2 | 87.7 | 84.4 | 87.8 | 98.5 | 88.9 | 89.5 | 88.2 | 5.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Appiani, A.; Beyan, C. VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection. Information 2025, 16, 233. https://doi.org/10.3390/info16030233
Appiani A, Beyan C. VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection. Information. 2025; 16(3):233. https://doi.org/10.3390/info16030233
Chicago/Turabian StyleAppiani, Andrea, and Cigdem Beyan. 2025. "VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection" Information 16, no. 3: 233. https://doi.org/10.3390/info16030233
APA StyleAppiani, A., & Beyan, C. (2025). VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection. Information, 16(3), 233. https://doi.org/10.3390/info16030233