Next Article in Journal
A Comparison of Surrogate Modeling Techniques for Global Sensitivity Analysis in Hybrid Simulation
Previous Article in Journal
Detection and Classification of Knee Injuries from MR Images Using the MRNet Dataset with Progressively Operating Deep Learning Methods
 
 
Article
Peer-Review Record

Automated Event Detection and Classification in Soccer: The Potential of Using Multiple Modalities

Mach. Learn. Knowl. Extr. 2021, 3(4), 1030-1054; https://doi.org/10.3390/make3040051
by Olav Andre Nergård Rongved 1,2, Markus Stige 3, Steven Alexander Hicks 1,2, Vajira Lasantha Thambawita 1,2, Cise Midoglu 2,*, Evi Zouganeli 1, Dag Johansen 4, Michael Alexander Riegler 2,4 and Pål Halvorsen 2,5,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Mach. Learn. Knowl. Extr. 2021, 3(4), 1030-1054; https://doi.org/10.3390/make3040051
Submission received: 2 November 2021 / Revised: 8 December 2021 / Accepted: 10 December 2021 / Published: 16 December 2021
(This article belongs to the Topic Applied Computer Vision and Pattern Recognition)

Round 1

Reviewer 1 Report

The paper aim is to develop an intelligent soccer event detection and classification system using machine learning. 

Two visual approaches based on Res-Net are tested and combined with an approach based on audio processing. None of the exploited models is an original contribution of this work.

The combination of audio and video is also already in use.

So, what I would have expected was a comparison with existing approaches on the same dataset to demonstrate an improvement in classification performance. Unfortunately, the paper reports only a comparison of implemented models. 

Missing references concerning video-soccer analysis:

[1] Mazzeo, Pier Luigi, et al. "Visual players detection and tracking in soccer matches." 2008 IEEE Fifth International Conference on Advanced Video and Signal Based Surveillance. IEEE, 2008.

concerning audio soccer analysis: 

[2] Islam, Muhammad Rafiqul, et al. "Sports Highlights Generation using Decomposed Audio Information." 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2019.

Concerning both audio-video modalities:

[3] Sanabria, Melissa, Frédéric Precioso, and Thomas Menguy. "A deep architecture for multimodal summarization of soccer games." Proceedings Proceedings of the 2nd International Workshop on Multimedia Content Analysis in Sports. 2019.

[4] Gao, Xin, et al. "Automatic Key Moment Extraction and Highlights Generation Based on Comprehensive Soccer Video Understanding." 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2020.

Author Response

Dear Reviewer,
thank you very much for your comments. We have revised our manuscript as much as possible within the 6-day revision timeframe, according to your suggestions. Please find our responses below. 

Comment #1:
"The paper aim is to develop an intelligent soccer event detection and classification system using machine learning. 
Two visual approaches based on Res-Net are tested and combined with an approach based on audio processing. None of the exploited models is an original contribution of this work."

Author response:
We experiment with 3 visual models, namely CALF (Cioppa et al. [4]), 3D-CNN (Rongved et al. [2]), and 2D-CNN (using pre-extracted features from SoccerNet) as a baseline, and 1 audio model. It is true that all of these models can be found individually in literature. However, our focus is on the use of multiple modalities jointly, and the investigation of various model configurations and trade-offs with respect to detection performance. We investigate different model fusion strategies and tuning strategies such as window size, position, and tolerance (Eq. 5). 

Overall, our main aim is not to present a completely novel model which yields a performance surpassing the state-of-the-art, but rather to evaluate different model and modality combinations from the state-of-the-art using different configurations, in order to derive insights. Note also that some of the configurations we explore deliver considerable performance (e.g., Zhou et al.[64] achieve good results in the SoccerNet-v2 spotting challenge with an average-mAP of about 75\%, which is lover than than the AP of 84\% for the CALF-120-40 model with combined inputs on goal events, as given in Table 9).

In the revised manuscript, we added a dedicated section for the discussion of our results, including some comparisons to related work and relevant benchmarks (Section 6).

Comment #2:
"The combination of audio and video is also already in use."

Author response:
We agree that the use of multiple modalities for the event detection task is not novel, as we have mentioned in Section 2.3. However, our contribution in this work is not the idea of the combination of modalities itself, but rather the evaluation of the performance benefits of multimodality. Our main insight, after attempting to quantify these per event class, is that the benefits of multimodality might actually be context-dependent, which we believe is an interesting result for the scientific community to investigate further and build upon. In the revised manuscript, we included a discussion about multimodality in the new Section 6. 

Comment #3:
"So, what I would have expected was a comparison with existing approaches on the same dataset to demonstrate an improvement in classification performance. Unfortunately, the paper reports only a comparison of implemented models."

Author response:
In the revised manuscript, we added a dedicated paragraph for the discussion of our results and relevant benchmarks (Section 6).

Comment #4:
"Missing references concerning video-soccer analysis:
[1] Mazzeo, Pier Luigi, et al. "Visual players detection and tracking in soccer matches." 2008 IEEE Fifth International Conference on Advanced Video and Signal Based Surveillance. IEEE, 2008."

Author response:
In this study, we focus on the detection of soccer events (such as goals, cards, and substitutions). The detection and tracking of players in sports (e.g., for athlete training or team strategy building purposes) is in itself a comprehensive field of study, involving many different computer vision technologies, ranging from motion detection and object recognition to segmentation. However, this field is beyond the scope of our work. The above reference also employs a stochastic approach, which does not directly relate to the ML-based models we investigate in our study. Therefore, in order not to dilute the focus on related works in our specific target field (or works that employ similar ML-based tools), or to do injustice to other works in this adjacent field by only referencing a single publication without further elaboration on the topic (which is impractical in terms of space constraints, as well as contextually inappropriate), we chose not to include a reference to this publication. 

Comment #5:
"concerning audio soccer analysis: 
[2] Islam, Muhammad Rafiqul, et al. "Sports Highlights Generation using Decomposed Audio Information." 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2019."

Author response:
In the revised manuscript, we added a reference to this publication under Section 6, as part of the discussion on multimodality. 

Comment #6:
"Concerning both audio-video modalities:
[3] Sanabria, Melissa, Frédéric Precioso, and Thomas Menguy. "A deep architecture for multimodal summarization of soccer games." Proceedings Proceedings of the 2nd International Workshop on Multimedia Content Analysis in Sports. 2019."

Author response:
Unfortunately, the manuscript for this publication is not available (see: https://dl.acm.org/doi/10.1145/3347318.3355524). Therefore, despite the title and the abstract suggesting relevance, we were unable to add a reference to this publication in our revised manuscript.

Comment #7:
"[4] Gao, Xin, et al. "Automatic Key Moment Extraction and Highlights Generation Based on Comprehensive Soccer Video Understanding." 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2020."

Author response: 
In the revised manuscript, we added a reference to this publication under Section 2, with a brief discussion of its content.

Other notes:
In the revised manuscript, we have also gone over all sections to improve the writing quality, adjusted figure and table sizes to improve readability, added a discussion of real-time detection requirements, and updated the title slightly to reflect our focus on soccer. 

We thank the reviewer for their feedback. 

Reviewer 2 Report

In this paper the authors present and evaluate different approaches combining visual and audio features to detect and classify events in soccer videos. They claim that using multiple modalities improves event detection performance for certain types of events.

I think that the research is interesting, it is well explained and the results are promising. As the window sizes are several seconds long, it would be interesting to extend the discussion about the best approach if near real-time detection is desired (window size of two seconds).

I believe that the title could be a bit misleading, as you are focused on soccer videos, not general sports. Maybe that should be reflected in the title.

Lines 91-96 should be rewritten to better fit an abstract of Karpathy et al. [8] work. 

Lines 98-99: what is "exceeded the previous attempts"? Improved the results? What do you mean by "using deep nets by a large margin"?

Line 136: "where it reaches 98.2%" of accuracy, I guess; same in line 139.

Line 141-142: do you mean to return a detected action along with its time tags? Please rewrite that sentence.

Line 205: "A traditional SVM classifier group the features" -> "groups"; same in the following line.

Line 214: Please say in a explicit way that your research is soccer videos.

In ref. 51 in the bibliography, "support vector Machine" capitalization should be changed to "support vector machine" or "Support Vector Machine".

It may be interesting to add this reference:
Zhou, Xin, Le Kang, Zhiyu Cheng, Bo He, and Jingyu Xin. "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection." arXiv preprint arXiv:2106.14447 (2021).

Author Response

Dear Reviewer,
thank you very much for your comments. We have revised our manuscript as much as possible within the 6-day revision timeframe, according to your suggestions. Please find our responses below. 

Comment #1:
"In this paper the authors present and evaluate different approaches combining visual and audio features to detect and classify events in soccer videos. They claim that using multiple modalities improves event detection performance for certain types of events.
I think that the research is interesting, it is well explained and the results are promising. As the window sizes are several seconds long, it would be interesting to extend the discussion about the best approach if near real-time detection is desired (window size of two seconds)."

Author response:
Thank you for your comment. In the revised manuscript, we added a new section (Section 6) dedicated to the discussion of various aspects, including real-time detection requirements. 

Comment #2: 
"I believe that the title could be a bit misleading, as you are focused on soccer videos, not general sports. Maybe that should be reflected in the title."

Author response:
Thank you for the suggestion. We believe that some of our conclusions can be generalized to a number of sports other than soccer as well. However, it is true that in this work, we only ran experiments on soccer videos. Therefore, we replaced the word "Sports" with "Soccer" in the title of our revised manuscript.

Comment #3: 
"Lines 91-96 should be rewritten to better fit an abstract of Karpathy et al. [8] work."

Author response:
In Section 2 of the revised manuscript, we added some quantitative measures of performance and dataset details from this study, as mentioned in their abstract.

Comment #4: 
"Lines 98-99: what is "exceeded the previous attempts"? Improved the results? What do you mean by "using deep nets by a large margin"?"

Author response: 
In Section 2 of the revised manuscript, we rewrote this sentence to clarify our meaning: the approach by Simonyan and Zisserman greatly improved the performance compared to other works using deep neural networks. 

Comment #5:
"Line 136: "where it reaches 98.2\%" of accuracy, I guess; same in line 139."

Author response:
Thank you for pointing out the problematic wording. In the revised manuscript, we added the missing metric in both lines.

Comment #6:
"Line 141-142: do you mean to return a detected action along with its time tags? Please rewrite that sentence."

Author response:
Thank you for pointing out the unclear expression. In the revised manuscript, we rewrote the sentence and updated Section 2.1 to better explain temporal action localization in the context of action detection.

Comment #7:
"Line 205: "A traditional SVM classifier group the features" -> "groups"; same in the following line."

Author response:
Thank you for pointing out these grammar errors. We corrected both in the revised manuscript.

Comment #7:
"Line 214: Please say in a explicit way that your research is soccer videos."

Author response:
In the revised manuscript, we rewrote the sentence to make it clear that the context of our own research is the detection and classification of events in soccer videos. 

Comment #7:
"In ref. 51 in the bibliography, "support vector Machine" capitalization should be changed to "support vector machine" or "Support Vector Machine"."

Author response:
Thank you for pointing out the case inconsistency. We updated this entry in the revised manuscript.

Comment #8:
"It may be interesting to add this reference:
Zhou, Xin, Le Kang, Zhiyu Cheng, Bo He, and Jingyu Xin. "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection." arXiv preprint arXiv:2106.14447 (2021)."

Author response:
Thank you for pointing out this publication. The authors use a video-only unimodal approach, but we believe that their work is very interesting. In the revised manuscript, we added a reference to this publication under Section 2, with a brief summary of its content, as well as Section 6, with a discussion of its results. 

Other notes:
In the revised manuscript, we have also gone over all sections to improve the writing quality, adjusted figure and table sizes to improve readability, added further references, and a discussion of our results in the context of relevant benchmarks. 

We thank the reviewer for their feedback. 

Round 2

Reviewer 1 Report

I'm not totally convinced by the authors' reply. In particular, the originality of the work in my opinion is marginal. Anyway, the paper could be accepted since it is certainly of interest to journal readers. The authors decided to not include references on players tracking. Following this reasonable pathway, I suggest including at least the following works on event detection in soccer:

[1] Spagnolo, Paolo, et al. "Non-invasive soccer goal-line technology: a real case study." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013.

[2] Thamaraimanalan, T., et al. "Prediction and Classification of Fouls in Soccer Game using Deep Learning." Irish Interdisciplinary Journal of Science & Research 4.3 (2020): 66-78.

Author Response

Dear Reviewer,

Thank you again for your effort in reviewing our paper.

We have added your suggested papers in the first paragraph of section 2.2.

Back to TopTop