Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

SemConvTree: Semantic Convolutional Quadtrees for Multi-Scale Event Detection in Smart City

Smart Cities 2024, 7(5), 2763-2780; https://doi.org/10.3390/smartcities7050107

by Mikhail Andeevich Kovalchuk^*

, Anastasiia Filatova

, Aleksei Korneev

, Mariia Koreneva, Denis Nasonov^*

, Aleksandr Voskresenskii and Alexander Boukhanovsky

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Smart Cities 2024, 7(5), 2763-2780; https://doi.org/10.3390/smartcities7050107

Submission received: 4 July 2024 / Revised: 10 September 2024 / Accepted: 26 September 2024 / Published: 28 September 2024

(This article belongs to the Section Smart Data)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents an interesting approach to event detection in urban environments using an enhanced version of the ConvTree algorithm. The proposed model, SemConvTree, integrates advanced topic modelling and semantic analysis techniques to improve event detection accuracy, particularly for low-scale events. The study demonstrates a significant increase in the detection rate and accuracy of urban events, offering valuable insights for smart city management.

Strengths:
- The introduction of SemConvTree is a notable improvement over traditional ConvTree, particularly in handling multi-scale events. The incorporation of semantic analysis via advanced NLP models like BERTopic, TSB-ARTM, and SBERT-Zero-Shot is well-executed and justified.
- The experimental results are thorough.
- The adaptation of the grid based on post density is a practical approach that allows for finer detection in densely populated areas.

Weaknesses:
- The proposed model, while effective, is computationally intensive due to the combination of multiple NLP models and the use of convolutional quadtrees, this might be detrimental to a real-life application.
- While the paper demonstrates the effectiveness of the method in New York City, scalability to other cities, especially those with different social media usage patterns or lower data availability, remains unclear. This is not necessarily a weakness, but more of a point to consider in future work.

Also, the paper briefly mentions ethical concerns and data anonymization. However, the discussion is somewhat superficial. Given the potential sensitivity of the data involved (especially with low-scale events that might be personal), a more detailed exploration of the ethical implications, including potential biases in the dataset and how they are mitigated, is necessary.

Author Response

Dear Reviewer,

We would like to express our sincere gratitude for your thoughtful and constructive feedback on our manuscript. Your comments and suggestions have been invaluable in improving the quality and clarity of our work. We have carefully considered each point and made the necessary revisions to address your concerns.

In response to your suggestions, we have made the following changes in the revised manuscript:

Comment 1: The proposed model, while effective, is computationally intensive due to the combination of multiple NLP models and the use of convolutional quadtrees, this might be detrimental to a real-life application.
Response 1:
We appreciate your comment regarding the computational complexity of the proposed model. We would like to note that despite its apparent complexity, our system is capable of processing data in a real-time for the following reasons:

- We have implemented a multi-threaded crawler that collects posts from a single city with minimal delay.
- All major services and components of the system, except for the re-ranking module, are implemented in the compiled language Go, which ensures high performance. More detailed information, including execution time, is described in our previous publication [1].
- PostgreSQL with TimescaleDB and PostGIS modules is used for storing posts, providing high-speed processing of geo-distributed data.
- Lightweight models have been chosen for the re-ranking module. For example, Sentence BERT is noted for its excellent balance of accuracy and speed, and the same is true for BigARTM.
In the future, we plan to optimize the re-ranking module so that all models can be run simultaneously on different nodes, which will increase the processing speed of individual posts.

Thus, despite the complexity of the algorithm, we have paid special attention to optimizing its performance to ensure real-time operation.
[1] A. A. Visheratin, K. D. Mukhina, A. K. Visheratina, D. Nasonov, and A. V. Boukhanovsky, "Multiscale event detection using convolutional quadtrees and adaptive geogrids," in Proceedings of the 2nd ACM SIGSPATIAL Workshop on Analytics for Local Events and News, Nov. 2018, doi: 10.1145/3282866.3282867.

Comment 2: While the paper demonstrates the effectiveness of the method in New York City, scalability to other cities, especially those with different social media usage patterns or lower data availability, remains unclear. This is not necessarily a weakness, but more of a point to consider in future work.
Response 2:
We sincerely thank you for your valuable comment regarding the use of data only from New York City in our study. Your comment raises an important point about the geographical coverage and generalizability of our research results.
We would like to note that the choice of New York City as the primary data source was influenced by several factors:

Most studies in the field of event detection traditionally use Twitter data from around the world or individual countries [1-17]. However, this approach is difficult to apply to Instagram data collection, which has a different structure and data access limitations.
Among studies focusing on specific cities, New York is mentioned frequently. It appears in publications [18-21,24,25], while not being used in [22,23]. The second most frequently mentioned city, Los Angeles, appears in [18, 19, 25], which is less than New York. This prevalence of New York in the literature is likely due to the high activity of social media users in this city, which provides a rich and diverse dataset for analysis.
The choice of New York is also motivated by the desire to ensure some comparability of results with other studies. In the absence of publicly available codes and datasets in this field, using data from the same city allows for at least an approximate comparison of the effectiveness of different algorithms.
Thus, New York effectively serves as a benchmark for the task of event detection in social media.

We agree that expanding the geography of the study could increase the generalizability of the results. In future work, we plan to validate our model on data from other cities and regions with different patterns of social media usage.
We thank you for this valuable feedback, which will undoubtedly help improve the quality of our research and guide further work.
[1] M. Osborne et al., "Real-Time Detection, Tracking, and Monitoring of Automatically Discovered Events in Social Media," in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014.
[2] P. Giridhar et al., "Social Fusion: Integrating Twitter and Instagram for Event Monitoring," in 2017 IEEE International Conference on Autonomic Computing (ICAC), 2017.
[3] C. Zhang et al., "TrioVecEvent," in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017.
[4] W. Feng et al., "STREAMCUBE: Hierarchical spatio-temporal hashtag clustering for event exploration over the Twitter stream," in 2015 IEEE 31st International Conference on Data Engineering, 2015.
[5] F. U. Rehman et al., "Understanding the Spatio-Temporal Scope of Multi-scale Social Events," in Proceedings of the 1st ACM SIGSPATIAL Workshop on Analytics for Local Events and News, 2017.
[6] M. Sokolova et al., "Topic Modelling and Event Identification from Twitter Textual Data," ArXiv, vol. abs/1608.02519, 2016.
[7] S. Cresci et al., "A Linguistically-Driven Approach to Cross-Event Damage Assessment of Natural Disasters from Social Media Messages," in Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 1195–1200.
[8] H. Abdelhaq et al., "EvenTweet: Online localized event detection from twitter," Proceedings of the VLDB Endowment, vol. 6, pp. 1326-1329, 2013.
[9] V. Lampos and N. Cristianini, "Nowcasting Events from the Social Web with Statistical Learning," ACM Trans. Intell. Syst. Technol., vol. 3, no. 4, Article 72, 2012.
[10] A. Weiler et al., "Event Identification and Tracking in Social Media Streaming Data," CEUR Workshop Proceedings, vol. 1133, 2014.
[11] D. Zhou et al., "An Unsupervised Framework of Exploring Events on Twitter: Filtering, Extraction and Categorization," in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2468–2474.
[12] M. Khodabakhsh et al., "Detecting Life Events From Twitter based on Temporal Semantic Features," Knowledge-Based Systems, vol. 148, 2018.
[13] Y. Zhang et al., "A General Method for Event Detection on Social Media," in Symposium on Advances in Databases and Information Systems, 2021.
[14] H. Hettiarachchi et al., "Embed2Detect: temporally clustered embedded words for event detection in social media," Machine Learning, vol. 111, no. 1, pp. 49-87, 2021.
[15] G. A. Neruda and E. Winarko, "Traffic Event Detection from Twitter Using a Combination of CNN and BERT," in 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2021, pp. 1-7.
[16] L. Huang et al., "Similarity-based emergency event detection in social media," Journal of Safety Science and Resilience, vol. 2, no. 1, pp. 11-19, 2021.
[17] L. Huang et al., "Early detection of emergency events from social media: a new text clustering approach," Natural Hazards, vol. 111, pp. 1-25, 2022.
[18] C. Zhang et al., "GeoBurst+: Effective and Real-Time Local Event Detection in Geo-Tagged Tweet Streams," ACM Trans. Intell. Syst. Technol., vol. 9, no. 3, pp. 1-24, 2018.
[19] C. Zhang et al., "GeoBurst: Real-Time Local Event Detection in Geo-Tagged Tweet Streams," in Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2016, pp. 513–522.
[20] H. Wei et al., "DeLLe," in Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Analytics for Local Events and News, 2019.
[21] A. A. Visheratin et al., "Multiscale event detection using convolutional quadtrees and adaptive geogrids," in Proceedings of the 2nd ACM SIGSPATIAL Workshop on Analytics for Local Events and News, 2018.

[22] H. Becker et al., "Beyond Trending Topics: Real-World Event Identification on Twitter," in Proceedings of the International AAAI Conference on Web and Social Media, vol. 5, no. 1, pp. 438-441, 2021.
[23] J. Weng et al., "Event Detection in Twitter," in Proceedings of the International AAAI Conference on Web and Social Media, pp. 1-21, 2011.
[24] T. Cheng and T. Wicks, "Event Detection using Twitter: A Spatio-Temporal Approach," PloS one, vol. 9, p. e97807, 2014.
[25] F. Ali et al., "Traffic accident detection and condition analysis based on social networking data," Accident Analysis & Prevention, vol. 151, p. 105973, 2021

Comment 3: Also, the paper briefly mentions ethical concerns and data anonymization. However, the discussion is somewhat superficial. Given the potential sensitivity of the data involved (especially with low-scale events that might be personal), a more detailed exploration of the ethical implications, including potential biases in the dataset and how they are mitigated, is necessary.
Response 3: Thank you for your insightful comment regarding the ethical concerns and data anonymization. We fully agree that these are critical aspects, especially when dealing with potentially sensitive data. While our current paper briefly touches upon these issues, we acknowledge the need for a more in-depth exploration of the ethical implications, including potential biases and the strategies for mitigating them.
Although we have not provided a detailed discussion in this manuscript, we consider your suggestion extremely valuable, and we will certainly incorporate a more comprehensive analysis of these ethical concerns in our future research. Addressing these issues thoughtfully is essential to ensure both the integrity and social responsibility of our work.

We have provided detailed responses to each of your comments. Our aim is to ensure that our manuscript is as clear, accurate, and comprehensive as possible, and we believe that your insights have greatly contributed to this goal. We hope the revised version of the manuscript now meets the standards for publication, and we would be grateful for any further feedback you might have.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

1.Some images in the article (e.g., Figure 1) are not clear enough.

2.Although the paper has discussed the limitations of existing methods, it could further elaborate on the theoretical background, particularly the latest advancements in semantic analysis techniques related to the SemConvTree model. This would help better highlight the model's technological innovations and advantages.

3.The paper mentions the improvements in detection precision and recall achieved by the SemConvTree model, but it could provide more performance comparison analysis, including detailed comparisons with other state-of-the-art methods. This section could be enhanced with additional charts and statistical data to more intuitively demonstrate the superiority of the model.

4.The paper primarily conducted experimental research using social media data from New York City. To enhance the generalizability of the research findings, it would be beneficial to validate the model on data from different cities or regions, especially those with varying social media usage habits. This would help demonstrate the model's applicability and robustness across different cultural contexts.

5. Some formatting and grammatical improvements could be made in certain areas, such as in lines 426, 431, and 443.

Comments on the Quality of English Language

The grammar is mostly correct, but there are occasional instances where sentence structure could be simplified for better readability.

Author Response

Dear Reviewer,

In response to your suggestions, we have made the following changes in the revised manuscript:

Comment 1: Some images in the article (e.g., Figure 1) are not clear enough.
Response 1:
Thank you for your careful consideration of the visual materials in our article. We have taken note of your comment regarding the lack of clarity in some images, particularly Figure 1. In response, we have completely revised the diagram, making it more comprehensible, higher quality, and consistent with the other diagrams in the article.
The new version of Figure 1 now provides a clearer and more illustrative representation of the described process. We have endeavored to improve both the visual quality and the informativeness of the diagram.
We would appreciate any additional comments or suggestions for further improvement of this or other images in our article. Our goal is to ensure maximum clarity and usefulness of all visual elements for readers.

Comment 2: Although the paper has discussed the limitations of existing methods, it could further elaborate on the theoretical background, particularly the latest advancements in semantic analysis techniques related to the SemConvTree model. This would help better highlight the model's technological innovations and advantages.
Response 2:
Thank you very much for your comment. We appreciate your suggestion to further elaborate on the theoretical background and include the latest advancements in semantic analysis techniques. However, we would like to emphasize that the SemConvTree model is specifically designed for event detection tasks, where many of the current state-of-the-art semantic analysis methods cannot be directly applied, aside from those already mentioned in the paper.
For instance, while there are valuable datasets for Sentiment Analysis, such as the one available on Kaggle [1], and insightful articles like [2], these resources do not fully align with the specific requirements of our task. Event detection necessitates a unique combination of parameters, including geolocation data and other contextual attributes, which are critical to our approach but are typically not addressed in the available sentiment analysis datasets or methodologies.
That said, we fully agree that integrating newer technologies and approaches could enhance our model in future iterations. We are committed to exploring and incorporating these advancements in our future research to continuously improve the effectiveness and innovation of our approach.

[1] Sentiment Analysis Dataset (https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset)
[2] Petrescu A. et al., "EDSA-Ensemble: an Event Detection Sentiment Analysis Ensemble Architecture", IEEE Transactions on Affective Computing ( DOI 10.1109/TAFFC.2024.3434355)

Comment 3: The paper mentions the improvements in detection precision and recall achieved by the SemConvTree model, but it could provide more performance comparison analysis, including detailed comparisons with other state-of-the-art methods. This section could be enhanced with additional charts and statistical data to more intuitively demonstrate the superiority of the model.
Response 3:
Thank you for your comment regarding the comparison with other models and methods. We acknowledge that most researchers in the field of online event detection using social media data do not share their code or models, even for significant publications. This practice hinders the reproducibility of results and makes it nearly impossible to conduct fair comparisons between different approaches.

A comprehensive survey by Afyouni et al. [1] on multi-feature, multi-modal, and multi-source social event detection highlights this issue. Another study by Korneev et al. [2] shows that out of numerous research papers, very few provide public source code or datasets (see Table 1 in [2]).

Moreover, there is a lack of openly available datasets in this area. The few datasets that are publicly accessible cover very short time periods (less than one year), which is insufficient for our algorithm to function effectively. This issue further complicates the objective comparison of algorithms from different authors.

To address these problems, one of our future research directions aims to create synthetic open datasets that imitate streaming social media data over longer time periods. These datasets will enable researchers to openly compare various approaches and contribute to the overall progress in the field.

We have made an effort to compare our model with the available baselines where data and code were accessible. Despite the limited opportunities for comparison, our model demonstrates promising results on the available data.

[1] I. Afyouni, Z. A. Aghbari, R. A. Razack, Multi-feature, multi-modal, and multi-source social event detection: A comprehensive survey, Information Fusion 79 (2022) 279–308. doi:https://doi.org/10.1016/j.inffus.2021.10.013.
URL https://www.sciencedirect.com/science/article/pii/S1566253521002220
[2] A. Korneev, M. Kovalchuk, A. Filatova, S. Tereshkin, Towards comparable event detection approaches development in social media, Procedia Computer Science 212 (2022) 312–321, 11th International Young Scientist Conference on Computational Science. doi:https://doi.org/10.1016/j.procs.2022.11.015.
URL https://www.sciencedirect.com/science/article/pii/S1877050922017069

Comment 4: The paper primarily conducted experimental research using social media data from New York City. To enhance the generalizability of the research findings, it would be beneficial to validate the model on data from different cities or regions, especially those with varying social media usage habits. This would help demonstrate the model's applicability and robustness across different cultural contexts.
Response 4:
We sincerely thank you for your valuable comment regarding the use of data only from New York City in our study. Your comment raises an important point about the geographical coverage and generalizability of our research results.
We would like to note that the choice of New York City as the primary data source was influenced by several factors:

1. Most studies in the field of event detection traditionally use Twitter data from around the world or individual countries [1-17]. However, this approach is difficult to apply to Instagram data collection, which has a different structure and data access limitations.
2. Among studies focusing on specific cities, New York is mentioned frequently. It appears in publications [18-21,24,25], while not being used in [22,23]. The second most frequently mentioned city, Los Angeles, appears in [18, 19, 25], which is less than New York. This prevalence of New York in the literature is likely due to the high activity of social media users in this city, which provides a rich and diverse dataset for analysis.
3. The choice of New York is also motivated by the desire to ensure some comparability of results with other studies. In the absence of publicly available codes and datasets in this field, using data from the same city allows for at least an approximate comparison of the effectiveness of different algorithms.
4. Thus, New York effectively serves as a benchmark for the task of event detection in social media.

[15] G. A. Neruda and E. Winarko, "Traffic Event Detection from Twitter Using a Combination of CNN and BERT," in 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2021, pp. 1-7.
[16] L. Huang et al., "Similarity-based emergency event detection in social media," Journal of Safety Science and Resilience, vol. 2, no. 1, pp. 11-19, 2021.
[17] L. Huang et al., "Early detection of emergency events from social media: a new text clustering approach," Natural Hazards, vol. 111, pp. 1-25, 2022.
[18] C. Zhang et al., "GeoBurst+: Effective and Real-Time Local Event Detection in Geo-Tagged Tweet Streams," ACM Trans. Intell. Syst. Technol., vol. 9, no. 3, pp. 1-24, 2018.
[19] C. Zhang et al., "GeoBurst: Real-Time Local Event Detection in Geo-Tagged Tweet Streams," in Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2016, pp. 513–522.
[20] H. Wei et al., "DeLLe," in Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Analytics for Local Events and News, 2019.
[21] A. A. Visheratin et al., "Multiscale event detection using convolutional quadtrees and adaptive geogrids," in Proceedings of the 2nd ACM SIGSPATIAL Workshop on Analytics for Local Events and News, 2018.
[22] H. Becker et al., "Beyond Trending Topics: Real-World Event Identification on Twitter," in Proceedings of the International AAAI Conference on Web and Social Media, vol. 5, no. 1, pp. 438-441, 2021.
[23] J. Weng et al., "Event Detection in Twitter," in Proceedings of the International AAAI Conference on Web and Social Media, pp. 1-21, 2011.
[24] T. Cheng and T. Wicks, "Event Detection using Twitter: A Spatio-Temporal Approach," PloS one, vol. 9, p. e97807, 2014.
[25] F. Ali et al., "Traffic accident detection and condition analysis based on social networking data," Accident Analysis & Prevention, vol. 151, p. 105973, 2021

Comment 5: Some formatting and grammatical improvements could be made in certain areas, such as in lines 426, 431, and 443.
Response 5: Thank you for your comment! We have reviewed the entire block of formulas, improved the formatting, made the necessary adjustments, and removed conflicting notations. Additionally, we have worked to make the descriptions of all introduced notations and formulas clearer. If you have any further comments on this section, we will be sure to take them into account.

Comment 6: The grammar is mostly correct, but there are occasional instances where sentence structure could be simplified for better readability.
Response 6:
Thank you very much for your feedback! To improve the readability and clarity of the article, we have revised the entire text and simplified the sentence structures. Unfortunately, we are not native speakers, but if you have any additional suggestions or comments regarding the grammar and language, please let us know, and we will do our best to address them.

Thank you very much for the comments! Please, find the reviewed version of our article attached.
Our aim is to ensure that our manuscript is as clear, accurate, and comprehensive as possible, and we believe that your insights have greatly contributed to this goal. We hope the revised version of the manuscript now meets the standards for publication, and we would be grateful for any further feedback you might have.

Author Response File: Author Response.pdf

Article Menu

SemConvTree: Semantic Convolutional Quadtrees for Multi-Scale Event Detection in Smart City

Further Information

Guidelines

MDPI Initiatives

Follow MDPI