Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

An Automatic Near-Duplicate Video Data Cleaning Method Based on a Consistent Feature Hash Ring

Electronics 2024, 13(8), 1522; https://doi.org/10.3390/electronics13081522

by Yi Qin^*

, Ou Ye^* and Yan Fu

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2024, 13(8), 1522; https://doi.org/10.3390/electronics13081522

Submission received: 26 March 2024 / Revised: 12 April 2024 / Accepted: 15 April 2024 / Published: 17 April 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1- The Introduction should focus on the background of the problem. You start talking about the state-of-the-art and the methodologies regarding near-duplicate video detection, which you talk again in the "Related work" section. It is confusing why you use two sections to talk about the same topics.

2 - During the paper you give the name "RCLA" to your method. What does this name stand for? Is this an upgrade for a previous method, or is it completely new?

3 - Regarding "Related work" section you should try to improve the quality of the chosen references. You cite a lot of papers from proceedings of certain conferences that are not in fact international, but are regional conferences, whose quality is difficult to understand.

4 - Why did you choose only 63 videos of a total of a public dataset that has 13129 videos? Also, you used a higher quantity of videos of a private dataset. You should give preference always to public datasets. And you should explain why only these numbers of videos were selected.

5 - In line 546 you state that the number of hidden layers has a great impact on the performance of the proposed method, as well as attention size. Why is that? You should explain it.

6 - In the results section you show some comparison with other methods. Did you implement these methods? (If so, did you have to choose some parameters that were not explained in the papers?) Or are you refering to the results that these methods present in each paper? I didn't find this explanation within the paper.

7 - One thing I don't see is the time each of the method consumes to train and execute each video. It should be interesting also to analyze these results.

Comments on the Quality of English Language

You should improve the whole text for clarity of the English. There are some phrases which are even difficult to fully understand.

Author Response

Original Manuscript ID: 2959345

Original Article Title: “An automatic near-duplicate video data cleaning method based on a consistent feature hash ring”

To: Electronics’ Editor

Re: Response to reviewers

Dear Editor,

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) an updated manuscript with yellow highlighting indicating changes, and (c) a clean updated manuscript without highlights.

Best regards,

<Yi Qin> et al.

Reviewer#1, Concern # 1:

The Introduction should focus on the background of the problem. You start talking about the state-of-the-art and the methodologies regarding near-duplicate video detection, which you talk again in the "Related work" section. It is confusing why you use two sections to talk about the same topics.

Author response: We thank the reviewer for the comments. We have revised the Introduction section according to the above reviewer’s comment, removed some repetitive descriptions, and elaborated on the upstream and downstream task relationship between detecting and cleaning near-duplicate video data in this paper.

Author action:

We have removed the repetitive descriptions in paragraph 5 and merged the original paragraphs 3, 4, and 6 in Introduction section.

From the perspective of data quality^[6], video data quality a great deal of attention is paid to the overall quality of the video data set, and stress on the degree of data consistency, data correctness, data completeness, and data minimization are satisfied in the information systems. The emergence of near-duplicate videos will reduce the degree of data consistency and minimization for video data sets. These near-duplicate videos can be taken to be a kind of dirty data, which has a wide coverage, rich and diverse forms. Concretely, regardless of the stage of video collection, video integration, or video processing, it is possible to generate near-duplicate videos. For instance, in the video collection stage, they can be collected from different angles within the same scene; in the video integration stage, there may be near-duplicate videos with different formats and video lengths from different data sources; in the video processing stage, video copy, video editing, and other operations will produce mass near-duplicate video data. Studies on near-duplicate video detection can help us discover hidden near-duplicate videos in video datasets. Currently, various kinds of methodologies have been proposed in the literature, and the implementation process mainly includes feature extraction, feature signature, and signature index. In either of these methodologies, feature extraction can be regarded as a key component of near-duplicate video detection. From the perspective of video feature representation, near-duplicate video detection methodologies can be categorized into the hand-crafted features-based methodology and high-level features-based methodology ^[7-9]. Nevertheless, near-duplicate video detection methodologies can only identify the near-duplicate videos in a video dataset ^{[10, 11]}, which lack a process of feature sorting and automatic merging for the video data represented by high-dimensional features. Therefore, it is very challenging for them to automatically clean up redundant near-duplicate videos to reduce video copyright infringement and related issues caused by video copying, video editing, and other manual operations. (Page 2)

We have added the following contents to elaborate on the upstream and downstream task relationship between detecting and cleaning up near-duplicate video data in the “Introduction” and “Relate Work” sections.

At present, data cleaning modeling techniques are important technical ways to effectively reduce near-duplicate data and improve data quality. (Page 2)

Data duplication may have the following reasons: data maintenance, manual input, device errors, and so on [26], data cleaning modeling techniques are effective ways to automatically clean and reduce near-duplicate data, which can effectively address the shortcomings of near-duplicate video detection methodologies. (Page 4)

Reviewer#1, Concern # 2:

During the paper you give the name "RCLA" to your method. What does this name stand for? Is this an upgrade for a previous method, or is it completely new?

Author response: We thank the reviewer for the comments. We have added the full name of the RCLA model.

Author action: we added the full name “Resnet-CBAM-LSTM-Multi-headed Attention” of the RCLA model on Page 7.

a residual network with convolutional block attention modules is adopted firstly to extract spatial features of video data, the channel attention and spatial attention modules in this convolutional block attention module can effectively improve the representation capability of spatial features in the saliency region of video data. Then, the above network and a long short-time memory model are integrated to extract the spatiotemporal features of video data. Finally, to highlight the role of key information in video data on the video semantic representations, an attention model based on the above network models is introduced, thereby along three independent dimensions of channel, space, and time series to construct a video spatiotemporal feature extraction model with multi-head attention mechanism, which is named as RCLA (Resnet-CBAM-LSTM-Multi-headed Attention) model in this paper. (Page 7)

The RCLA model is a new one that combines the Resnet, CBAM, LSTM, and multi-headed attention models.

Reviewer#1, Concern # 3:

Regarding the "Related work" section you should try to improve the quality of the chosen references. You cite a lot of papers from proceedings of certain conferences that are not in fact international but are regional conferences, whose quality is difficult to understand.

Author response: We thank the reviewer for the comments. We hereby provide an explanation of the above issue.

Author action: We explained the above issue as follows:

Currently, with the rapid emergence of video data, a significant amount of studies have been focused on video retrieval, video anomaly detection, video content generation, and so on, with less attention paid to the quality of video datasets. Consequently, there are relatively few papers related to the cleansing of near-duplicate video data, and the quality of the available literature in this field is relatively limited. Nevertheless, we have endeavored to cite as many papers related to near-duplicate video cleaning methods as possible in the “Related Work” section.

Reviewer#1, Concern # 4:

Why did you choose only 63 videos of a total of a public dataset that has 13129 videos? Also, you used a higher quantity of videos of a private dataset. You should give preference always to public datasets. And you should explain why only these numbers of videos were selected.

Author response: We thank the reviewer for the comments. We hereby provide an explanation of the above issue.

Author action: We explained the above issue as follows:

Considering the hardware environment conditions of the experiments, we randomly selected video data from 5 scenarios with a relatively small data scale from 24 scenes in the CC-WEB-VIDEO dataset to validate the proposed method in this paper. On this basis, since the public datasets are manually processed datasets, they are difficult to reflect the complex scenarios present in real life environments. Therefore, this paper uses a coal mine video dataset for experiments to further verify the performance of the proposed method in real-world environments.

Reviewer#1, Concern # 5:

In line 546 you state that the number of hidden layers has a great impact on the performance of the proposed method, as well as attention size. Why is that? You should explain it.

Author response: We thank the reviewer for the comments. We hereby provide an explanation of the above issue.

Author action: We explained the above issue as follows:

Since the close correlation between the number of hidden layers and the complexity of the LSTM network model can affect the effectiveness of video feature representation learning, the setting of the number of hidden layers has a significant impact on the performance of the proposed method.

Reviewer#1, Concern # 6:

In the results section, you show some comparisons with other methods. Did you implement these methods? (If so, did you have to choose some parameters that were not explained in the papers?) Or are you referring to the results that these methods present in each paper? I didn't find this explanation within the paper.

Author response: We thank the reviewer for the comments. We have added an explanation in this paper.

Author action: The explanation we added on Page 17 is as follows:

The performance of the proposed method (RCLA-HAOPFDMC) is verified in comparison with the existing studies on near-duplicate video data cleaning. The comparison methods are all reproduced through experiments, except for some key parameters set to the values from those papers, all other parameters are set to default values. The experimental results are shown in Table 5.

Reviewer#1, Concern # 7:

One thing I don't see is the time each of the methods consumes to train and execute each video. It should be interesting also to analyze these results.

Author response: We thank the reviewer for the comments. We have added the analyses of the time consumption for each method.

Author action: The analyses of the time consumption for each method we added on Page 18 are as follows:

This paper compares the proposed method with the existing studies and the different clustering cleaning models, as shown in Table 5. First, it can be seen from Table 5 that the performance indicators of the near-duplicate video cleaning methods based on hand-crafted feature extraction are relatively low, such as spatiotemporal key points and LBoW models, which is due to the limited ability of hand-crafted features to represent video features. However, both the aforementioned methods require less time, especially the spatiotemporal key-point model, which solely extracts key-point features from video frames, thus consuming less time than the LBoW model. Second, the BS-VGG16 model only extracts the spatial features of the video data, and the CBAM-Resnet model introduces the channel and spatial attention mechanisms in the spatial feature extraction. Since the depth of the BS-VGG16 model is only 16 layers, and the residual network depth in the CBAM Resnet model is 31 layers, the time consumed by the BS-VGG16 model is less than that of the MLE-MRD model, while the time consumed by the MLE-MRD model is less than that of the CBAM Resnet model. The 3D-CNN model, due to the lack of attention mechanism introduction, has the time consumed between BS-VGG16 and CBAM Resnet. On this basis, the ACNNBN-LSTM model can extract the spatiotemporal features of the video data, and the RCLA-HAOPFDMC method based on the spatiotemporal feature extraction to introduce the standard attention mechanism, which can more accurately depict the features of near-duplicate video data to help to clean the near-duplicate video data accurately and automatically. In addition, by comparing the experimental results of the K-Means, FD-Means, and FD-Means fused with the peak function clustering algorithms, the performance indicators after near-duplicate video cleaning using the K-Means algorithm are low, it is caused by the randomness of the initial clustering center setting. When the FD-Means algorithm is used for near-duplicate video cleaning, the influence of the K value in the K-Means algorithm on the experimental results can be reduced. Thus, the performance indicators are relatively high. This paper constructs a consistent feature hash ring to decrease the impact of data ordering on near-duplicate data cleaning. On this basis, the fusion of the FD-Means algorithm and peak function can further reduce the influence of the random initial cluster center setting on the near-duplicate video cleaning. Therefore, the performance indicators of the proposed method (RCLA-HAOPFDMC) in this paper are higher than the existing methods. However, this method requires more computing resources and takes longer computation time. Finally, the results of near-duplicate video data cleaning are shown in Fig. 8.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The primary research framework requires a diagram to depict each step of the proposed framework. In addition, the technique seems to be a muddle.
The researcher should better provide a short overview of the methodology's substance. Figures are used to illustrate the majority of the findings.
More statistics in the form of tables to aid in the examination of the findings will be helpful for the readers.
It is recommended that authors provide all of the parameters and settings that were utilized in the simulation process.
The whole document should be checked for all gaps between words.
The researches reviewed better to perform a comparison with others in literature, more recent studies in order to emphasize the study's unique qualities.
What are the limitations of the present work?
What are the practical implications of this research?

Author Response

Original Manuscript ID: 2959345

Original Article Title: “An automatic near-duplicate video data cleaning method based on a consistent feature hash ring”

To: Electronics Editor

Re: Response to reviewers

Dear Editor,

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

Best regards,

<Yi Qin> et al.

Reviewer#2, Concern # 1:

The primary research framework requires a diagram to depict each step of the proposed framework. In addition, the technique seems to be a muddle.

Author response: We thank the reviewer for the comments. We have modified the framework and described each step of the proposed method.

Author action: We have revised the unclear process in Figure 1 and clarified each step of the proposed method on Page 6. Here, Figure 1 shows the overall framework of our proposed method.

Figure 1. The overall framework of our proposed method.

Fig. 1 outlines our approach, which consists of three parts: high-dimensional feature extraction of videos, construction of a consistent feature hash ring, and cleaning near-duplicate videos based on a consistent feature hash ring. Each of these sections is explained next. (Page 6)

Reviewer#2, Concern # 2:

The researcher should better provide a short overview of the methodology's substance. Figures are used to illustrate the majority of the findings.

Author response: We thank the reviewer for the comments. We provided a short overview of our proposed method after Figure 1.

Author action: We have added a brief overview of the proposed method shown after Figure 1 on Page 7, as follows:

In general, the proposed method is based on high-dimensional feature extraction from video data, drawing on the ideas of load balancing and high scalability in big data storage to construct a novel consistent feature hash ring. On this basis, optimizing the FD-Means clustering algorithm to automatic cleaning of near-duplicate video data is achieved. (Page 7)

Reviewer#2, Concern # 3:

More statistics in the form of tables to aid in the examination of the findings will be helpful for the readers.

Author response: We thank the reviewer for the comments. We have added the Table 4.

Author action: We added the experiment in Table 4 on Page 17 to supplement the performance evaluation results of parameter settings on two datasets.

Table 4. The experimental results of different parameter settings for attention size on coal mine video dataset.

Models	The coal mine video dataset
Models	precision	recall	F1-score	Accuracy
Spatiotemporal Keypoint^[3]	0.57	0.85	0.68	0.61
BS-VGG16^[36]	0.72	0.83	0.77	0.79
LBoW^[43]	0.60	0.92	0.73	0.72
MLE-MRD^[44]	0.85	0.84	0.84	0.87
CBAM-Resnet^[42]	0.79	0.86	0.82	0.84
3D-CNN^[24]	0.91	0.89	0.90	0.90
RCLA	0.93	0.94	0.93	0.92

Reviewer#2, Concern # 4:

It is recommended that authors provide all of the parameters and settings that were utilized in the simulation process.

Author response: We thank the reviewer for the comments. We hereby provide an explanation of the above issue.

Author action: We explained the above issue as follows:

First, when we describe each model in Section 3, we provided different parameter settings; Then, to reflect the changes in parameters during the training and validation process, we adopt the tensorboard tool to describe the changes in the values of some parameter during the training and validation process through Figures 4 and 5; Finally, the ablation experiments in Tables 1 to 4, and Figure 6 reflect the influence of important parameters on the experimental results.

Reviewer#2, Concern # 5:

The whole document should be checked for all gaps between words.

Author response: We thank the reviewer for the comments. We have checked the issue of all gaps between words.

Author action: We have addressed the issue of no gaps between words in this paper.

Reviewer#2, Concern # 6:

The researches reviewed better to perform a comparison with others in literature, more recent studies in order to emphasize the study's unique qualities.

Author response: We thank the reviewer for the comments. We hereby provide an explanation of the above issue.

Author action: We explained the above issue as follows:

Reviewer#2, Concern # 7:

What are the limitations of the present work?

Author response: We thank the reviewer for the comments. We have clarified the limitations of the proposed method in the Conclusions section.

Author action: The limitations of the proposed method described in the Conclusions section are as follows:

However, the method proposed in this paper is not an end-to-end deep neural network model, which needs to be trained separately in the feature extraction and clustering stages. In addition, the computation of cleaning on the consistent feature hash ring is large. (Page 19)

Reviewer#2, Concern # 8:

What are the practical implications of this research?

Author response: We thank the reviewer for the comments. We hereby provide an explanation of the above issue.

Author action: We explained the above issue as follows:

The study of this paper can not only alleviate copyright infringement issues such as short video editing by automatically cleaning near-duplicate video data but also reduce the storage space occupied by redundant videos and save data storage costs for enterprises.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for your answers. It is fine for publishing

Author Response

Original Manuscript ID: 2959345

Original Article Title: “An automatic near-duplicate video data cleaning method based on a consistent feature hash ring”

To: Electronics Editor

Re: Response to reviewers

Dear Editor,

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

Best regards,

<Yi Qin> et al.

Reviewer#1, Concern # 1:

Please include the following changes:

(1) Deep Learning as Keyword

Author response: We thank the reviewer for this comment. We have revised the Keywords in the Abstract.

Author action: We have inserted the “deep learning” keyword.

Keywords: Video cleaning; deep learning; consistent feature hash ring; feature distance means; mountain peak function; multi-head attention mechanism; near-duplicate videos.

Reviewer#1, Concern # 2:

Please include the following changes:

(2) Line 120 “Finally, the paper is summarized in Section 5.”

Author response: We thank the reviewer for this comment. We have revised this sentence.

Author action: We have modified the sentence “Finally, the paper is summarized.” to the sentence “Finally, the paper is summarized in Section 5.”.

The following is the organizational structure of the remaining parts of this article. In Section 2, a brief review of related works on near-duplicate video detection and data cleaning is presented; an automatic near-duplicate video cleaning method based on a consistent feature hash ring is proposed in Section 3. The experimental results validate the performance of the method presented in Section 4. Finally, the paper is summarized in Section 5.

Reviewer#1, Concern # 3:

Please include the following changes:

(3) The bibliographic citation is not consistent throughout the text. For example, compare quote 40 versus 41 on lines 327 and 337, respectively.

Author response: We thank the reviewer for this comment. We have revised the above issue.

Author action: The modifications we made are as follows:

This part is composed of 4 blocks, which respectively include 3, 4, 6, and 3 residual blocks, and each residual block contains a convolutional block attention module (CBAM) ^[41]. (Page 8)

The extensive experiments on a commonly used dataset named CC_WEB_VIDEO and a coal mining video dataset are conducted to evaluate the performance of RCLA-HAOPFDMC (the proposed method in this paper) and compare them with other representative advanced methods, such as the CBAM-Resnet ^[42] and BS-VGG16 ^[36] methods. All experiments were conducted on the same machine, which had 8-Inter Xeon processors with 2.10GHz and a graphics card NVIDIA Corporation GP102 with 16G memory, the programs are implemented based on Python version 3.6.5 and PyTorch version 0.4.0. Next, we will provide a detailed explanation of the experiment and results. (Page 12)

Author Response File: Author Response.docx

Article Menu

An Automatic Near-Duplicate Video Data Cleaning Method Based on a Consistent Feature Hash Ring

Further Information

Guidelines

MDPI Initiatives

Follow MDPI