Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Semantic-Preserving Deep Hashing Model for Multi-Label Remote Sensing Image Retrieval

Remote Sens. 2021, 13(24), 4965; https://doi.org/10.3390/rs13244965

by Qimin Cheng^1,*, Haiyan Huang¹

, Lan Ye², Peng Fu³, Deqiao Gan¹ and Yuzhuo Zhou¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Remote Sens. 2021, 13(24), 4965; https://doi.org/10.3390/rs13244965

Submission received: 3 November 2021 / Revised: 2 December 2021 / Accepted: 3 December 2021 / Published: 7 December 2021

Round 1

Reviewer 1 Report

Review of the paper 'A Semantic-preserving Deep Hashing Model for Multi-label Remote Sensing Image Retrieval' by Qimin Cheng, Haiyan Huang, Lan Ye, Peng Fu, Deqiao Gan, Yuzhuo Zhou.

The paper proposes a method based on deep learning and the hashing function to retrieve multi-label remote sensing data. The author presented the technique nicely and convincingly. However, the literature review is not complete. Moreover, the authors should consider applying the approach to a second dataset to show the method's effectiveness and robustness.

Major comments:

The literature review is not up to date. Therefore, I suggest the authors read the papers of Dr. Begüm Demir of 2021 that were published on applying deep learning to retrieve multi-label remote sensing images. In addition, other articles (not cited in the literature review) propose to use hashing function and deep learning for multi-label data retrieval, e.g., 'Z. Zhang, Q. Zou, Y. Lin, L. Chen, and S. Wang, "Improved deep hashing with soft pairwise similarity for multi-label image retrieval," IEEE Trans. Multimedia, vol. 22, no. 2, pp. 540–553, 2020.'

The proposed method is tested only on one dataset, showing promising results. However, this is not enough to prove the robustness and effectiveness of the technique. Considering this and the literature on multi-label remote sensing image retrieval, I suggest the authors apply their method to another dataset and, for example, compare the results with those on the BigEarth dataset.

Minor comments:

Fig 1. The caption contains acronyms that are not described.

Line 82, I suggest moving 'In this paper we propose' to a new line.

Line 106, multi-Label should be capital, i.e., Multi-label.

Line 176-179, I suggest that the authors rewrite the sentence to emphasize the literature gap and avoid presenting the paper's scope.

Eq (1) misses the closing >.

Line 196, I guess that 'CC' should be removed.

Line 203-209, this paragraph can be replaced or introduced by a sentence like 'the similarity value of a given couple ?? and ?? is inversely proportional to the hamming distance.' to increase the readability.

Line 223-229, I suggest splitting the sentence into many shorter sentences to increase the readability.

Fig 2. is a bit hard to understand.

Line 291-294, this sentence doesn't go at the end of the proposed method description but at the beginning of the methodological section. At this point of the paper, this should be known.

Author Response

We gratefully thank the reviewer for his/her insightful comments and suggestions. Based on the comments, we have revised the manuscript.

We have supplemented the latest literatureof multi-label remote sensing retrieval and classification since 2021.
we have made supplementary experiments on anotherdataset(UCMerced), which is a popular dataset in multi-label retrieval tasks. The experimental results show the effectiveness and advantages of our model to introduce hash learning into remote sensing image multi-label retrieval task.

The item-by-item responses to the helpful comments raised by the reviewer are as follows.

Major comments:

Point 1: The literature review is not up to date. Therefore, I suggest the authors read the papers of Dr. Begüm Demir of 2021 that were published on applying deep learning to retrieve multi-label remote sensing images. In addition, other articles (not cited in the literature review) propose to use hashing function and deep learning for multi-label data retrieval, e.g., 'Z. Zhang, Q. Zou, Y. Lin, L. Chen, and S. Wang, "Improved deep hashing with soft pairwise similarity for multi-label image retrieval," IEEE Trans. Multimedia, vol. 22, no. 2, pp. 540–553, 2020.

Response 1: We gratefully thank the reviewer for his/her valuable suggestion. We have read and citied references above in the revised manuscript. We have supplemented the latest literature of multi-label remote sensing retrieval and classification since 2021. We added five references [21][32][33][34][35] in the multi-label image retrieval.

See Page3, line 129-131; Page4, line 155-162.

[21] Zhang, Z.; Zou, Q.; Lin, Y.; Chen, L.; Wang, S. Improved deep hashing with soft pairwise similarity for multi-label image retrieval. IEEE Transactions on Multimedia 2019, 22, 540-553.

[32] Qi, X., Zhu, P., Wang, Y., Zhang, L., Peng, J., Wu, M., Chen, J., Zhao, X., Zeng, N., Mathiopoulos, P. T. MLRSNet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding. ISPRS Journal of Photogrammetry and Remote Sensing 2020, 169, 337-350.

[33] Sumbul, G., de Wall, A., Kreuziger, T., Marcelino, F., Costa, H., Benevides, P., Markl, V. BigEarthNet-MM: A Large Scale Multi-Modal Multi-Label Benchmark Archive for Remote Sensing Image Classification and Retrieval 2021, arXiv preprint arXiv:2105.07921.

[34] Sumbul, G., Ravanbakhsh, M., Demir, B. Informative and Representative Triplet Selection for Multi-Label Remote Sensing Image Retrieval. IEEE Transactions on Geoscience and Remote Sensing, doi: 10.1109/TGRS.2021.3124326.

[35] Sumbul, G.; and Demir, B. A Novel Graph-Theoretic Deep Representation Learning Method for Multi-Label Remote Sensing Image Retrieval, IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, July, 2021; pp 266-269.

Point 2: The proposed method is tested only on one dataset, showing promising results. However, this is not enough to prove the robustness and effectiveness of the technique. Considering this and the literature on multi-label remote sensing image retrieval, I suggest the authors apply their method to another dataset and, for example, compare the results with those on the BigEarth dataset.

Response 2: Thanks a lot for the comments, we have made supplementary experiments on UCMerced dataset to verify the effectiveness of our model. The UCMerced dataset is a widely used dataset in multi-label retrieval task. We make comparision with the latest method which is proposed by Sumbul et al. in[44]. The experimental results show the effectiveness and advantages of the model to introduce hash learning into remote sensing image multi-label retrieval task. See Page13-14, part 4.4.2(2).

BigEarth dataset contains 590,326 remote sensing images, a considerable quantity of labeled data is beneficial to train the models, but training such a large-scale data requires high hardware requirements, and time cost( the time of training multiple models is far more than ten days). We consider applying this dataset in the future multi-label retrieval task.

Minor comments:

Point 1: Fig 1. The caption contains acronyms that are not described.

Response 1: Thanks again for the valuable suggestion. We have explained acronyms in the revised manuscript. As suggested, we provide introduction in the paragraph as follows, “we select two images from the single label dataset UCMD and the multi-label dataset MLRSD as an example”. See Page 2,line55-56.

Point 2: Line 82, I suggest moving 'In this paper we propose' to a new line.

Response 2: Thanks again for the valuable suggestion. We have corrected it in the revised manuscript.

Point 3: Line 106, multi-Label should be capital, i.e., Multi-label.

Response 3: Thanks very much for pointing out this mistake. We have corrected it in the revised manuscript. See Page 3, line119.

Point 4: Line 176-179, I suggest that the authors rewrite the sentence to emphasize the literature gap and avoid presenting the paper's scope.

Response 4: Thanks a lot for the valuable comment. We have rewritten the sentence in the revised manuscript. See Page 5,line197-200.

Point 5: Eq (1) misses the closing >.

Response 5: Thanks again for the valuable suggestion. We have modified it in the revised manuscript. See Page 5, line 215.

Point 6: Line 196, I guess that 'CC' should be removed.

Response 6: Thanks very much for pointing out this mistake. We have corrected this error in the revised manuscript.

Point 7: Line 203-209, this paragraph can be replaced or introduced by a sentence like 'the similarity value of a given couple and is inversely proportional to the hamming distance.' to increase the readability.

Response 7: Thanks again for the valuable suggestion. We summarize the original content as suggested in the revised manuscript. See Page 6, line 222-223.

Point 8: Line 223-229, I suggest splitting the sentence into many shorter sentences to increase the readability.

Response 8: We gratefully thank the reviewer for his/her valuable comment. We have splitted the sentence into many shorter sentences. See Page6, Line 243-247.

Point 13: Fig 2. is a bit hard to understand.

Response 13: In section 3.2, we introduce the system architecture of deep hashing model, it includes the construction of the model and the retrieval process of our model. See Page 6, Part3.2.

The introduction is as follows:

The whole system architecture of our model is shown in Figure 2 and is mainly composed of two parts, namely the deep feature extraction module and hash learning module. While the deep feature extraction module is responsible for generating high-level and abstract image representation through multi-level architecture, the hash learning module is responsible for converting each image to a binary space in a binary sequence. In order to preserve the similarity information of multi-label remote sensing image pairs and control hashing quality, we improve the pair-wise similarity loss and use quantization loss to limit the output value range of the hash network.

The retrieval process of our model is: the paired image group is fed into the model, the high-dimensional abstract features of multi-label remote sensing images are extracted through multi-layer convolution and two fully connected layers, and then the output is fed to a hash layer that connects the fully connected layer FC1 and the fully connected layer FC2 to generate a hash code of q length. Then, the image similarity is used as supervision information to train the model in an end-to-end manner. In the retrieval process, the multi-label remote sensing image is encoded into a binary code, and then the distance between the binary code of the query image and other images is calculated, and finally the ranked search results are returned.

Point 14: Line 291-294, this sentence doesn't go at the end of the proposed method description but at the beginning of the methodological section. At this point of the paper, this should be known.

Response 14: We have corrected it in the revised manuscript. See Page 7, Line 262-267.

Author Response File: Author Response.docx

Reviewer 2 Report

Dear authors,

I have finished the review of your paper. In my opinion, there are a few improvements that are required to consider your paper for publication.

In table 2, consider adding the number of samples per class.
Add the definition of the performance metrics, at least of the Hamming Loss.
In tables 6 and 7, you repeat the name "proposed" please, to ease the reading of your paper, modify that text with a name that identifies that result uniquely since it seems that it is a repeated result.

I hope these recommendations help you to improve your paper.

Author Response

Point 1: In table 2, consider adding the number of samples per class.

Response 1: We gratefully thank the reviewer for his/her valuable comment. We have added the number of samples per class in the revised manuscript, as follows:

Table 2. Annotation description of DLRSD image dataset.

Label ID	Category	Number of images	Label ID	Category	Number of images	Label ID	Category	Number of images
1	airplane	100	7	dock	100	13	sea	101
2	bare soil	754	8	field	103	14	ship	103
3	building	713	9	grass	977	15	tank	100
4	car	897	10	mobile home	102	16	tree	1021
5	chaparral	116	11	pavement	1331	17	water	208
6	court	105	12	sand	291

Point 2: Add the definition of the performance metrics, at least of the Hamming Loss.

Response 2: Thanks a lot for the valuable comment. Hamming loss(HL) directly counts the number of incorrect labels. The relevant marker does not appear in the predicted marker set or the irrelevant marker appears in the predicted marker set (the number of errors/total number in the prediction result). The smaller the value of this evaluation metric, the better the performance of the model. Other evaluation metrics are defined as follows:

Precision refers to the ratio of the number of similar (related) images returned by the system to the number of all returned images in a retrieval (query) process, reflecting the accuracy of the retrieval results.

Recall refers to In a retrieval (query) process, the ratio of the number of similar (related) images returned by the system to the number of all similar (related) images in the image library, reflecting the comprehensiveness of the retrieval results.

Accuracy refers to the average value of the average precision calculated for each query under all queries.

F1 comprehensively considers the recall rate and the precision rate. It is calculated by the following formula F₁=2PR/(P+R) (P:Precision,R:recall).

Point 3: In tables 6 and 7, you repeat the name "proposed" please, to ease the reading of your paper, modify that text with a name that identifies that result uniquely since it seems that it is a repeated result.

Response 3: Thanks again for the valuable suggestion. We have repeated our model’s name which is called DHMR in the revised manuscript. In addition, in order to distinguish from other comparison models, we name the model as DHMR/fc, only change the loss function without jump connection .

Author Response File: Author Response.docx

Reviewer 3 Report

This article proposes a new semantic preserving deep hash model for multi-label remote sensing image retrieval and experimentally demonstrates the effectiveness of the proposed method. This is an interesting research paper. There are some suggestions for revision.

The motivation is not clear. Please specify the importance of the proposed solution.
Two contributions listed in introduction are a little bit weak. Please highlight the innovations of the proposed solution. The advantages of the proposed solution should be specified, especially deep hashing.
The pros and cons of existing solutions should be compared.
Most of references are a little bit out of date. Please discuss more recently published solutions, especially the solutions published in 2021.
This paper does not discuss how to process fog/haze in remote sensing image retrieval. Please discuss the following two solutions. "Atmospheric Light Estimation Based Remote Sensing Image Dehazing", Remote Sensing 13 (13), 2432, 2021, and "Remote Sensing Image Defogging Networks Based on Dual Self-Attention Boost Residual Octave Convolution", Remote Sensing 13 (16), 3104, 2021.
As mentioned in this paper, this is the first application of hashing to multi-label information for remote sensing image retrieval, which should be treated strictly. Moreover, in the supervised hash learning method for image retrieval discussed in related work, it is pointed out that the above method does not really use multi-label information. It is better to integrate the related contents with the multiple tags in introduction to establish a relationship. Then, the integrated interpretation is more convenient for readers to read and understand.
More detailed descriptions of encoding remote sensing images using deep hash networks should be given. The design of the loss function can be contacted and explained further. How does the proposed solution maintain multi-label semantic information? The detailed explanations are necessary.
Please pay attention to the contents and format of the article, including writing errors, content typesetting, chart content, etc., such as the content of line 196 "CC" appears abruptly in the paragraph; Figure 2 and Figure 4 are blurred and difficult to see clearly Display information, inconvenient for readers to read, etc. Please review the content of the article carefully. Maybe some key methods or techniques should be explained further. For example, "inspired by the squeeze excitation network, we regard channel attention as the process of adaptively selecting the convolutional layer", which will make it easier for readers to understand and establish a connection with the context.
In the content of model comparison, for example, the performance of the DMSH model is the worst. This may be because the model uses the L2 paradigm as the distance metric. Are there any further experiments to verify this conclusion? In the comparative experiments, maybe you can list the comparison of model performance separately.
More technical details of the proposed solution should be given.
The experimental results are not convincing. Please compare the proposed solution with more recently published solutions.

Author Response

Point 1: The motivation is not clear. Please specify the importance of the proposed solution.

Response 1: We gratefully thank the reviewer for his/her insightful comments. We have clarified our motivation in detail in the revised manuscript. See Page3, line 83-94

Thanks to the high storage efficiency of binarized hash codes and computational efficiency of Hamming distance as well, deep hashing is largely favourable to the image retrieval of massive remote sensing imagery. However, almost all existing deep hashing networks are based on single-label learning, which cannot adapt to complex remote sensing scenes. In a bit more insight, remote sensing images are usually composed of multiple land categories, so that a single label describing the most significant semantic content might ignore the complex and abundant information contained in the image. To tackle this dilemma, we attempt to introduce multi-label supervision into the deep hashing framework. Further, we propose the pair-wise label similarity loss to fully exploit the multi-label semantic information, including the hard similarity loss represented by cross entropy and soft similarity loss measured by mean square error. Specifically, the hard similarity loss accounts for the completely similar or dissimilar sample pair, while the soft similarity loss considers the partially similar sample pair, which together encourage the deep hashing model to preserve the semantic consistency between the input paired image samples.

Point 2: Two contributions listed in introduction are a little bit weak. Please highlight the innovations of the proposed solution. The advantages of the proposed solution should be specified, especially deep hashing.

Response 2: We gratefully thank the respected referee for this constructive comment. We make a detailed introduction to our work. See Page 3, line 83-106.

Due to the high storage efficiency of the binary hash code and the calculation efficiency of the Hamming distance, deep hashing is beneficial to the image retrieval of massive remote sensing images to a large extent. However, in the field of remote sensing, almost all existing deep hash networks are based on single-label learning, which cannot adapt to complex remote sensing scenarios. Remote sensing images are usually composed of multiple land categories, so a single label describing the most important semantic content may ignore the complex and rich information contained in the image. In order to solve this problem, we try to introduce multi-label supervision in the deep hashing framework.

The main contributions of this paper can be summarized as follows:

(1) We propose a semantic-preserving deep hashing model for the multi-label remote sensing image retrieval. As far as we know, this is the first attempt to introduce hashing methods in the multi-label remote sensing image retrieval.

(2) We propose paired label similarity loss to make full use of multi-label semantic information, including hard similarity loss represented by cross-entropy and soft simi-larity loss measured by mean square error. Specifically, the hard similarity loss considers completely similar or dissimilar sample pairs, while the soft similarity loss considers partially similar sample pairs. Together, they encourage the deep hash model to maintain semantic consistency between the input paired image samples.

Point 3: The pros and cons of existing solutions should be compared.

Response 3: We sincerely thank the reviewer for his/her constructive comment. We have researched the current retrieval in the field of multi-label retrieval of remote sensing images. all work has shown that multi-label retrieval can achieve better retrieval performance compared with single-label retrieval. In addition, deep learning can improve retrieval performance furtherly. However, in the era of massive remote sensing data, how to increase retrieval efficiency and reduce feature storage while preserving semantic information remains unsolved. Considering the powerful capability of hashing learning in overcoming the curse of dimensionality caused by high-dimension image representation in Approximate Nearest Neighbor (ANN) search problems, we propose a new semantic-preserving deep hashing model for the multi-label remote sensing image retrieval ”.

Point 4: Most of references are a little bit out of date. Please discuss more recently published solutions, especially the solutions published in 2021.

Response 4:We gratefully thank the reviewer for his/her valuable suggestion. We have read and citied references above in the revised manuscript. Among them, we added five references [21][32][33][34][35] in the multi-label image retrieval.

See Page3, line 129-131; Page4, line 155-162.

[21] Zhang, Z.; Zou, Q.; Lin, Y.; Chen, L.; Wang, S. Improved deep hashing with soft pairwise similarity for multi-label image retrieval. IEEE Transactions on Multimedia 2019, 22, 540-553.

Point 5: This paper does not discuss how to process fog/haze in remote sensing image retrieval. Please discuss the following two solutions. "Atmospheric Light Estimation Based Remote Sensing Image Dehazing", Remote Sensing 13 (13), 2432, 2021, and "Remote Sensing Image Defogging Networks Based on Dual Self-Attention Boost Residual Octave Convolution", Remote Sensing 13 (16), 3104, 2021.

Response 5: Thanks again for the valuable suggestion. It’s an interesting and meaningful research in remote sensing field. We have made some supplementary references in the revised manuscript.

[4] Zhu, Z.; Luo, Y.; Wei, H.; Li, Y.; Qi, G.; Mazur, N.; Li, Y. Atmospheric Light Estimation Based Remote Sensing Image Dehazing. Remote Sensing 2021, 13, 2432.

[5] Zhu ,Z.; Luo, Y.; Qi, G.; Meng, J.; li, Y.; Mazur, N. Remote sensing image defogging networks based on dual self-attention boost residual octave convolution. Remote Sensing 2021, 13, 3104.

Point 6: As mentioned in this paper, this is the first application of hashing to multi-label information for remote sensing image retrieval, which should be treated strictly. Moreover, in the supervised hash learning method for image retrieval discussed in related work, it is pointed out that the above method does not really use multi-label information. It is better to integrate the related contents with the multiple tags in introduction to establish a relationship. Then, the integrated interpretation is more convenient for readers to read and understand.

Response 6: Thanks again for the valuable suggestion. In the current public papers, we have not found similar reports. Deep hashing is largely favourable to the image retrieval of massive remote sensing imagery because of its high storage efficiency of binarized hash codes and computational efficiency of Hamming distance. However, almost all existing deep hashing networks are based on single-label learning[46,49]. Multi-label retrieval is a popular research recently. Especially in recent research work, Shao et al. [17] proposed a new FCN-based multi-label method, and [18][35] proposed a new graph relational network (GRN) for multi-label retrieval. In [34] a new triple learning method is proposed. These methods prove the advantages of deep learning for multi-label retrieval. we combine hash learning with deep learning and propose a paired label similarity loss to make full use of multi-label semantic information and encourage deep hashing models to keep the input paired images semantic consistency between samples.

In Section 2.3, we investigated the methods based on deep hashing, and introduced them into categories, which are divided into unsupervised hashing and supervised hashing. In addition, we introduced the development history of deep hashing in the field of natural images in the form of a timeline in which single-label hash retrieval did not utilize the multi-label semantic information of the image. Remote sensing images are usually composed of multiple land categories, so that a single label describing the most significant semantic content might ignore the complex and abundant information contained in the image. To tackle this dilemma, in this article, we apply the deep hashing method to multi-label retrieval of remote sensing images to realize more refined retrieval in the era of massive remote sensing data. See Page 5, line 197-199.

Point 7: More detailed descriptions of encoding remote sensing images using deep hash networks should be given. The design of the loss function can be contacted and explained further. How does the proposed solution maintain multi-label semantic information? The detailed explanations are necessary.

Response 7: Thanks a lot for the valuable comment. In recent years, many deep hashing methods mostly defined in a hard-allocation manner. That is, if they share no less than one class label, the pairwise similarity is "1", and if they do not share any class label, it is "0". However, this definition of similarity does not reflect the similarity ranking of paired images containing multiple labels. This paper proposes an improved deep hashing method to improve the ability of multi-label image retrieval. To express the hard similarity and soft similarity uniformly, we introduce parameters to balance the range of the two values in designing the loss function; =1 refer to cross entropy loss and =0 refer to mean square error loss. Since the value range of is 2q times the value range of , we introduce hyperparameters to balance the range of them. In addition, we use the similarity of paired images to weight the paired similarity loss. Expressed as , where is the adjustment parameter. See Page 7, line 279-292.

Point 8: Please pay attention to the contents and format of the article, including writing errors, content typesetting, chart content, etc., such as the content of line 196 "CC" appears abruptly in the paragraph; Figure 2 and Figure 4 are blurred and difficult to see clearly Display information, inconvenient for readers to read, etc. Please review the content of the article carefully. Maybe some key methods or techniques should be explained further. For example, "inspired by the squeeze excitation network, we regard channel attention as the process of adaptively selecting the convolutional layer", which will make it easier for readers to understand and establish a connection with the context.

Response 8: We sincerely thank the reviewer for his/her careful reading. We have checked manuscript carefully and corrected errors in the revised manuscript. Please refer to the manuscript for more details. "CC" has been deleted; Figure 2 and Figure 4 have inserted original image. For the key technology, we have made a detailed introduction in abstract, “Considering the powerful capability of hashing learning in overcoming the curse of dimensionality caused by high-dimension image representation in Approximate Nearest Neighbor (ANN) search problems, we propose a new semantic-preserving deep hashing model for the multi-label remote sensing image retrieval.” See Page 1, Line 21-22.

Point 9: In the content of model comparison, for example, the performance of the DMSH model is the worst. This may be because the model uses the L2 paradigm as the distance metric. Are there any further experiments to verify this conclusion? In the comparative experiments, maybe you can list the comparison of model performance separately.

Response 9: Thanks a lot for your comments. In the comparison experiments, we use the same network model (AlexNet) and choose same parameters for network training. But different loss functions are applied. Experimental results show that our proposed model achieves best. This proves the effectiveness of the design of loss function. Considering DMSH model uses L2 paradigm as the distance metric, so we estimated that the poor performance of the model might be resulted from the using of L2 paradigm.

Point 10: More technical details of the proposed solution should be given.

Response 10: Thanks again for the valuable suggestion. We have made more supplementary description in the revised manuscript.

This paper proposes an improved deep hashing method to improve the ability of multi-label image retrieval.

In terms of the design of the loss function, See Page 7, line 279-292. To express the hard similarity and soft similarity uniformly, we introduce parameters to balance the range of the two values in designing the loss function; =1 refer to cross entropy loss and =0 refer to mean square error loss. Since the value range of is 2q times the value range of , we introduce hyperparameters to balance the range of them. In addition, we use the similarity of paired images to weight the paired similarity loss. Expressed as , where is the adjustment parameter.
For the detailed process of our model, See section 3.2.

The retrieval process of our model is: the paired image group is fed into the model, the high-dimensional abstract features of multi-label remote sensing images are extracted through multi-layer convolution and two fully connected layers, and then the output is fed to a hash layer that connects the fully connected layer FC1 and the fully connected layer FC2 to generate a hash code of q length. Then, the image similarity obtained by the label vector calculation is utilized as the supervision information to train the model in an end-to-end mode. In the retrieval process, the multi-label remote sensing image is encoded by the deep hashing model to output the binary code, and then the distance between the binary code of the query image and the binary code of other images in the hamming space is calculated, and finally the ranked search results are returned.

Point 11: The experimental results are not convincing. Please compare the proposed solution with more recently published solutions.

Response 11: Thanks for the comments. We conducted supplementary experiments on UCMerced dataset, compared with whether the hash method in the natural image field or the latest multi-label retrieval method in the remote sensing field by Sumbul et al. in 2021, the experimental results show the effectiveness and advantages of the model to introduce hash learning into remote sensing image multi-label retrieval task.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The authors addressed all my comments.

I have a few small comments:

First, the caption of Fig 2 may be enriched by describing the scheme.
It would be better to give more space to the second dataset. The authors may consider adding a table like Table 2 for describing the UC Merced dataset.
In the conclusion and the rest of the paper, the authors state that the proposed method shows better results than the state-of-the-art method. However, this is not true for all the metrics and both datasets. I suggest the authors be less sharp.

Author Response

Point 1: First, the caption of Fig 2 may be enriched by describing the scheme.

Response 1: We sincerely thank the reviewer for his/her constructive comment. We have described it in detail in the revised manuscript. See Page 6, line 252-255.

An overview of the DMHR model for multi-label remote sensing image retrieval. On one hand, batch images are input into the network to extract high-dimensional features, which are then fed into a hash layer to generate hash codes. On the other hand, the image similarity calculated through label vectors is used as the supervision information to train the model.

Point 2: It would be better to give more space to the second dataset. The authors may consider adding a table like Table 2 for describing the UC Merced dataset.

Response 2: Thanks a lot for the valuable comment. We have added it in the revised manuscript. The difference between DLRSD and MLRSD datasets lies in the label form and content of the datasets. See Page 10, line 340.

Label ID	Category	Number of images	Label ID	Category	Number of images	Label ID	Category	Number of images
1	airplane	100	7	dock	100	13	sea	100
2	bare soil	633	8	field	106	14	ship	102
3	building	696	9	grass	977	15	tank	100
4	car	884	10	mobile home	102	16	tree	1015
5	chaparral	119	11	pavement	1305	17	water	203
6	court	105	12	sand	389

Point 3: In the conclusion and the rest of the paper, the authors state that the proposed method shows better results than the state-of-the-art method. However, this is not true for all the metrics and both datasets. I suggest the authors be less sharp.

Response 3: Thanks again for the valuable suggestion. We have corrected the corresponding descriptions in our manuscript. See Page 1, line 28-30, Page 17, line 498-499.

Author Response File: Author Response.docx

Reviewer 2 Report

Dear authors,

I have finished the review of this revised version of your paper. In my opinion, you have addressed al my initial concerns and suggestions, therefore I have no more recommendations about your work.

Article Menu

A Semantic-Preserving Deep Hashing Model for Multi-Label Remote Sensing Image Retrieval

Further Information

Guidelines

MDPI Initiatives

Follow MDPI