Interactive Instance Search: User-Centered Enhanced Image Retrieval with Learned Perceptual Image Patch Similarity
Abstract
:1. Introduction
- We introduce a novel framework for interactive multi-region collaborative instance search, which enhances human–computer interaction in the process of selecting instances of user interest and efficiently acquires feedback that aligns with user requirements.
- We employ a deep learning-based method for measuring similarity, which effectively captures the perceptual similarity between images. This approach closely aligns with human visual perception, thereby enhancing the accuracy of query results.
2. Related Work
2.1. User-Centered Interactive Image Retrieval
2.2. Instance Search
2.3. Image Similarity Comparison
3. Method
3.1. Instance Search
3.1.1. Instance Detection
3.1.2. LPIPS-Based Instance Similarity Comparison
3.2. Ranking
4. Experimental Results
4.1. Experiments on Standard Instance Search Datasets
4.2. Multiple Instance Search Accuracy
4.3. Practicability of System Demonstration
4.4. Interactive Experiment
5. Discussion
5.1. The Robustness of the System
5.2. LPIPS
5.3. Similarity Score Calculation
5.4. Obstacles in Practical Application
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Dagan, A.; Guy, I.; Novgorodov, S. Shop by image: Characterizing visual search in e-commerce. Inf. Retr. J. 2023, 26, 2. [Google Scholar] [CrossRef]
- Hou, S.; Zhao, C.; Chen, Z.; Wu, J.; Wei, Z.; Miao, D. Improved instance discrimination and feature compactness for end-to-end person search. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2079–2090. [Google Scholar] [CrossRef]
- Parikh, V.; Keskar, M.; Dharia, D.; Gotmare, P. A tourist place recommendation and recognition system. In Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 20–21 April 2018; pp. 218–222. [Google Scholar]
- Wu, Y.; Dong, X.; Shi, G.; Zhang, X.; Chen, C. Crime scene shoeprint image retrieval: A review. Electronics 2022, 11, 2487. [Google Scholar] [CrossRef]
- Hegde, N.; Hipp, J.D.; Liu, Y.; Emmert-Buck, M.; Reif, E.; Smilkov, D.; Terry, M.; Cai, C.J.; Amin, M.B.; Mermel, C.H.; et al. Similar image search for histopathology: SMILY. NPJ Digit. Med. 2019, 2, 56. [Google Scholar] [CrossRef] [PubMed]
- Rui, Y.; Huang, T.S.; Chang, S.F. Image retrieval: Current techniques, promising directions, and open issues. J. Vis. Commun. Image Represent. 1999, 10, 39–62. [Google Scholar] [CrossRef]
- Alzu’bi’, A.; Amira, A.; Ramzan, N. Semantic content-based image retrieval: A comprehensive study. J. Vis. Commun. Image Represent. 2015, 32, 20–54. [Google Scholar] [CrossRef]
- Wang, L.; Zhang, Y.; Feng, J. On the Euclidean distance of images. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1334–1339. [Google Scholar] [CrossRef] [PubMed]
- Steck, H.; Ekanadham, C.; Kallus, N. Is cosine-similarity of embeddings really about similarity? In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 887–890. [Google Scholar]
- Sheikh, H.R.; Sabir, M.F.; Bovik, A.C. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 2006, 15, 3440–3451. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Zhan, Y.; Zhao, W.L. Instance search via instance level segmentation and feature representation. J. Vis. Commun. Image Represent. 2021, 79, 103253. [Google Scholar] [CrossRef]
- Zhang, Y.; Liu, C.; Chen, W.; Xu, X.; Wang, F.; Li, H.; Hu, S.; Zhao, X. Revisiting instance search: A new benchmark using cycle self-training. Neurocomputing 2022, 501, 270–284. [Google Scholar] [CrossRef]
- Wong, K.M.; Cheung, K.W.; Po, L.M. MIRROR: An interactive content based image retrieval system. In Proceedings of the 2005 IEEE International Symposium on Circuits and Systems (ISCAS), Kobe, Japan, 23–26 May 2005; pp. 1541–1544. [Google Scholar]
- Lai, C.C.; Chen, Y.C. A user-oriented image retrieval system based on interactive genetic algorithm. IEEE Trans. Instrum. Meas. 2011, 60, 3318–3325. [Google Scholar] [CrossRef]
- Anwaar, M.U.; Labintcev, E.; Kleinsteuber, M. Compositional learning of image-text query for image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1140–1149. [Google Scholar]
- Wang, S.; Han, J. Automated detection of exterior cladding material in urban area from street view images using deep learning. J. Build. Eng. 2024, 96, 110466. [Google Scholar] [CrossRef]
- Salvador, A.; Giró-i-Nieto, X.; Marqués, F.; Satoh, S.I. Faster r-cnn features for instance search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 27–30 June 2016; pp. 9–16. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
- Philbin, J.; Chum, O.; Isard, M.; Sivic, J.; Zisserman, A. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Kalantidis, Y.; Mellina, C.; Osindero, S. Cross-dimensional weighting for aggregated deep convolutional features. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10+15–16 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 685–701. [Google Scholar]
- Pang, S.; Zhu, J.; Wang, J.; Ordonez, V.; Xue, J. Building discriminative CNN image representations for object retrieval using the replicator equation. In Pattern Recognition; Elsevier: Amsterdam, The Netherlands, 2018; Volume 83, pp. 150–160. [Google Scholar]
- Liu, G.H.; Yang, J.Y. Exploiting deep textures for image retrieval. Int. J. Mach. Learn. Cybern. 2023, 14, 483–494. [Google Scholar] [CrossRef]
- Xu, J.; Wang, C.; Qi, C.; Shi, C.; Xiao, B. Unsupervised semantic-based aggregation of deep convolutional features. IEEE Trans. Image Process. 2018, 28, 601–611. [Google Scholar] [CrossRef] [PubMed]
- Gkelios, S.; Boutalis, Y.; Chatzichristofis, S.A. Investigating the vision transformer model for image retrieval tasks. In Proceedings of the 2021 17th International Conference on Distributed Computing in Sensor Systems (DCOSS), Pafos, Cyprus, 14–16 July 2021; pp. 367–373. [Google Scholar]
- Yan, L.; Cui, Y.; Chen, Y.; Liu, D. Hierarchical attention fusion for geo-localization. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2220–2224. [Google Scholar]
- Xu, Y.; Shamsolmoali, P.; Granger, E.; Nicodeme, C.; Gardes, L.; Yang, J. TransVLAD: Multi-scale attention-based global descriptors for visual geo-localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 2840–2849. [Google Scholar]
- Lu, F.; Liu, G.H. Image retrieval using contrastive weight aggregation histograms. Digit. Signal Process 2022, 123, 103457. [Google Scholar] [CrossRef]
- Grinberg, M. Flask Web Development; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
- Li, Y.; Košecká, J. Uncertainty aware proposal segmentation for unknown object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 241–250. [Google Scholar]
Method of Similarity | Oxford5k | Paris6k |
---|---|---|
LPIPS | 0.782 | 0.824 |
Cosine similarity | 0.667 | 0.749 |
L2 | 0.645 | 0.723 |
Network in LPIPS | Oxford5k | Paris6k |
---|---|---|
VGG | 0.782 | 0.824 |
Alexnet | 0.712 | 0.764 |
Squeezenet | 0.765 | 0.803 |
Instance Detection Method | Oxford5k | Paris6k |
---|---|---|
Faster R-CNN | 0.778 | 0.834 |
YOLOV11 | 0.769 | 0.816 |
DETR | 0.782 | 0.824 |
Method of Similarity | Oxford5k | Paris6k |
---|---|---|
CroW [25] | 0.708 | 0.797 |
ReSW [26] | 0.726 | 0.824 |
DTFH [27] | 0.703 | 0.809 |
SBA [28] | 0.720 | 0.823 |
ViT-16B [29] | 0.647 | 0.878 |
HAF [30] | 0.695 | 0.783 |
TransVLAD [31] | 0.764 | 0.828 |
CWAH [32] | 0.753 | 0.854 |
Ours | 0.782 | 0.834 |
Number of Instances | Method | mAP |
---|---|---|
LPIPS | 0.856 | |
One | Consine Similarity | 0.798 |
L2 | 0.743 | |
LPIPS | 0.748 | |
Two | Consine Similarity | 0.684 |
L2 | 0.646 | |
LPIPS | 0.706 | |
Three | Consine Similarity | 0.624 |
L2 | 0.598 |
Category | mAP |
---|---|
Car | 0.745 |
Bottle | 0.665 |
Book | 0.647 |
Car logo | 0.784 |
Person | 0.452 |
User Frame Selection | Box Type | mAP |
---|---|---|
Accurate | IOU > 0.95 | 0.764 |
Large | Area > GT frame 20% | 0.679 |
Small | Area < GT frame 20% | 0.475 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Z.; Lu, S.; Yuan, Z.; Hou, B.; Bian, J. Interactive Instance Search: User-Centered Enhanced Image Retrieval with Learned Perceptual Image Patch Similarity. Electronics 2025, 14, 1766. https://doi.org/10.3390/electronics14091766
Li Z, Lu S, Yuan Z, Hou B, Bian J. Interactive Instance Search: User-Centered Enhanced Image Retrieval with Learned Perceptual Image Patch Similarity. Electronics. 2025; 14(9):1766. https://doi.org/10.3390/electronics14091766
Chicago/Turabian StyleLi, Zikun, Shige Lu, Zhaolin Yuan, Bowen Hou, and Jilong Bian. 2025. "Interactive Instance Search: User-Centered Enhanced Image Retrieval with Learned Perceptual Image Patch Similarity" Electronics 14, no. 9: 1766. https://doi.org/10.3390/electronics14091766
APA StyleLi, Z., Lu, S., Yuan, Z., Hou, B., & Bian, J. (2025). Interactive Instance Search: User-Centered Enhanced Image Retrieval with Learned Perceptual Image Patch Similarity. Electronics, 14(9), 1766. https://doi.org/10.3390/electronics14091766