Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation
Abstract
:1. Introduction
1.1. AI in Upper GI Cancer: Transforming Detection and Surveillance
1.2. Expanding Horizons: Foundation Models in Endoscopy and Pathology Imaging
- We conduct a thorough review of FMs applied in the field of pathology and endoscopy imaging, beginning with their architecture types, training objectives, and large-scale training. Then, they are classified into visually and textually prompted models (based on prompting type), and then their subsequent application/utilization is discussed.
- We also discuss the challenges and unresolved aspects linked to FMs in pathology and endoscopy imaging.
2. Inclusion and Search Criteria
3. Foundation Models in Computer Vision
3.1. Architecture Type
3.2. Training Objectives
3.2.1. Contrastive Objectives
3.2.2. Generative Objectives
3.3. Large-Scale Training
4. Pathology Foundation Models
4.1. Visually Prompted Models
4.1.1. Pathology Image Segmentation
4.1.2. Pathology Image Classification
4.2. Textually Prompted Models
Pathology Image Classification
5. Endoscopy Foundation Models
Visually Prompted Models
6. Challenges and Future Work
7. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
GI | Gastrointestinal |
GC | Gastric Cancer |
IM | Intestinal Metaplasia |
SOTA | State-Of-The-Art |
FM | Foundation Model |
PRISMA | Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
LLM | Large Language Model |
SAM | Segment Anything Model |
WSI | Whole-Slide Imaging |
TCGA | The Cancer Genome Atlas |
MeSH | Medical Subject Headings |
ITC | Image–Text Contrastive |
ITM | Image–Text Matching |
SimCLR | Simple Contrastive Learning of Representations |
MIM | Masked Image Modeling |
MLM | Masked Language Modeling |
NLP | Natural Language Processing |
MMM | Masked Multi-Modal Modeling |
IMLM | Image-conditioned Masked Language Modeling |
ITG | Image-grounded Text Generation |
CapPa | Captioning with Parallel Prediction |
HIPT | Hierarchical Image Pyramid Transformer |
ViT | Vision Transformer |
TCGA | The Cancer Genome Atlas |
PAIP | Pathology AI Platform |
MAE | Masked AutoEncoder |
CITE | Connect Image and Text Embeddings |
LoRA | Low-Rank Adaptation |
Appendix A
References
- Yoon, H.; Kim, N. Diagnosis and management of high risk group for gastric cancer. Gut Liver 2015, 9, 5. [Google Scholar] [CrossRef]
- Pimentel-Nunes, P.; Libânio, D.; Marcos-Pinto, R.; Areia, M.; Leja, M.; Esposito, G.; Garrido, M.; Kikuste, I.; Megraud, F.; Matysiak-Budnik, T.; et al. Management of epithelial precancerous conditions and lesions in the stomach (maps II): European Society of gastrointestinal endoscopy (ESGE), European Helicobacter and microbiota Study Group (EHMSG), European Society of pathology (ESP), and Sociedade Portuguesa de Endoscopia Digestiva (SPED) guideline update 2019. Endoscopy 2019, 51, 365–388. [Google Scholar]
- Matysiak-Budnik, T.; Camargo, M.; Piazuelo, M.; Leja, M. Recent guidelines on the management of patients with gastric atrophy: Common points and controversies. Dig. Dis. Sci. 2020, 65, 1899–1903. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Radford, A.; Kim, J.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 9694–9705. [Google Scholar]
- Yao, L.; Huang, R.; Hou, L.; Lu, G.; Niu, M.; Xu, H.; Liang, X.; Li, Z.; Jiang, X.; Xu, C. Filip: Fine-grained interactive language-image pre-training. arXiv 2021, arXiv:2111.07783. [Google Scholar]
- Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11686–11695. [Google Scholar]
- Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.-N.; et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10965–10975. [Google Scholar]
- Xu, J.; De Mello, S.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Yang, J.; Li, C.; Zhang, P.; Xiao, B.; Liu, C.; Yuan, L.; Gao, J. Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19163–19173. [Google Scholar]
- Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.-C.; Li, L.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.-N.; Gao, J. Glipv2: Unifying localization and vision-language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36067–36080. [Google Scholar]
- Bao, H.; Dong, L.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Zhang, X.; Zeng, Y.; Zhang, J.; Li, H. Toward building general foundation models for language, vision, and vision-language understanding tasks. arXiv 2023, arXiv:2301.05065. [Google Scholar]
- Singh, A.; Hu, R.; Goswami, V.; Couairon, G.; Galuba, W.; Rohrbach, M.; Kiela, D. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15638–15650. [Google Scholar]
- Hao, Y.; Song, H.; Dong, L.; Huang, S.; Chi, Z.; Wang, W.; Ma, S.; Wei, F. Language models are general-purpose interfaces. arXiv 2022, arXiv:2206.06336. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
- Tschannen, M.; Kumar, M.; Steiner, A.; Zhai, X.; Houlsby, N.; Beyer, L. Image captioners are scalable vision learners too. arXiv 2023, arXiv:2306.07915. [Google Scholar]
- Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Computer Vision—ECCV 2020: 16th European Conference, Part XXX; Springer: Cham, Switzerland, 2020; pp. 104–120. [Google Scholar]
- Tsimpoukelli, M.; Menick, J.L.; Cabi, S.; Eslami, S.M.; Vinyals, O.; Hill, F. Multimodal few-shot learning with frozen language models. Adv. Neural Inf. Process. Syst. 2021, 34, 200–212. [Google Scholar]
- Xu, H.; Zhang, J.; Cai, J.; Rezatofighi, H.; Yu, F.; Tao, D.; Geiger, A. Unifying flow, stereo and depth estimation. arXiv 2022, arXiv:2211.05783. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
- Deng, R.; Cui, C.; Liu, Q.; Yao, T.; Remedios, L.W.; Bao, S.; Landman, B.A.; Wheless, L.E.; Coburn, L.A.; Wilson, K.T.; et al. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv 2023, arXiv:2304.04155. [Google Scholar]
- Cui, C.; Deng, R.; Liu, Q.; Yao, T.; Bao, S.; Remedios, L.W.; Tang, Y.; Huo, Y. All-in-sam: From weak annotation to pixel-wise nuclei segmentation with prompt-based finetuning. arXiv 2023, arXiv:2307.00290. [Google Scholar] [CrossRef]
- Zhang, J.; Ma, K.; Kapse, S.; Saltz, J.; Vakalopoulou, M.; Prasanna, P.; Samaras, D. Sam-path: A segment anything model for semantic segmentation in digital pathology. arXiv 2023, arXiv:2307.09570. [Google Scholar]
- Israel, U.; Marks, M.; Dilip, R.; Li, Q.; Yu, C.; Laubscher, E.; Li, S.; Schwartz, M.; Pradhan, E.; Ates, A.; et al. A foundation model for cell segmentation. bioRxiv 2023, preprint. [Google Scholar]
- Archit, A.; Nair, S.; Khalid, N.; Hilt, P.; Rajashekar, V.; Freitag, M.; Gupta, S.; Dengel, A.; Ahmed, S.; Pape, C. Segment anything for microscopy. bioRxiv 2023, preprint. [Google Scholar]
- Li, X.; Deng, R.; Tang, Y.; Bao, S.; Yang, H. and Huo, Y. Leverage Weakly Annotation to Pixel-wise Annotation via Zero-shot Segment Anything Model for Molecular-empowered Learning. arXiv 2023, arXiv:2308.05785. [Google Scholar]
- Chen, R.J.; Chen, C.; Li, Y.; Chen, T.Y.; Trister, A.D.; Krishnan, R.G.; Mahmood, F. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16144–16155. [Google Scholar]
- Wang, X.; Yang, S.; Zhang, J.; Wang, M.; Zhang, J.; Yang, W.; Huang, J.; Han, X. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. Image Anal. 2022, 81, 102559. [Google Scholar] [CrossRef]
- Ciga, O.; Xu, T.; Martel, A.L. Self supervised contrastive learning for digital histopathology. Mach. Learn. Appl. 2022, 7, 100198. [Google Scholar] [CrossRef]
- Azizi, S.; Culp, L.; Freyberg, J.; Mustafa, B.; Baur, S.; Kornblith, S.; Chen, T.; Tomasev, N.; Mitrović, J.; Strachan, P.; et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat. Biomed. Eng. 2023, 7, 756–779. [Google Scholar] [CrossRef]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Vorontsov, E.; Bozkurt, A.; Casson, A.; Shaikovski, G.; Zelechowski, M.; Liu, S.; Severson, K.; Zimmermann, E.; Hall, J.; Tenenholtz, N.; et al. Virchow: A million-slide digital pathology foundation model. arXiv 2023, arXiv:2309.07778. [Google Scholar]
- Roth, B.; Koch, V.; Wagner, S.J.; Schnabel, J.A.; Marr, C.; Peng, T. Low-resource finetuning of foundation models beats state-of-the-art in histopathology. arXiv 2024, arXiv:2401.04720. [Google Scholar]
- Chen, R.J.; Ding, T.; Lu, M.Y.; Williamson, D.F.K.; Jaume, G.; Chen, B.; Zhang, A.; Shao, D.; Song, A.H.; Shaban, M.; et al. A general-purpose self-supervised model for computational pathology. arXiv 2023, arXiv:2308.15474. [Google Scholar]
- Filiot, A.; Ghermi, R.; Olivier, A.; Jacob, P.; Fidon, L.; Mac Kain, A.; Saillard, C.; Schiratti, J.-B. Scaling self-supervised learning for histopathology with masked image modeling. medRxiv 2023. 2023-07. [Google Scholar] [CrossRef]
- Campanella, G.; Kwan, R.; Fluder, E.; Zeng, J.; Stock, A.; Veremis, B.; Polydorides, A.D.; Hedvat, C.; Schoenfeld, A.; Vanderbilt, C.; et al. Computational pathology at health system scale–self-supervised foundation models from three billion images. arXiv 2023, arXiv:2310.07033. [Google Scholar]
- Dippel, J.; Feulner, B.; Winterhoff, T.; Schallenberg, S.; Dernbach, G.; Kunft, A.; Tietz, S.; Jurmeister, P.; Horst, D.; Ruff, L.; et al. RudolfV: A Foundation Model by Pathologists for Pathologists. arXiv 2024, arXiv:2401.04079. [Google Scholar]
- Xu, H.; Usuyama, N.; Bagga, J.; Zhang, S.; Rao, R.; Naumann, T.; Wong, C.; Gero, Z.; González, J.; Gu, Y.; et al. A whole-slide foundation model for digital pathology from real-world data. Nature 2024, 630, 181–188. [Google Scholar] [CrossRef] [PubMed]
- Naseem, U.; Khushi, M.; Kim, J. Vision-language transformer for interpretable pathology visual question answering. IEEE J. Biomed. Health Inform. 2022, 27, 1681–1690. [Google Scholar] [CrossRef]
- He, X.; Zhang, Y.; Mou, L.; Xing, E.; Xie, P. Pathvqa: 30,000+ questions for medical visual question answering. arXiv 2020, arXiv:2003.10286. [Google Scholar]
- Huang, Z.; Bianchi, F.; Yuksekgonul, M.; Montine, T.; Zou, J. Leveraging medical twitter to build a visual–language foundation model for pathology ai. bioRxiv 2023. 2023-03. [Google Scholar] [CrossRef]
- Sun, Y.; Zhu, C.; Zheng, S.; Zhang, K.; Shui, Z.; Yu, X.; Zhao, Y.; Li, H.; Zhang, Y.; Zhao, R.; et al. Pathasst: Redefining pathology through generative foundation ai assistant for pathology. arXiv 2023, arXiv:2305.15072. [Google Scholar]
- Lu, M.Y.; Chen, B.; Zhang, A.; Williamson, D.F.; Chen, R.J.; Ding, T.; Le, L.P.; Chuang, Y.S.; Mahmood, F. Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Lu, M.Y.; Chen, B.; Williamson, D.F.; Chen, R.J.; Liang, I.; Ding, T.; Jaume, G.; Odintsov, I.; Zhang, A.; Le, L.P.; et al. Towards a visual-language foundation model for computational pathology. arXiv 2023, arXiv:2307.12914. [Google Scholar]
- Zhang, Y.; Gao, J.; Zhou, M.; Wang, X.; Qiao, Y.; Zhang, S.; Wang, D. Text-guided foundation model adaptation for pathological image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2023; pp. 272–282. [Google Scholar]
- Weinstein, J.N.; Collisson, E.A.; Mills, G.B.; Shaw, K.R.; Ozenberger, B.A.; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, J.M. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 2013, 45, 1113–1120. [Google Scholar] [CrossRef]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
- Kim, Y.J.; Jang, H.; Lee, K.; Park, S.; Min, S.-G.; Hong, C.; Park, J.H.; Lee, K.; Kim, J.; Hong, W.; et al. Paip 2019: Liver cancer segmentation challenge. Med. Image Anal. 2021, 67, 101854. [Google Scholar] [CrossRef]
- Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; Kong, T. ibot: Image bert pre-training with online tokenizer. arXiv 2021, arXiv:2111.07832. [Google Scholar]
- Wang, X.; Du, Y.; Yang, S.; Zhang, J.; Wang, M.; Zhang, J.; Yang, W.; Huang, J.; Han, X. Retccl: Clustering-guided contrastive learning for whole-slide image retrieval. Med. Image Anal. 2023, 83, 102645. [Google Scholar] [CrossRef]
- Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. Coca: Contrastive captioners are image-text foundation models. arXiv 2022, arXiv:2205.01917. [Google Scholar]
- Wang, Z.; Liu, C.; Zhang, S.; Dou, Q. Foundation model for endoscopy video analysis via large-scale self-supervised pre-train. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2023; pp. 101–111. [Google Scholar]
- Cui, B.; Mobarakol, I.; Bai, L.; Ren, H. Surgical-DINO: Adapter Learning of Foundation Model for Depth Estimation in Endoscopic Surgery. arXiv 2024, arXiv:2401.06013. [Google Scholar] [CrossRef] [PubMed]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Cheng, Y.; Li, L.; Xu, Y.; Li, X.; Yang, Z.; Wang, W.; Yang, Y. Segment and track anything. arXiv 2023, arXiv:2305.06558. [Google Scholar]
- Song, Y.; Yang, M.; Wu, W.; He, D.; Li, F.; Wang, J. It takes two: Masked appearance-motion modeling for self-supervised video transformer pre-training. arXiv 2022, arXiv:2022. [Google Scholar]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
- Hoelscher-Obermaier, J.; Persson, J.; Kran, E.; Konstas, I.; Barez, F. Detecting edit failures in large language models: An improved specificity benchmark. arXiv 2023, arXiv:2305.17553. [Google Scholar]
- Lekadir, K.; Feragen, A.; Fofanah, A.J.; Frangi, A.F.; Buyx, A.; Emelie, A.; Lara, A.; Porras, A.R.; Chan, A.; Navarro, A.; et al. FUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in healthcare. arXiv 2023, arXiv:2309.12325. [Google Scholar]
Model | Training Data | Key Features | Outcomes/Performance |
---|---|---|---|
(A) Visually Prompted Image Segmentation | |||
SAM [24] | [25]: Cell nuclei segmentation on WSI | A positive point prompt, 20 point prompts, and comprehensive points or bounding boxes | Exceptional performance in segmenting large, connected objects |
SAM | [26]: Generate pixel-level from points and bounding boxes. | Label-efficient finetuning of SAM, with no requirement for annotation prompts during inference | Minimizes annotation efforts without compromising on segmentation accuracy |
SAM | [27]: SAM-Path for semantic segmentation of pathology images | Extend SAM by incorporating trainable class prompts, augmented further with a pathology-specific encoder | Improves SAM’s capability for performing pathology semantic segmentation |
SAM | [28]: Cell-SAM for cell segmentation | Extend SAM by introducing CellFinder as a novel prompt engineering technique that cues SAM to produce segmentation | Achieves SOTA performance in segmenting images of mammalian cells |
SAM | [29]: Segmenting objects in multi-dimensional microscopy data | Extend SAM by training specialized models for microscopy data | Significantly improve segmentation quality for a wide range of imaging conditions |
SAM | [30]: Cell segmentation in digital pathology | Use SAM’s ability to produce pixel-level annotations from box annotations to train a segmentation model | Diminish the labeling efforts for lay annotators by only requiring weak box annotations |
(B) Visually Prompted Image Classification | |||
HIPT [31] | 100 M patches from 11,000 WSIs (TCGA) | Hierarchical Image Pyramid Transformer, student–teacher knowledge distillation | Outperforms SOTA for cancer subt6yping and survival prediction |
CTransPath [32] | 15 M patches from 30,000 WSIs (TCGA & PAIP) | Swin transformer with a convolutional backbone | Potential to be a universal model for various histopathological image applications |
ResNet [33] | 25,000 WSIs, 39,000 patches (TCGA and others) | Trained on extensive patch dataset | Combining multiple multi-organ datasets with various types of staining and resolution improves the learned features quality |
REMEDIS [34] | 29,000 WSIs (TCGA) | SimCLR framework | Improves diagnostic accuracies when compared to supervised baseline models |
DINOv2 [35] | Histopathology data | Vision transformer, [36]: a large model, trained on extensive TCGA dataset, [37]: Benchmarking against CTransPath and RetCCL as feature extractors | Surpasses SOTA in pan-cancer detection & subtyping [36], comparable or better performance with less training [37] |
UNI [38], ViT-Base [39] | 100 K proprietary slides [38], 43 M patches from 6000 WSIs (TCGA) [39] | [38]: ViT-large model, 100 million tissue patches, 20 major organ types, [39]: iBOT framework, surpasses HIPT and CTransPath in TCGA tasks | Generalizable across 33 pathology tasks [38], superior performance in TCGA evaluation tasks [39] |
Largest Academic Model [40] | 3 billion patches from over 400,000 slides | MAE and DINO pre-training, comparison on pathology data vs. natural images | Superior downstream performance with DINO |
Rudolfv [41] | Less extensive dataset than competitors | Integrating pathological domain knowledge | Superior performance with fewer parameters |
Prov-GigaPath [42] | 1.3 billion pathology image tiles in 171,189 whole slides from Providence | Vision transformer with dilated self-attention | SOTA performance on various pathology tasks, demonstrating the importance of real-world data and whole-slide modeling |
(C) Textually Prompted Image Classification | |||
TraP-VQA [43] | PathVQA [44] | A vision language transformer for processing pathology images | Outperformed the SOTA comparative methods with public PathVQA dataset |
Huang et al. [45] | Image–text paired pathology data sourced from public platforms, including Twitter | A contrastive language–image pre-training model | Promising zero-shot capabilities in classifying new pathological images |
PathAsst [46] | ChatGPT/GPT-4 to produce over 180,000 samples | Vicuna-13B language model in conjunction with the CLIP vision encoder | Underscore the capability of leveraging AI-powered generative foundation models to enhance pathology diagnoses |
MI-Zero [47] | 550,000 pathology reports along with other available in-domain text corpora | Reimagines zero-shot transfer within the context of multiple instance learning | Potential usefulness for semi-supervised learning workflows in histopathology |
CONCH [48] | 1.17 million image–caption pairs derived from educational materials and PubMed articles | ViT backbone and iBOT self-supervised framework, integrates a vision language joint embedding space | Potential to directly facilitate machine learning-based workflows requiring minimal or no further supervised fine-tuning |
CITE [49] | PatchGastric stomach tumor pathological image dataset | Language models pre-trained on a wide array of biomedical texts to enhance foundation models for better understanding of pathological images | Achieves leading performance compared to various baselines especially when training data are scarce |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kerdegari, H.; Higgins, K.; Veselkov, D.; Laponogov, I.; Polaka, I.; Coimbra, M.; Pescino, J.A.; Leja, M.; Dinis-Ribeiro, M.; Fleitas Kanonnikoff, T.; et al. Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation. Diagnostics 2024, 14, 1912. https://doi.org/10.3390/diagnostics14171912
Kerdegari H, Higgins K, Veselkov D, Laponogov I, Polaka I, Coimbra M, Pescino JA, Leja M, Dinis-Ribeiro M, Fleitas Kanonnikoff T, et al. Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation. Diagnostics. 2024; 14(17):1912. https://doi.org/10.3390/diagnostics14171912
Chicago/Turabian StyleKerdegari, Hamideh, Kyle Higgins, Dennis Veselkov, Ivan Laponogov, Inese Polaka, Miguel Coimbra, Junior Andrea Pescino, Mārcis Leja, Mário Dinis-Ribeiro, Tania Fleitas Kanonnikoff, and et al. 2024. "Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation" Diagnostics 14, no. 17: 1912. https://doi.org/10.3390/diagnostics14171912
APA StyleKerdegari, H., Higgins, K., Veselkov, D., Laponogov, I., Polaka, I., Coimbra, M., Pescino, J. A., Leja, M., Dinis-Ribeiro, M., Fleitas Kanonnikoff, T., & Veselkov, K. (2024). Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation. Diagnostics, 14(17), 1912. https://doi.org/10.3390/diagnostics14171912