A Data Ingestion Procedure towards a Medical Images Repository
Abstract
:1. Introduction
2. ALPACS: Current Status
2.1. Data Providers
2.2. Repository Users
2.3. Key Systems/Actors
2.3.1. HAPI/FHIR
2.3.2. PACS Orthanc
2.3.3. Mirth Connect Integration Engine
2.3.4. Anonymizing Service
2.3.5. VMs and Docker
2.3.6. CE-Net and XCeption Deep Neural Network
2.4. The COVID-19 ALPACS Repository
2.5. Empirical Evaluation of PROXIMITY 1.0
- Test 1: PROXIMITY deployed two tests with condensation. The remaining eight (8) were cuts from typical cases or findings unrelated to the search. Five images corresponded to the same exam. An improvement for PROXIMITY 2.0 is the filtering of repeated images. In addition, the unfolded cuts have a different brightness and contrast than the original.
- Test 2: PROXIMITY was specified to display exams from different patients. Three (3) images were obtained with similar findings or some relationship with the search, and seven (7) have unrelated findings, i.e. they are normal or without similarity with the search. The brightness and contrast problem remained, except for one image with adequate brightness.
2.6. Cloud Computing for ALPACS
2.6.1. Replicating On-Premise Production Environment
2.6.2. Automation of Data Transfer
- Inbound Load: The first data flow goes from the user to the NFS storage (Figure 1). It starts when the User instantiates an SCU to transmit DICOM files from its local storage to the intermediary server (IBHIS Server), which accepts TCP requests on a specific port and balances the load over TCP/DICOM channels. We have chosen HAProxy due to its popularity, but any commercial or open-source alternatives, such as NGINX, can be used. The objective is to be able to scale up transfer if needed. These files are written to memory and relocated to shared storage (S3 Bucket Storage). The IBHIS server must provide channels to transport files from a source to a destination. Here MirthConnect acts as the SCP, allowing us to cache in the inbound.
- Curated Load: The second data flow goes from the shared storage to the PACS (Figure 2). It starts when the broker (IBHIS Server) detects that DICOM files have been uploaded to the shared storage through MirthConnect. Thus, it is ready to re-send them via TCP/DICOM. The PACS Server accepts TCP requests on a specific port and balances the load on SCU instances, which register the files in PACS (SCP) through DICOMweb. The Orthanc service stores DICOM files and provides DIMSE services and web services.
- Archived Load: The third data stream is used to back up the archived data, which are formatted according to the PACS functions.
2.6.3. Implementation of ALPACS in the Cloud
2.6.4. HAPI/FHIR Filling
3. Data Ingestion Procedure
3.1. Ingestion Roadmap
- Opportunity Identification: The procedure begins with identifying common objectives between the repository and the health provider. The opportunity/benefit that the provider would obtain from providing the data to an external repository must be agreed upon, i.e., the services they would access by providing their data. This benefit can be clinical, scientific, technological, pedagogical, etc. The focus is on the repository, as it has to evaluate the feasibility of ingestion and the data mining services that can be generated from the available data. The result is a document detailing the objectives and scope of the ingestion.
- General Conditions: Each health provider has different realities, such as policies, budgets, and governance. In this step, it is essential to agree upon the general transfer conditions that guarantee ingestion, such as necessary authorizations. The focus is on the provider, as it must approve this procedure internally. The result is a formal document supported by an ethics committee that authorizes collaboration between the provider and repository while overseeing data transfer and use.
- Technical Study: When use of the data has been authorized, the repository must evaluate the technical feasibility of the transfer, including sizing the volume of the dataset, its formats, necessary transformations, its completeness, level of curation, the condition of the data, and the telematic networks and interoperable services used for automated transfer. The focus is on the repository, as its technical staff must complete the respective queries. The result is a document on the dataset profile that allows for analysis of the possible services to be offered and the specific data required for the transfer.
- Transfer Constraints: Any data transfer uses health provider resources and must comply with policies, contracts, and legal constraints regarding data usage. Here, the data must be made available, i.e., resources must be assigned to carry out the transfer within the constraints mentioned above. The focus is on the provider, as its constraints shape this stage. The result is evidenced by a formal agreement/contract that enables the transfer.
- Service Development: The repository must develop the necessary modifications to its interoperable services in order to receive/ingest the data correctly. This includes allocating the required storage, establishing communication channels, and defining data access services in a way that is interoperable with the data models. The focus is on the repository, as it must modify its software components to support the new dataset. The (internal) result is a new architecture and data model for the dataset.
- Automated Transfer: When ingestion services are available, the provider transfers data automatically through the agreed-upon channels. The repository monitors this process, and the analysis process begins for the next curation step. The provider must assign someone responsible for restarting the transfer in case of failure. The process may need to take place over several weeks so as to not impact the provider’s networks. The focus is on the provider, who sends the data and on whom the transfer rates depend. The result corresponds to a transfer report that shows the dataset transferred to the repository.
- Curation and Transformation: Curation and transformation corresponds to the internal process of the repository to process the data; that is, the data must be reviewed, selected, imputed, organized and transformed in order to constitute curated and coherent datasets. The process specifications must respect the conditions of Step 2, and carrying out this process with the provider is unnecessary. The focus is on the repository, as its staff and systems must perform this process semi-automatically. In addition to the curated data, the result is documentation of the modifications and adaptations of the data, which are to be delivered to the provider’s users, allowing them to use the datasets without uncertainty.
- Service Generation: All of the services developed in Step 5 are deployed for the provider’s use and enjoyment. The focus is on the provider, as its medical departments must be integrated into the repository’s computer system, from enabling access to remote repository services to integrating the services into the provider’s RIS/PACS and HIS. The result is a service deployment architecture that integrates the repository and provider while clarifying responsibilities between the parties for maintenance, updating, and incorporation of new services.
- Operational Services: The data access services generated by this ingestion procedure must be communicated concisely to the target audience determined in Step 2, whether they are part of the provider, the repository, or external. For this, a service catalog must be generated based on the inputs of all the previous results. This catalog must specify the form of use, its constraints, and the description of each generated dataset.
3.2. Implementation with HCUCH
- Opportunity Identification: Chest radiology was identified as a suitable domain for ALPACS objectives. A COVID-19 data-driven collaboration with HCUCH defined the goal of using ALPACS and PROXIMITY results to perform Differential Diagnostic Studies (DDS).
- General Conditions: A project was drafted for approval by the HCUCH ethics committee detailing the medical objectives, data to be used (non-contrast chest CT scans from 2010 to 2020), and methodology to be used.
- Technical Study: We sized the data along with the approximate volume (initially, 42,000 CTs were projected), individual volume, PACS scalability, available storage, etc. This dataset was named HCUCH DS1.
- Transfer Constraints: The DS1 volume to be transmitted was technically approved by the HCUCH PACS administration, and negotiation with the RIS/PACS provider company, AGFA, which carried out the transmission, ensured operational continuity of the RIS/PACS while establishing ratios, low transmission rates, and an expected transmission time of several weeks.
- Service Development: Services were adapted for the new DS1 data scale based on the ALPACS deployment. The main changes were that the provider carried out the anonymization and that a hybrid deployment was chosen, that is, while the original and curated data are hosted in ALPACS, the services usable by the provider are configured in the AWS cloud.
- Automated Transfer: Transfer tests have been completed and the transfer has begun.
- Curation and Transformation: The reconstructions with which the study will be carried out were agreed upon, as well as how to identify studies of the same anonymized patient. In addition, work was performed on the metadata collection systems from DICOM and radiological reports.
- Generation of Services: Strategies were defined for deployment in the HCUCH.
- Services Catalog: The services catalog of ALPACS.
3.3. New Oopportunity: Radiological Reports
- Opportunity identification: The DS2 dataset complements the CTs of DS1.
- General Conditions: The data fit within the same authorization from the ethics committee.
- Technical Study: The data and approximate volume were already sized (33,000 reports).
- Transfer Constraints: DS2 transfer was coordinated via AGFA.
- Service Development: Services for this dataset are available, and storage usage is lower.
- Automated Transfer: A transfer is planned via interoperable protocols, although the volume of the DS2 dataset is smaller.
- Curation and Transformation: Those of ALPACS.
- Generation of Services: Those of ALPACS.
- Services Catalog: Those of ALPACS.
4. Analysis of the Data Ingestion Procedure
4.1. Data Transfer Statistics
4.2. Developing Proximity 2.0
4.2.1. Study Level Recommendation
4.2.2. Diversity and Novelty of Results
4.2.3. Automatic Annotation Extraction via NLP
4.2.4. PROXIMITY Optimization
4.2.5. New Training Mechanisms
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Strickland, N.H. PACS (picture archiving and communication systems): Filmless radiology. Arch. Dis. Child. 2000, 83, 82–86. [Google Scholar] [CrossRef] [PubMed]
- Valente, F.; Costa, C.; Silva, A. Dicoogle, a PACS featuring profiled content based image retrieval. PLoS ONE 2013, 8, e61888. [Google Scholar] [CrossRef] [PubMed]
- Valente, F.; Silva, L.A.; Godinho, T.M.; Costa, C. Anatomy of an Extensible Open Source PACS. J. Digit. Imaging 2016, 29, 284–296. [Google Scholar] [CrossRef]
- Silva, L.A.B.; Costa, C.; Oliveira, J.L. A PACS archive architecture supported on cloud services. Int. J. Comput. Assist. Radiol. Surg. 2012, 7, 349–358. [Google Scholar] [CrossRef] [PubMed]
- Costa, C.; Freitas, F.; Pereira, M.; Silva, A.; Oliveira, J.L. Indexing and retrieving DICOM data in disperse and unstructured archives. Int. J. Comput. Assist. Radiol. Surg. 2009, 4, 71–77. [Google Scholar] [CrossRef] [PubMed]
- Sotomayor, C.G.; Mendoza, M.; Castañeda, V.; Farías, H.; Molina, G.; Pereira, G.; Härtel, S.; Solar, M.; Araya, M. Content-Based Medical Image Retrieval and Intelligent Interactive Visual Browser for Medical Education, Research and Care. Diagnostics 2021, 11, 1470. [Google Scholar] [CrossRef] [PubMed]
- Pinho, E.; Godinho, T.; Valente, F.; Costa, C. A multimodal search engine for medical imaging studies. J. Digit. Imaging 2017, 30, 39–48. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Zhang, X.; Müller, H.; Zhang, S. Large-scale retrieval for medical image analytics: A comprehensive review. Med. Image Anal. 2018, 43, 66–84. [Google Scholar] [CrossRef] [PubMed]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Bennani, S.; Regnard, N.-E.; Lassalle, L.; Nguyen, T.; Malandrin, C.; Koulakian, H.; Khafagy, P.; Chassagnon, G.; Revel, M.-P. Evaluation of radiologists’ performance compared to a deep learning algorithm for the detection of thoracic abnormalities on chest X-ray. In Proceedings of the ECR 2022, Vienna, Austria, 13–17 July 2022. [Google Scholar]
- McLouth, J.; Elstrott, S.; Chaibi, Y.; Quenet, S.; Chang, P.D.; Chow, D.S.; Soun, J.E. Validation of a Deep Learning Tool in the Detection of Intracranial Hemorrhage and Large Vessel Occlusion. Front. Neurol. 2021, 12, 656112. [Google Scholar] [CrossRef]
- Müller, H.; Squire, D.M.; Mueller, W.; Pun, T. Efficient access methods for content-based image retrieval with inverted files. In Multimedia Storage and Archiving Systems IV; SPIE: Bellingham, WA, USA, 1999; Volume 3846, pp. 461–472. [Google Scholar]
- Li, X.; Yang, J.; Ma, J. Recent developments of content-based image retrieval (CBIR). Neurocomputing 2021, 452, 675–689. [Google Scholar] [CrossRef]
- Müller, H.; Michoux, N.; Bandon, D.; Geissbuhler, A. A review of content-based image retrieval systems in medical applications—Clinical benefits and future directions. Int. J. Med. Inform. 2004, 73, 1–23. [Google Scholar] [CrossRef] [PubMed]
- Deselaers, T.; Deserno, T.M.; Müller, H. Automatic medical image annotation in ImageCLEF 2007: Overview, results, and discussion. Pattern Recognit. Lett. 2008, 29, 1988–1995. [Google Scholar] [CrossRef]
- Rodríguez, A.F.; Müller, H. Ground truth generation in medical imaging: A crowdsourcing-based iterative approach. In Proceedings of the ACM Multimedia 2012 Workshop on Crowdsourcing for Multimedia, Nara, Japan, 29 October 2012; pp. 9–14. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning 2021, Virtual, 17–24 July 2021; PMLR: Westminster, UK, 2021; pp. 4904–4916. [Google Scholar]
- Cao, Y.; Steffey, S.; He, J.; Xiao, D.; Tao, C.; Chen, P.; Müller, H. Medical image retrieval: A multimodal approach. Cancer Inform. 2014, 13, CIN-S14053. [Google Scholar] [CrossRef] [PubMed]
- Guo, Y.; He, Y.; Lyu, J.; Zhou, Z.; Yang, D.; Ma, L.; Tan, H.-T.; Chen, C.; Zhang, W.; Hu, J.; et al. Deep learning with weak annotation from diagnosis reports for detection of multiple head disorders: A prospective, multicentre study. Lancet Digit. Health 2022, 4, e584–e593. [Google Scholar] [CrossRef] [PubMed]
- Cave, A.; Brun, N.C.; Sweeney, F.; Rasi, G.; Senderovitz, T. Big Data—How to Realize the Promise. Clin. Pharmacol. Ther. 2020, 107, 753–761. [Google Scholar] [CrossRef] [PubMed]
- Mavrogioorgou, A.; Kiourtis, A.; Touloupou, M.; Kapassa, E.; Kyriazis, D. Internet of medical things (IoMT): Acquiring and transforming data into hl7 fhir through 5g network slicing. Emerg. Sci. J. 2019, 3, 64–77. [Google Scholar] [CrossRef]
- Mavrogioorgou, A.; Kiourtis, A.; Manias, G.; Symvoulidis, C.; Kyriazis, D. Batch and Streaming Data Ingestion towards Creating Holistic Health Records. Emerg. Sci. J. 2023, 7, 339–353. [Google Scholar] [CrossRef]
- Simplilearn. 2024. Available online: https://www.simplilearn.com/data-ingestion-article (accessed on 31 July 2024).
- Cognizant. 2024. Available online: https://www.cognizant.com/us/en/glossary/data-ingestion#:~:text=Data%20ingestion%20is%20the%20process,can%20be%20accessed%20and%20analyzed (accessed on 31 July 2024).
- Ranchal, R.; Bastide, P.; Wang, X.; Gkoulalas-Divanis, A.; Mehra, M.; Bakthavachalam, S.; Lei, H.; Mohindra, A. Disrupting Healthcare Silos: Addressing Data Volume, Velocity and Variety With a Cloud-Native Healthcare Data Ingestion Service. IEEE J. Biomed. Health Inform. 2020, 24, 3182–3188. [Google Scholar] [CrossRef]
- Wu, X.; Wang, H.; Shi, M.; Wang, A.; Xia, K. DNA Motif Finding Method Without Protection Can Leak User Privacy. IEEE Access 2019, 7, 152076–152087. [Google Scholar] [CrossRef]
- Prasser, F.; Eicher, J.; Spengler, H.; Bild, R.; Kuhn, K.A. Flexible data anonymization using ARX—Current status and challenges ahead. Softw. Pract. Exper. 2020, 50, 1277–1304. [Google Scholar] [CrossRef]
- HL7 FHIR. 2024. Available online: https://www.hl7.org/fhir/ (accessed on 31 July 2024).
- Orion Health. An Intelligent Research Platform: Moving from Data Complexity to Research Excellence. White Paper. 2023. Available online: https://orionhealth.com/global/blog/health-data-ingestion-in-an-on-demand-world/ (accessed on 31 July 2024).
- Bennani, S.; Regnard, N.; Ventre, J.; Lassalle, L.; Nguyen, T.; Ducarouge, A.; Dargent, L.; Guillo, E.; Gohier, E.; Zaimi, S.H.; et al. Using AI to Improve Radiologist Performance in Detection of Abnormalities on Chest Radiographs. Radiology 2023, 309, e230860. [Google Scholar] [CrossRef] [PubMed]
- Boginskis, V.; Zadoroznijs, S.; Cernavska, I.; Beikmane, D.; Sauka, J. Artificial intelligence effectivity in fracture detection. Med. Perspekt. 2023, 28, 68–77. [Google Scholar] [CrossRef]
- Jacques, T.; Cardot, N.; Ventre, J.; Demondion, X.; Cotten, A. Commercially-available AI algorithm improves radiologists’ sensitivity for wrist and hand fracture detection on X-ray, compared to a CT-based ground truth. Eur. Radiol. 2023, 34, 2885–2894. [Google Scholar] [CrossRef]
- Rava, R.A.; Seymour, S.E.; LaQue, M.E.; Peterson, B.A.; Snyder, K.V.; Mokin, M.; Waqas, M.; Hoi, Y.; Davies, J.M.; Levy, E.I.; et al. Assessment of an Artificial Intelligence Algorithm for Detection of Intracranial Hemorrhage. World Neurosurg. 2021, 150, e209–e217. [Google Scholar] [CrossRef]
- Abuduweili, A.; Li, X.; Shi, H.; Xu, C.Z.; Dou, D. Adaptive consistency regularization for semi-supervised transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Virtual, 19–25 June 2021. [Google Scholar]
- Solar, M.; Aguirre, P. Deep Learning techniques to process 3D chest CT. J. Univers. Comput. Sci. 2024, 30, 758–778. [Google Scholar] [CrossRef]
- Amin, K.S.; Davis, M.A.; Doshi, R.; Haims, A.H.; Khosla, P.; Forman, H.P. Accuracy of ChatGPT, Google Bard, and Microsoft Bing for Simplifying Radiology Reports. Radiology 2023, 309, e232561. [Google Scholar] [CrossRef]
- Pabón, O.S.; Montenegro, O.; Torrente, M.; González, A.R.; Provencio, M.; Menasalvas, E. Negation and uncertainty detection in clinical texts written in Spanish: A deep learning-based approach. Peer Comput. Sci. 2022, 8, E913. [Google Scholar] [CrossRef] [PubMed]
- Rojas, M.; Dunstan, J.; Villena, F. Clinical flair: A pre-trained language model for Spanish clinical natural language processing. In Proceedings of the 4th Clinical Natural Language Processing Workshop, Seattle, WA, USA, 14 July 2022. [Google Scholar]
- Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. Vse ++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference 2017, London, UK, 4–7 September 2017. [Google Scholar]
- Molina, G.; Mendoza, M.; Loayza, I.; Núñez, C.; Araya, M.; Castañeda, V.; Solar, M. A New Content-Based Image Retrieval System for SARS-CoV-2 Computer-Aided Diagnosis. In Proceedings of the 2022 International Conference on Medical Imaging and Computer-Aided Diagnosis (MICAD 2022), Leicester, UK, 20–21 November 2022; Lecture Notes in Electrical Engineering. Volume 784, pp. 316–324. [Google Scholar] [CrossRef]
- Carter, A.B. Considerations for genomic data privacy and security when working in the cloud. J. Mol. Diagn. 2019, 21, 542–552. [Google Scholar] [CrossRef]
- Markonis, D.; Baroz, F.; Ruiz De Castaneda, R.L.; Boyer, C.; Müller, H. User tests for assessing a medical image retrieval system: A pilot study. Studies in health technology and informatics. In MEDINFO; IOS Press: Lansdale, PA, USA, 2015; Volume 192, pp. 224–228. [Google Scholar]
- Markonis, D.; Holzer, M.; Baroz, F.; De Castaneda, R.L.; Boyer, C.; Langs, G.; Müller, H. User-oriented evaluation of a medical image retrieval system for radiologists. Int. J. Med. Inform. 2015, 84, 774–783. [Google Scholar] [CrossRef]
- Dash, J.K.; Mukhopadhyay, S.; Gupta, R.D.; Khandelwal, N. Content-based image retrieval system for HRCT lung images: Assisting radiologists in self-learning and diagnosis of Interstitial Lung Diseases. Multimed. Tools Appl. 2021, 80, 22589–22618. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Solar, M.; Castañeda, V.; Ñanculef, R.; Dombrovskaia, L.; Araya, M. A Data Ingestion Procedure towards a Medical Images Repository. Sensors 2024, 24, 4985. https://doi.org/10.3390/s24154985
Solar M, Castañeda V, Ñanculef R, Dombrovskaia L, Araya M. A Data Ingestion Procedure towards a Medical Images Repository. Sensors. 2024; 24(15):4985. https://doi.org/10.3390/s24154985
Chicago/Turabian StyleSolar, Mauricio, Victor Castañeda, Ricardo Ñanculef, Lioubov Dombrovskaia, and Mauricio Araya. 2024. "A Data Ingestion Procedure towards a Medical Images Repository" Sensors 24, no. 15: 4985. https://doi.org/10.3390/s24154985
APA StyleSolar, M., Castañeda, V., Ñanculef, R., Dombrovskaia, L., & Araya, M. (2024). A Data Ingestion Procedure towards a Medical Images Repository. Sensors, 24(15), 4985. https://doi.org/10.3390/s24154985