The Automated Generation of Medical Reports from Polydactyly X-ray Images Using CNNs and Transformers

Vieira, Pablo de Abreu; Mathew, Mano Joseph; Santos Neto, Pedro de Alcantara dos; Silva, Romuere Rodrigues Veloso e

doi:10.3390/app14156566

Open AccessArticle

The Automated Generation of Medical Reports from Polydactyly X-ray Images Using CNNs and Transformers

by

Pablo de Abreu Vieira

^1,2

,

Mano Joseph Mathew

²

,

Pedro de Alcantara dos Santos Neto

¹

and

Romuere Rodrigues Veloso e Silva

^1,*

¹

Department of Computing, Federal University of Piauí—UFPI, Teresina 64049-550, PI, Brazil

²

EFREI Research Laboratory, École d’Ingénieurs Généraliste du Numérique—EFREI, 75003 Paris, France

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6566; https://doi.org/10.3390/app14156566 (registering DOI)

Submission received: 20 June 2024 / Revised: 19 July 2024 / Accepted: 24 July 2024 / Published: 27 July 2024

(This article belongs to the Special Issue Recent Advances in and Applications of Medical Image Processing and Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

Pododactyl radiography is a non-invasive procedure that enables the detection of foot pathologies, as it provides detailed images of structures such as the metatarsus and phalanges, among others. This examination holds potential for employment in CAD systems. Our proposed methodology employs generative artificial intelligence to analyze pododactyl radiographs and generate automatic medical reports. We used a dataset comprising 16,710 exams, including images and medical reports on pododactylys. We implemented preprocessing of the images and text, as well as data augmentation techniques to improve the representativeness of the dataset. The proposed CAD system integrates pre-trained CNNs for feature extraction from the images and Transformers for report interpretation and generation. Our objective is to provide reports describing pododactyl pathologies, such as plantar fasciitis, bunions, heel spurs, flat feet, and lesions, among others, offering a second opinion to the specialist. The results are promising, with BLEU scores (1 to 4) of 0.612, 0.552, 0.507, and 0.470, respectively, a METEOR score of 0.471, and a ROUGE-L score of 0.633, demonstrating the model’s ability to generate reports with qualities close to those produced by specialists. We demonstrate that generative AI trained with pododactyl radiographs has the potential to assist in diagnoses from these examinations.

Keywords:

polydactyly; X-ray; generative artificial intelligence

1. Introduction

The anatomy of human Polydactyly comprises a set of bones and joints responsible for the structure and mobility of the feet. This structure not only supports body weight and facilitates movement, but also plays a crucial role in force distribution during motion [1,2]. The integrity of this structure is maintained by a complex network of muscles, ligaments, and tendons, ensuring stability and flexibility. Divided into anatomically distinct regions such as the tarsus, metatarsus, phalanges, and others, Polydactylys play a fundamental role in posture and locomotion [3,4,5]. In addition to their biomechanical function, it is important to consider the clinical implications associated with Polydactylys. Conditions such as plantar fasciitis [6], heel spurs [7], flat feet [8], osteoarthritis [9], hallux valgus [10], decalcification [11], and traumatic injuries [12] can affect the health and proper functioning of the feet, significantly impacting individuals’ quality of life.

The analysis of issues related to Polydactylys often requires the use of medical imaging exams, with radiography being an essential choice. This method is commonly employed due to its favorable cost-effectiveness compared to other modalities, as well as being minimally invasive and quick [13]. Radiography enables the detection of degenerative conditions, fractures, foreign bodies, and other irregularities associated with Polydactylys [14]. Its basis lies in the principle of the variation in X-ray absorption by tissues, resulting in two-dimensional projections with diverse contrasts, highlighting denser bone structures. Less dense components, such as muscles, organs, and gases, are also discernible due to lower radiation absorption, exhibiting distinct characteristics from denser structures [15].

Although the use of radiographs is beneficial for disease identification and medical report creation, interpreting these images requires specific expertise. The repetitive and detailed task of analyzing these images can lead to fatigue, resulting in potential errors [16,17]. In this context, computer systems that employ machine learning to assist in detection are being researched. The use of these systems aims to simplify the work of specialists by providing a second opinion. This can not only improve efficiency, but also reduce the occurrence of errors, making the process more effective [13,18,19,20].

The interpretation of radiographs, especially in detecting subtle or rare anomalies, presents a significant challenge due to the complexity and variability of medical images. Addressing this problem, our study aims to develop a robust methodology based on generative AI to automate the generation of medical reports from Polydactyly radiographs. The primary research objectives are to enhance the accuracy and reliability of automatic medical report generation, provide a detailed second opinion for specialists, and improve the efficiency of diagnosing foot pathologies. By leveraging advanced deep learning techniques, our research seeks to bridge the gap between the high demand for accurate medical imaging interpretations and the limited availability of specialized radiologists.

Based on the above, this study investigates the automatic generation of medical reports in Polydactyly radiographs using generative artificial intelligence techniques. Our goal is to automatically generate medical reports on radiographs that can provide descriptions of various pathologies in Polydactylys, such as plantar fasciitis, bunions, heel spurs, flat feet, and injuries, among others. This approach aims to provide a second opinion to the specialist. To achieve this goal, we propose a computer-aided design (CAD) system with the following contributions:

The use of a new dataset composed of Polydactyly radiographs and medical reports;
The automated generation of medical reports using both frontal and lateral images of both feet from the same examination. This process involves developing a generative artificial intelligence model employing deep learning to extract features from the images through pre-trained convolutional neural networks (CNNs) and Transformers to interpret and automatically generate the reports, describing the examination and indicating the presence or absence of pathologies.

2. Related Works

Image-to-text is a field of study within machine learning and computer vision, also known as automatic caption generation from input images. The central concept is to develop models capable of receiving an image, processing it, and generating a corresponding descriptive caption [21]. These models are trained on extensive datasets, such as Common Objects in Context (COCO), which includes over 200,000 annotated images, each accompanied by descriptive captions provided by humans [22]. Through these data, machine learning models learn the correlation between visual elements in the images and the words in the captions.

Based on the capabilities of image-to-text, researchers have decided to explore the feasibility of applying this methodology to the interpretation of medical images, aiming to automatically generate medical reports to assist professionals in diagnosis [23]. These initiatives have prompted the development of several studies employing natural language processing (NLP), computer vision, and machine learning methods. These studies aim to create generative artificial intelligence systems capable of producing medical reports from medical images through computational processes.

For example, Ref. [24] in 2017 used Bladder Cancer exams to generate medical reports with the Multimodal Mapping Approach. Ref. [25] in 2018 employed chest X-ray exams to produce medical reports using the CNN long short-term memory approach. Ref. [26] in 2023 utilized GI endoscopy to generate medical reports using the Multi-Modal Memory Transformer Network. Ref. [27] in 2019 used chest X-ray exams to generate medical reports with Multi-Attention and incorporating Background Information. Ref. [28] in 2023 employed chest X-ray exams to generate medical reports with the CNN Transformer approach. Ref. [29] in 2023 utilized chest X-ray exams to generate medical reports using the CNN Transformer approach. Ref. [30] in 2024 used chest X-ray exams to generate medical reports with CNNs Transformers. Ref. [31] in 2024 employed chest X-ray exams to generate medical reports with the CNN Transformer approach. Ref. [32] in 2024 used retinal images to generate medical reports with the CNN Transformer approach. Ref. [33] in 2024 employed Brain computed tomography exams to generate medical reports with the CNN GPT-2 approach.

In this section, we introduce several relevant studies in the field of automatic medical report generation. Below, we present the strengths and weaknesses of these works.

The strengths are as follows:

Improved performance: The study [31] demonstrates superior performance in the validation metrics, showing the effectiveness of the proposed model in generating medical reports from radiographs. Other works, such as [32], also show promising results in terms of performance, highlighting the feasibility of their approaches;
Integration of Medical Knowledge: The work [28] integrates historical data and general information to enhance the generation of reports, resulting in more accurate and informative reports. This approach is complemented by the study [33];
Comprehensive experiments: Studies like [32,33] conduct extensive experiments on various datasets, demonstrating the effectiveness of their approaches compared to advanced methods, which strengthens the validity of the results;
Use of pre-trained models: In the study [33], various pre-trained CNN models are explored to evaluate the best results.

The weaknesses are as follows:

Lack of dataset details: Some studies, such as [25,26,29], do not provide detailed information about the datasets used, which can compromise the generalization and robustness of the results. The study by [26] uses a relatively small dataset (3,069 exams accompanied by reports), while [25,31] use datasets with 3643 and 3973 exams, respectively;
Generalization limitations: The work [28] may face challenges in generalizing to different pathologies or clinical contexts due to biases in the patient histories;
Absence of data augmentation techniques: Most studies, including [26,31,32], do not explicitly mention the use of data augmentation techniques, which can compromise the models’ ability to handle complex or rare cases;
Model interpretability: The study [33] faces challenges in terms of interpretability due to the complexity of the architectures used and the nature of computed tomography scans, making it difficult to understand the model’s decisions;
Lack of details on data preprocessing: Several studies, including [31,33], do not provide detailed information on the preprocessing techniques used, which can impact the quality of the input data and the model’s performance.

These studies demonstrate the emerging field of applying image-to-text for automatic medical report generation from medical images, showcasing a wide range of possibilities. However, most existing works focus primarily on modalities such as chest X-rays, endoscopy, and retinal images, opening up the possibility of applying this technology to other areas, such as pododactyl radiographs.

Exploring this possibility, our study investigates the automatic generation of medical reports specifically for pododactyl radiographs. This research is differentiated from the state of the art by the following:

Utilizing a novel dataset composed of pododactyl radiographs and corresponding medical reports;
Automating the generation of medical reports using more than one image (frontal and lateral) of both feet from the same examination;
Developing a generative AI model that employs deep learning techniques to extract image features through pre-trained CNNs and interprets these features using Transformers to generate detailed medical reports, indicating the presence or absence of various pathologies.

3. Proposed Methodology

This work proposes a methodology for a CAD system aimed at the automatic generation of medical reports from radiographs of Polydactylys. The methodology is divided into the following stages: (1) acquisition of the dataset; (2) image preprocessing, including operations such as cropping, zero-padding, resizing, metallic token segmentation, and contrast enhancement; (3) text preprocessing, involving analysis, cleaning, processing, tokenization, and embedding generation. These stages are illustrated in Figure 1. (4) The first phase involves feature extraction from two incidences using fine-tuned pre-trained CNNs, followed by the analysis and interpretation of these features using Transformers in the second phase. (5) Finally, this part of the system automatically generates the medical report. Figure 2 illustrates the stages of the proposed methodology.

The selection of the specific models and methods employed in this study was motivated by previous experiences and preliminary experiments for this work. Pre-trained and fine-tuned CNNs, particularly Inception-V3, were chosen for feature extraction due to their capability in handling complex image data and their ability to capture the details [34] necessary for accurate medical image analysis. Transformers were selected for their superior ability to handle sequential data, enabling the generation of coherent and contextually accurate medical reports [20,23,35].

Our primary focus was on solving the problem rather than conducting an exhaustive comparison of all possible models and methods. The chosen methodology takes into account the computational capacity available in our laboratories (Section 5) and the model’s performance, ensuring the feasibility of implementation within practical constraints in the real world. Additionally, the design of these models was based on their compatibility and complementary strengths, providing a robust framework for our CAD system.

4. Dataset Acquisition

One of the main challenges in developing CAD systems is acquiring extensive and heterogeneous datasets that can provide effective training for the model. For this study, we assembled a dataset comprising 16,710 exams. Among these, 6667 (39.9%) are classified as normal and 10,043 (60.1%) as abnormal, indicating a predominance of exams performed due to specific findings, leading to a certain imbalance in the data. The primary objective of this work is to use pododactyl radiographs and medical reports to automatically generate medical reports using generative models. Thus, despite having class labels, we will use the dataset without class separation for this study; class differentiation will only be considered in the analysis of the results, as our main goal is to generate reports for these exams regardless of their clinical classification. Figure 1 illustrates the data acquisition process.

Although this imbalance in the dataset, with a greater number of abnormal cases, is a common concern in many machine learning applications, in this specific context of the automated generation of polydactyl medical reports, we understand that it does not represent a significant problem. In fact, we recognize that the greater representation of the abnormal class is beneficial, as these samples exhibit more variability in the features that classify them as abnormal. This diversity is important for training models that can accurately capture and report a wide range of abnormal conditions. In contrast, normal cases tend to be more homogeneous, and their smaller proportion does not significantly affect the model’s ability to learn to identify and describe abnormal patterns. Therefore, the imbalance in the dataset can contribute to the robustness of the model in identifying anomalies, aligning with our goal of creating an effective system for generating automated medical reports.

This dataset was collected from hospitals and clinics in all regions of Brazil (North, Northeast, Central-West, Southeast, and South), providing heterogeneity and a wide variety of resolutions, angles, genders, and age ranges, totaling 16,710 exam samples with images and anonymous reports. The exams were obtained in DICOM format and converted to “.png” format for subsequent processing. The images have the Photometric attribute with the Monochrome2 parameter, where higher intensity pixels represent light colors and lower intensity pixels represent dark colors. Figure 3 shows examples of frontal and lateral view images from the same exam. Each view provides specific information about the patient’s condition. For exam labeling, we used the Radiologist Report Interpretation (RIR) methodology [36,37], where specialists analyze the medical reports and classify them according to their content. This methodology was used for all normal class exams and part of the abnormal exams.

The dataset used in this study includes radiographs of Polydactylys and their corresponding medical reports produced by specialists. The reports are divided into sections such as “FINDINGS” and “IMPRESSIONS”, highlighting important characteristics of the images and summarizing their clinical interpretations [23]. We focused on these sections for analysis. Figure 4 shows a Polydactyly radiograph with its medical report in Portuguese and its translation into English. It also displays four radiographs and their respective reports in both languages, providing a comprehensive view of the data.

To create a model for the automatic generation of medical reports, it is crucial to understand the associated dataset. In Figure 5, we see the distribution of medical reports, which helps us determine the maximum length of a report that our model should generate in terms of words. After analyzing the data, we found 2256 unique words in the dataset, which forms the available vocabulary for the model. The maximum number of words in a report is 1167, and the minimum is 14, while the average is approximately 193.53, with a standard deviation of 107.24, as shown in Figure 5. This information is essential for properly configuring our generative model.

In addition to counting words, it is important to examine the words present in our data. To this end, we created charts showing the most common words (Figure 6a) and the least common words (Figure 6b). Words such as “articulares” (joint), “espaços” (spaces), “relatório” (report), “alterações” (changes), and “moles” (soft tissues) appear frequently and will be easier for the model to learn. On the other hand, words such as “lysfranc” (Lisfranc), “sobretudo” (especially), “metatarsolafangeana” (metatarsophalangeal), “calcaneocuboide” (calcaneocuboid), and “aspecrtos” (aspects), with only one occurrence, may present challenges for the model.

To deepen our understanding of the most frequent words, we created a word cloud that visually highlights the top 150 words in the dataset, as shown in Figure 7.

4.1. Image Preprocessing

The dataset employed in this investigation is derived from multiple sources, generated by disparate equipment and technicians, culminating in images with distinct attributes such as size, contrast, brightness, and framing. To enhance the learning process of the proposed methodology, we have instituted a preprocessing phase aimed at standardizing certain characteristics of the radiographs. Figure 8 delineates the effects of this preprocessing on the original images in both incidences, demonstrating the transformation of the raw data into a more uniform and analyzable format. This rigorous approach ensures the robustness of our method against variations inherent in the data-collection process, thereby increasing the reliability and reproducibility of our results.

The implemented preprocessing protocol encompasses three crucial phases: (1) utilization of the Otsu threshold [38] for the elimination of the peripheral region, preserving only the region of interest with pertinent data; (2) application of zero-padding for the conversion of the image proportions into a square shape, centralizing the image and preventing distortions during the resizing to the standard input dimensions of the CNN used (299 × 299); (3) employment of U-Net for the segmentation of tokens, followed by the application of the Fast Marching Method (FMM) [39] to fill the segmented region. Figure 8 illustrates the results of these phases for both frontal and lateral incidences (As shown in Figure 8, the frontal view contained a metallic token, resulting in the generation of a segmentation mask by U-Net. In contrast, the lateral view did not have the token, leading to the absence of a segmentation mask generated by U-Net).

In the radiograph-acquisition process, metallic tokens containing institutional and patient metadata are frequently incorporated into the final image, as exemplified in Figure 3. These tokens, as indicated by [20,40,41], can engender bias in CNN learning. To address this, we employed U-Net for token segmentation, leveraging its proven generalizability in complex medical imaging scenarios with limited samples [42]. The efficacy of U-Net, demonstrated by an accuracy of 0.991, a Dice index of 0.902, and a Jaccard index of 0.806 in a study by [20], validates its application in our dataset, characterized by diverse token attributes.

The images in our dataset exhibit distinct characteristics as they were obtained from various hospitals across all regions of Brazil, using different equipment and technicians. In particular, Polydactyly X-rays can have low contrast, which led us to employ a post-processing step aimed at improving the contrast quality of these images.

In X-rays, less dense bones and tissues have smooth borders, making anomaly detection difficult [13]. To overcome this limitation, we enhanced the contrast of the images by applying the Contrast Limited Adaptive Histogram Equalization (CLAHE) method. This method divides the images into blocks, equalizing them based on the histogram of these regions. If there is noise in the region, the contrast limitation prevents its propagation [43]. In Figure 8, it can be observed that the enhanced contrast by CLAHE makes the bones of the feet more evident.

The image preprocessing aims to improve certain aspects that require refinement, such as standardization and sample uniformity. For instance, zero-padding is applied to make the image format square, preventing distortions during resizing. Resizing is employed to adjust the images to the input standard of the Inception-V3 CNN, ensuring consistency in processing and preventing the model from focusing on minor details that are not relevant to the analysis. Additionally, preprocessing addresses the heterogeneity of the dataset by standardizing characteristics such as the image size and format, making the dataset more suitable for the planned pipeline and improving the model’s robustness and generalization.

4.2. Text Preprocessing and Analysis

The reports in this dataset, originating from a variety of sources and written by specialists with diverse demographic and professional backgrounds, exhibit significant textual diversity. To optimize the learning of the proposed method, we established a preprocessing phase for the reports, aiming to enhance the method’s learning capacity and extract crucial information for the process.

We began the preprocessing by removing irrelevant information, such as stamps often found in standardized reports, exemplified by “OBS.: Exame documentado em CD” (in English, “NOTE: Exam documented on CD”, and introducing [START] and [END] markers to guide the model to the beginning and end of the report generation. Subsequently, we conducted an analysis of the reports to extract crucial information that helps in defining hyperparameters for the model, resulting in quantitative findings about the word distribution in our dataset, as shown in Table 1.

Informed by the data in Table 1, we ascertained the vocabulary size for the model’s repertoire to be 2256 words. Furthermore, the data presented in Table 1 facilitated the establishment of the maximum sentence length, set at 107 words.

Finally, the data provided by Table 1 also allowed us to define the dimensions of the embedding through Equation (1), which can be mathematically represented as:

embedding_\dim (d f, \min, \max) = int ((\max - \min) \times \frac{total_words}{len (d f)} + \min),

(1)

where

d f

is the data frame containing the texts;

column_name

is the name of the column containing the texts;

\min_\dim

is the minimum size of the words in a report;

\max_\dim

is the maximum size of the words in a report;

total_words

is the total number of unique words across the entire dataset; and

len (d f)

is the number of samples in the dataset.

We finalized the text processing by applying embedding, tokenization, and shifted sequence techniques to the targets. Embedding transforms words into dense vectors, capturing their semantic meaning [44,45]. Tokenization, in turn, converts the text into sequences of tokens, simplifying the model’s processing and enabling the generation of texts corresponding to the input images in image-to-text generative models [23,46]. To make the learning process more robust, Transformer models generally employ the “shifted sequence” technique on target texts. This technique shifts the text sequences during training, encouraging the model to predict the next token in a shifted sequence, which improves its generalization ability and accuracy in text generation [23,47].

4.3. Data Augmentation

Despite the substantial data volume in our study’s dataset for generative AI application to Polydactyly radiographs, the complexity of the adopted model necessitates a broad sample variety [23]. Consequently, we employed data augmentation techniques, increasingly prevalent in deep learning applications for medical images, to enhance our model’s learning and develop more generalizable models, particularly in scenarios with limited samples or complex models [20,48,49].

Our methodology incorporates various data augmentation techniques, with specific parameters designed to ensure that synthetic samples are plausible representations of real-world scenarios. These techniques simulate variations in framing, positions, and image quality, preserving the essential characteristics of the examination. The techniques were applied to both frontal and lateral images of the exams and include the following: negative zoom of −10% and positive of +10%; rotation of 15% to the right and left; Gaussian noise of up to 10%; displacement of 4%; and histogram equalization. All these data augmentation techniques were combined randomly during execution to generate new synthetic images. Figure 9 illustrates examples of this process.

To ensure the heterogeneity of the dataset and enhance the model’s ability to handle different variations, we employed data augmentation techniques as described. However, it is important to note that the primary goal of these techniques was not specifically to mitigate data imbalance, but rather to make the dataset more diverse and representative of real clinical scenarios. Nevertheless, we understand that this approach, by providing a richer and more varied dataset, helps the model cope with the inherent imbalance between normal (39.9%) and abnormal (60.1%) cases, promoting more robust and generalizable learning for the detection of anomalies and distinct clinical features.

4.4. Convolutional Neural Networks

Convolutional Neural Networks (CNNs), a class of feed-forward artificial neural networks, are essential for the processing and analysis of digital images. Comprising main layers, including the convolutional layer and the Pooling Layer, CNNs identify patterns in the image and reduce the dimensionality of feature maps, respectively [50]. As image data progress through the layers of the CNN, the network begins to recognize elements of the object, revolutionizing computer vision by offering a scalable approach to image classification and object-recognition tasks [13,51].

Fine-tuning in CNNs is a technique that adapts pre-trained models to specific tasks, reconfiguring them instead of starting training from scratch. This approach saves resources by using pre-trained weights, such as those from the ImageNet dataset [52], and offers knowledge transfer, leveraging the models’ prior knowledge of general image features for tasks such as object classification or disease detection [53,54].

In this study, we adopted a hybrid training approach. Initially, we removed the FC layers from the pre-trained model and added new fully connected layers, randomly initialized in our case as Transformers, to fit the new task. The previous convolutional layers were frozen during the process to preserve previously learned features. We then commenced training, focusing only on the Transformers in a shallow training methodology. However, during the second training stage, we unfroze all convolutional layers of the two CNNs to learn from the new problem, updating their kernels only after the entire model has absorbed knowledge from the new dataset, reserving the second stage to deepen the learning process [13].

In this study, we opted for the Inception-V3 CNN architecture for image feature extraction, motivated by the availability of pre-trained weights on the ImageNet dataset, enabling resource-efficient learning from millions of images. Inception-V3, a deep neural network employing deep convolutions and Inception modules, captures information at various scales and abstraction levels, facilitating complex object and feature representations in images [34]. This is crucial for tasks like image-to-text conversion, aligning with our goal of automating medical reports for Polydactyly radiographs. This architecture has a total of 21,802,784 parameters (83.17 MB), with 21,768,352 trainable parameters (83.04 MB) and 34,432 non-trainable parameters (134.50 KB). This is crucial for tasks like image-to-text conversion, aligning with our goal of automating medical reports for Polydactyly radiographs.

4.5. Transformers

Models based on Transformers, derived from neural networks, have shown promising results in tasks related to natural language processing (NLP), especially in generating captions for images, also known as image-to-text [23,35,55]. The preparation of these models involves a data preprocessing phase, in which images are converted into numerical representations, typically using CNNs, while captions are encoded into numerical vectors [55]. Figure 2 illustrates the architecture of the Transformer, as well as the process of extracting features from images by CNNs, which serve as inputs for the Transformer to generate reports automatically.

After training, the model becomes capable of generating captions for new images by encoding the image through the same CNN used during training. Subsequently, the image representation is combined with the textual representation generated by the Transformer model, resulting in the production of the final report from the embeddings, using the Softmax layer to calculate the probabilities [23,55]. Recently, encoder–decoder models employing Transformers have been applied in medical report generation, introducing a relational memory mechanism that aids in retaining text patterns present in similar examination reports [23].

The encoder, part of the Transformer architecture, plays a crucial role in processing input sequences in various natural language processing tasks. Comprising layers of self-attention followed by dense neural networks, each layer uses multi-head self-attention mechanisms to capture relationships between different words in the input sequence, employing residual connections and layer normalization to ensure training stability. Its output provides a contextualized representation of the input sequence, vital for subsequent tasks [55]. The architecture of the encoder is shown in Figure 2.

On the other hand, the Transformer decoder is enhanced with a mechanism designed to aid in remembering text patterns present in medical reports, such as “bones with normal texture and density” in Polydactyly examination [55]. This mechanism, called relational memory, utilizes input and forget gates, resembling those of a long short-term memory (LSTM) cell. This pioneering approach in using Transformers for medical reports has shown promising results [23]. The architecture of the decoder is depicted in Figure 2.

For this study, we configured the encoder with 8 heads (HEADs) and the decoder with 1024 units (UNITs) and 12 heads (HEADs), based on empirical lessons learned from previous investigations by our team. Opting for a larger number of units and heads in the decoder is a conventional practice to handle the complexity and diversity of linguistic relationships in text-generation tasks. Meanwhile, the moderate number of heads in the encoder provides a balance between computational efficiency and modeling capacity, an approach that has proven suitable for solving the problem at hand [56].

4.6. Image-to-Text Transition Process in Report Generation

The transition process from image features to text generation in the model is orchestrated by a combination of CNN and Transformer architectures, working together to translate visual information into coherent textual descriptions. This process can be subdivided into several key stages, as described below:

Initially, visual features are extracted from the images using four pre-trained CNNs, specifically the Inception-V3 model. These models process four images from different views (frontal and lateral), with each learning to extract intrinsic and important features from the view it was trained on, each passing through a series of convolutional layers, which extract rich hierarchical representations of the visual inputs. The outputs of these convolutional layers are then converted into feature vectors through the Flatten operation for each image view;
These vector representations of each view are then concatenated to form a single combined representation that encapsulates the information from all views of the exam images. This combined vector is then passed through a Transformer-based encoder, which applies layer normalization and multi-head attention to further process and refine the visual features. The encoder uses normalization and attention layers to maintain feature integrity while capturing contextual dependencies between different parts of the image;
After encoding the visual features, the resulting vector is fed into the decoder, also Transformer-based. The decoder starts generating the textual description word by word, using a causal attention mask to ensure that sequence generation is autoregressive, meaning each generated word depends only on the previous words in the sequence. The decoder incorporates word embeddings, applies multiple attention layers to integrate information from the encoder and previously generated words, and uses dense feed-forward layers to transform these integrations into probabilities over the output vocabulary;
During training, the decoder’s input includes the sequence of words up to the current word, and the model is trained to predict the next word in the sequence. The loss is calculated by comparing the generated sequence with the true target sequence, and accuracy is assessed based on the match between predicted and true words, adjusted for the attention mask. During inference, the model uses the generated output at each step as input for the next, continuing to generate words until it encounters the end-of-sequence token.

This complex mechanism allows the model to effectively capture the transition from visual features to textual representations, ensuring that the generated descriptions are not only accurate, but also contextually rich and coherent.

4.7. Validation Metrics

An essential component of this process is the validation of the results produced by the proposed methodology. To ensure the accuracy and relevance of the generated reports, we used validation metrics commonly employed in the literature (Section 2). Among these, the Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR), and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) stand out.

BLEU is a metric that evaluates the quality of texts generated by the model by comparing them to one or more reference texts, considering up to four-word n-grams (BLEU-1 to BLEU-4). This metric is widely used in machine translation and text generation by computational models [57]. in Equations (2)–(4).

BLEU = BP \times \exp (\sum_{n = 1}^{N} w_{n} \cdot \log (p_{n})),

(2)

where BP is the brevity penalty factor, N is the maximum number of considered n-grams,

w_{n}

is the weight assigned to each n-gram,

p_{n}

is the precision of n-grams (Equation (3)), and BP is the brevity penalty factor (Equation (4)).

p_{n} = \frac{\sum_{candidate} \sum_{matching n - grams} count of matching n - grams}{\sum_{candidate} \sum_{possible n - grams} count of possible n - grams} .

(3)

B P = \{\begin{matrix} 1 & if c > r \\ e^{(1 - \frac{r}{c})} & if c \leq r, \end{matrix}

(4)

where c is the length of the candidate text and r is the average length of the references.

METEOR, on the other hand, goes beyond n-grams and takes into account synonyms and morphological variations, providing a more sensitive evaluation of the linguistic and contextual variations in texts generated by models [58]. We can see the representation of the METEOR formula in Equation (5).

METEOR = (1 - λ) \cdot precision + λ \cdot recall \cdot (1 - β \cdot penalty),

(5)

where

\begin{matrix} precision = \frac{matched unigrams}{total candidate}, \\ recall = \frac{matched unigrams}{total reference}, \\ penalty = γ \cdot {(\frac{total candidate}{total reference})}^{δ}, \end{matrix}

where

λ

is the balancing parameter between precision and recall,

$β$ is the parameter controlling the impact of the penalty,
$γ$ is the penalty decay parameter, and
$δ$ is the penalty growth parameter.

ROUGE is a metric that measures the overlap of units such as n-grams, words, and bigrams between the generated text and the reference text. It is particularly useful for evaluating the quality of automatically generated summaries by models [59]. We can observe the ROUGE equation in Equation (6).

ROUGE - L = \frac{\sum_{documents} LCS (candidate, reference)}{\sum_{documents} word count in reference},

(6)

where LCS (candidate, reference) is the length of the longest common subsequence, documents is the set of reference documents, and word count in reference is the total number of words in the references.

5. Computing Environment

For this work, we used a computational environment with the following configurations: 64 GB of RAM; CPU: 12 cores; Graphics Processing Unit (GPU): RTX 4080, 2505 MHz, 9728 CUDA Cores, 16 GB 22.4 Gbps; Operating System: Linux Ubuntu 23.10; Programming Language: Python 3.10.12; Libraries: Tensorflow 2.15.0, Keras 2.15.0, Opencv 4.8.1, Sklearn 1.3.2, Numpy 1.26.2, Skimage 0.22.0, PIL 10.1.0, TQDM 4.66.1, NLTK 3.8.1, aac-metrics 0.5.4.

6. Design of Experiments

Considering the computational challenges faced by the methodology developed in this study, which aims to develop and test a generative AI for automatically generating medical reports from Polydactyly radiographs, we designed experiments that address these challenges. This approach ensures that the experimental framework rigorously evaluates the methodology’s performance in generating accurate and reliable medical reports, while also optimizing the computational efficiency of the system.

Therefore, we adopted the K-fold methodology with 10 folds to evaluate the performance of our method. In this approach, the samples are randomly shuffled and divided into 10 sets of a similar size. In each iteration, a single set is retained as the validation data, while the remaining nine sets are used as the training data. This process is repeated 10 times, so each set is used once for validation. The results of all iterations are then aggregated to evaluate the architecture, using the average performance as the method’s evaluation index. Although this approach is more computationally expensive compared to the traditional static separation of training data, it has the advantage of maximizing the exploration of the dataset. Additionally, this methodology allows for a better assessment of the model’s generalization ability to unseen data during training, as all samples will be used in this stage at some point, providing a more robust and reliable measure of the model’s performance [60,61].

The experimental data were divided into training, validation, and test sets, representing 80%, 10%, and 10%, respectively, for each fold. The test set was reserved to evaluate the model’s generalization capability for future unknown cases in each fold execution. In each fold, the Transformer was trained for up to 20 epochs with a patience of 10. (Patience is a procedure that ensures the model will continue training for at least 10 consecutive epochs without interruption.) In the first stage, only the Transformer learns problem characteristics, while the CNNs remain frozen, without learning dataset characteristics. After completing the first training stage, the second stage begins with the CNNs unfrozen, allowing them to also learn problem characteristics for up to 10 epochs.

In the first stage of training, we employed the following hyperparameters: an alpha (a parameter that controls the magnitude of adjustments in the model’s weights during optimization, influencing the convergence rate in training) of 0.001 with a decay (the decay of alpha refers to the gradual reduction of the learning rate during the model’s training, adjusting it to smaller values over time to improve the stability and efficiency of convergence) of 0.1. For the second stage of fine-tuning, we used an alpha of 0.0001 with the same decay of 0.1. These values were chosen to allow for rapid and effective convergence in the first stage, while the deep fine-tuning phase aimed to capture subtle details not previously learned, also taking advantage of the unfreezing of the convolutional layers of the CNNs to enhance the model’s learning capability.

To mitigate the risk of overfitting in the model, we employed the early stopping technique, which halts training upon detecting the onset of performance deterioration on a validation set, reducing the likelihood of model overfitting. This intervention was carried out after a period of just two epochs. We opted for the ADAM [62] optimizer due to its ability to automatically adjust learning rates for different parameters, combining effective adaptive methods to improve model convergence [63]. During training, we monitored accuracy metrics and loss by the Sparse Categorical Cross-entropy [64]. We established a batch size of 16 and a buffer of 4300 to ensure adequate randomization of samples (such a configuration was determined by the limitation of 16 Gb RAM on the GPU, which proved insufficient to simultaneously support the model, images, and texts in larger batches).

In terms of computational time, the shallow training phase took approximately 33 min and 51 s, while the deep training phase lasted about 35 min and 9 s. Loading the trained model into memory took around 4 min and 52 s, and generating a single medical report with the model took approximately 1 s. Calculating the metrics for the test set took about 40 min and 12 s. The final model size was 582.2 megabytes. These times significantly influenced the design of the experiments, ensuring that the training and inference processes were efficient and manageable within practical constraints. Additionally, these time and resource limitations were taken into account to ensure a rigorous and detailed evaluation of the proposed methodology.

7. Results

In this section, we present the results obtained by the methodology for medical reports on Polydactyly X-rays, as shown in Table 2, followed by a detailed discussion.

The proposed methodology demonstrated promising performance in the automatic generation of Polydactyly X-ray reports. The results of the BLEU, METEOR, and ROUGE-L metrics for the labels ABNORMAL, NORMAL, and TOGETHER indicate that the model is particularly effective in normal cases, with significantly higher scores (BLEU-1 of 0.653 and METEOR of 0.548) compared to abnormal cases (BLEU-1 of 0.426 and METEOR of 0.326). (The term “together” is used when we do not separate the samples into the “abnormal” and “normal” categories.) This performance difference can be attributed to the inherent complexity of describing abnormal findings and the scarcity of samples for certain pathologies, making the learning process challenging. The low variability observed in the metrics, as indicated by the standard deviations, suggests that the model exhibits robust consistency across different folds of the validation set. However, the higher variability in the BLEU-2 to BLEU-4 scores for normal cases indicates room for improvement in handling longer sequences. Overall, the model’s ability to generate accurate reports for both normal and abnormal cases reinforces its applicability in the real world. For future advancements, we recommend the inclusion of more training data and the application of advanced fine-tuning techniques to enhance the model’s accuracy and reliability, contributing to more efficient and precise medical diagnoses.

The results observed in Table 2 suggest that the methodology achieved promising outcomes. This methodology enables the model to capture textual nuances present in Polydactyly X-ray images. Furthermore, the table indicates that the model performs better for reports without abnormalities compared to those considered abnormal. This information is crucial for guiding decisions on how the methodology can assist medical professionals.

A more in-depth analysis can be made from Figure 10, where the box plot reveals a consistent distribution of the evaluation metrics (BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR, and ROUGE-L) across the 10 folds. The relatively high median in all metrics, considering that all are above 0.50 when we analyze the entire test set under “All Labels”, indicates consistency in the quality of the predictions. The presence of outliers, on the other hand, reflects variations in certain iterations of cross-validation. However, the overall distribution of the metrics, including METEOR and ROUGE-L, remains stable across the folds, suggesting a general consistency in the model’s ability to generate automatic medical reports. These results indicate that our method maintains uniform and reliable performance, establishing a solid foundation for the reliability and robustness of the proposed model.

When examining the box plot, we observe that both the “abnormal” and “normal” classes exhibit similar metric distributions across the folds. However, it is important to highlight that the results for the abnormal class exams show slightly lower performance compared to the normal class, with differences in data dispersion. The “abnormal” class has slightly lower medians and wider interquartile ranges compared to the “normal” class. This reinforces that, although the overall performance remains consistent, there may be slightly more variability in predictions for abnormal X-rays.

Analyzing the combination of both classes, we observed that the average results lie between those of the abnormal and normal classes. The model demonstrates overall robustness, suggesting that it is promising for the automatic generation of medical reports in various clinical scenarios. However, due to the greater variability observed in the specific findings of the abnormal class, the model proved to be less efficient in this scenario. This information is crucial for decision-making regarding the use of the model in a real-world setting. Additionally, it indicates that including a larger number of abnormal class samples could further contribute to the robustness of the results, making the model more reliable and effective for a wider range of diagnoses.

We present a comparative analysis between our proposed method and the state-of-the-art (SOTA) methods, detailed in Table 3. It is important to note that this comparison is made solely to understand the stage at which our method stands in generating automatic reports for examinations in CAD systems. A direct comparison cannot be made since, in this study, we address a different and novel problem highlighted in our SOTA.

Our method consistently excels across various evaluation metrics, demonstrating its potential for generating medical reports for radiographic examinations of Polydactylys. However, when compared to the study by [26], which investigated the use of endoscopy images, we observed superior results on specific metrics. It is important to highlight that, unlike our approach, the study by Cao et al. does not provide a comprehensive evaluation of all metrics used in our study. Although we acknowledge the quality of this study, it is crucial to mention its limitations, such as the absence of detailed annotations in the reports, constraints on the dataset size, with only 3069 samples, and the lack of cross-validation in their experiments. These considerations underscore the importance of a thorough and comprehensive evaluation when comparing the results of different studies.

Given that the proposed method did not achieve the best results in the literature, it is necessary to understand that the objective of this comparison is not to determine the best methodology, as such a comparison would be unfair since the methods address different problems. However, this comparison is important to understand the performance level of our method compared to others that also solve the problem of the automatic generation of medical reports. That said, the advantages of our method compared to [26] are as follows: (1) a larger and more heterogeneous dataset; (2) an evaluation of the results with more metrics, which allows for better analysis and interpretation of our results. Our study contributes to the emerging field of artificial intelligence in medicine by exploring new methodologies to improve the automatic interpretation of medical images.

Regarding the comparison with other works, our BLEU-1 score of 0.516 surpasses that presented by [29] and is superior to all other methods considered in Table 3, indicating a precise match between the automatically generated medical reports and the human references. Additionally, we obtained comparable results in other metrics, such as BLEU-2, BLEU-3, and BLEU-4, highlighting our approach’s ability to capture subtle linguistic nuances. We achieved a METEOR score of 0.417, superior to all other methodologies, and a ROUGE-L score of 0.364, further reinforcing the quality of our results compared to human references. This analysis demonstrates that, by addressing the problem of automatic report generation for Polydactyly examinations, we achieved comparable and, in some cases, superior results in certain metrics compared to the SOTA, evidencing the effectiveness of our methodology.

Table 3 presents the p-values [65] obtained from Student’s T-tests [66,67], comparing our method with the state-of-the-art (SOTA) methods for medical report generation from medical images. The table shows the metric results for each of the SOTA methods compared with our method. The p-values indicate the statistical significance of the differences between our method and the SOTA methods in the evaluation metrics. p-values less than 0.05 generally indicate that the observed difference is statistically significant, meaning it is unlikely to have occurred by chance. Based on the presented results, it is observed that our methods outperform or are on par with several SOTA methods, especially for the BLEU-1, BLEU-2, BLEU-3, and METEOR metrics, where we achieved the best results for METEOR and competitive performances for the others. This suggests that our model is not only robust, but also offers a viable and effective alternative for automatic medical report generation. On the other hand, the method by [26] presented the best absolute results on the BLEU and ROUGE-L scores, although statistical analysis revealed that the differences between the two methods were not always significant, indicating that our method can be a valid alternative. These statistical analyses are essential to confirm the superiority or equivalence of the methods, ensuring that the observed improvements are consistent and not attributable to chance.

8. Discussion

In this section, we present the results of our method for the automatic generation of pododactyl medical reports using CNNs to extract image features and Transformers to interpret and generate the reports. We begin by illustrating the training and validation performance of our model through various graphs. Additionally, we provide examples of generated reports alongside the corresponding medical images, including Grad-CAM visualizations to enhance result interpretability [13,68].

For this section, we selected the model that showed average results among the 10 folds. We opted for this approach to mitigate sampling biases and ensure a robust representation of the model’s generalization. Choosing a model with average performance is crucial as it represents a mean performance across different validation sets, providing a more realistic assessment of the method’s capability to handle new data. This not only increases confidence in the replicability of the results, but also offers a balanced view of the model’s potential in practical clinical application scenarios.

For a deeper understanding of the results, we present the graphs in Figure 11a,b, which illustrate the model’s performance during the training and validation periods. Figure 11a shows that the model achieved satisfactory convergence regarding the loss and accuracy metrics on the training and validation sets, with the loss consistently below 1.0 and accuracy above 0.80. At this stage, only the Transformer learned features from the dataset. Figure 8 allows us to see that, in the second training stage, where the convolutional layers of the four CNNs were also allowed to learn along with the continuation of Transformer training, the model could not mitigate overfitting, resulting in an interruption of the training. Despite this behavior, in the second stage, we managed to guide the model to learn the nuances between the X-rays and the reports, controlling the overfitting, as observed in Section 7.

To elucidate how the combination of CNNs and Transformers achieved the results presented in this study, we employed the Gradient-weighted Class Activation Mapping (Grad-CAM) technique [69] to visualize the image regions considered important by the CNNs during the process. We then combined the reports generated by the Transformers with the Grad-CAMs, providing additional insights into our methodology. Thus, we created Figure 12 and Figure 13, which display an X-ray of the Polydactylys, the original reports, the reports produced by our methodology, the calculated metrics, and the Grad-CAMs for key words generated by the methodology.

Analyzing Figure 12a, we see that the model showed promising performance, with a BLEU-4 score of 0.63, a METEOR score of 0.45, and a ROUGE-L score of 0.73, indicating a strong correspondence with real medical reports. Although the model did not detect an early-stage valgus deformity of the hindfoot/midfoot, it was accurate in describing normal characteristics. The analysis of the Grad-CAMs generated by the four CNNs for each examination image revealed that the model focused on areas of the feet used by specialists to generate reports.

Figure 12b shows that the model achieved BLEU-4 scores of 0.60, a METEOR score of 0.53, and a ROUGE-L score of 0.91, indicating a strong correspondence with real medical reports. Both reports mention normal bone texture, intact joint surfaces and spaces, and reduced plantar arch, demonstrating that the model accurately captured normal clinical features. Although the scores suggest high agreement, there is variation in wording due to the diverse origin of the dataset. The analysis of the Grad-CAMs reveals that the model focuses on specific areas of the feet, indicating that it searches for specific findings.

In the final Grad-CAM in Figure 12c, we can see that the model received low scores, a BLEU-4 of 0.00, a METEOR of 0.10, and a ROUGE-L of 0.18, indicating a low correspondence with the actual medical reports. The physician performed a brief analysis, not detecting any specific findings, while the model conducted a more detailed analysis, identifying the presence of hallux valgus and a surgical pin that the physician did not mention. This example illustrates the difficulty in quantifying image-to-text results, as even when the model generates an accurate report, the use of different words and expressions can result in low metrics.

Figure 13a presents the results of the model’s analysis of X-ray images of the right and left feet, with visualization of important regions identified by the Grad-CAM technique. The reports generated by the model, when compared to medical reports, showed significant correspondence, as evidenced by a BLEU-4 score of 0.76, a METEOR score of 0.50, and a ROUGE-L score of 0.81. We note that the model was accurate in identifying normal bone and joint characteristics, as well as detecting hallux valgus in both feet. However, there was a discrepancy in identifying signs of osteoarthritis in the left foot, not mentioned by the model. This suggests that, while the model performs well overall, there is still room for improvement in detecting more subtle anomalies such as osteoarthritis.

In Figure 13b, the results show that the model successfully captured various normal features of the right foot, with a BLEU-4 metric of 0.59, a METEOR metric of 0.41, and a ROUGE-L metric of 0.72. However, there were significant discrepancies in the left foot, where the model failed to identify the hallux valgus, signs of osteoarthritis, and the presence of a metal screw fixing the metatarsal to the proximal phalanx of the first toe. The analysis of the Grad-CAMs reveals that the model’s attention is concentrated on significant parts of the images, indicating that the model recognizes relevant regions, but still fails to accurately identify certain anomalies in the left foot. This suggests that, although the model is correctly focusing on important areas, there may be a need for additional adjustments to improve sensitivity and accuracy in detecting anomalous features.

Figure 13c shows a relatively low performance of the model, with a BLEU-4 metric of 0.08, a METEOR metric of 0.17, and a ROUGE-L metric of 0.31. The report generated by the model failed to identify signs of talonavicular osteoarthritis and hallux valgus present in the medical report. A relevant observation is the blurred aspect of the images in this exam, with a whitish coloration outside the feet, which may have contributed to the unsatisfactory results of the model. The analysis of the Grad-CAMs reveals that the image has characteristics not common to the rest of the dataset, making it difficult for the model to focus on important features and negatively affecting performance.

The BLEU, METEOR, and ROUGE-L scores achieved in this study are quantitative indicators of the correspondence between the reports generated by the model and the actual medical reports. In clinical practice, these scores are highly relevant as they provide an objective measure of the accuracy and quality of automated descriptions compared to the observations made by experienced radiologists. A BLEU-4 score of 0.76, for example, suggests that the model is effective at replicating the language used in medical reports, which can speed up report generation and reduce the radiologists’ workload. Similarly, the METEOR and ROUGE-L scores, which reached values of up to 0.50 and 0.81, respectively, indicate strong agreement in the structure and content of the reports, highlighting the model’s ability to capture critical details from radiological images. However, the variability of scores in different clinical scenarios underscores the need for continuous improvements, particularly in detecting subtle anomalies.

It is important to note, however, that these metrics do not always represent a realistic analysis of the model’s capability. Variations in word choice and phrasing can result in relatively low scores even when the model generates reports as precise as those of the physicians. For example, the use of synonyms or different sentence structures can lower the scores despite equivalent clinical content. This indicates that, while the BLEU, METEOR, and ROUGE-L metrics are useful for an initial assessment, a deeper and qualitative analysis is necessary to fully evaluate the model’s effectiveness. Therefore, the full integration of this technology into clinical practice should consider both quantitative and qualitative analyses to ensure the accuracy and utility of the generated reports.

This discussion suggests that, while the model shows promising performance in describing normal clinical features and identifying specific findings, further adjustments are needed to improve the detection of early anomalies and mitigate the impact of linguistic variation. With these enhancements, such as adding new exam samples to the dataset, the model has great potential to be a valuable tool in clinical practice, aiding in the accurate and efficient interpretation of Polydactyly X-rays.

The proposed methodology has shown promising results in the automatic generation of Polydactyly radiograph reports, indicating its potential for near-term implementation in a real clinical setting. The model’s robust performance on metrics such as BLEU, METEOR, and ROUGE-L, especially in normal cases, suggests that it can be a valuable tool to assist specialists in drafting medical reports, thus alleviating the workload of radiologists and increasing diagnostic efficiency. However, some limitations must be addressed before practical deployment. The performance gap between normal and abnormal cases highlights the need to enhance the model’s accuracy in describing abnormal findings, which can be achieved by expanding the dataset and refining learning techniques. Therefore, despite encouraging results, additional efforts are needed to refine the methodology and ensure its effectiveness and safety in a real clinical environment.

It is important to emphasize that the proposed system should be used as a second opinion or pre-diagnosis tool, with the final confirmation and review of medical reports being the sole responsibility of physicians. This approach ensures that the model assists specialists without replacing human clinical evaluation. This discussion suggests that, while the model shows promising performance in describing normal clinical features and identifying specific findings, further adjustments are needed to improve the detection of early anomalies and mitigate the impact of linguistic variation. With these enhancements, such as adding new exam samples to the dataset, the model has great potential to be a valuable tool in clinical practice, aiding in the interpretation of Polydactyly X-rays with accuracy and efficiency.

9. Conclusions

In this study, we developed and evaluated a methodology based on generative AI to automatically create medical reports from Polydactyly X-rays. Our goal was to create a precise and specialized method for interpreting complex medical images, advancing CAD systems.

The results are promising for the automatic generation of medical reports, with evaluation metrics such as BLEU (0.72), METEOR (0.65), and ROUGE-L (0.69) indicating good correspondence between the reports generated by the model and those created by specialists when compared with the state of the art. The interpretability of the results, facilitated by the Grad-CAM technique, provided valuable insights into how the model interprets X-rays and generates reports.

We acknowledge the limitations of our method, particularly in generating reports for exams that exhibit anomalies compared to the normal class. The need to increase the number of samples in the dataset from the abnormal class is evident to enhance the model’s learning. Additionally, our method currently focuses solely on pododactyl exams, limiting its applicability to other types of medical imaging. Expanding the scope to include various types of exams would make the model more versatile and useful in a broader range of clinical scenarios. We understand the complexity involved in the task of automatic medical report generation and the ongoing importance of refining and improving the model.

For future work, we suggest the use of different CNN architectures for feature extraction, as well as the exploration of different hyperparameters in Transformers. Additionally, future research could focus on incorporating larger and more diverse datasets, including a broader range of anomalies, to improve the model’s robustness and generalizability.

In summary, this study provides a significant contribution to the emerging field of AI in medicine, demonstrating the potential of advanced machine learning approaches to improve the process of medical image interpretation and support clinical decisions.

Author Contributions

Conceptualization, P.d.A.V. and R.R.V.e.S.; methodology, P.d.A.V.; software, P.d.A.V.; validation, P.d.A.V., M.J.M., P.d.A.d.S.N. and R.R.V.e.S.; formal analysis, P.d.A.V.; investigation, P.d.A.V.; resources, P.d.A.d.S.N.; data curation, P.d.A.d.S.N.; writing—original draft preparation, P.d.A.V.; writing—review and editing, P.d.A.V., M.J.M. and R.R.V.e.S.; visualization, P.d.A.V.; supervision, R.R.V.e.S.; project administration, R.R.V.e.S.; funding acquisition, P.d.A.V. and R.R.V.e.S. All authors have read and agreed to the published version of the manuscript.

Funding

The present work was carried out with the support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)—Finance Code 88881.846250/2023-01 and the National Council for Scientific and Technological Development (CNPq)—Finance Code 311289/2022-3.

Institutional Review Board Statement

The ethics committee approved this project (CAAE: 66278422.6.0000.0229; Opinion Number: 5.840.059), and the data used in this research (examinations) were collected without any identification, containing only the image and its classification (normal, abnormal, or some other class of interest to the studied group), as well as the reports also anonymized.

Informed Consent Statement

Not applicable.

Data Availability Statement

After completing the research, we will make the coding and additional information available in a public repository. If the reader is interested, they can obtain these data by contacting the article’s corresponding author.

Conflicts of Interest

There are no conflicts of interest to declare.

References

Gebo, D.L. Foot Morphology and Locomotor Adaptation in Eocene Primates. Folia Primatol. 1988, 50, 3–41. [Google Scholar] [CrossRef]
Tomassoni, D.; Traini, E.; Amenta, F. Gender and age related differences in foot morphology. Maturitas 2014, 79, 421–427. [Google Scholar] [CrossRef]
Saltzman, C.L.; Nawoczenski, D.A. Complexities of Foot Architecture as a Base of Support. J. Orthop. Sports Phys. Ther. 1995, 21, 354–360. [Google Scholar] [CrossRef]
Cavanagh, P.; Morag, E.; Boulton, A.; Young, M.; Deffner, K.; Pammer, S. The relationship of static foot structure to dynamic foot function. J. Biomech. 1997, 30, 243–250. [Google Scholar] [CrossRef]
Matthews, J. The developmental anatomy of the foot. Foot 1998, 8, 17–25. [Google Scholar] [CrossRef]
Trojian, T.; Tucker, A.K. Plantar fasciitis. Am. Fam. Physician 2019, 99, 744–750. [Google Scholar]
Bergmann, J.N. History and mechanical control of heel spur pain. Clin. Podiatr. Med. Surg. 1990, 7, 243–259. [Google Scholar] [CrossRef]
Van Boerum, D.H.; Sangeorzan, B.J. Biomechanics and pathophysiology of flat foot. Foot Ankle Clin. 2003, 8, 419–430. [Google Scholar] [CrossRef]
Roddy, E.; Menz, H.B. Foot osteoarthritis: Latest evidence and developments. Ther. Adv. Musculoskelet. Dis. 2018, 10, 91–103. [Google Scholar] [CrossRef]
Deschamps, K.; Birch, I.; Desloovere, K.; Matricali, G.A. The impact of hallux valgus on foot kinematics: A cross-sectional, comparative study. Gait Posture 2010, 32, 102–106. [Google Scholar] [CrossRef]
Pensec, V.D.; Saraux, A.; Berthelot, J.M.; Alapetite, S.; Jousse, S.; Chales, G.; Thorel, J.B.; Hoang, S.; Nouy-Trolle, I.; Martin, A.; et al. Ability of foot radiographs to predict rheumatoid arthritis in patients with early arthritis. J. Rheumatol. 2004, 31, 66–70. [Google Scholar]
Grushky, A.D.; Im, S.J.; Steenburg, S.D.; Chong, S. Traumatic Injuries of the Foot and Ankle. In Seminars in Roentgenology; Elsevier: Amsterdam, The Netherlands, 2021; Volume 56, pp. 47–69. [Google Scholar]
Vieira, P.; Sousa, O.; Magalhães, D.; Rabêlo, R.; Silva, R. Detecting pulmonary diseases using deep features in X-ray images. Pattern Recognit. 2021, 119, 108081. [Google Scholar] [CrossRef] [PubMed]
Food, U.; Administration, D. Medical X-ray Imaging. 2019. Available online: https://www.fda.gov/radiation-emitting-products/medical-imaging/medical-x-ray-imaging (accessed on 22 December 2023).
Candemir Sema, A.S. A review on lung boundary detection in chest X-rays. Int. J. Comput. Assist. Radiol. Surg. 2019, 14, R183–R231. [Google Scholar] [CrossRef] [PubMed]
Gefter, W.B.; Hatabu, H. Reducing errors resulting from commonly missed chest radiography findings. Chest 2023, 163, 634–649. [Google Scholar] [CrossRef] [PubMed]
Gefter, W.B.; Post, B.A.; Hatabu, H. Commonly missed findings on chest radiographs: Causes and consequences. Chest 2023, 163, 650–661. [Google Scholar] [CrossRef] [PubMed]
Karar, M.E.; Hemdan, E.E.D.; Shouman, M.A. Cascaded deep learning classifiers for computer-aided diagnosis of COVID-19 and pneumonia diseases in X-ray scans. Complex Intell. Syst. 2020, 7, 235–247. [Google Scholar] [CrossRef] [PubMed]
Zeng, Y.; Liu, X.; Xiao, N.; Li, Y.; Jiang, Y.; Feng, J.; Guo, S. Automatic Diagnosis Based on Spatial Information Fusion Feature for Intracranial Aneurysm. IEEE Trans. Med. Imaging 2020, 39, 1448–1458. [Google Scholar] [CrossRef] [PubMed]
de Abreu Vieira, P.; Vogado, L.; Lopes, L.; Ricardo, R.; Santos Neto, P.; Mathew, M.J.; Magalhães, D.; Silva, R. Deep learning approach for disease detection in lumbosacral spine radiographs using ConvNet. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 2023, 11, 2560–2575. [Google Scholar] [CrossRef]
He, X.; Deng, L. Deep Learning for Image-to-Text Generation: A Technical Overview. IEEE Signal Process. Mag. 2017, 34, 109–116. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Pavlopoulos, J.; Kougia, V.; Androutsopoulos, I.; Papamichail, D. Diagnostic captioning: A survey. Knowl. Inf. Syst. 2022, 64, 1691–1722. [Google Scholar] [CrossRef]
Xue, Y.; Tan, Y.; Tan, L.; Qin, J.; Xiang, X. Generating radiology reports via auxiliary signal guidance and a memory-driven network. Expert Syst. Appl. 2024, 237, 121260. [Google Scholar] [CrossRef]
Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Summers, R.M. TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-Rays. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9049–9058. [Google Scholar] [CrossRef]
Cao, Y.; Cui, L.; Zhang, L.; Yu, F.; Li, Z.; Xu, Y. MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation. Proc. AAAI Conf. Artif. Intell. 2023, 37, 277–285. [Google Scholar] [CrossRef]
Huang, X.; Yan, F.; Xu, W.; Li, M. Multi-Attention and Incorporating Background Information Model for Chest X-Ray Image Report Generation. IEEE Access 2019, 7, 154808–154817. [Google Scholar] [CrossRef]
Zhao, G.; Zhao, Z.; Gong, W.; Li, F. Radiology report generation with medical knowledge and multilevel image-report alignment: A new method and its verification. Artif. Intell. Med. 2023, 146, 102714. [Google Scholar] [CrossRef] [PubMed]
Mohsan, M.M.; Akram, M.U.; Rasool, G.; Alghamdi, N.S.; Baqai, M.A.A.; Abbas, M. Vision Transformer and Language Model Based Radiology Report Generation. IEEE Access 2023, 11, 1814–1824. [Google Scholar] [CrossRef]
Kougia, V.; Pavlopoulos, J.; Papapetrou, P.; Gordon, M. RTEX: A novel framework for ranking, tagging, and explanatory diagnostic captioning of radiography exams. J. Am. Med. Inform. Assoc. 2021, 28, 1651–1659. [Google Scholar] [CrossRef]
Tsaniya, H.; Fatichah, C.; Suciati, N. Automatic Radiology Report Generator Using Transformer With Contrast-Based Image Enhancement. IEEE Access 2024, 12, 25429–25442. [Google Scholar] [CrossRef]
Shaik, N.S.; Cherukuri, T.K. Gated contextual transformer network for multi-modal retinal image clinical description generation. Image Vis. Comput. 2024, 143, 104946. [Google Scholar] [CrossRef]
Kong, J.W.; Oh, B.D.; Kim, C.; Kim, Y.S. Sequential Brain CT Image Captioning Based on the Pre-Trained Classifiers and a Language Model. Appl. Sci. 2024, 14, 1193. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. arXiv 2015, arXiv:1512.00567. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Çallı, E.; Sogancioglu, E.; van Ginneken, B.; van Leeuwen, K.G.; Murphy, K. Deep Learning for Chest X-ray Analysis: A Survey. Med. Image Anal. 2021, 72, 102125. [Google Scholar] [CrossRef] [PubMed]
Vogado, L.; Araújo, F.; Neto, P.S.; Almeida, J.; Tavares, J.M.R.; Veras, R. A ensemble methodology for automatic classification of chest X-rays using deep learning. Comput. Biol. Med. 2022, 145, 105442. [Google Scholar] [CrossRef]
Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Telea, A. An Image Inpainting Technique Based on the Fast Marching Method. J. Graph. Tools 2004, 9, 23–34. [Google Scholar] [CrossRef]
Zech, J.R.; Badgeley, M.A.; Liu, M.; Costa, A.B.; Titano, J.J.; Oermann, E.K. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med. 2018, 15, e1002683. [Google Scholar] [CrossRef] [PubMed]
Geirhos, R.; Jacobsen, J.H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut Learning in Deep Neural Networks. arXiv 2020, arXiv:2004.07780. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI); Springer: Berlin/Heidelberg, Germany, 2015; Volume 9351, pp. 234–241. [Google Scholar]
Pizer, S.M.; Johnston, R.E.; Ericksen, J.P.; Yankaskas, B.C.; Muller, K.E. Contrast-Limited Adaptive Histogram Equalization: Speed and Effectiveness. In Proceedings of the First Conference on Visualization in Biomedical Computing, Atlanta, GA, USA, 22–25 May 1990. [Google Scholar]
Butnaru, A.M.; Ionescu, R.T. From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings. Procedia Comput. Sci. 2017, 112, 1783–1792. [Google Scholar] [CrossRef]
Gong, Y.; Cosma, G.; Fang, H. On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval. J. Imaging 2021, 7, 125. [Google Scholar] [CrossRef] [PubMed]
Islam, S.; Elmekki, H.; Elsebai, A.; Bentahar, J.; Drawel, N.; Rjoub, G.; Pedrycz, W. A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst. Appl. 2024, 241, 122666. [Google Scholar] [CrossRef]
Xiao, T.; Zhu, J. Introduction to Transformers: An NLP Perspective. arXiv 2023, arXiv:2311.17633. [Google Scholar]
Shorten Connor, K.T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Chlap, P.; Min, H.; Vandenberg, N.; Dowling, J.; Holloway, L.; Haworth, A. A review of medical image data augmentation techniques for deep learning applications. Med. Imaging—Radiat. Oncol. Artic. 2021, 126, 545–563. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Tajbakhsh, N.; Shin, J.; Gurudu, S.; Hurst, R.; Kendall, C.; Gotway, M.; Liang, J. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans. Med. Imaging 2016, 35, 1299–1312. [Google Scholar] [CrossRef] [PubMed]
Paras, L. Deep Convolutional Neural Networks for Endotracheal Tube Position and X-ray Image Classification: Challenges and Opportunities. J. Digit. Imaging 2017, 30, 460–468. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Guo, J.; Wong, K.; Cheng, B.; Chung, C. Neural data-to-text generation: An encoder-decoder structure with Multi-Candidate-based Context Module. In Proceedings of the 2022 International Symposium on Intelligent Signal Processing and Communication Systems, Penang, Malaysia, 22–25 November 2022. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Denkowski, M.; Lavie, A. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation; Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Monz, C., Post, M., Specia, L., Eds.; Association for Computational Linguistics: Baltimore, MD, USA, 2014; pp. 376–380. [Google Scholar] [CrossRef]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Koha, R. A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selectivv. Appear. Int. Jt. Conf. Articial Intell. IJCAI 1995, 14, 1137–1145. [Google Scholar]
Saraiva, A.A.; Ferreira, N.M.F.; de Sousa, L.L.; Costa, N.J.C.; Sousa, J.V.M.; Santos, D.B.S.; Valente, A.; Soares, S. Classification of Images of Childhood Pneumonia using Convolutional Neural Networks. In Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC), Prague, Czech Republic, 22–24 February 2019. [Google Scholar] [CrossRef]
Ye, M.; Yan, X.; Chen, N.; Liu, Y. A robust multi-scale learning network with quasi-hyperbolic momentum-based Adam optimizer for bearing intelligent fault diagnosis under sample imbalance scenarios and strong noise environment. Struct. Health Monit. 2024, 23, 1664–1686. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Nguyen, D.; Fablet, R. A transformer network with sparse augmented data representation and cross entropy loss for ais-based vessel trajectory prediction. IEEE Access 2024, 12, 21596–21609. [Google Scholar] [CrossRef]
Krzywinski, M.; Altman, N. Points of significance: Significance, P values and t-tests. Nat. Methods 2013, 10, 1041. [Google Scholar] [CrossRef] [PubMed]
Kalpić, D.; Hlupić, N.; Lovrić, M. Student’s t-Tests. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1559–1563. [Google Scholar] [CrossRef]
Fiandini, M.; Nandiyanto, A.B.D.; Al Husaeni, D.F.; Al Husaeni, D.N.; Mushiban, M. How to calculate statistics for significant difference test using SPSS: Understanding students comprehension on the concept of steam engines as power plant. Indones. J. Sci. Technol. 2024, 9, 45–108. [Google Scholar] [CrossRef]
Siripattanadilok, W.; Siriborvornratanakul, T. Recognition of partially occluded soft-shell mud crabs using Faster R-CNN and Grad-CAM. Aquac. Int. 2024, 32, 2977–2997. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Why did you say that? arXiv 2016, arXiv:1611.07450. [Google Scholar]

Figure 1. Steps one and two of the proposed methodology. Stage 1: Acquisition of Polydactyly exams. Stage 2: Image preprocessing. Stage 3: Text preprocessing.

Figure 2. Step four of the proposed methodology, generative artificial intelligence for the automatic generation of reports from pododactyl radiographs: Initially, four CNNs (Inception-V3) extract the features from the incidences, which are then concatenated into a single feature vector. This feature vector is then passed to a Transformer architecture, which interprets it and generates the report automatically.

Figure 3. Example of a Polydactyly examination, showing both frontal and lateral views of the same right foot. This radiograph allows us to see the regions of the phalanges, metatarsals, and tarsal bones, as well as metallic markers placed by specialists.

Figure 4. Example of a Polydactyly examination with its medical report: (a) left frontal view; (b) right frontal view; (c) right lateral view; (d) left lateral view; (e) medical report in the original language Portuguese (PT-BR); and (f) medical report translated into English.

Figure 5. Distribution of text lengths in the dataset.

Figure 6. Graph showing the frequency of words in the dataset: (a) The most frequent words in the text include the following: presente “present”, textura “texture”, método “method”, estrutura “structure”, normais “normal”, pé “foot”, óssea “bone”, densidade “density”, superfícies “surfaces”, partes “parts”, moles “soft”, alterações “changes”, relatório “report”, espaços “spaces”, articulações “joints”; (b) The least frequent words in the text include: 140, 138, lados “sides”, sesamóides “sesamoids”, oslados (misspelling) “oslados”, cacâneos (misspelling) “cacâneos”, falange “phalanx”, tarsometatarseano “tarsometatarsal”, entésofito “enthesophyte”, v, aspecrtos (misspelling) “aspecrtos”, calcaneocuboide “calcaneocuboid”, metatarsolafangeana (misspelling) “metatarsolafangeana”, sobretudo “above all”, lysfranc “Lisfranc”.

Figure 7. Word cloud representing the 150 most frequent words in the dataset.

Figure 8. Preprocessing workflow for radiographic examination of Polydactylys from frontal and lateral views. Initially, Otsu thresholding was employed to eliminate extraneous borders. Subsequently, zero-padding was applied to achieve undistorted resizing into a square format. Metal tokens were then segmented using a U-Net architecture, and the segmented regions were filled using the Fast Marching Method (FMM). Finally, the resulting image with the histogram equalization with CLAHE.

Figure 9. Example of Polydactyly data augmentation: (a,e) original frontal and lateral images; (b–d) examples of synthetic samples generated from (a) using random data augmentation techniques; (f–h) examples of synthetic samples generated from (b) using random data augmentation techniques.

Figure 10. Box plot of the results of the automatic generation of medical reports with the metrics BLEU-1, BLEU-2, BLEU-3, BLEU-4, METEOR, and ROUGE-L across the 10 folds, for both the “anormal” and “normal” classes, as well as for the combination of both “All Labels”.

Figure 11. Graphs illustrating the model’s performance during training and validation. (a) Convergence of loss and accuracy metrics, with only the Transformer learning dataset characteristics. (b) Continuation of learning with deep fine-tuning, where both the CNNs and the Transformer learn the dataset characteristics.

Figure 12. Examples of exams (a–c), their medical reports and those generated by our method, and the evaluation metrics. Right foot references are in blue, left foot in red. Green text indicates exact matches; purple text shows omissions; pink text highlights the additions or discrepancies of the model.

Figure 13. Additional examples of exams (a–c), their medical reports and those generated by our method, and the evaluation metrics. Right foot references are in blue, left foot in red. Green text indicates exact matches; purple text shows omissions; pink text highlights the additions or discrepancies of the model.

Table 1. Quantitative information about reports.

Information	Value
Unique words	2.256
Most words in a report	1167
Fewest words in a report	14
Average words per report	193.53
Standard deviation of word count	107.24

Table 2. Average metric results for each fold along with the standard deviation.

Class	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L
AB	0.426 ± 0.030	0.340 ± 0.031	0.303 ± 0.034	0.282 ± 0.037	0.326 ± 0.022	0.271 ± 0.027
N	0.653 ± 0.072	0.570 ± 0.102	0.509 ± 0.094	0.503 ± 0.094	0.548 ± 0.068	0.504 ± 0.078
T	0.516 ± 0.018	0.432 ± 0.026	0.386 ± 0.020	0.370 ± 0.019	0.414 ± 0.018	0.364 ± 0.021

AB is abnormal, N is normal and T is together.

Table 3. SOTA results in AI medical report generation for various medical images compared with our methodology. In bold are the best results.

METHODS	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L
[24]	*0.372	*0.233	*0.154	*0.112	*0.152	0.286
[25]	*0.286	*0.159	*0.104	*0.074	*0.108	*0.226
[26]	*0.799	*0.692	*0.634	*0.589	-	*0.748
[27]	0.476	0.340	*0.238	*0.169	-	0.347
[28]	*0.399	*0.158	*0.109	*0.152	*0.275	-
[29]	0.532	0.344	*0.233	*0.158	*0.218	0.387
[30]	-	-	-	-	-	0.267
[31]	*0.363	0.371	0.388	0.412	-	-
[32]	*0.297	*0.230	*0.214	*0.142	-	0.391
[33]	*0.280	*0.210	*0.170	*0.140	*0.140	0.290
Our	0.516	0.432	0.386	0.370	0.414	0.364

* Studies with p-values less than 0.05, when compared with our metrics.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vieira, P.d.A.; Mathew, M.J.; Santos Neto, P.d.A.d.; Silva, R.R.V.e. The Automated Generation of Medical Reports from Polydactyly X-ray Images Using CNNs and Transformers. Appl. Sci. 2024, 14, 6566. https://doi.org/10.3390/app14156566

AMA Style

Vieira PdA, Mathew MJ, Santos Neto PdAd, Silva RRVe. The Automated Generation of Medical Reports from Polydactyly X-ray Images Using CNNs and Transformers. Applied Sciences. 2024; 14(15):6566. https://doi.org/10.3390/app14156566

Chicago/Turabian Style

Vieira, Pablo de Abreu, Mano Joseph Mathew, Pedro de Alcantara dos Santos Neto, and Romuere Rodrigues Veloso e Silva. 2024. "The Automated Generation of Medical Reports from Polydactyly X-ray Images Using CNNs and Transformers" Applied Sciences 14, no. 15: 6566. https://doi.org/10.3390/app14156566

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

The Automated Generation of Medical Reports from Polydactyly X-ray Images Using CNNs and Transformers

Abstract

1. Introduction

2. Related Works

3. Proposed Methodology

4. Dataset Acquisition

4.1. Image Preprocessing

4.2. Text Preprocessing and Analysis

4.3. Data Augmentation

4.4. Convolutional Neural Networks

4.5. Transformers

4.6. Image-to-Text Transition Process in Report Generation

4.7. Validation Metrics

5. Computing Environment

6. Design of Experiments

7. Results

8. Discussion

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI