The main objective of this article is to locate the scaphoid bone in X-ray images and assist doctors in determining whether it is fractured, as well as in identifying the fracture area. Displaced fractures are typically easy to detect in X-ray images, allowing doctors to diagnose them quickly. In contrast, non-displaced and occult fractures are more challenging to identify because of X-ray projection shadowing and variations in bone density among patients. Additionally, discrepancies in interpretation between radiologists and orthopedic surgeons are common. For fractures not visible in the anterior–posterior view, doctors must examine the scaphoid bone for discontinuities, making the diagnosis time-consuming and demanding a thorough understanding of the scaphoid.
2.2.1. Scaphoid Bone Area Detection
As mentioned earlier, the first stage aims to extract the scaphoid bone from the entire wrist X-ray image. For this purpose, we used a Faster RCNN model combined with a Feature Pyramid Network (FPN). Faster RCNN is a renowned two-stage detector that incorporates a Region Proposal Network (RPN), which enhances the speed of the detector by replacing the traditional algorithmic approach used in the initial stage. The Feature Pyramid Network improves the utilization of feature maps at various scales, addressing the issue that deeper networks might lose small objects.
In our research, if the first stage fails to locate the scaphoid bone accurately, subsequent fracture detection cannot proceed effectively. Therefore, we employed the two-stage Faster RCNN model with the FPN at this stage, despite its slightly slower processing time, because of its improved accuracy. Unlike the original Faster RCNN model, which uses a single-level feature map to generate candidate boxes, the FPN integrates feature maps from multiple levels. This approach helps retain small objects that might otherwise be lost during pooling, thereby enhancing detection performance for small and medium-sized objects.
Considering that a typical wrist X-ray image measures approximately 2000 × 1600 pixels and the scaphoid bone region is around 150 × 150 pixels, the scaphoid falls into the category of small and medium-sized objects. To address this, we used the Feature Pyramid Network (FPN), which upsamples and enlarges deep features to match the size of previous layers and then integrates them using a convolutional layer. Specifically, after incorporating the FPN, a 1 × 1 convolutional layer adjusts the channel number to 256, and deep features are upsampled and combined with the previous layer’s features. A final 3 × 3 convolutional layer refines the feature map. This multi-layer approach, with varied anchor box sizes and aspect ratios, enhances overall detection performance.
Additionally, the Faster RCNN model in this study uses RoI Align to generate feature maps of consistent dimensions for different proposals. RoI Align addresses the limitations of RoI Pooling by avoiding rounding issues associated with dividing regions into smaller blocks. Instead, RoI Align performs bilinear interpolation on these blocks, yielding more accurate feature maps and improving detection results.
To handle multiple bounding boxes that may overlap, Non-Maximum Suppression (NMS) is applied to retain the most accurate boxes. NMS starts by selecting the bounding box with the highest score and storing the remaining boxes in an output list. It then calculates the Intersection over Union (IoU) between the selected box and the remaining boxes. If the IoU exceeds a predefined threshold, the overlapping box is removed from the list. This process repeats until only the boxes with the highest scores remain, effectively representing the detected objects.
2.2.2. Fracture Classification and Detection
The following sections will detail the multi-view fusion model proposed in this article. Multi-view image techniques have shown promise in enhancing classification capabilities, particularly in clinical settings where accurately locating fractures is crucial. The effective fusion of multi-view data is essential to improving fracture detection beyond what can be achieved with 2D projection images alone. The model presented consists of the following key components: the backbone network, the context extractor, the Feature Pyramid Network (FPN), and the multi-view fusion module, as illustrated in
Figure 4.
Backbone Network: The anterior–posterior (AP) and lateral (LA) views are first processed through the backbone network to extract features. The backbone network must have a sufficiently large receptive field to capture meaningful features. Deeper networks like ResNet, which use residual connections to mitigate issues like gradient vanishing, are ideal. ResNet-152 was found to be the most effective in our experiments among ResNet-50, ResNet-101, and ResNet-152. It provides deep, high-quality features, with the last three blocks of ResNet-152 feeding into the context extractors.
Context Extractor: To address the challenge of detecting occult or non-displaced fractures, which can be camouflaged by lighting or bone density variations, the context extractor expands the receptive field without pooling, using dilated convolutions. This approach, as demonstrated by Mei et al. [
33], helps to extract rich semantic features across a broader context.
Each context extractor contains four context extraction branches, as shown in
Figure 5. First, the
convolution compresses the number of channels to
of the original and then uses the
convolution and the
dilated convolution with a dilation rate of
to extract context-aware features, where
represents the context extraction branch of
. Each convolutional layer is followed by a set of batch normalization layers and a ReLU. The output of
is added with the output of
to stack the different ranges of receptive fields together. Finally, the outputs of the four context extraction branches are concatenated together and use
convolution to integrate all features. Such a design enables the context extractor to obtain a wide range of rich perceptual features, allowing the model to have contextual reasoning capabilities and to see finer differences in feature maps. The FPN enhances the feature maps extracted from different scales by combining deep and shallow features, improving the detection of small- and medium-sized objects like the scaphoid bone.
Multi-View Fusion Module: After feature extraction and context enhancement, the FPN outputs from the AP and LA views are fed into the multi-view fusion module. This module integrates features from both views to enhance fracture classification and detection accuracy. The multi-view fusion module uses the FPN features from the AP views and LA views for fusion, and the feature maps of the corresponding layers are input into the fusion block, as shown in
Figure 6.
The fusion block first performs global average pooling on the feature maps of the two views and outputs [B, C, 1, 1] to flatten the feature maps and concatenate them. Then, the designed weight transform calculates the common important information of the two views and then multiplies this weight back to the respective views. Finally, the features of the two views are concatenated together for the final classification. The goal is to find as much of the fractured scaphoid region as possible. We added the output of to the output of , where represents the ith fusion block.
The algorithm is as follows:
- (1)
Feature Extraction: Both AP and LA views are processed through the backbone network to extract initial features.
- (2)
Contextual Information: The context extractor refines these features by expanding the receptive field and integrating contextual information.
- (3)
Feature Fusion: The FPN combines features from multiple scales to improve detection performance.
- (4)
Final Classification and Detection: The multi-view fusion module combines features from both views to classify fractures and re-evaluate detection boxes for precise localization.
The model aims to improve the accuracy of fracture detection, particularly for occult and non-displaced fractures, by leveraging comprehensive multi-view information and advanced feature extraction techniques.
Most of the detection methods, regardless of whether they are one-stage or two-stage methods, only use a single scale for prediction. However, in order to extract better features, the depth of the model is often very deep to prevent the loss of information on small objects. FPN can combine deep and shallow information well, allowing the feature map of each scale to be better used and greatly improving the detection results of small objects. The ground truth of the fractures in this article also belongs to small objects, so an FPN is added after the context extractor. The structure here is the same as that of Faster RCNN. First, the convolution layer is used to adjust the number of channels of the input feature map to ; then, the feature map is upsampled and added with feature maps; and finally, the information is integrated through a convolution.
Weight transform (see
Figure 7) is mainly used to combine information from the AP and LA views. It is too rough to classify the features directly after merging, as mentioned above. In general, the importance of the AP view is higher than that of the LA view. Therefore, classification after direct concatenation may cause all the weights to be pulled away by the AP views and results in the loss of the features of the LA views. Therefore, we design a weight transformation, which combines the features of the AP views and the LA views, calculates the common important information of the two views, and then multiplies them back to their respective views. In this way, the information of the LA views is preserved, preventing its weight from being completely discarded. Equations (1) and (2) show the module construction by different feature maps AP
i and LP
i of scale
i.
where
represents the number of layers of the feature pyramid;
represents the fully connection layer; and
represents the concatenation. The number of channels output by the two fully connected layers is
of the number of input channels.
The architecture of the detection branch in the second stage is a fully convolutional network. The classification of the detection branch and the prediction head of the detection is the same and consists of
convolutional layers, a batch normalization layer, ReLU, and
convolutional layers. The number of output channels of the classification branch is
, and the number of channels of the detection branch is
.
and
represent the number of anchor boxes and categories, respectively, and
represents
, representing the center point coordinates, width, high, and angle of the prediction box [
22]. Each layer of the FPN has its prediction head, and finally, the results of each layer are merged to unify match with the ground truth and calculate the loss. The multi-view fusion classification branch consists of a fully connected layer, ReLU, and a fully connected layer. The first fully connected layer inputs the refined features by the multi-view fusion module, and the output channel is half of the number of inputs. The output of the second fully connection layer is 2, indicating the probability of fracture and non-fracture.
To pass information to fracture detection in two different views through this classification branch, the classification results are rescored for the prediction boxes in the final prediction stage. In the most perfect case, both the classification score and the predicted box score will be close to 1, so when both branch predictions are at their highest, the sum is 2 and the smallest is 0. We took 0.5 as the threshold of the classification branch to keep the classification score greater than or equal to the threshold and set the classification score to 0 if it was smaller than the threshold. The purpose of this is to give priority to the classification results. Because the classification branch directly uses the multi-view feature, it can better distinguish whether the scaphoid is fractured or not. The rescoring formula is shown in Equation (3).
The model in the second stage consists of three parts, namely, the detector for AP views, the detector for LA views, and the classifier for multi-view fusion. The following describes their loss functions. The detector part is similar to Faster RCNN but regresses an additional angle. First, the ground truth box
and prediction box
are encoded as
and
, and the anchor box is expressed as
.
and
are defined in Equations (4) and (5).
Then,
loss is used to regress the positive samples, where
loss is defined in Equation (6).
The classification prediction head of the detection branch uses the Binary Cross Entropy loss, defined in Equations (7) and (8) as follows:
Since the number of negative samples is much larger than that of positive samples, we divide by the number of positive samples rather than the total number of samples to take the average, which can prevent the classifier from being affected by too many negative samples and can better solve the problem of sample imbalance. Where
represents the number of total samples, and
represents the number of positive samples. The loss function of multi-view fusion classification is the same as that of Faster RCNN, and both use the Cross Entropy loss. The total loss is defined in Equation (9).