4.3.1. Mask Design
Mask draws on the production of semiconductors in the industry, using photolithography in the production process. The location only corrodes or diffuses the area outside the designated area, thus introducing the concept of the “mask” to cover the selected area with an opaque template. In this experiment, the role of the mask is to extract the region of interest. First, set the mask to a matrix of all 0s with the same size as the picture, set the range of the region of interest, set the matrix value in the range to 1, and set the rest to all 0. Each target corresponds to a mask matrix, for example, for a sagittal image containing 11 vertebral bodies and disc marks, a total of 11 mask images are generated.
There are many ways to make a mask. One of the common ways is to use image annotation tools such as VGG Image Annotator (VIA) and Labelme for manual annotation. It can use rectangular, circular, polygonal, or dotted borders for annotation, and it can simultaneously generate classification labels that can be directly used for training. Using graphical annotation tools can better mark the area of interest, but this method is time-consuming and labor-intensive. This article directly uses the marked point information provided by the data to generate a rectangular box as a mask, which is more convenient.
This experiment consists of nine classification tasks. The nine categories are (1) BG (background); (2) disc_v1; (3) disc_v2; (4) disc_v3; (5) disc_v4; (6) disc_v5; (7) vertebra_v1; (8) vertebra_v2; (9) T12_S1. Among them, T12_S1 is specifically used to define the boundaries of the entire lumbar spine, preventing the Mask RCNN algorithm from incorrectly identifying vertebrae and discs outside the designated lumbar area.
This experiment is centered on the marked points of the intervertebral disc and vertebra and constructs a horizontal rectangular frame with width and height of
width and
height (width and height are the width and height of the sagittal figure) as the mask of discs and vertebra. A sagittal position is randomly selected from the training set, and the mask is displayed on the sagittal position map, as shown in
Figure 4.
4.3.2. Spine Detection
For the positioning of the vertebra and intervertebral disc, the adjacent vertebral bodies and intervertebral discs can be positioned by using the position of the upper and lower-vertebrae or intervertebral discs, but this method is not feasible in actual engineering practice. In fact, there are missing annotation points on the spine image, which will cause if a vertebra or disc fails to be located, all subsequent predictions will be wrong. In this paper, the method of image location is used to transfer the middle frame of the sagittal image of the MRI image into the fine-tuned Mask RCNN model to locate and classify all the intervertebral discs and vertebrae as a whole. Target detection of the spine is divided into positioning tasks and classification tasks. The input is a spine picture, and the output is a prediction of the position of the disc and the vertebra in the spine picture and its corresponding category. The positioning task is to input the MRI image of the spine and output the coordinate position information of the vertebra or the disc. The classification task is to input an MRI image of the spine and output a feature vector. This feature vector represents the probability that the image belongs to each category. The size of the probability can determine which category the image belongs to. Therefore, the target detection task is to output not only a coordinate position information but also a feature vector of classification information.
Take the recognition of a sagittal image as an example. When a spine image is input a into a multi-layer convolutional neural network, the network will output a feature map, them classify the feature map through the fully connected layer, and finally obtain the feature vector. This feature vector will be composed of two parts, one is the coordinates and the other is the one-hot code representing the category and a credibility value p, which is 0-1. The coordinates are composed of center coordinates and width and height . One-hot codes are used to indicate categories. The discs in the lumbar spine are divided into five categories, and the vertebral bodies are divided into two categories. In this paper, five types of intervertebral discs and two types of vertebral bodies are classified together, plus background and T12_S1 categories, a total of nine9 categories. The codes are 0 0000 0001, 0 0000 0010, 0 0000 0100, 0 0000 1000, 0 0001 0000, 0 0010 0000, 0 0100 0000, 0 1000 0000, 1 0000 0000, respectively. The combined label is . With this label template, the real label and the output of the model can be used to calculate the loss during training and backpropagate after the loss is output, and the coordinates, category, and credibility predicted by the model can be obtained during testing. With the predicted coordinates, the target can be marked on the image.
4.3.3. Model Training Configuration
In this experiment, suitable adjustments are made to the configuration of the Mask RCNN training.
Table 4 shows most of the model training hyperparameters in this experiment. All of the images have been scaled down to 512 pixels by 512 pixels. The number of classification classes, which is denoted by the parameter NUM_CLASSES and is set to 9, consists of the following: one class for the background, one class for the full lumbar spine (T12_S1), two classes for the vertebra categories, and five classes for the disc categories. We use ResNet 18, 34, 50, and 101 as the backbone networks, and each network undergoes training for a total of 100 epochs.
Due to the unique challenges associated with identifying vertebral bodies and intervertebral discs, we did not use a pre-trained model in this study. Compared to general image recognition tasks, these anatomical structures display distinct and limited features. The use of non-specific datasets may lead to overfitting to non-relevant features, which may obscure critical medical details and specific pathological characteristics that are essential for accurate medical imaging analysis. Further, reliance on general datasets for pretraining might impair the model’s ability to generalize, leading to suboptimal performance on real medical images.
We simply choose the learning rate as 0.001. ALIDATION_STEPS is the number of validation steps that should be run at the end of each training epoch, and STEPS PER EPOCH is the number of training steps that should be executed throughout each epoch. In spite of the fact that setting them to large numbers can result in more accuracy, we decided to set them to 100 and 30, respectively, because the dataset and epoch were rather small. This allowed us to strike a balance between accuracy and training efficiency. We give the TRAIN_ROIS_PER_IMAGE hyperparameter the value of 200, which indicates that 200 ROIs per image are fed to the classifier or mask head. The DETECTION_MIN_CONFIDENCE parameter is set to 0.8, which represents the minimal probability value required to detect instances, so regions of interest (ROIs) that fall below this threshold are skipped. DETECTION_NMS_THRESHOLD is a non-maximum suppression threshold for detection, which is set to 0.3 to discard results that exceed this threshold, to ensure that only structures within the lumbar region are analyzed. Except for the above hyperparameters, the other hyperparameters in the experiments are kept consistent with those in the original experiments of Mask RCNN.