1. Introduction
Cancers of the brain and the central nervous system cause the death of over two hundred thousand people every year [
1]. Life expectancy after the diagnosis depends on several factors like: being a primary tumor or a metastatic one; being an aggressive form of tumor (also called high-grade glioma (HGG)) or a less aggressive one (low-grade glioma (LGG)); and of course, a key factor is how early the tumor is diagnosed [
2]. Patients with HGG live fifteen months on average after diagnosis. With an LGG, it is possible to live for several years, as this form of the tumor does not always require aggressive treatment immediately after the diagnosis.
Magnetic resonance imaging (MRI) is the technology that has become the most frequently utilized in the diagnosis of gliomas. MRI is preferred because it is much less invasive than other imaging modalities, like positron emission tomography (PET) or computed tomography (CT). With its high contrast and good resolution, it can provide accurate data about the structure of the tumor. Multi-modal or multi-spectral MRI, through its T1-weighted (with or without contrast improvement) and T2-weighted (with or without fluid attenuated inversion recovery) data channels, can significantly contribute to the better visibility of intracranial structures [
3].
The quick development of computerized medical devices and the economic rise of several underdeveloped countries both contribute to the fast spreading of MRI equipment in hospitals worldwide. These MRI devices produce more and more image data. Training enough human experts to process these records would be very costly, if possible at all. This is why there is a strong need for automatic algorithms that can reliably process the acquired image data and select those records that need to be inspected by the human experts, who have the final word in establishing the diagnosis. Although such algorithms are never perfect, they may contribute to cost reduction and allow for screening large masses of the population, leading to several early detected tumors.
The brain tumor segmentation problem is a difficult task in medical image processing, having major obstacles like: (1) the large variety of the locations, shapes, and appearances of the tumor; (2) displacement and distortion of normal tissues caused by the focal lesion; (3) the variety of imaging modalities and weighting schemes applied in MRI, which provide different types of biological information; (4) multi-channel data are not perfectly registered together; (5) numerical values produced by MRI do not directly reflect the observed tissues; they need to be interpreted in their context; (6) intensity inhomogeneity may be present in MRI measurements due to the turbulence of the magnetic field.
The history of automatic brain tumor segmentation from MRI records can be divided into two eras: the pre-BraTS and the BraTS era, where BraTS refers to the Brain Tumor Segmentation Challenges [
3,
4] organized every year since 2012 by the MICCAI society, which had an enormous impact with the introduction of a multi-spectral MRI data set that can be used as a standard in the evaluation of segmentation methods. Earlier solutions were usually developed for single data channels and even for 2D data (or reduced number of distant slices) and mostly validated with private collections of MRI records that do not allow for objective comparison. A remarkable review on early segmentation methods, including manual, semi-automatic, and fully automatic solutions, was provided by Gordillo et al. [
5]. Methods developed in the BraTS era are usually fully automatic and employ either one or a combination of: (1) advanced general-purpose image segmentation techniques (mostly unsupervised); (2) classical machine learning algorithms (both supervised and unsupervised); or (3) deep learning convolutional neural networks (supervised learning methods).
All methods developed and published in the pre-BraTS era belong to the first two groups or their combination. They dominate the first years of the BraTS era as well, and they are still holding significant research interest. Unsupervised methods have the advantages that they do not require large amounts of training data and provide segmentation results in a relatively short time. They organize the input data into several clusters, each consisting of highly similar data. However, they either have difficulty with correctly labeling the clusters or they require manual interaction. Unsupervised methods involving active contours or region growing strongly depend on initialization as well. Sachdeva et al. [
6] proposed a content-based active contour model that utilized both intensity and texture information to evolve the contour towards the tumor boundary. Njeh et al. [
7] proposed a quick unsupervised graph-cut algorithm that performed distribution matching and identified tumor boundaries from a single data channel. Li et al. [
8] combined the fuzzy
c-means (FCM) clustering algorithm with spatial region growing to segment the tumors, while Szilágyi et al. [
9] proposed a cascade of FCM clustering steps placed into a semi-supervised framework.
Supervised learning methods deployed for tumor segmentation first construct a decision model using image-based handcrafted features and use this for prediction in the testing phase. These methods mainly differ from each other in the employed classification technique, the features involved, and the extra processing steps that apply constraints to the intermediary segmentation outcome. Islam et al. [
10] extracted so-called multi-fractional Brownian motion features to characterize the tumor texture and compared its performance with Gabor wavelet features, using a modified AdaBoost classifier. Tustison et al. [
11] built a supervised segmentation procedure by cascading two random forest (RF) classifiers and trained them using first order statistical, geometrical, and asymmetry-based features. They used a Markov random field (MRF)-based segmentation to refine the probability maps provided by the random forests. Pinto et al. [
12] deployed extremely randomized trees (ERTs) and trained them with features extracted from local intensities and neighborhood context, to perform a hierarchical segmentation of the brain tumor. Soltaninejad et al. [
13] reformulated the simple linear iterative clustering (SLIC) [
14] for the extraction of 3D superpixels, extracted statistical and texton feature from these, and fed them to an RF classifier to distinguish superpixels belonging to tumors and normal tissues. Imtiaz et al. [
15] classified the pixels of MRI records with ERTs, using superpixel features from three orthogonal planes. Kalaiselvi et al. [
16] proposed a brain tumor segmentation procedure that combined clustering techniques, region growing, and support vector machine (SVM)-based classification. Supervised learning methods usually provide better segmentation accuracy than unsupervised ones and require a longer processing time.
In recent years, convolutional neural networks (CNNs) and deep learning have been attempting to conquer a wide range of application fields in pattern recognition [
17]. Features are no longer handcrafted, as these architectures automatically build a hierarchy of increasingly complex features based on the learning data [
18]. There are several successful applications in medical image processing, e.g., detection of kidney abnormalities [
19], prostate cancer [
20], lesions caused by diabetic retinopathy [
21] and melanoma [
22], and the segmentation of liver [
23], cardiac structures [
24], colon [
25], renal artery [
26], mandible [
27], and bones [
28]. The brain tumor segmentation problem is no exception. In this order, Pereira et al. [
29] proposed a CNN architecture with small
sized kernels and accomplished a thorough analysis of data augmentation techniques for glioma segmentation attempting to compensate the imbalance of classes. Zhao et al. [
30] applied conditional random fields (CRF) to process the segmentation output produced by a fully convolutional neural network (FCNN) and assure the spatial and appearance consistency. Wu et al. [
31] proposed a so-called multi-feature refinement and aggregation scheme for convolutional neural networks that allows for a more effective combination of features and leads to more accurate segmentation. Kamnitsas et al. [
32] proposed a deep CNN architecture with 3D convolutional kernels and a double pathway learning, which exploits a dense inference technique applied to image segments, thus reducing the computational burden. Ding et al. [
33] successfully combined the deep residual networks with the dilated convolution, achieving fine segmentation results. Xue et al. [
34] proposed a solution inspired by the concept of generative adversarial networks (GANs). They set up a CNN to produce pixel level labeling, complemented it with an adversarial critic network, and trained them to learn local and global features that were able to capture both short and long distance relationships among pixels. Chen et al. [
35] proposed a dual force-based learning strategy, employed the DeepMedic (Version 0.7.0, Biomedical Image Analysis Group, Imperial College London, London, UK, 2018) and U-Net architectures (Version 1.0, Computer Science Department, University of Freiburg, Freiburg, Germany, 2015) for glioma segmentation, and refined the output of deep networks using a multi-layer perceptron (MLP). In general, CNN architectures and deep learning can lead to slightly better accuracy than well designed classical machine learning techniques in the brain tumor detection problem, at the cost of a much higher computational burden in all processing phases, especially at training.
The goal of this paper is to propose a solution to the brain tumor segmentation problem that produces a high-quality result with a reduced amount of computation, without needing special hardware that might not be available in underdeveloped countries, employing classification via ensemble learning assisted by several image processing tasks designed for MRI records that may contain focal lesions. Further, the proposed procedure only employs machine learning methods that fully comply with the explainable decision making paradigm, which is soon going to be required by law in the European Union [
36]. Multiple pre-processing steps are utilized for bias field reduction, histogram normalization, and atlas-based data enhancement. A twofold post-processing is employed to improve the spatial consistency of the initial segmentation produced by the decision ensemble.
Ensemble learning generally combines several classifiers and aggregates their predictions into a final one, thus allowing for more accurate decisions than the individual classifiers are capable of [
37,
38]. The ensembles deployed in this study consist of binary decision trees (BDTs) [
39]. BDTs are preferred because of their ability to learn any complex patterns that contain no contradiction. Further, with our own implementation of the BDT model, we have full control of its functional parameters.
In image segmentation problems, atlases and multiple atlases are generally employed to provide shape or texture priors and to guide the segmentation toward a regularized solution via label fusion [
40,
41,
42]. Our solution uses multiple atlases in a different manner: before proceeding to ensemble learning and testing, atlases are trained to characterize the local appearance of normal tissues and applied to transform all feature values to emphasize their deviation from normal.
The main contributions consist of: (1) the way the multiple atlases are involved in the pre-processing, to prepare the feature data for segmentation via ensemble learning; (2) the ensemble model built from binary decision trees with unconstrained depth; (3) the two-stage post-processing scheme that discards a large number of false positives.
Some components of the procedure proposed here, together with their preliminary results, were presented in previous works. For example, we can mention some part of the feature generation and the classification with ensembles of binary decision trees [
43] or the multi-atlas assisted data enhancement of the MRI data [
44]. However, the procedure proposed here puts them all together and proposes additional components like the post-processing steps, and a thorough evaluation process is performed using the BraTS training data sets for the years 2015 and 2019. The evaluation process demonstrates that the proposed brain tumor segmentation procedure is in competition with state-of-the-art methods both in terms of accuracy and efficiency.
The rest of this paper is structured as follows:
Section 2 presents the proposed brain tumor segmentation procedure, with all its steps starting from histogram normalization and feature generation, followed by initial decision making via an ensemble learning technique, and a twofold post-processing that produces the final segmentation.
Section 3 evaluates the proposed segmentation procedure using recent standard data sets of the BraTS challenge.
Section 4 compares the performance of the proposed method with several state-of-the-art algorithms, from the point of view of both accuracy and efficiency.
Section 5 concludes this study.
2. Materials and Methods
2.1. Overview
The steps of the proposed method are exhibited in
Figure 1. All MRI data volumes involved in this study go through a multi-step pre-processing, whose main goal is to provide uniform histograms to all data channels of the involved MRI records and to generate further features to all pixels. After separating training data records from evaluation (test) data records, the former are fed to the ensemble learning step. Trained ensembles are used to provide an initial prediction for the pixels of the test data records. A twofold post-processing is employed to regularize the shape of the estimated tumor. Finally, statistical markers are used to evaluate the accuracy of the whole proposed procedure.
2.2. Data
This study uses the MICCAI BraTS training data sets, both low- and high-grade glioma volumes for the years 2015 and 2019. Some main attributes of these data sets are exhibited in
Table 1. Only one out of these four data sets is used at a time, so in any scenario, the total number
of involved records varies between 54 and 259, as indicated in the first row of the table.
The records in each set have the same format. Records are multi-spectral, which means that every pixel in the volumes has four different observed intensity values (named T1, T2, T1C, and FLAIR after the weighting scheme used by the MRI device) recorded independently of each other and registered together afterwards with an automatic algorithm. Each volume contains 155 square shaped slices of
pixels. Pixels are isovolumetric as each of them represents brain tissues from a 1 mm
3 sized cubic region. Pixels were annotated by human experts using a semi-automatic algorithm, so they have a label that can be used as the ground truth for supervised learning.
Figure 2 shows two arbitrary slices, with all four observed data channels, and the ground truth for the whole tumor, without distinguishing tumor parts. Since the adult human brain has a volume of approximately 1500 cm
3, records contain around 1.5 million pixels. Each record contains gliomas of a total size between 7 and 360 cm
3. The skull was removed from all volumes so that the researchers can concentrate on classifying brain tissues only, but some of the records intentionally had missing or altered intensity values, in the amount of up to one third of the pixels, in one of the four data channels. An overview of these cases is also reported in
Table 1.
Intensity non-uniformity (INU) is a low-frequency noise with possibly high magnitude [
45,
46]. In the early years of the BraTS challenges, INU was filtered from the training and test data provided by the organizers. However, in later data sets, this facility is not granted. This is why we employed the method of Tustison et al. [
47] to check the presence of INU and to eliminate it when necessary.
There are several non-brain pixels in the volume of any record, a fact indicated by zero intensity on all four data channels. Furthermore, there are some brain pixels in most volumes for which not all four nonzero intensity values exist; some of them are missing. Our first option to fill missing values is to replace them with the average computed from the
cubic neighborhood of the pixel, if there are any pixels in the neighborhood with valid nonzero intensity. Otherwise, the missing value becomes
after pre-processing, which is the middle value of the target intensity interval; see the definition in
Section 2.3.
2.3. Histogram Normalization
Although very popular among medical imaging devices due to its high contrast and relatively good resolution, MRI has a serious drawback: the numerical values it produces do not directly reflect the observed tissues. The correct interpretation of the MRI data requires adapting the numerical values to the context, which is usually accomplished via a histogram normalization or standardization.
Figure 3 gives a convincing example why the normalization of histograms is necessary. This is what we get if we represent on the same scale the observed intensities from twenty different MRI records, extracted from the same data channel (T1). In some cases, the spectra of intensities from two different records do not even overlap. There is no classification method that could deal with such intensities without transforming the input data before classification.
Several algorithms exist in the literature for this purpose (e.g., [
48,
49,
50]), more and more sophisticated histogram matching techniques, which unfortunately are not designed for records containing focal lesions. These lesions may grow to 20–30% of the brain volume, which strongly distorts the histogram in MRI channels of any weighting. The most popular histogram normalization method proposed by Nyúl et al. [
48] works in batch mode, registering the histogram of each record to the same template. Modifying all histograms to look similar, no matter whether they contain tumor or not, is likely to obstruct the accurate segmentation. The method of Nyúl et al. uses some predefined landmark points (percentiles) to accomplish a two-step transformation of intensities. Many recent works [
51,
52,
53,
54,
55] reported using this method, omitting any information on the employed landmark points. Pinto et al. [
12] and Soltaninejad et al. [
13] used 10 and 12 landmark points, respectively. Alternately, Tustison et al. [
11] deployed a simple linear transformation of intensities, because it gives slightly better accuracy, but they did not specify any details. This finding was confirmed later in a comparative study by Győrfi et al. [
56], which also showed that if we stick to the above-mentioned popular method of Nyúl et al., it provides better segmentation of focal lesions when used with a smaller number (no more than five) of landmark points, or in other words, with less restrictions.
Our histogram normalization method relies on a context dependent linear transform. The main goal of this transform is to map the pixel intensities from each data channel of a volume independently of each other onto a target interval
in such a way that target intensities satisfy some not very restrictive criteria. The inner values of the target interval can be written as
, with
. Obviously,
and
. To compute the two coefficients of the linear transform applied to the intensities from a certain data channel of any record, first we identify the 25th-percentile and 75th-percentile value of the original intensities and denote them by
and
, respectively. Then, we define the
transform of the intensities in such a way that
is mapped onto
and
becomes
. This assumption leads to the coefficients:
Finally, all pixel intensities from the given data channel of the current MRI record are transformed according to the formula:
using the values
a and
b given by Equation (
1), where
I and
stand for the original and transformed intensities, respectively. This formula assures that the transformed intensities are always situated within the target interval
.
2.4. Feature Generation
Feature generation is a key component of the procedure, because the four observed features, namely the T1, T2, T1C, and FLAIR intensities provided by the MRI device, do not contain all the possible discriminative information that can be employed to distinguish tumor pixels from normal ones. A major motivation for feature generation represents the fact that the automated registration algorithm used by the BraTS experts to align the four data channels never performs a perfect job, so for an arbitrary cubic millimeter of brain tissues, represented by the pixel situated at coordinates in volume T1, the corresponding information in the other volumes is not in the same place, but somewhere in the close neighborhood of . A further motivation consists of the usual property of image data that neighbor pixels correlate with each other.
All four MRI data channels equally contribute to the feature generation process. Observed features are treated independently of each other: 20 computed features are extracted from each of them. The inventory of extracted features is presented in
Table 2. Minimum, maximum, and average values are extracted in such a way that only the brain pixels are taken into consideration from the neighborhoods indicated in the table. The computation of the gradient features involves the masks presented in
Figure 4. The current pixel is part of all masks to avoid division by zero in Equation (
3). The 16 gradient features of an arbitrary pixel
p are computed with the formula:
followed by
to stay within the target interval
, where
is one of the four data channels,
is the index of the current gradient mask,
stands for the set of all brain pixels in the volume,
is the middle of the target interval of intensities,
is a globally constant scaling factor, and
represents the cardinality of any set denoted by
Q.
2.5. Multi-Atlas-Based Data Enhancement
In image segmentation, atlases are usually used as approximative maps of the objects that should be present in the image in the normal case. Atlases are usually established based on prior measurements. The image data before the current segmentation is registered to the atlas, and the current segmentation outcome is fused with the atlas to obtain a final, atlas assisted segmentation result.
In this study, we use atlases in a different manner. We build a spatial atlas for each feature using the training records, which contains the local average and standard deviation of the intensity values taken from normal pixels only. Of coarse, the training records need to be registered to each other so that we can build consistent atlases. These atlases are then used to refine the feature values in both the training and test records, before proceeding to the ensemble learning-based classification.
First of all, it is important to define the spatial resolution of atlases, which depends on a single parameter denoted by
S. Each atlas is a discrete spatial array defined on
, where
. Within the atlas array, the neighborhood of atlas point
having the coordinates
is defined as:
where
is a small positive integer, typically one ore two, which determines the size of the neighborhood.
The mathematical description of the atlas building process requires the introduction of several notations. Let stand for the set of all MRI records of the chosen data set, while and represent the current set of training and evaluation (or test) records, respectively. Obviously, , and . Let be the record with index i, belonging to either or in a certain scenario. The set of pixels belonging to record is denoted by , for any . The set of all pixels from all MRI records is . Any pixel has a feature vector with elements. For any pixel , the value of the feature with index is denoted by . Further, let and be the set of negative and positive pixels, respectively, as indicated by the ground truth. Obviously, , and .
A rigid registration is defined in the following, so that we can map all volumes onto the atlas. For each record
(
), a function
is needed that maps the pixels onto the atlas, to find the corresponding atlas position for all brain pixels. These functions map the gravity center of each brain to the atlas origin, and the standard deviations of the
x,
y, and
z coordinates are all transformed to
, where
is a predefined constant. From the pixel coordinates, we compute averages and standard deviations as follows:
and then:
where
,
, and
are the coordinates of pixel
in the brain volume. The formula of the mapping
is:
where
stands for the operation of rounding a floating point variable to the closest integer.
For any feature with index
, the atlas function has the form
, which for any atlas point
is defined as:
where the components
,
, and
are established with the following formulas:
where:
The feature values of each pixel
(
), no matter whether
belongs to a record of the training or test data, are updated with the following formula:
where parameters
and
represent the target average and standard deviation, respectively, and their recommended values are:
Algorithm 1 summarizes the construction and usage of the atlas. The updated intensity values compose the pre-processed data denoted by
in
Figure 1. Both
and
follow the very same classification and post-processing steps.
2.6. Ensemble Learning
The total number of records is randomly separated into six equal or almost equal groups. Any of these groups can serve as test data, while the other five groups are used to train the decision making ensemble. The ensembles consist of binary decision trees, whose main properties are listed in the following:
The size of the ensemble is defined by the number of trees, set to based on intermediary tests performed on the out-of-bag (OOB) data, which are the set of feature vectors from the training data that are never selected for training the trees of a given ensemble.
The training data size, represented by the number of randomly selected feature vectors used to train each tree, and denoted by , is chosen to vary between 100 thousand and one million in four steps: . These four values are used during the experimental evaluation of the proposed method to establish its effect on the segmentation accuracy and efficiency.
The selection of feature vectors for training the decision trees is not totally random. The balance of positive and negative feature vectors in the training set of each BDT is defined by the rate of negatives and rate of positives . For the LGG and HGG data sets, and are used, respectively, values set according to the findings of intermediary tests performed on OOB data.
Trees are trained using an entropy-based rule, without limiting their depth, so that they can provide perfect separation of feature vectors representing pixels belonging to tumors and normal tissues. The trees of the ensemble have their maximum depth ranging between 25 and 45, depending on and the random training data itself.
Algorithm 1: Build the atlas function for each feature , and apply it to all feature data. |
|
When all trees of the ensemble are trained, they can be used to obtain the prediction for the feature vectors from the test data records. Each pixel from the test records receives
votes, one from each BDT. The decision of the ensemble is established by the majority rule, and that gives a label to each test pixel. However, these are intermediary labels only, as they undergo a twofold post-processing. The intermediary labels received by test pixels compose the
and
segmentation results indicated in
Figure 1.
2.7. Random Forest-Based Post-Processing
The first post-processing step reevaluates the initial label received by each pixel of the test data records. The decision is made by a random forest classifier that relies on morphological features. The details of this post-processing step are listed in the following:
The RF is trained to separate positive and negative pixels using six features extracted from the intermediary labels produced by the BDT ensemble. Let us consider
concentric cubic neighborhoods of the current pixel, denoted by
(
), each having the size
. Inside the neighborhood
of the current pixel, the count of brain pixels
and the count of positive intermediary labeled brain pixels
is extracted. The ratio
is called the rate of positives within neighborhood
. The feature vector has the form
, where
is the normalized value of the number of complete neighborhoods of the current pixel, determined as:
where:
To assure that testing runs on data never seen by the training process, pixels from HGG tumor volumes are used to train the random forest applied in the segmentation of LGG tumor volumes, and vice versa. Each forest is trained using the feature vectors of randomly selected voxels, whose feature values fulfil .
Pixels whose features satisfy are declared negatives by default; they are not used for training the RF and are not tested with the RF either.
The number of trees in the RF is set to 100, while the maximum allowed depth of the trees is eight.
The result of this post-processing step can be seen in
Figure 1, represented by segmentation results
and
.
2.8. Structural Post-Processing
The structural post-processing handles only pixels that are labeled positive in segmentation results
and
; consequently, it has the option to approve or discard the current positive labels. As a first operation, it searches for contiguous spatial regions formed by positive pixels within the volume using a region growing algorithm. Contiguous regions of positive labeled pixels with a cardinality below 100 are discarded, because such small lesions cannot be reliably declared gliomas. Larger lesions are subject to shape-based validation. For this purpose, the coordinates of all positive pixels belonging to the current contiguous region undergo a principal component analysis (PCA), which establishes the three main axes determined by the three eigenvectors and the corresponding radii represented by the square root of the three eigenvalues provided by PCA. We denote by
the three eigenvalues in decreasing order. Lesions having the third radius below a predefined threshold (
) are discarded, as they are considered unlikely shapes for a glioma. All those detected lesions that are not discarded by the criteria presented above are finally declared gliomas, and all their pixels receive final positive labels. This is the final solution denoted by
and
in
Figure 1.
2.9. Evaluation Criteria
The whole set of pixels of the MRI record with index i is , which is separated by the ground truth into two disjoint sets: and , the set of positive and negative pixels, respectively. The final segmentation result provides another separation of into two disjoint sets denoted by and , which represent the positive and negative pixels, respectively, according to the final decision (labeling) of the proposed procedure. Statistical accuracy indicators reflect in different ways how much the subsets and and their complementary subsets and overlap. The most important accuracy markers obtained for any record with index i () are:
Dice similarity coefficient (DSC) or Dice score:
Sensitivity or true positive rate (TPR):
Specificity or true negative rate (TNR):
Positive predictive value:
Accuracy or rate of correct decisions (ACC):
To characterize the global accuracy, we may compute the average values of the above-defined indicators over a whole set of MRI records. We denote them by
,
,
,
, and
. The average Dice similarity coefficient is given by the formula:
The average values of other accuracy indicators are computed with analogous formulas.
The overall Dice similarity coefficient (
) is a single Dice score extracted from the set of all pixels from
, given by the formula:
All accuracy indicators are defined in the interval. Perfect segmentation sets all indicators to the maximum value of 1.
3. Results
The proposed brain tumor segmentation procedure was experimentally evaluated in various scenarios. Tests were performed using four different data sets: the LGG and HGG records separately from the BraTS 2015 training data set and, similarly, the LGG and HGG records separately from BraTS 2019 training data set. Detailed information on these data sets is given in
Table 1. The whole segmentation process was the same for all four data sets, the one indicated in
Figure 1.
All records of these four data sets underwent the mandatory pre-processing steps, namely the histogram normalization and feature generation, resulting in pre-processed data denoted by . At this point, the records were randomly separated into six equal or almost equal groups, which took turns serving as the test data. An atlas was built using the training data only, namely the records from the other five groups. The atlas was then used to extract enhanced feature values for all records, resulting in pre-processed data denoted by . During the whole processing, there were six different such sets denoted by , one for the testing of each group.
Preprocessed data sets
and
underwent the ensemble learning process separately, so that we could evaluate the benefits of the atlas-based pre-processing. All ensembles involved in this study consisted of
binary decision trees, but there were four variants for each data set, in terms of training data size. Each tree of an ensemble was trained to separate
feature vectors,
ranging from 100 thousand to on million, as indicated in
Section 2.6. The intermediary segmentation obtained at the ensemble output is denoted by
or
, depending on the use of the atlas during pre-processing. The labels of pixels in
or
were treated with two post-processing steps described in detail in
Section 2.7 and
Section 2.8, reaching the final segmentation denoted by
or
. Theoretically, all six segmentation results
,
,
,
,
, and
can be equally evaluated with the statistical measures presented in
Section 2.9. The main result is represented by the final outcome of the segmentation:
for the data that are not pre-processed with the atlas and
for the data with atlas-based enhancement.
Table 3 presents the average (
) and overall (
) values of the Dice similarity coefficients obtained for various data sets and training data sizes, at four different phases of the segmentation process. A larger training data size always led to better accuracy: both the average and overall value rose by 0.5–0.8% if the training data size changed from
to
. If we consider segmentation accuracy the only important quality marker, then it is worth using the largest training data size. The use of the atlas was beneficial in all scenarios, but there were large differences in the strength of this effect: at the ensemble output, the difference was between
and
, while at the final segmentation, it was between
and
. The difference between the average and overall values in the same scenario usually ranged between
and
. This was caused by the distribution of the Dice similarity coefficients obtained for individual MRI records. The obtained Dice scores, having most averages between
and
and overall values ranging between
and
, were quite competitive.
Table 4,
Table 5,
Table 6 and
Table 7 exhibit the average sensitivity, positive predictive value, specificity, and correct decision rate (accuracy) values, respectively, obtained for different data sets and various training data sizes. Not all these indicators increased together with the training data size, but the rate of correct decisions did, showing that it is worth using a larger amount of training data to achieve better segmentation quality. Sensitivity values were in the range 0.8–0.85, while positive predictive values around
, which indicate a fine recognition of true tumor pixels. Specificity rates were well above
, which is highly important because it grants a reduced number of false positives. The average rate of correct decisions, with its values mostly above 0.98, indicates that one out of fifty or sixty pixels was misclassified by the proposed segmentation procedure. All values from
Table 4,
Table 5,
Table 6 and
Table 7 reflect the evaluation of the final segmentation outcomes denoted by
, obtained from atlas enhanced pre-processed data.
Figure 5 exhibits the Dice similarity coefficients obtained for individual MRI records (
, for
) plotted against the true size of the tumor according to the ground truth. Each cross represents one of the MRI records, while the dashed lines indicate the linear trend of the
values identified with linear regression. As was expected, the linear trends indicate better segmentation accuracy for larger tumors. It is also visible that for most tumors, we achieved a Dice score above 0.8, and there were some records below that limit where the Dice scores could be very low. This is mostly because not all records in the BraTS data have the same image quality; some of them even contain artificially created obstacles. This distribution of the
values is also the reason why the overall Dice similarity value was higher than the average (
) in case of all four data sets. In fact,
was close to the median value of individual Dice scores
,
.
Figure 6 presents the Dice similarity scores we may expect for a 10 cm
3 and a 100 cm
3 sized tumor according to the linear trends identified from the data presented in
Figure 5, in the case of the four data sets separately. Small tumors of 10 cm
3 are not even present in all data sets, but the proposed method apparently learned how to segment them accurately, with an expected Dice similarity value around 0.8, which reportedly represents fine segmentation [
57]. The Dice score obtained for an average sized tumor from the data sets was expected in the proximity of 0.85.
Figure 7 exhibits some selected segmentation results. Four MRI volumetric records were selected from each of the four data sets, out of which that slice was chosen that contained the highest number of positive pixels, ground truth positives, and positive labeled pixels combined. Thus, each row in this figure exhibits one slice from a volume in five images, which represent the four observed data channels T1, T2, T1C, and FLAIR and the segmentation outcome, respectively. In the segmentation outcome, true positives are drawn in green, false negatives in red, false positives in blue, and true negatives in grey. These segmentation results suggest that there is still place for improvement in establishing the exact boundary of the tumor, and also in suppressing the patches of false positives.
The proposed procedure does not require special hardware equipment. The whole evaluation process was performed with an Asus Zenbook computer having a quad-core i7 processor with 1.8 GHz nominal frequency, 16 GB RAM, and 1 TB SSD for data storage. The most part of the time consuming processing steps was implemented to run in parallelized tasks. The average processing time of an MRI record (coming in uncompressed .nii files) never seen by the trained procedure was 58 s, out of which: (1) 26 s were needed by the pre-processing including feature generation and atlas based data enhancement; 21 s were required by the trained ensemble to produce the intermediary segmentation; (3) 11 s were the duration of the post-processing steps. No GPU was involved in the computations, so there is still place for improvement in terms of efficiency.
4. Discussion
Table 8 presents the accuracy performance of a collection of state-of-the-art methods proposed for the whole brain tumor segmentation problem. Although they are recent solutions, the vast majority still use the BraTS 2013 and BraTS 2015 data sets for evaluation purposes. An important requirement for the methods to be selected is that they apparently report results obtained by processing the whole data set, not only some records that give high accuracy. The Dice similarity coefficients shown in the table are written in the form published by the authors, with the intention to be interpreted correctly. We do not agree with publishing
values expressed with two decimals only, because the half percent imprecision can hide an extremely large difference. One important thing reflected by this table is that it is much easier to obtain high accuracy with the BraTS 2013 data, which is probably caused by the reduced number of records in this set that cannot cover enough variation of tumor shapes and appearances. See for example the method of Pereira et al. [
29], which obtains average Dice scores of 0.88 and 0.78 on BraTS 2013 and BraTS 2015, respectively. The proposed procedure with its achieved Dice scores belongs to the set of methods with top accuracy.
From the point of view of efficiency, the main question is how much time a method needs to perform a whole segmentation on an MRI record of the same type that is never seen during the training of the method.
Table 9 presents the performance against the clock of some selected state-of-the-art methods. Our goal is to report those runtime values, which reflect the total duration needed, starting from uncompressed .nii files and ending when the segmentation result is saved, but we cannot be sure that all methods included in this table report this kind of duration. It is not surprising at all that solutions based on CNN need a longer time even if they run on GPUs, because their architecture is usually more complex. However, other methods can also achieve fine segmentation quality in less time using CPUs only. This is probably because fine tuning complex architectures is much more difficult, and some of them may still run in a suboptimal state. With its reduced time complexity, our proposed method is very competitive and can be deployed without needing any special hardware.
The main parameters of the proposed brain tumor segmentation procedure are: (1) the size of training data, or in other words, the number of feature vectors used to train each BDT in the ensemble, ; (2) the percentage of negatives and positives in the set of feature vectors used to train the BDTs; (3) the number of trees in the ensemble ; (4) the presence or absence of the atlas enhanced pre-processing within the pipeline of the procedure.
The decision on the training data size has to take into account its twofold implications. Larger training data sets enable the procedure to achieve slightly better accuracy in all tested cases. On the other hand, changing from to raises the training time of an ensemble from 2 to 40 min, but the total duration of a segmentation after having the ensemble trained only increases by 4–5 s. Once the large ensembles are trained, they can perform the segmentation quickly enough without prohibitive computational load.
The percentage of negatives in the training data can have a strong implication on the segmentation quality. Experiments show that there are MRI records in the BraTS data sets that give the best segmentation accuracy at , others at , and many of them at various values between these two. It is very likely that the optimal value of for a given MRI record strongly correlates with the true rate of negative pixels in the volume. However, this true rate is unknown, as we are not allowed to look at the ground truth while testing. The best accuracy at the ensemble output could be achieved if we had a precise estimation for each MRI record, which is the best choice for , only using the intensity values of the four observed data channels. Such an estimator would allow for 2–3% higher Dice similarity coefficients at the ensemble output. Without such an estimator, a dedicated constant value of needs to be chosen for all MRI records in each data set. In this study, was set to 91% for HGG data and 93% for LGG data, based on evaluations performed on OOB data.
The number of trees in the ensemble (
) theoretically has an impact on the decision error, which should decrease with any additional tree included in the ensemble. Experiments show that the decision accuracy saturates at a certain level due to the noise that is present in the data. Further, the number of trees linearly affects the runtime of both the training and testing process. After several evaluation loops performed on OOB data, with various values of
ranging from five to 255, the size of the ensemble was chosen to be set to
[
43,
64].
The use of the atlas caused up to a 3% improvement in the average Dice scores obtained at the output of the ensemble (see the difference between solutions
and
in
Table 3), but this difference diminished to a maximum of 1% in the accuracy of the final segmentation. Even for this difference, we consider the use of the atlas beneficial, as it does not affect the execution time of the procedure much.
The relatively high accuracy achieved by the proposed procedure is probably caused by the way it handles the imbalanced amounts of positive and negative data. Our BDTs are allowed to grow to unlimited depth during ensemble training, as deep as is found necessary by the entropy-based criterion applied in decision nodes. Unlike the random forest classifier, which limits its depth and sometimes assigns labels to mixtures of positives and negatives when the depth limit is reached, our binary trees has crisp labels assigned to their leaves. This allows for more precise classification unless the trees are over-trained. With the parameter settings we used, the decision trees did not show signs of being over-trained, which is supported by the accuracy indicators that improved as the training data size grew. The number of false positives was not high at the output of the ensemble, and it was still reduced by the twofold post-processing, which discarded those predicted positive structures that were too small or too flat to be reliably called tumors.
The proposed procedure also has some limitations. In its current version, it is trained and tuned to segment the whole tumor only, not the parts of it. Using only a set of handcrafted features, it may not succeed in exploring the input data as thoroughly as a convolutional network can. Further, the approximately three million decision nodes contained by our most complex trained forest of binary decision trees is less than the number of parameters used in current CNN architectures by up to two orders of magnitude. This might become a visible limitation in accuracy if we increase the number of MRI records used for training and testing.