*3.2. Stage 1: Proposed Image Pre-Processing for SbBDEM* 3.2.1. Image Preparation

The overall process for Stage 1 is illustrated in Figure 3. Standard morphological operations were applied to remove stray annotation marks to allow only the breast area to maximize the processing image area. To unify features between CC and MLO views, pectoral muscle was digitally removed from the MLO view images. To prepare the image to accommodate the needs of different breast densities, the images were segregated based on their supplied ACR density levels following the supplemented density scores to nondense (1 = almost entirely fatty, 2 = scattered dense) and dense (3 = heterogeneous dense, 4 = extremely dense) categories.

**Figure 3.** Stage 1: Proposed image SbBDEM technique as a pre-processing step.

### 3.2.2. Lower Limit Contrast Cap Determination

As the next stage of the proposed framework includes mass detection process, it is essential to differentiate the mass from its background whether it is overlapped on the nondense or dense background. To reduce the non-dense image information while enhancing features from the denser region (hence the mass), image modification was conducted by selecting the best lower-limit contrast of the image. The final output will be a breast image that have a less skin and non-dense region appearance and a pronounced textural definition of the dense region. This includes the mass region while keeping the textural features from the fibroglandular and vascular tissue of the lower-intensity fatty tissue in the background. To achieve this, the higher limit of contrast adjustment was set to the same as the original image.

#### 3.2.3. Factorized Otsu's Thresholding for Breast Density Group Segregation

Otsu's thresholding calculates the point value of intensity based on the image's intensity spread on a bimodal histogram and separates the image into its foreground and background [53]. Since the original mammogram was converted to a normalized grayscale image consisting of two main tissue types that are closely related to its intensity and contrast (higher intensity = dense region, lower intensity = non-dense region), the Otsu's value was definitive in determining the middle-intensity value that separates these tissue groups. Therefore, Otsu's method has been implemented in this study as a reference point for determining the lower limit contrast to be clipped from the input image. However, direct Otsu's threshold separates tissue that might belong to the other side of the histogram, such as the black background as a non-dense region and calcified vessels and the skin lining appearing white in the image as a dense region. To properly lessen this imbalance effect, the threshold value was interpolated on a scale of 1.0 to 1.9 for each non-dense and dense image group that has been separated in the previous step to subtly adapt the sudden change of region

foreground to background image as a buffer intensity region. Subsequently, the training images were chosen based on their quality score, which is explained in the next stage.

#### 3.2.4. Blind/Reference-Less Image Spatial Quality Evaluator (BRISQUE)

When an image is altered, it is vital to assess it through an image quality assessment metric by referencing a gold-standard image for quality assessment in terms of its sharpness, contrast, etc., for comparison [54]. Common examples of tests where the referenced image must come from one of the images closely linked to the evaluated image include the meansquare error (MSE) and peak signal-to-noise (PSNR). However, when dealing with deep learning, possibly thousands of images are being trained, making it impossible to select only one for reference quality perspective. This is especially true if the dataset consists of multiple image acquisition techniques, which further vary the dataset's measurement range [55,56]. In this study, to separate the overlapped mass with its background, the non-dense region becomes darker, hence enhancing the mass's edge. This is expected to cause substantial image alteration, with mild changes on the mass and dense regions of the resulted image, causing noise to be increased in the final image. Hence, the MSE and PSNR scores are likely to produce unsatisfactory performance. Moreover, using quality assessments such as PSNR for reconstruction quality in determining the quality of an image used for a detection algorithm is unwarranted since a detection algorithm relies on its ability to separate a mass from its surroundings and, by extension, on the overall image, regardless of the final quality of the image used for training. Therefore, we chose the best Otsu's threshold factor with an image perceptual quality evaluator known as the Blind/Reference-less Image Spatial Quality Evaluator (BRISQUE) [57]. It performed as a spatial feature image assessment metric that is commonly known as opinion-aware and analyses images with similar distortion [57], similar to how visual perception is made. As image distortions affect the quality in term of its textural features (texture signifying the difference of pixel of dense region background and the overlapped mass), BRISQUE was chosen as the primary evaluation metrics in this study. The BRISQUE score guided in choosing the optimal quality factor that clearly defines the difference between nondense and dense breast images without using any reference image. It provides a rating by generating matching differential mean opinion score (DMOS) values using a support vector machine (SVM) regression model trained on a spatial domain image database [57]. During the training of BRISQUE, the database contained both the clean and edited versions with different additive noise implementations such as Gaussian white noise and blur, compression artifacts, and Rayleigh fast fading channel simulation, serving as the distortion image version for comparison [57]. Besides that, BRISQUE uses scenic data from locally normalized luminosity coefficients to measure any loss of naturalness due to distortion, resulting in a holistic quality score compared to calculating user-defined quality, such as ringing or blurring, as what is being measured when using PSNR [55]. Recent studies of medical images such as mammogram [58–60], lung CT scans [15,58], kidney and brain MRIs [15] have moved towards reference-less image quality evaluators to evaluate their work with good results. In this study, the image group was ultimately selected as the input for mass detection in the subsequent step once the best image score of BRISQUE was obtained.

#### 3.2.5. Evaluation and Analysis of the Proposed Enhancement Technique

We measured the proposed SbBDEM enhancement quality and its direct application in the input of the detection stage based on both reference-less (BRISQUE) and referenced (MSE) measurements. BRISQUE was calculated based on the method proposed by [57], and MSE was given by Equation (1):

$$\text{Mean Squared Error, MSE} = \frac{1}{mn} \sum\_{0}^{m} \sum\_{0}^{n} ||f(i,j) - g(i,j)||^2 \tag{1}$$

where *m* and *n* are the image's height and width, *i* and *j* are elements from the enhanced image, *f*, and referenced image, *g*, whereas additional textural features analysis was made on the images based on the Gray-Level Co-occurrence Matrix (GLCM) for comparison. The texture properties extracted from the produced matrix were four statistical feature descriptors defined as contrast, correlation, energy, and homogeneity as mathematically defined in Equations (2)–(5). For every element, *P*, it reflected the total number of occurrences of the pixel values of *i* and *j* respective to the number of gray levels where *σ* and *μ* are the standard deviation and central moments derived in the form of means of variance and skewness.

$$\text{Contrast} = \sum\_{i,j=0}^{lerels-1} P\_{i,j}(i-j)^2 \tag{2}$$

$$\text{Correlation} = \sum\_{i,j=0}^{levels-1} P\_{i,j} \left[ \frac{(i - \mu\_i) \left(j - \mu\_j\right)}{\sqrt{\left(\sigma\_i^2\right) \left(\sigma\_j^2\right)}} \right] \tag{3}$$

$$\text{Energy} = \sqrt{\sum\_{i,j=0}^{lelevls-1} P\_{i,j'}^2} \tag{4}$$

$$\text{Homogeneity} = \sum\_{i,j=0}^{level-1} P\_{i,j} |i-j| \,\tag{5}$$

Additional analysis of the images' mean intensity was evaluated for comparison. The mean intensity is the normalized mean number of normalized pixel values in each RGB channel, divided by the total number of pixels in the image, *n*, given in Equation (6).

$$\text{Mean Intensity} = \frac{\sum\_{n=0}^{n} (R + G + B)}{n} \,\text{}\tag{6}$$

For pixel mapping evaluation, we assessed an example of True Positive (TP) and False Positive (FP) from a sample of mass edge from the enhanced testing image using the proposed SbBDEM technique. We assessed the probability of edge detection on the next-best performed on the BRISQUE and MSE scores. Note that mass edge detection's pixel analysis is emulated based on the first layer of modified YOLOv3 based on convolution process from Equation (7), zero padding, with a stride of two with maximum pooling downsampling to reveal the effect of pixel change made during enhancement that affects edge detection. On the other hand, diagonal edge analysis using kernel matrix *K* = [110, 10-1, 0-1-1] was chosen with a window size of 3-by-3, slides on the image using the convolution process, where *I* is the cropped mass image with *i*, *j* element, *K* represents the kernel with *x*, *y* element, and *ηW*, *η<sup>h</sup>* and *η<sup>C</sup>* are the number of heights, widths, and channels of *I*, respectively. Consequently, the maximum pooling downsampled element was chosen to represent both suspected mass and background area. The edge pixel difference of Mass and Background edge detection is denoted as Δ in Equation (8). Higher Δ denotes the higher pixel difference between the neighboring pixel encapsulating the mass.

$$\text{Conv}(\mathbf{I}, \mathbf{K})\_{\mathbf{x}, \mathbf{y}} = \sum\_{i=1}^{\eta\_W} \sum\_{j=1}^{\eta\_W} \sum\_{k=1}^{\eta\_C} K\_{i,j,k} I\_{\mathbf{x} + i - 1, y + j - 1, k} \tag{7}$$

Edge pixel difference, <sup>Δ</sup> <sup>=</sup> Maxconv(mass) <sup>−</sup> Maxconv(background) (8)

#### *3.3. Stage 2: Mass Detection Using Modified YOLOv3*

#### 3.3.1. You Only Look Once (YOLO)

Object detection is a process of detecting a specifically trained object within an image. YOLO and its versions (v2, v3, and so on) implement a single forward-pass filter by splitting the original image into a grid of s-by-s size. Subsequently, a bounding box prediction will be made for each separated cell. The algorithm searches for the object's midpoint during training, where the specific cells containing the midpoint will be responsible for determining the target object's presence. The corresponding cells are linked to the cell

with the midpoint, which is set up as the cell with the midpoints defined as the bounding box, which is made of four components [*x*, *y*, *w*, *h*]. Here, *x* and *y* are the top left-most coordinates of the bounding box with a value of 0 to 1.0, while *w* and *h* are the width and height of the box, respectively. Both *w* and *h* could be greater than 1.0 if the final detected box is wider than an entire s-by-s cell. In addition to the four components, each box has a probability value that indicates the presence of an object in the cell and the number of class predictions. Based on this prediction value, the trained network for each cell should be able to output a specific box coordinate that contains the highest probability value for the final detected output for class prediction.

#### 3.3.2. YOLOv3 Modification for Mass Detection

This study utilized the simplest form of YOLOv3 using SqueezeNet [61] as its base network and modified it to improve the overall detection result. Note that the SqueezeNet has only 1.2 million learnable parameters as opposed to the original DarkNet-53 [40] network, which has 41.6 million parameters. As a result, SqueezeNet-based YOLOv3 was chosen to lessen the burden of weightage parameter training. Among the benefits of using a simpler network architecture are more efficiently distributed training parameters, more use of spatial information, which leads to shorter training times, less bandwidth for future model updates, and the ability to be deployed with less memory configuration [62]. Aside from being lightweight, using predefined anchors and detection heads introduced in YOLOv3 architecture allows smaller objects to be detected [40]. Depending on the base network, the YOLOv3 could extract deep features to extract three-scale feature maps from the anchors used for the final bounding-box calculation to predict the best confidence score (CS). YOLOv3 has also been successfully implemented in recent mammogram studies [63,64], showing that its implementation is reliable with good results. A comparison of YOLOv3 and YOLOv4 conducted by [65] shows that even though YOLOv4 is an improvement, it shows no substantial difference between the two models, leading the author to infer that the performance of YOLO primarily depends on the features of the dataset and the representativity of the training images.

Figure 4 illustrates the modified SqueezeNet CNN architecture for the mass detection stage in this study. The input image size was set to 227-by-227, where the enhanced input training images were trained with whole mammogram images. The image went through a series of cascaded and parallel convolutions with concatenation along the nine repeated layers, reducing the information and computation by compacting feature maps as the network went deeper. Two detection heads were allocated when this architecture was modified for detection purposes in YOLOv3. The second detection head was double the size of the downsampled input (28-by-28) of the first detection head (14-by-14), causing smaller masses to be better detected. Since the mass size ranged from the aspect ratio of the breast size, with more than 50% of the training data containing mass with a size less than a sixth of the overall images, we have tried to resolve this problem by devising this architecture by modifying the input of the second detection head.

Hence, to improve the detection of small masses and overall detection performance, we proposed two strategies to solve this problem.

Strategy One: Residual feature mapping for the second detection head: Features from the shallower layer were included (depth concatenation four), containing higher spatial features from the skip connection, and were elementwise added with the semantic features from the deeper layer (depth concatenation nine), where the element-wise addition reduced feature degradation that occurred during downsampling which enhanced feature contrast and feature discrimination [51].

**Figure 4.** Modified SqueezeNet CNN architecture used for YOLOv3 training. The modified layer is in the Bold setting.

Strategy Two: An additional anchor box assigned to a smaller feature map: This anchor box was introduced to the lower scale of the anchor box number of the second detection head (ratio of 4:3 to first detection head). While simply increasing the number of anchor boxes increased the predefined mean intersection over union (IoU), this could only lead to lower performance due to overfitting the number of bounding boxes per image mapping [66]. However, assigning an extra anchor box only for the smaller feature map specifically will increase the bounding box refinement on the feature map allocated to features coming from Strategy One, which increases the possibility of detecting smaller mass sizes coming from the images' semantic information.

The image gave seven predictions with their confidence level scores on every single grid cell with the size of s-by-s. The network was trained on 80 epochs with 10 mini-batch sizes. The learnable parameters were updated through a loop of stochastic gradient descent momentum (sgdm) solver. The initial learning rate was set to 0.001, and a 0.5 confidence score (CS) threshold value was defined for determining the overall mean Average Precision (mAP) score for mass detection, with the largest CS bounding box score selected for final prediction. It is important to note that the hyper-parameter tuning values were chosen based on previous studies and this study's repeated trial processes [67].

#### 3.3.3. Performance Evaluation of the Modified YOLOv3 Using Enhanced Images

In this study, mass detection performance was correlated with the image enhancement performance in the prior stage. Therefore, we assessed TP and FP, while the mAP was calculated from the area under the curve of recall and precision, following Equation (9):

$$\text{mean AveragePrecision, map} = \frac{1}{|classes|} \sum\_{c \in clases} \frac{|TP\_c|}{|FP\_c| + |TP\_c|} \tag{9}$$

where *c* is the number of classes. The mAP is the current metric used by computer vision researchers to evaluate the robustness of object identification models. It incorporates the trade-off between precision and recall, which optimizes the influence of both metrics, given that precision measures the prediction accuracy and recall measures the total number of predictions concerning the ground truth.
