*2.7. Statistical Methods*

The U-Net performance was measured by examining the Dice score. *X* and *Y* are both spatial target regions and their overlap is defined by the Dice score:

$$\text{Dice} = \frac{2 \| X \cap \mathcal{Y} \|}{|X| + |\mathcal{Y}|}. \tag{1}$$

The Dice score quantifies the spatial overlap between the manually segmented and neural network-derived segmentations (Appendix A, Figure A1). A Dice score ranges from 0 (no overlap) to 1 (perfect overlap). A Dice score is the most widely used metric for evaluating segmentation performance for a neural network [19]. To estimate the stability of the neural network during training, the variance of the training Dice score was calculated.

The total number of cases available for training and validation was 400 MRIs. Our U-Net was implemented for 12 runs and trained on the following number of cases: 8, 16, 24, 32, 40, 80, 120, 160, 200, 240, 280, and 320 cases. For each of the 12 runs, the cases were randomly partitioned as either training or validation and the entire set of 400 cases were used. The Dice score was calculated for every validation case in every run. From validation cases in every run, the mean and standard deviation of the Dice scores were computed. For example, the CNN in Run 1 was trained on 8 cases. After the CNN was done training, validation on 392 cases that produced 392 different Dice scores was completed. The mean and standard deviation for these 392 Dice scores were 0.424 and 0.206, respectively. Training size, validation size, mean Dice score, and standard deviation of Dice score are listed for Runs 1 through 12 in Table 2.


**Table 2.** Mean Dice score and standard deviation of Dice score for 12 training sizes.

After calculating the mean Dice score for 12 different runs, the SciPy [20] library in Python was used to complete curve fitting to these 12 data points with three nonlinear functions (Equations (2)–(4)). Multiple functions were used to optimize the regression and the best three functions that approximate the data were shown (Equations (2)–(4)). These three functions were selected from the SciPy library because they most effectively modeled the dataset that increased quickly from training sizes 8 to 32 and then gradually from training sizes 200 to 320. For all three functions, *a*, *b*, and *c* were constants, *y* was the Dice score, and *x* was the training size (Figure 2). The first function was logarithmic, with the formula:

$$y = a \times \ln(x) + b.\tag{2}$$

The second function was asymptotic and used the formula:

$$y = \frac{a}{b + \frac{1}{x}}.\tag{3}$$

The third function was exponential, with the formula:

$$y = 1 - a \times e^{-b \times x} + c \tag{4}$$

**Mean Dice Score vs. Number of Cases** 

**Figure 2.** The mean Dice score at 12 di fferent training sizes was approximated with several curve functions. (**a**) The first function was logarithmic with the formula *y* = *a* × ln(*x*) + *b*. *a* was 0.938 and *b* was 0.3594. The mean squared error was 2.55 × 10−3. (**b**) The second function was asymptotic and used the formula *y* = *a b*+ 1 *x* . *a* was 0.128 and *b* was 0.145. The mean squared error was 5.70 × 10−4. (**c**) The third function was exponential with the formula *y* = 1 − *a* × *e*<sup>−</sup>*b*×*<sup>x</sup>* + *c*. *a* was 0.651, *b* was 0.064, and *c* was −0.162. The mean squared error was 8.30 × 10−4. The second function (**b**) provided the best approximation because it had the lowest mean squared error.

For each approximation, the mean squared error was calculated with the following formula:

$$\text{Mean Squared Error} = \frac{1}{n} \sum\_{i=1}^{n} (y\_i - \overline{y}\_i)^2 \tag{5}$$

where *n* was 12, *y* was the Dice score, and *yi*was the estimated Dice score produced by the function.
