*3.1. Classification Accuracy*

Accuracy was measured by comparing a model's classification of the images within the test set against that of the human experts, with a final accuracy rate calculated by dividing the number of correct predictions of 'fertile' or 'infertile' by the number of total predictions. Recall and precision were measured to ensure there was no imbalance in the accuracy of classes. The bespoke architecture achieved a benchmark of 78% without data augmentation but had significant skew toward 'fertile' predictions. Accuracy increased to 82%, with less skew toward 'fertile', when the augmented training set was used. When transfer learning was employed, the VGG-16 architecture achieved an accuracy of 89% with a satisfactory balance between classification, and the InceptionV3 architecture reached an accuracy similar to the benchmark of 80%, but with slight skew toward 'infertile.' However, the ResNet-50 architecture was able to attain the highest performance of all architectures, scoring an accuracy of 94% with a good balance between classes when measured against the test set. This is a level of precision close to the human accuracy rate of 99%. For a confusion matrix detailing the ResNet-50 architecture's performance, in this instance, see Table 3. As the images used to train and test the were all pre-processed, the magnification, resolution, or quality of image should not affect classification, and the accuracy scores reported here should be representative of real-world use.

**Table 3.** Confusion matrix for ResNet-50.


1 Confusion matrix—comparison of actual values with those predicted by the model and giving details on true positive (top-left), true negative (bottom-right), false positive (top-right), and false negative (bottom-left) rates.
