**1. Introduction**

Zebrafish embryos have gained popularity in biological research since they share 84% of genes associated with human disease [1] and they are nearly transparent under bright-field microscopes. Zebrafish egg is a special form of the embryo, and it is usually used to study the influence of environmental factors on embryo development. To evaluate the biological endpoints based on zebrafish eggs, microscopic screening is frequently performed [2]. By far, the analysis of zebrafish microscopic images is mostly performed by human operators. With the advances in image acquisition systems, the number of microscopic images is increasing rapidly, making manual assessments increasingly time-consuming. Therefore, automatic analysis of zebrafish microscopic image becomes an urgent demand [3].

To meet this stringent demand, a series of studies was conducted for computerized zebrafish microscopic image analysis [4,5]. Most techniques were based on traditional machine learning strategies, i.e., using texture filters to extract hand-crafted image features and then using classification algorithms (e.g., the supported vector machine and random forest) to conduct phenotype pattern recognition. The performances of these methods highly rely on the quality of hand-crafted image features, but the design and selection of hand-crafted features involve subjective human interventions, which limit the objectiveness and robustness of the method.

In the last decade, deep learning methods experienced dramatic development, leading to improvements in many pattern recognition applications, such as image processing, video analysis, and language recognition [6–11]. Compared to the conventional machine learning methods, deep learning overcomes the limitation of hand-crafted features by automatically optimizing the feature extraction and classification procedure. The core of deep learning for image analysis is the revolutionary development of the convolutional neural network (CNN) [12,13]. CNN was originally designed to recognize and classify object patterns in images. As of today, numerous CNN-based powerful image classification models are developed, including Alex Krizhevsky Network (AlexNet) [14], Visual Geometry Group (VGG) nets [15], and Residual Neural Network (ResNet) [16]. These methods were also applied to biological image analysis [17,18], leading to improvement of accuracy and robustness.

Despite the fast development of deep learning techniques, their applications in zebrafish egg microscopic image analysis are rare. A common data analysis task for zebrafish egg images analysis is to classify whether the egg is fertilized or not, in order to verify if the tested drug has impaired the fertilization process. To accomplish this task, there are several challenging problems to solve:


To overcome these problems, this paper proposed a deep learning algorithm for automated zebrafish egg fertilization status classification from microscopic images. Dedicated data augmentation and transfer learning strategy were used to tackle the imbalanced and small training set problem. The global average pooling scheme was used to address the subtle inter-class differences. Experimental results showed that the proposed method yielded dramatic accuracy improvement compared to traditional CNN network, and the classification accuracy for zebrafish eggs could reach up to 98.8%.

#### **2. Materials and Methods**

#### *2.1. Data Collection*

In this study, the microscopic images of zebrafish eggs were acquired using a bright-field microscopy imaging device called ImageXpress [19]. The system automatically placed three or four embryos in a U-shaped bottom transparent well plate. The image of each plate was collected using a ×2 dry objective between 3 and 3.7 h post fertilization. Transition Metal Oxide Nanoparticles were applied to the zebrafish embryos, and some of the eggs became unfertilized due to the toxicity effect of the nanoparticles. Figure 1 shows a typical sample image of the zebrafish eggs. The eggs are to be classified into two classes, fertilized and unfertilized. The fertilized eggs contain the nucleus surrounded by dark yolk membranes, whereas the unfertilized eggs have clear yolk membranes.

**Figure 1.** A typical example of the zebrafish eggs bright-field microscopic image, in which the fertilized and unfertilized eggs are marked. The fertilized eggs contain the nucleus surrounded by dark yolk membranes, whereas the unfertilized eggs have clear yolk membranes.

#### *2.2. Method Workflow*

As illustrated in Figure 2, our automatic zebrafish egg recognition and counting method consisted of three major steps. The input image is a well plate image containing three or four eggs. For the first step, each individual egg was detected and separated as a small patch. The patch of each egg was then fed into a deep convolutional neural network to calculate the classification feature vector. Finally, a global average pooling layer was used to classify the fertilization status based on the feature vector. Details of the proposed method are explained in the following subsections.

**Figure 2.** The workflow of the proposed method.

#### *2.3. Egg Detection*

As required by the classification task, the microscopic images were pre-segmented and cropped into square patches of single eggs. This was achieved via a template matching step, which detected the center of each egg. Figure 3 demonstrates the principle of template matching. As shown in Figure 3a, the template was constructed by manually cropping K typical egg patches of N × N pixels from the training images. The K patches were reoriented into the same direction, and an averaged template was created by calculating the average image of them. Then, the average template was rotated with 30 degrees interval to generate 12 template patches of different orientations (Figure 3b). To perform

template matching, each of the 12 templates was moved with N/10 pixels interval in both *x* and *y* directions throughout the test image. For each moved position, the mutual information between the template and its covered image area was calculated as the similarity metric. The top 20% positions with the largest mutual information were maintained as the candidate egg centers. At last, candidate centers close to each other (within N/5 pixels distance) were clustered, and the mean coordinates of clustered candidates were used as the egg center. Based on the detected egg centers, a bounding box of size N × N was used to crop the egg out of the image. In this study, we found K = 10 sufficed for our needs, and a cropping size of N = 150 pixels ensured to enclose all eggs.

**Figure 3.** The workflow of egg detection. (**a**) The egg template is created by averaging K patches of egg samples; (**b**) The egg template is rotated into different directions, and each rotated patch is moved through the target image to find its matched egg.

### *2.4. Convolutional Feature Extraction*

After the egg detection step, each cropped egg patch was fed into a convolutional neural network to extract the image features for egg classification. To train such a network, we needed to overcome the limitation of the small and imbalanced training dataset. Our study involved only a few hundreds of samples of zebrafish eggs, which were not enough for training a deep neural network. Compared to the popular ImageNet [14] dataset of over ten million sample images, the size of our datasets is at least four orders of magnitude less. When the number of weights to be trained in a neural network is far more than the number of training samples, the problem of overfitting is likely to occur, and the network will have poor generalization ability.

Another problem with our training set is that the sample numbers of different categories were seriously imbalanced. The ratio between fertilized and unfertilized eggs was almost 6:1 in our dataset. Imbalanced training data could potentially diminish the specificity of the network, making the network incompetent to recognize the relatively smaller category, i.e., the unfertilized eggs.

To overcome the limitation of the small and imbalanced training set, we used the image augmentation strategy to increase the training set size and to balance the training sample numbers of different categories. Image augmentation is the process to increase the training set by creating altered versions of the existing sample images, and it is proved to be an effective solution to prevent overfitting [14]. Typical ways of data augmentation include rotation, translation, zooming, flipping, scaling, color perturbation, and adding random noise.

In our study, image augmentation strategies were carefully chosen according to the characteristics of our datasets (as shown in Figure 4). Since the eggs were captured at the same time point, they had similar sizes. All the images were captured under the same environmental light condition so that the grayscale level of different eggs was similar. The most possible variation of the eggs is the different orientation caused by random placement. Therefore, we used image rotation and flipping to simulate possible deviation of egg orientations. In order to improve the balance of the dataset, we augmented the unfertilized eggs more than the fertilized eggs. Each fertilized patch was rotated three times with 60 degrees interval, while each unfertilized patch was rotated 18 times with 10 degrees interval. All the rotated patches were also flipped vertically to simulate the effect of different illumination orientations of the environmental light. As a result, the ratio between fertilized and unfertilized eggs was close to 1:1 after the augmentation.

**Figure 4.** Examples of the one zebrafish egg (the leftmost patch) and its augmented patches, including rotational augmentation and flipping augmentation.

Based on the augmented training dataset, a convolutional neural network was trained. The network used the architecture of VGG-16 [15], the winner of the 2014 Large Scale Visual Recognition Challenge (ILSVRC). As shown in Figure 5, this architecture consisted of 5 blocks of 13 convolutional layers. For each convolutional layer, a convolution kernel of size 3 × 3 was convolved with the layer input to produce a tensor of outputs. The output tensor of the convolutional layer was then transferred into a finite value by an activation function of Rectified Linear Unit (ReLu), i.e., F(*x*) = *x* for *x* > 0 and 0 otherwise. At the end of each block of the convolutional layers, there was a max-pooling layer to perform down-sampling by dividing the output feature map from each block into 2 × 2 pooling regions and computing the maximum of each region. The down-sampled feature map from each max-pooling layer was then fed into the next convolutional layers as an input.

**Figure 5.** The architecture of the convolutional feature extraction network, where 'Conv N' stands for a 3 × 3 convolutional layer with N channels, 'Max Pool' stands for a 2 × 2 max pooling layer.

To train this deep convolutional network, the transfer learning strategy was used. The network weights pre-trained on the ImageNet dataset was adopted as the initialization. By using VGG-16 as the initial model, we were able to take advantage of deep features learned from millions of natural images [20]; therefore, the risk of overfitting was further reduced, and the convergence of the training was accelerated. During the training, the weights of the first three blocks of layers were frozen to retain the extracted simple features by VGG-16. Other two blocks of convolutional layers were fine-tuned with a small learning rate to make sure that the magnitude of the updates from each fine-tuning iteration stayed small.

For the bottleneck feature training phase, RMSprop optimizer was used for faster general localization. For the fine-tuning phase, Stochastic Gradient Descent (SGD) optimizer with momentum was chosen for better generalization ability. Choosing proper optimizers is crucial since it directly affects the convergence of the algorithm. Both optimizers we chose here originate from the optimizer of Gradient Descent. However, the basic Gradient Descent method calculates the gradient of the whole data set for performing only one update. Therefore, it is extremely slow and memory expensive for experiments with large datasets. Stochastic Gradient Descent (SGD) method was designed to rectify the above problems of the regular Gradient Descent method by performing a parameter update for each training example. To further improve convergence accuracy and reduce fluctuation, a momentum term was added to the SGD method. It restricts the oscillation in one direction during searching to improve the speed of the convergence. Based on the SGD with momentum method, RMSprop optimizer restricts the oscillations in the vertical direction. In this way, a larger learning rate could be adapted to have a larger searching pace in the horizontal direction to increase convergence speed. For bottleneck feature training phase, RMSprop optimizer was used for faster general localization at the beginning. While Stochastic Gradient Descent (SGD) optimizer with momentum was chosen for more precisely global minima localization.

Learning rate is one of the most important aspects of Gradient Descent because it determines the pace size for searching the global minima of the optimizing algorithm. Here, a small learning rate of 0.0001 was used to perform fine adjustments to weights without changing the overall weight structure. We used a small learning rate so that the features learned previously were not wrecked. For the training process of each data fold, we ran 50 epochs and saved the best result of model weights at the epoch when the validation loss was the least. The technique of reduced learning rate was used, i.e., the learning rate was multiplied with 0.2 when the training loss stopped reducing for 3 epochs.

#### *2.5. Global Average Pooling Classifier*

After features were extracted by the convolutional network module, a classification module based on global average pooling method was used instead of the traditional fully connected layer classifier. Conventionally, in a convolutional neural network, convolutional layers are usually followed by several fully connected layers to vectorize the feature extracted by convolutional layers and to accomplish the classification task via a softmax logistic regression layer. However, fully connected layers involve many weights to be trained, which increase the cost of computing and reduce the convergence speed of the network. On the other hand, the increment of weights will also increase model complexity, which may easily lead to overfitting. Effective techniques have been proposed to avoid overfitting, such as dropout [21,22]. Using global average pooling (GAP) instead of fully connected layers to classify different categories directly from feature maps is a revolutionary innovative improvement made to traditional convolutional neural networks [23]. Instead of adding fully connected layers on top of the feature maps from convolutional layers, GAP generates one feature map for each corresponding category to be classified, vectorize the features by global average pooling, and feed the vectors directly into the final softmax classifier, as shown in Figure 2. Compared to traditional fully connected layers, GAP had enforced the correspondences between feature maps and categories. Besides, the GAP didn't introduce extra weights to be optimized for the network, which had reduced the prone of overfitting.

#### **3. Results**

In this study, a total of 211 zebrafish egg microscopic images containing 638 eggs were acquired using the ImageXpress system. A human biologist with over 10 years' experience was invited to assign fertilization labels to all the eggs, resulting in 546 fertilized eggs and 92 unfertilized. The labels of human expert were used as the gold standard for method validation. The network was constructed using the Keras platform on a server with NVIDIA K4000 Graphics Processing Unit (GPU). The training process took ~60 min for each training subsample set and took less than 5 s on each test image.

#### *3.1. Zebrafish Egg Classification Accuracy*

To validate the proposed method, a five-fold cross-validation scheme was used. For K-fold cross-validation, the original dataset was randomly partitioned into K equal-sized subsample sets. The training and validation processes were repeated K times. Each time, one subsample set was retained, in turn, as the validation data, while other K−1 subsample sets were used as the training data. We chose 5-fold cross-validation according to the overall size of the dataset so that there were no less than 500 eggs in each training set. The training and validation processes were repeated five times, and the accuracy of each validation subsample set was calculated.

Table 1 reports the accuracy of each cross-validation fold. The accuracy was defined as Accuracy = (NTP + NTN)/Nall, where NTP, NTN, Nall stand for the number of true positive, true negative, and all eggs, respectively. In this study, we considered unfertilized eggs as positive samples and fertilized eggs as negative samples, respectively. As reflected in Table 1, the third fold had the highest accuracy (98.8%), and the second fold had the lowest accuracy (93.2%). Even the lowest accuracy was higher than 93%, and the mean accuracy of all folds was 95.0%, meaning that the proposed method has a quite high classification accuracy for zebrafish egg fertilization status. The standard deviation of all the folds was also small (2.2%), meaning this method performs stably over different test datasets.

**Table 1.** Classification Accuracy of the Proposed Method.

