**3. Methodology**

Figure 2 depicts an overview of our proposed method: distribution-aware pseudo labeling for small defect detection (DAP-SDD).

**Figure 2.** An overview of our proposed distribution-aware pseudo labeling for small defect detection (DAP-SDD). We first use data augmentation techniques to leverage the limited labeled data for training in Step 1 (green dash box). Then, we use the trained model to generate initial pseudo labels for unlabeled data. We also apply bootstrapping for the limited labels to obtain approximate distribution with statistics such as that of the whole labeled dataset. Then, we use it to guide the threshold setting during pseudo labeling propagation in Step 2 (orange dash box). To achieve better detection performance from our model, we update pseudo labels for unlabeled data iteratively. Once the detection performance remains or starts to degrade, we apply the warm restart and mixup augmentation for both labeled data and pseudo labeled data in Step 3 (purple dash box). This step is to overcome confirmation bias [39] and overfitting, thereby improving model performance further.

### *3.1. Leverage Labeled Data*

Data augmentation can usually help improve detection performance, and there are several commonly used augmentation techniques we could employ, such as random crop, rotation, horizontal flip, color jittering, mixup, etc. [28,40]. However, these commonly-used techniques become incompetent in improving model performance when limited labels are available, e.g., less than 5% of the fully-labeled dataset. Inspired by augmentation techniques proposed in [12,41], we rotate images by 0, 90, 180, 270 degrees to quadruple labeled data. Unlike augmenting images by directly copy-pasting multiple times as in [12], our variants of original data not only enrich the labeled data but also can help prevent the model from becoming biased when the amount of pseudo labeled data increases during pseudo label propagation.

### *3.2. Distribution-Aware Pseudo Labeling*

In pseudo labeling methods, one common way is to convert model predictions to hard pseudo labels directly. To illustrate this, let *Dl* = {*x*(*i*) *l* , *y*(*i*) *l* } be a labeled dataset and *Du* = {*x*(*i*) *u* } be a unlabeled dataset. We first train a model *fθ* on *Dl* and use the trained model to infer on *Du*. Let us further denote *p*(*x*(*i*) *u* ) as the prediction of unlabeled sample *x*(*i*) *u* , then the pseudo label for *x*(*i*) *u* can be denoted as:

$$\bar{g}\_{\mu}^{(i)} = \mathbf{1}[p(\mathbf{x}\_{u}^{(i)}) > \gamma],\tag{1}$$

where *γ* ∈ (0, 1) is a threshold to generate pseudo labels. Note that, for a semantic segmentation model such as ours, *p*(*x*(*i*) *u* ) is a probability map and *y*˜(*i*) *u* is a binary mask with pseudo labels. As Equation (1) shows, threshold setting is critical to generate reliable pseudo labels. However, determining an optimal threshold is difficult, and a sub-optimal threshold value can introduce many incorrect pseudo labels, which degrades model performance. Therefore, we propose a novel threshold setting method, which can generate more pseudo labels with high confidence without bringing many noisy predictions.

### 3.2.1. Bootstrap Labels

Assuming the distribution of limited labeled data approximates that of fully labeled data, we first apply bootstrapping, a resampling technique that estimates summary statistics (e.g., mean and standard deviation) on a population by randomly sampling a dataset with replacement. The metric we employed in bootstrapping is *label*\_*pixel*\_*ratio*, which is denoted as:

$$label\\_pixel\\_ratio = \frac{label\\_pixels}{image\\_pixels} \tag{2}$$

where *label* \_*pixels* is the number of pixels of one label, and the *image*\_*pixels* is the number of pixels of the image in which the label locates. For example, if the number of pixels of one label is 256, and the image size is 2048 × 2048, then the *label*\_*pixel*\_*ratio* for this label is 256/(2048 × 2048) = 0.00006104.

### 3.2.2. Distribution-Aware Pseudo Label Threshold Setting

Once we obtain the mean of *label* \_*pixel*\_*ratio* (*μ*) in the previous step, we calculate the number (*k*) of pixels of predictions on *Dl* (*p*(*Dl*)), and the top *k* of sorted *p*(*Dl*) are pixels for labels. In other words, the threshold for *Dl* is the *k*-th value in *p*(*Dl*), which we use to set the threshold for unlabeled data *Du* as well. This mechanism works as both labeled and unlabeled data are supposed to be sampled from the same distribution of fully labeled data and share the same mean of label distribution. We also use the same trained model to infer on them. The *k* at a specific iteration *n* with the predictions of *pn*(*Dl*) is given by:

$$k\_{n,base} = \lfloor \mathcal{N}(p\_n(D\_l)) \* \mu \rceil,\tag{3}$$

where N (*pn*(*Dl*)) is used to obtain the total number of pixels in *pn*(*Dl*). The raw outcome from this equation is a real number, so we round it to the nearest integer to obtain *kn*,*base*. Then, the corresponding *kn*,*base*-th value in *pn*(*Dl*) can be used for threshold setting.

Using Equation (3), we can set a quite reasonable initial threshold as the calculation utilizes the mean of estimated label distribution. However, as the pseudo labeling model is encouraged to produce more high-confidence (i.e., low-entropy) predictions as training continues, this method alone may suffer from an insufficient number of proposed pseudo labels. To incorporate more pseudo labels with high-confidence predictions while reducing the possibility of introducing noisy predictions, we use a confidence interval and gradually increase it. An increasing confidence interval allows incorporating a higher number of confident predictions as high-confidence pseudo labels. Specifically, we use the t-distribution to find a given 100(1 − *α*)% confidence interval (*C I*), which can be obtained via:

$$CI = \mu \pm t\_{a/2, m-1} \frac{s}{\sqrt{m}} \tag{4}$$

where *μ* and *s* are the estimated mean and sample standard deviation of *label*\_*pixel*\_*ratio*, respectively. *m* represents the number of labels and *t* is a critical value in t-distribution

table to obtain *P*(*T* ≤ *t*) = 1 − *α*/2 at the degrees freedom of *m* − 1. The lower bound of the confidence interval (*C Ilower*) is *μ* − *<sup>t</sup>α*/2,*<sup>m</sup>*−<sup>1</sup> √*sm* , whereas the upper bound (*C Iupper*) is *μ* + *<sup>t</sup>α*/2,*<sup>m</sup>*−<sup>1</sup> √*sm* . We use the t-distribution in our proposed method because of the lack of labeled samples available. In such a case, the estimated standard deviation tends to be farther from the real standard deviation, and t-distribution fits better than the normal distribution. We also present the comparison results of them in the later section of ablation studies. Once we obtain the confidence interval, we can map them to find the lower and upper bound of *k* via Equation (3) by replacing the *μ* with *C Ilower* or *C Iupper*. Then, we can use the *kn*,*ci*-th value of *pn*(*Dl*), with a given 100(1 − *<sup>α</sup>n*)% confidence level to obtain the threshold *γn* at a specific iteration *n*:

$$\gamma\_n = \mathcal{K}(p\_n(D\_l), \lfloor \mathcal{N}(p\_n(D\_l)) \* t\_{a\_n/2, m-1} \frac{2s}{\sqrt{m}} \* \nu\_n \rceil),\tag{5}$$

where K(*pn*, *k*) is a function to find *k*-th value in *pn* and *νn* is an adjustment factor used to slow down or speed up propagation during training.

In addition to using the t-distribution to calculate the confidence interval for setting thresholds, we also employ another intuitive method for selecting high confidence pseudo labels. Specifically, we find the threshold that produces the best performance on labeled data. We then use that threshold to generate initial pseudo labels for unlabeled data. To illustrate this, let *Pl*,<sup>0</sup> denote the precision obtained from the labeled data, which can be considered as a confidence level for pseudo labels since the precision indicates how many predictions out of all predictions are true small defects. We can increase the confidence level with a moving step *τ* as the training goes on. Along with the *kn*,*base* via Equation (3), we can obtain the threshold *γn* at a specific iteration *n* by using:

$$\gamma\_n = \mathbb{K}(p\_n(D\_l), \lfloor \mathcal{N}(p\_n(D\_l)) \* \mu \* (P\_{l,0} + \nu\_n \* \tau) \rceil). \tag{6}$$

Overall, the method utilizing t-distribution confidence interval Equation (5) performs better than the intuitive method Equation (6), and their comparison results are presented in the later section of ablation studies.

### 3.2.3. Training Strategies

During training, we adjust the moving step of pseudo labeling propagation to set threshold adaptively. To accomplish this, we keep monitoring training and use the model evaluation results (e.g., Precision, Recall, F1 score) on labeled data. For example, if the monitored results show a decrease in both F1 score and recall but an increase in precision (close to 1.0), it indicates the threshold is set too high to incorporate more confident pseudo labels. In other words, the model can speed up the propagation and set the adjustment factor *ν* to a bigger value so that the threshold will be set to a lower value, thereby incorporating more high-confidence pseudo labels and vice versa. Another strategy we adopt during training is a weighted moving average of thresholds. Due to the random combination of training data batch, a model may temporarily suffer a significant performance decrease in a certain iteration. A weighted moving average of thresholds can prevent such an outlier threshold from resetting the threshold value that the model has learned to ensure more stable pseudo labeling propagation.

Algorithm 1 presents the training procedure of our proposed distribution-aware pseudo labeling. First, we use the labeled data to train a model *f<sup>θ</sup>*,0, and then use the trained model to generate initial pseudo labels and obtain the initial precision *Pl*,0. During the iterative pseudo labeling with the maximum number of iterations *N*, we calculate thresholds *γn* for each iteration via Equations (5) or (6). Then, we evaluate the obtained thresholds *γn* and the moving average threshold *γ<sup>n</sup>*−1,*ma* on labeled data. The threshold that yields better evaluation results (i.e., F1 Score) is selected. Meanwhile, by comparing the evaluation results (Precision, Recall, F1 Score denoted as *Pn*, *Rn*, *F*1*<sup>n</sup>*, respectively) of the current iteration with that of the previous iteration, we can obtain the adjustment

factor *νn* to speed up or slow down pseudo label propagation. Moreover, we update the moving average threshold for the next iteration. Next, we use the selected threshold *γn* to generate pseudo labels for unlabeled data *Du*. We then combine the pseudo labeled data *Dp* with labeled data *Dl* to retrain the model. We repeat these steps to update pseudo labels iteratively to achieve better detection performance of the model. Once the detection performance from the model reaches a certain threshold (e.g., F1 ≥ 0.85) but remains or decreases beyond that, the warm restart and mixup augmentation will be applied on both labeled data and pseudo labeled data to improve detection performance further.

### **Algorithm 1** Distribution-Aware Pseudo Labeling.

1: Train a model *f<sup>θ</sup>*,<sup>0</sup> using labeled data *Dl*. 2: **for** *n* = 1, 2, . . . , *N* **do** 3: Obtain threshold *γn* 4: *Pn*, *Rn*, *F*1*<sup>n</sup>*, *νn*, *γn* ← *<sup>E</sup>*(*Dl*, *γ<sup>n</sup>*, *<sup>γ</sup><sup>n</sup>*−1,*ma*) 5: *γ<sup>n</sup>*,*ma* ← *<sup>M</sup>*(*<sup>γ</sup><sup>n</sup>*, *γ<sup>n</sup>*−1, *γ<sup>n</sup>*−2, *α*, *β*) 6: *Dp*,*<sup>n</sup>* ← Pseudo label *Du* using *γn* 7: *D* ˜ ← *Dl* ∪ *Dp*,*<sup>n</sup>* 8: Train *f<sup>θ</sup>*,*<sup>n</sup>* using *D* ˜ . 9: *f<sup>θ</sup>*, *Dp* ← *f<sup>θ</sup>*,*n*, *Dp*,*<sup>n</sup>* 10: **end for** 11: **return** *f<sup>θ</sup>*, *Dp*

### 3.2.4. Loss Function

During pseudo labeling propagation, the loss function <sup>L</sup>*p* incorporates labeled and pseudo labeled data, which can be denoted as:

$$\mathcal{L}\_p = -\left(\sum\_{D\_l} \mathcal{L}(y\_{l'} \hat{g}\_l) + \eta \sum\_{D\_u} \mathcal{L}(y\_{p'} \hat{g}\_p)\right),\tag{7}$$

where *yl* is the ground truth labels, and *yp* represents pseudo labels. *y*ˆ*l* and *y*ˆ*p* denote the predictions of labeled data and unlabeled data, respectively. L represents the cross-entropy loss function. As the pseudo labeling progresses, the amount of pseudo labeled data will increase accordingly. To avoid the model increasingly favoring pseudo labeled data over the original labeled data, we add a weight *η* ∈ (0, 1) to adjust the impact from pseudo labels. In practice, we can achieve this by repeatedly sampling or using the similar augmentation techniques described in Section 3.1.

During the training process using mixup augmentation, we define a mixup loss function L*<sup>m</sup>*, which is given by:

$$\mathcal{L}\_{\mathfrak{M}} = -\sum\_{i=1}^{N} \left( \lambda \mathcal{L}(y\_a^{(i)}, \mathfrak{J}\_a^{(i)}) + (1 - \lambda) \mathcal{L}(y\_b^{(i)}, \mathfrak{J}\_b^{(i)}) \right), \tag{8}$$

*ya* and *yb* are the original labels of the input images, and *y*ˆ*a* and *y*ˆ*b* are corresponding predictions. *N* is the number of samples used for training. In the first step of leveraging labeled data, *N* only includes the number of labeled data, while in the last step, both the labeled and pseudo labeled data will be included. *λ* ∈ [0, 1] is used in mixup augmentation for constructing virtual inputs and outputs [40]. Specifically, the mixup uses the following rules to create virtual training examples:

$$
\tilde{\mathbf{x}} = \lambda \times \mathbf{x}\_a + (1 - \lambda) \times \mathbf{x}\_b
$$

$$
\tilde{y} = \lambda \times y\_a + (1 - \lambda) \times y\_{b\star}
$$

where (*xa*, *ya*) and (*xb*, *yb*) are two original inputs drawn at random from training batch, *λ* ∈ [0, 1] and the *x*˜ and *y*˜ are constructed input and corresponding output.

### **4. Results and Discussion**

*4.1. Datasets*

We use an in-house dataset from the wafer inspection system (WIS) in our evaluation. This dataset contains two types of small defects on wafer images: edge void and arc-like defects. Each of them has 213 images: 173 for training and 40 for test. There are 618 labels for edge void and 406 labels for arc-like, weak labels from the current system tool and verified predictions from a trained model. The image size is 2048 × 2048, and we crop it into 512 × 512 patches to fit into GPU memory.

We also evaluate our method on two public datasets: industrial optical inspection dataset of DAGM 2007 [14] and tiny defect detection dataset for PCB [42]. We use Class 8 and Class 9 of DAGM as they fit into a small defect category (denoted as Dagm-1 and Dagm-2). We split each class into two sets of 150 defective images in gray-scale for training and testing. The image size is 512 × 512 in DAGM. The defects are labeled as ellipses, and each image has one labeled defect.

On the other hand, the PCB dataset includes six types of tiny defects (missing hole, mouse bite, open circuit, short, spur, and spurious copper), and each image may have multiple defects. PCB contains 693 images with defects: 522 and 101 images for training and test, respectively. The total number of defects is 2953. There are different sizes of PCB images, and the average pixel size of an image is 2777 × 2138. We also crop it into 512 × 512 patches for training.

### *4.2. Evaluation Metrics*

**Intersection over Prediction (IoP).** In this work, instead of using IoU (intersection over union), we adopt IoP (intersection over prediction) [43] to overcome the issue shown in Figure 1, where one weak label may contain multiple small defects or cover more area than the true defect area. IoP is defined as the intersection area between ground truth and prediction divided by the area of prediction. If the IoP of a prediction for a small defect is greater than a given threshold (0.5 in this work), we count it as a true positive; otherwise, we count it as a false positive. If one weak label contains multiple true positive predictions, we only count it as one true positive.

**Average Precision (AP), F1 Score.** We use AP (average precision) and F1 Score to evaluate the performance of small defect detection.

### *4.3. Experimental Settings and Parameters*

In this work, we adopt a commonly-used segmentation model U-Net [8] in our proposed method, which has been proven to be effective in medical image segmentation tasks, such as detecting microcalcifications in mammograms [43,44]. Moreover, U-Net has a relatively small size of model parameters, which is favorable in practical use. U-Net consists of three downsampling blocks and three upsampling blocks with skip connections. Each block has two convolution layers, and each of them is followed by batch normalization and ReLU. As our proposed pseudo labeling strategy is not confined to a specific deep learning model, it can be easily implemented in other deep neural networks. We will extend our proposed techniques to other segmentation models in future work.

We use the Adam optimizer in the model training. The initial learning rate is set to 1 × 10−<sup>3</sup> and gradually decreases during training. The adjustment factor *νn* is set to 1.1 to speed up pseudo label propagation or set to 0.9 to slow down the propagation. The moving average weights [*α*, *β*,(<sup>1</sup> − *α* − *β*)] are set to [0.5, 0.3, 0.2] for the current iteration threshold *γn* and thresholds of previous two iterations *γ<sup>n</sup>*−1, *γ<sup>n</sup>*−2, respectively. The t-distribution confidence interval ranges from 0.5 to 0.995 with a moving step of 0.005.

### *4.4. Experiment Results*

We first evaluate our proposed method on two different types of small defects on wafer images (the WIS dataset). Figure 3 demonstrates the improvements brought by our method over the supervised baseline. Overall, our proposed method can achieve above

0.93 of AP for a different amount of labeled data available and obtain comparable results as a fully-labeled (100%) dataset even when the labeled data ratio is 2% (four labeled images). However, the detection performance of the supervised method decreases dramatically when the labeled data size is limited. For instance, the AP reduces to below 0.6 when 2% of labeled data is available.

**Figure 3.** Improvement over the supervised baseline on two small defects in the WIS dataset.

Figure 4 demonstrates the improvements by our method (solid lines) over the supervised baseline (dash lines) on the DAGM and PCB datasets. Similar to the results of the WIS dataset, our proposed method can achieve comparative results of fully-labeled (100%) when the labeled data ratio is 10% on different small defects in DAGM and PCB datasets. The average precision (AP) by our method remains above 0.9 when only 5% of labeled data are available while the AP by the supervised model degrades dramatically. Note that in Figure 4, the values of PCB-average represent the average AP of six different types of defects in the PCB datasets.

**Figure 4.** Improvements over the supervised baseline on small defects in the DAGM and PCB datasets.

We then compare our method with state-of-the-art semi-supervised semantic segmentation methods. Table 1 shows the comparison results on the Edge Void defect dataset with 10% of labeled data. CCT [26] is a consistency regularization-based method. As shown in Table 1, the CCT alone fails to recognize and locate small defects. CCT also provides a way of training with offline pseudo labels. So, we use the pseudo labels generated from the

first step of our method. As we can see, CCT+Pseudo improves the detection performance as more pseudo labeled data are incorporated. However, the initial pseudo labels might contain incorrect labels, which are not updated iteratively in CCT. Therefore, CCT+Pseudo still presents relatively low detection performance. AdvSemSeg [38], however, uses an adversarial network for semi-supervised semantic segmentation. The experiment results show that AdvSemSeg performs better than CCT, which indicates adversarial network can be a potential direction for improving small defect detection. However, due to limited ground truth labels, AdvSemSeg does not perform well as reported in [38]. In self-training, we exclude the labeled data and only use the initial pseudo labels as the supervisory signals for unlabeled data. During self-training, instead of setting pseudo labels based on pixel confidence score higher than 0.5 as in [45], we adopt the same threshold setting strategies as our method to generate pseudo labels for self-taught training effectively. As shown in Table 1, self-training shows significantly better AP and F1 scores than CCT and AdvSegSeg. Overall, DAP-SDD achieves the highest AP and F1 scores. We attribute this to the fact that ours also incorporates labeled data that contain useful prior knowledge.


**Table 1.** Comparison with state-of-the-art methods on the WIS dataset with 10% of labeled data.

In Figure 5, we present examples of predicted labels for small defects generated by different methods. We can observe that all the evaluated models can generate labels for relatively large defects, as shown in the first row of edge void defects and the third row of arc-like defects. Compared to CCT+Pseudo or AdvSemSeg, which generate incomplete labels or overfull labels, self-training and our proposed method obtain more accurate labels. However, for the significantly tiny or dim defects, such as ones shown in the second row and fourth row, most of these models suffer from missing detection while our method can still detect them. Overall, our proposed method performs best regardless of the different sizes of small defects.

**Figure 5.** Examples of predicted labels using different methods for edge void and arc-like defects (marked in red) in the WIS dataset. From left to right, columns are defects, segmentation results using CCT+Pseudo, AdvSemSeg, self-training, and DAP-SDD (ours), respectively.

The prediction results and comparison results with state-of-the-art methods on the DAGM and PCB datasets are shown in Tables 2 and 3.

**Table 2.** Evaluation results (AP, %) on public datasets (DAGM, PCB) when different amounts of labeled data are available. Total data amount (100%): DAGM (552), DAGM (150).


**Table 3.** Comparison with state-of-the-art methods on public datasets (DAGM, PCB), evaluation metric: AP (%).


Moreover, we present examples of predicted labels for small defects in the DAGM (Figure 6) and PCB datasets (Figure 7) generated by different methods. As the results have shown, our proposed method consistently outperforms the state-of-the-art semi-supervised segmentation models on various datasets with various types of defects.

**Figure 6.** Examples of predicted labels using different methods on the DAGM dataset. The top two rows show results for Dagm-1, while the bottom two rows show results for Dagm-2. The first column shows defect images with original weak labels (marked in red color), and the remaining columns are segmentation results using CCT+Pseudo, AdvSemSeg, self-training, and DAP-SDD (ours), respectively.

**Figure 7.** Examples of predicted labels using different methods on the PCB dataset. From top to bottom, each row represents six types of defects in the PCB dataset: mouse bite, missing hole, open circuit, short, spur, and spurious copper. The first column shows defect images with original weak labels (marked in red color), and the remaining columns are segmentation results using CCT+Pseudo, AdvSemSeg, self-training, and DAP-SDD (ours), respectively.

### *4.5. Ablation Studies*

**Contribution of components for performance improvement.** Figure 8 demonstrates how different components in our proposed method contribute to detection performance on both in-house and public datasets. For a fair comparison, we use the same data augmentations in the supervised baseline and ours. Therefore, the results of the first step using only labeled data are also supervised baseline. As shown in Figure 8, for the WIS dataset (solid bars), the model using 20% of labeled data can achieve around 88% of AP, which is still lower than our target (our real-world applications typically require AP of 90% or higher). When we have 2% of labeled data available, the AP value decreases to 56%. In Step 2, utilizing the proposed distribution-aware pseudo labeling method significantly improved the detection performance for all cases, and cases with fewer labeled data benefit more. For example, AP is improved from 56% to 92% when using 2% of labeled data. The results demonstrate that our proposed method can effectively leverage the information from massive unlabeled data to improve detection performance. In the final step, warm restart and mixup are employed to improve performance further. We obtain similar results on public datasets shown in Figure 8 (bars with patterns). Overall, the proposed distributionaware pseudo labeling contributes most significantly to the detection performance and the data augmentations we adopted in DAP-SDD are effective in enhancing the model's performance and robustness.

**Figure 8.** Ablation studies on different factors that contribute to performance improvement.

**Compare with more baselines.** In our proposed DAP-SDD, we assume the distribution of proposed labels approximates the distribution of ground truth labels as training proceeds. We use the Kullback–Leibler (KL) divergence to evaluate the differences in the distribution of proposed labels compared with ground truth labels during training, which is shown in Figure 9. The KL divergence is a commonly-used measurement for evaluating how one probability distribution differs from the other reference distribution. We can observe that: (a) t-dist vs. normal-dist: t-distribution (t-dist) performs better than a normal distribution (normal-dist) because t-dist has heavier tails. Thus it is more suitable for estimating the confidence interval (CI) when the sample size is limited as in our cases. For a given CI range, normal-dist tends to incorporate more predictions than t-dist, which in turn brings 'too many' noisy predictions for pseudo labels. As a result, the accumulated noisy impact overwhelms that of original limited labels as training proceeds. (b) Adaptive vs. fixed threshold: adaptive thresholding that combines Equation (3) and Equation (5) can keep the model learning more useful information during training and outperforms the fixed threshold obtained via Equation (3). (c) Equation (5) vs. Equation (6): Equation (6) is more conservative in incorporating confident predictions than Equation (5) when using the same moving step (0.005), and it requires more training epochs to reach the equivalent results as Equation (5). (d) with vs. without ma: compared with the baseline without moving average (ma) threshold, our method incorporates ma, which helps prevent outlier thresholds (e.g., epochs 23 and 65) from resetting what the model has learned. In addition, the corresponding detection performance and KL divergence at the same training epoch (100th) of these baselines are shown in Table 4. As we can observe, DAP-SDD using the t-distribution confidence interval and with moving average achieves the best detection performance while having the smallest KL divergence.

**Figure 9.** KL divergence curves of various baselines.


**Table 4.** Comparison of detection performance (AP, %) and KL divergence (same training epochs 100) on various baselines.
