To demonstrate the feasibility of small data training, this study will build the database and extract small samples from the database. The database establishment, data sampling, and model training will be described in detail.
4.1. Establish Database
Selecting features is typically the first step in establishing a database. Researchers need to choose features that are highly correlated with the predictive target [
25], which often requires expert knowledge or numerical analysis.
Figure 15 illustrates features highly correlated with WLCSP fatigue life based on expert knowledge. As an extension of the previous research [
14], this study still uses the selection of four features. And the foundation for establishing the database is TV2. The distribution of the four features is as follows: upper pad diameter, lower pad diameter, SBL thickness, and chip thickness.
Following the concept of space-filling in previous studies, we used as many data points as possible to fill the entire feature space evenly. We were able to achieve good performance in training the AI model. The database can be obtained by determining the value boundaries of the features and selecting node values for complete permutation and combination, as shown in
Table 5,
Table 6,
Table 7,
Table 8,
Table 9 and
Table 10.
This is the simplest form of space-filling. In total, there are over 9000 data points. We selected a certain proportion from them as the training set, and the performance of the AI model will improve with increased training data. Previous studies did not introduce additional sampling strategies and just used random sampling. This research will introduce the adaptive sampling method that will reduce the training data while maintaining the performance of the model.
4.2. Data Sampling (Random Pick)
In
Section 4.3, K-medoids will be used for data sampling. As a comparison group, 200 samples were randomly selected for training. The visual distribution is shown in
Figure 16.
This figure’s three axes represent the upper pad diameter, lower pad diameter, and chip thickness, while the “color bar” represents the SBL thickness. This image cannot directly assess the data distribution quality, and it needs to rely on the performance of AI models to evaluate it. Two hundred data points will be used as the training set, while the remaining data points will serve as the testing set. The performance of the AI model on the testing set will be the basis for evaluation.
It is worth mentioning that in classification problems, the decision boundary is often the focal point of research. Guan [
26] revealed the impact of decision boundary complexity on model generalization. To address generalization issues, a commonly used method is boundary sampling. We identified data points close to the decision boundary within the existing dataset and generated additional samples near the boundary using interpolation or other techniques [
27]. Although there are no decision boundaries in regression problems, the performance of AI regression models near feature boundaries is also worth exploring. In this study, the essence of adaptive sampling lies in sampling near feature boundaries. Its impact on AI model performance will be further investigated in
Section 4.3.
To assess the performance of AI models, standards need to be established, including maximum training differences, average training differences, maximum testing differences, average testing differences, the number of testing data points for a “difference > 50 cycles” and the number of testing data points for a “difference percentage > 7%”. A training difference indicates whether the model is underfitted, whereas a testing difference indicates whether it is overfitted. Using the other two standards, it is possible to determine the number of test points with inaccurate predictions intuitively. The preliminary preparation for model training has been completed so far. The ANN learning algorithm is being used with 200 training data points and over 9000 testing data points. The hyperparameter settings for the ANN are shown in
Table 11.
As mentioned in
Section 2.2, the learning rate of Adam is adaptive. It is a simple model with only four inputs and one output, so there are not too many tricks involved. The hyperparameter settings and the selection of data preprocessing were determined based on previous experience [
14]. Here, data preprocessing specifically refers to data transformation. It is a method to adjust the range of feature values, which can optimize model performance [
28]. A robust scaler is the best choice in this case. It uses quartiles for data standardization, as shown in Equation (5). Grid search is used to find the optimal combination of neuron numbers for each layer. The maximum number of iterations is set as the condition to terminate model updates.
Here are two sets of results, as listed in
Table 12. “Neuron number” indicates the number of neurons in each hidden layer. These two models were selected from many models generated by the grid search, and they exhibit good performance in testing differences.
The data corresponding to the “Maximum difference” in the table are as follows: prediction, target, absolute difference, and percent difference. The training differences of both models indicate that they have been sufficiently trained, and there is no underfitting. Testing differences indicate that the models were not accurate in predicting unknown data. In both cases, the maximum percentage of testing errors exceeds 15%. The number of test data points with inaccurate predictions (difference ≥ 50 cycles) exceeds 200. There is no doubt that the inadequacy of the training data is a contributing factor to the poor performance of the model. The small training set obtained through random sampling needs further optimization, either by increasing its size or by improving its distribution.
Increasing the number of data points in the regions of high interest is one of the most direct methods for improving the distribution of data, and adaptive sampling refers to this method. The method relies on acquiring prior knowledge, meaning the locations of the high-interest regions must be known. As previously mentioned, boundary sampling is noteworthy but requires further validation. It is necessary to first observe whether the inaccurately predicted test points tend to cluster near the feature boundaries. A clear clustering of these test points will indicate that these clustered regions are high-interest areas.
Except for SBL thickness, the results of the other three features are very similar, with proportions close to half. The detailed results are shown in
Table 13.
Among the 319 inaccurately predicted test points for Model I, 220 data points have an SBL thickness of less than 10.5 , accounting for a substantial portion of 70%. There are 153 data points, accounting for 48%, with the upper pad diameter at the boundary value. The situation with lower pad diameters and chip thicknesses is similar to that of the upper pad diameters. It is evidently necessary to increase the number of training data points near the feature boundaries and in regions with a small SBL thickness.
4.3. Data Sampling (Adaptive Method)
The results in
Section 4.3 are consistent with our expectations: half of the inaccurate predictions are located at the feature boundaries. Just as data points near the decision boundary are prone to classification errors, data points near the feature boundaries also increase the risk of inaccurate predictions. Next, K-medoids will be used to perform feature boundary sampling. Since the dissimilarity between clusters, the uniformity of each cluster, and the space-filling principle are guaranteed by the clustering algorithms, the cluster centers are chosen to represent the characteristics of the clusters.
To increase training data near the feature boundaries uniformly, the feature space is split first. After comparative testing, spatial partitioning using four features versus three features has a limited impact on the final prediction performance. This study provides a focused analysis of one situation.
By dividing the upper pad, lower pad, and SBL into two sets each, and using permutations and combinations, the entire feature space can be divided into eight parts. From
Table 14, in set 1, the upper and lower boundary values of the upper and lower pads are extracted. Referring to
Table 13, we set 10.5 as the dividing value for the two sets of the SBL. Using K-medoids, we generated 25 cluster centers in each of the eight regions after partitioning for a total of 200 training data points.
Table 15 displays the key configuration parameters of K-medoids. “Metric” specifies the measure used to compute distances between data points, with Euclidean distance being the most widely utilized method. “Method” specifies the specific approach used for clustering, with “Alternate” being chosen based on time cost considerations. “K-medoids++” is an initialization method for cluster centers that ensures the centers are initialized with sufficient distance between them, facilitating quicker convergence to improved clustering outcomes. Set “random_state” to ensure consistency in random results. The distribution of new training data is shown in the figure below.
Compared with
Figure 16,
Figure 17 shows a significant increase in the number of data points on the boundary surfaces. In addition, the number of blue data points with a small SBL thickness has increased. The AI models begin to be trained in
Section 4.4 with new training data.
4.4. AI Model Training
Besides continuously adjusting the hyperparameters of a model of four inputs and one output, the AI model’s prediction ability relies on the data quality and quantity. Given the limited quantity, previous work improved the quality of the training dataset. The new training set will be validated in this section.
Before applying new algorithms, the selection of the ANN should be validated against other known AI models. Kou et al. [
4] and Su [
6] reported the rather high prediction capability of Support Vector Regression (SVR) and Kernel Ridge Regression (KRR).
Table 16 compares the prediction performance of different algorithms. To reiterate the learning task, there were 200 training data points with K-medoids, 9000+ testing data points, 4 inputs, and 1 output.
All models were preprocessed using the robust scaler. In the table below, two ANN models are presented using different solvers. For ANN-1, the solver is ‘Adam’, while for ANN-2, the solver is ‘L-BFGS’. Both solvers are commonly used for ANNs. However, ‘L-BFGS’ can be more effective for small datasets [
29]. The average training difference for each algorithm is small, which indicates that all algorithms have been trained successfully without underfitting. Based on the average testing difference, the ANN outperforms SVR and KRR. Therefore, the ANN will be explored in greater depth. The table below illustrates the performance of ANN models with different solvers, hidden layers, and neurons.
From
Table 17, ANN-2 is the best-performing model in comparison. And it is obvious that the adjustment of important hyperparameters for the ANN has a limited impact on the average testing difference. Training a single ANN model is unable to improve performance further. On the other hand, the time cost for training on small datasets is much lower than that for training on large datasets. This is one of the advantages of training with small datasets.
Table 18 compares the prediction performance of ANN models with old and new training data. “Random pick” selected Model II, while “K-Medoids” selected ANN-2. It demonstrated that the adaptive sampling method is useful.
The new training dataset has greatly improved the performance of the AI model. The performance of the maximum testing difference has significantly improved. The number of test points with failed predictions has significantly decreased. Even though the performance is already quite good, there is room for further improvement. With the help of ensemble learning, the performance can be further improved.
4.5. Ensemble Learning
The ensemble learning shows high potential to improve the prediction accuracy against small datasets [
30]. This study only presents some preliminary results. Voting methods, or weighted averaging, have long been effective means of reducing system variance. That is also the core idea of “bagging”. As mentioned earlier, the performance of a single ANN model is stable in testing data. Although their performance is close in numerical metrics, different ANN models often have distinct areas of inaccurate prediction.
Table 17 presents the performance of some existing ANN models from grid search. In fact, there are many ANN models that are well trained. Although individual ANN models may have limited performance, aggregating them can enhance performance. Under different hidden layers, we selected the ANN models with excellent performance on the testing set as the sub-models. All sub-models in this study have equal weights.
Table 19 shows the performance results with 5 and 15 sub-models. At the same time, the final comparison results are shown in
Table 20.
When new testing data are input, each sub-model will provide its own prediction, and the final result will be the average of the results from multiple sub-models. Despite having only ANN models as sub-models and not yet involving a mixture of algorithms in the ensemble, the performance of the ensemble model is still outstanding. The number of failed prediction test points has further decreased, reaching below 10.
Table 20 shows the hyperparameter settings for the sub-models used, with any unmentioned hyperparameters remaining consistent with those in
Table 11. To increase sub-model diversity, some trained models were added using the standard scaler. The ensemble learning of 5 sub-models uses the sub-models numbered 1–5. The final comparison results have been shown in
Table 21.
Through intuitive comparison, both K-medoids and ensemble learning have the role of improving the performance of prediction.