3.2. Feature Engineering Based on Dilated CNN
Due to the consideration mentioned in
Section 2.2, the dilated CNN [
11] is utilized in the feature extraction process of this work to extract the correlation of path delays under various PVT corners. The structure of the dilated CNN is illustrated in
Figure 5 and then described in detail, and consists of an input layer, convolutional layers, flatten layer, dense layer, and output layer. As shown in
Figure 5,
Qm features are extracted with the dilated CNN based on
Nf original features for each of
Ns samples.
The input data is reshaped into a three-dimensional form as
Ns ×
Nf × 1, where
Ns represents the number of samples and
Nf denotes the number of features, as formulated in Equation (2).
With the input data,
n convolutional layers are cascaded to transform data into the shape of
Ns ×
Hi ×
Fi, where
Fi is the number of convolutional kernels in layer
i (1 ≤
i ≤
n) and
Hi is determined by the related parameters of the corresponding convolutional kernels. The computation for each convolutional layer is defined as in Equation (3), where
x and
y denote the input and output data, respectively, and
W is the weight coefficient of the convolutional kernel.
It should be noted that for each convolutional layer of the dilated CNN, the coverage of the convolution kernel could be different even with an equal size, as demonstrated in
Figure 3. In the flatten layer, the three-dimensional data is reshaped into a two-dimensional one as
Ns ×
HnFn. As shown in Equation (4), the
Ns matrixes with the shape of
Hn ×
Fn are flattened by concatenating
Hn Fn-dimensional vectors,
xi, into
yj for each matrix.
In the following, there are
m fully connected dense layers, where
Qj (1 ≤
j ≤
m) is the number of neurons in layer
j (1 ≤
j ≤
m). The computation for the dense layer is formulated as in Equation (5), where the
tanh function is used as the activation function, and
W and
b are the weight coefficient and bias, respectively, to produce the output
y with the input
x.
Finally, the predicted results are output with the shape of
Ns × 1 with a linear transform formulated in Equation (6), where
yj denotes one of the
Ns elements in the result vectors calculated from
Qm-dimensional
xj with the corresponding weight coefficient
W and bias
b.
It is worth noting that, in this work, the purpose of the utilization of the dilated CNN is to extract the correlation among the path delays from different PVT corners instead of prediction. To do this, the output of the m-th dense layer is collected as the input of the following process of training and inference.
Based on the dilated CNN, the process of feature extraction in this work is depicted as
Figure 6, with the advantage of preventing data leakage and overfitting by using a cross-validation strategy. The flow of feature extraction in
Figure 6 mainly consists of the training step, inference step, and feature concatenation step. The delays of
Np paths at specific voltage
Vi and
Nt various temperatures are partitioned into the training set and test set firstly, and include
Ntrn and
Ntest paths, respectively. In the training step,
k-fold cross-validation is performed by selecting 1/
k samples from the training set, which is iterated
k times to train
k different dilated CNNs for each
k-fold cross-validation with the original path delays at
Vi. Then, in the inference step, the
k groups of cross-validation data are predicted with the corresponding dilated CNN to extract new features based on the original path delays at
Vi from the last dense layer with the shape of
Ntrn/
k ×
Nnew, where
Nnew is equal to the parameter of
Qm in the dilated CNN. The new features of the training set are then concatenated with the original features, e.g., the original path delays at
Vi, in the feature concatenation step. Similarly, the new features generated by the dilated CNN for the test set are also concatenated to the original ones, except that the generated new features for the test set from
k different dilated CNNs should be averaged into one
Ntest ×
Nnew matrix before concatenation.
3.3. Ensemble Model with Two-layer Stacking
In the process of training and inference, an ensemble approach is adopted to modeling, which is an art of combining diverse sets of learners, e.g., individual models, to improve the stability and predictive power of the model. Here we use a learner to combine the output from different learners, which leads to the decrease in either bias or variance error, depending on the combining learner we use. Compared with other commonly-used ensemble learning techniques, such as bagging and boosting, stacking can transfer the ensemble features to a simple model and does not require too many parameter tunings and feature selections [
13,
14]. In order to improve prediction precision while avoiding overfitting, a two-layer stacking method is applied to build the ensemble model, as illustrated in
Figure 7, including a hidden layer and an output layer.
In the ensemble model flow shown in
Figure 7, the linear regression (LR) [
15] and light gradient boosting machine (LightGBM) [
16] algorithms are utilized in the two layers due to their unique characteristics as explained in the following. LR is an efficient and simple machine learning algorithm, which does not require complicated calculations, even in the case of large amounts of data. However, LR only considers the linear relationship between variables so that it is very sensitive to outliers and the input features should be independent for the LR algorithm. In order to overcome its demerit, the LightGBM is applied in the ensemble model as a widely-used gradient boosting framework. Since it uses tree-based learning algorithms, the LightGBM is not sensitive to outliers and can achieve high accuracy. The formula for LR model is written as in Equation (7), where
θi represents the weight coefficients. The equation for the LightGBM model is given in Equation (8), where
f0(
x) means the initial solution,
ft-1(
x) represents the (
t-1)-th solution,
ctj represents the weight coefficients, and
T and
J denote the number of iterations and weight coefficients, respectively.
The parameters of
θi and
ctj used in the LR and LightGBM models are generated in the training process by the back-propagation method. In this work, the commonly-used gradient descent algorithm is applied to update them iteratively from a random initialization value, as formulated in Equation (9), where
θ(i+1) and
θ(i) represent the parameter
θ in the (
i+1)-th and
i-th iterations,
f(
θ) is the loss function, and
η is the learning rate. The derivation process of the parameter
ctj is similar.
The parameters of θi are (Nnew + Nt + 1)-dimensional vectors with Nnew + Nt weight coefficients and one bias for the Nnew + Nt input features in the proposed framework for each voltage combination and each process corner. The parameters of ctj consist of T vectors with lengths of no longer than J for each voltage combination and each process corner, where T indicates the number of trees and J means the upper bound of the number of leaves for each tree.
As shown in
Figure 7, the hidden layer accepts the extracted features from feature engineering and the original path delays at the voltage of
Vi with the shapes of
Ntrn × (
Nnew +
Nt) for the training set and
Ntest × (
Nnew +
Nt) for the test set, which are defined as
Xtrn and
Xtest, respectively. The input features are trained by LR and LightGBM, respectively, with the predicted results denoted as
XLRtrn/
XLRtest and
XLGBMtrn/
XLGBMtest, which are concatenated as the input features of the output layer with the shapes of
Ntrn × 2 and
Ntest × 2, respectively. In the output layer, the data is further trained by another LR model with the predicted results indicated as
Ŷtrn and
Ŷtest for the training set and test set respectively, where the path delays at
Vj are predicted by the whole framework.