Se-xDeepFEFM: Combining Low-Order Feature Refinement and Interaction Intensity Evaluation for Click-Through Rate Prediction

Li, Guangli; Xu, Guangxin; Wu, Guangting; Ye, Yiyuan; Li, Chuanxiu; Zhang, Hongbin; Ji, Donghong

doi:10.3390/sym14102123

Open AccessArticle

Se-xDeepFEFM: Combining Low-Order Feature Refinement and Interaction Intensity Evaluation for Click-Through Rate Prediction

by

Guangli Li

^1,*,

Guangxin Xu

¹,

Guangting Wu

¹,

Yiyuan Ye

¹,

Chuanxiu Li

¹,

Hongbin Zhang

²

and

Donghong Ji

³

¹

School of Information Engineering, East China Jiaotong University, Nanchang 330000, China

²

School of Software, East China Jiaotong University, Nanchang 330000, China

³

Cyber Science and Engineering School, Wuhan University, Wuhan 430071, China

^*

Author to whom correspondence should be addressed.

Symmetry 2022, 14(10), 2123; https://doi.org/10.3390/sym14102123

Submission received: 22 August 2022 / Revised: 5 October 2022 / Accepted: 9 October 2022 / Published: 12 October 2022

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

Click-through rate (CTR) prediction can provide considerable economic and social benefits. Few studies have considered the importance of low-order features, usually employing a simple feature interaction method. To address these issues, we propose a novel model called Senet and extreme deep field-embedded factorization machine (Se-xDFEFM) for more effective CTR prediction. We first embed the squeeze-excitation network (Senet) module into Se-xDFEFM to complete low-order feature refinement, which can better filter noisy information. Then, we implement our field-embedded factorization machine (FEFM) to learn the symmetric matrix embeddings for each field pair, along with the single-vector embeddings for each feature, which builds a firm foundation for the subsequent feature interaction. Finally, we design a compressed interaction network (CIN) to realize feature construction with definite order through a vector-wise interaction. We use a deep neural network (DNN) with the CIN to simultaneously implement effective but complementary explicit and implicit feature interactions. Experimental results demonstrate that the Se-xDFEFM model outperforms other state-of-the-art baselines. Our model is effective and robust for CTR prediction. Importantly, our model variants also achieve competitive recommendation performance, demonstrating their scalability.

Keywords:

click-through rate prediction; field-embedded factorization machines; feature interactions; Senet

1. Introduction

Click-through rate (CTR) is a kind of recommendation system [1,2] that attempts to make accurate social recommendations activity according to user preference mining. It has very high practicality. Specifically, CTR is usually used to predict the probability of clicks on advertisements or products. It plays an important role in several applications, including e-commerce, social networking, and video and movie websites. As we know, online advertising provides considerable economic benefits. With the gradual conversion from traditional digital advertising to mobile digital advertising, digital advertising has played a vital role in promoting economic development. Accurate recommendation or ranking of digital advertising can improve users’ experience to a certain degree. More importantly, it can provide considerable flow benefits and revenue benefits to online companies, especially short video applications. Therefore, there is a certain symmetry between browsing records and click records of different users on short video apps, e-commerce websites, and social networking sites. Websites recommend the same ads to users with similar behaviors according to these records, providing satisfactory economic benefits. Hence, CTR prediction has wide application scope and has become a research hotspot in both academia and industrial fields in recent years.

Logistic regression (LR) [3] has achieved satisfactory results in early research on CTR prediction by virtue of its simplicity and strong interpretability. However, the LR model requires manual construction of feature interactions. Dave et al. proposed the prediction method of gradient boosting decision tree (GBDT) [4], which can avoid the need to manually identify features. However, this method is not suitable for very large and sparse datasets. Additionally, the training time is too long, and it is difficult to guarantee the accuracy. To resolve the feature interaction problem, the well-known factorization machine (FM) [5] model was proposed. The FM model sets an implicit vector for each feature to handle the feature interaction problem. Based on the FM model, a field factorization machine (FFM) [6] was proposed, which introduces interaction between different feature fields to learn diverse hidden information for CTR prediction. The above-mentioned traditional models only perform low-order rather than high-order feature interactions, which may limit the prediction ability of the CTR models. Recently, deep learning (DL) technology has made breakthroughs in natural language processing [7], computer vision [8], and other correlated fields. Both industrial and academic researchers have proposed several DL-based CTR models and achieved remarkable research progress. He et al. proposed the neural factorization machine (NFM) [9], which combines second-order features with deep neural network (DNN) features to learn implicit higher-order features. Qu et al. proposed the product-based neural network (PNN) [10], which uses the product layer to learn the feature interactions. The product layer adopts the forms of inner product, outer product, and inner product combined with outer product, fully learning the high-order, non-linear feature interactions. Yang et al. proposed operation-aware neural networks (ONNs) [11], which generate multiple embedding vectors for each feature so that different feature representations can be chosen for different operations. Google proposed the well-known wide & deep learning (WDL) [12], which combines the left wide model using LR with the right deep model using a feed-forward network. The wide part tries to make the prediction contains a certain memory ability, whereas the deep part intends to endow the prediction with a certain generalization ability. Similarly, Huawei proposed DeepFM [13], which replaces the wide part of the WDL model with FM to obtain better feature interactions.

The above-mentioned models usually use the DNN model to learn the combined features of low-order and high-order features without considering the importance of each feature. Hence, Xiao et al. proposed the attentional factorization machine (AFM) [14] using the famous attention mechanism [15] to learn the importance of each feature combination, which helps to improve the final CTR prediction performance. Huang et al. proposed the feature importance and bilinear feature interactions network (FiBiNET) [16], which uses the well-known squeeze-exception network (Senet) [17] to learn the significance of different features. The second-order feature interactions are learned in the bilinear interaction layer through the combination of inner and Hadamard products. Finally, the implicit higher-order features are obtained using DNN. AFM and FiBiNET consider the significance of various features. However, AFM only extends the FM model and does not input the second-order cross features into the DNN to learn more valuable high-order feature interactions [18]. Although FiBiNET achieves high-order feature interactions, its feature interactions lack a certain interpretability, owing to the essential characteristic of the DNN.

Hence, some researchers made more significant modifications to the CTR model. Wang et al. proposed the deep & cross network (DCN) [19], which replaces the wide part of WDL with a cross network. The cross network can make full use of the residual mechanism [20] to automatically construct finite high-order feature interactions. Lian et al. proposed the extreme deep factorization machine (xDeepFM) [21], which uses the feature-level interaction to construct a compressed interaction network (CIN) for explicit high-order feature learning. Liu et al. proposed the feature generation method by convolutional neural network (FGCNN) [22], which can generate local patterns and a combination of these patterns. Recently, Song et al. designed the automatic feature interactions method via self-attentive neural networks (AutoInt) [23]. AutoInt uses the multi-head self-attention mechanism to explicitly construct high-order features, improving the interpretability of the CTR model and reducing the corresponding parameters. Pan proposed the field-weighted factorization machine (FWFM) [24], which employs a weighted FFM. The FFM module is more efficient in modeling different features from different fields. However, the parameters of the FWFM are not significantly reduced. Pandeh et al. proposed the field-embedded factorization machine (FEFM) [25], which learns a kind of symmetric matrix embedding for each field pair, along with the single-vector embeddings. Compared with the traditional FFM model, the FEFM model can considerably reduce the number of parameters, which also promotes its practicality. Yu et al. proposed the feature structure-oriented model called xCrossNet [26], in which the dense and sparse features are calculated by the cross and product layers, respectively, which are then spliced into the DNN to learn higher-order features (The abbreviations and full names of all models can be found in Appendix A Table A1. The advantages and disadvantages of all models can be found in Appendix A Table A2).

The above-mentioned CTR models have achieved satisfactory recommendation results. However, few of them consider the importance of low-order features, which contain noisy information. The combination of explicit high-order feature construction and implicit feature learning has not been fully explored, especially with respect to the complementary information between them. Moreover, the corresponding feature interactions are relatively simple and do not make full use of the information in the feature field. To alleviate these issues, we propose a novel CTR model, namely Senet and extreme deep field embedded factorization machine (Se-xDFEFM), which combines low-order feature refinement and interaction intensity evaluation to implement CTR. Hence, the two key technologies used in our study, namely low-order feature refinement and interaction intensity evaluation, are symmetrical.

In Se-xDFEFM, we take low-order feature refinement, explicit high-order feature construction, and implicit high-order feature learning into account. We achieve feature refinement using the Senet module in the low-order features and implement a field-embedded factorization machine (FEFM) to improve the quality of feature interactions. Finally, we make achieve predictions with a certain generalization ability using a DNN and predictions with a certain memory ability using a CIN. Conceptually and empirically, the main contributions of this paper can be summarized as follows:

(1): We propose a novel CTR model called Se-xDFEFM, which can simultaneously explicitly construct high-order features and implicitly learn high-order feature combinations.
(2): We propose an improved feature interaction method called Se-FEFM, which first performs effective feature refinement before feature interactions and uses the field pair symmetric matrix to more accurately evaluate the interaction intensity between different feature fields. All these help to improve the final recommendation performance.
(3): We propose several model variants of Se-xDFEFM, which also achieve competitive recommendation performance, firmly demonstrating the scalability of our model.
(4): We reproduce a group of well-known CTR baselines. Extensive experiments on two public datasets demonstrate that the proposed Se-xDFEFM model outperforms these mainstream baselines. The code for our method and all the reproduced models are available at https://github.com/vancci-xgx/se-xdfefm (accessed on 11 October 2022).

2. Model Basis

2.1. FM

As described above, FM is an effective method for constructing second-order feature combinations. It uses the implicit inner product of different features to calculate the coefficient matrix of the interaction terms between various features. FM regards its feature interactions as a high-dimensional sparse matrix decomposition problem. Therefore, FM can extract many new cross features and hidden vectors to improve the final CTR performance. The corresponding mathematical formula of FM is shown in Equation (1), which is composed of three components:

y (x) = w_{0} + \sum_{i = 1}^{n} w_{i} x_{i} + \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} 〈 v_{i}, v_{j} 〉 x_{i} x_{j}

(1)

where

n

is the total number of features;

w_{0} \in R

is the global bias;

w_{i} \in R

is the parameter of the first-order feature;

〈 \cdot, \cdot 〉

calculates the dot product of two vectors;

v_{i}

and

v_{j}

are the

i - th

and

j - th

features of the embedding vector, respectively; and

x_{i}

and

x_{j}

are the values of the

i - th

feature and

j - th

feature, respectively. According to the definitions of these parameters, the first component,

w_{0}

, is the global bias. The second component,

\sum_{i = 1}^{n - 1} w_{i} x_{i}

, represents the sum of each first-order feature. The third component,

\sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} 〈 v_{i}, v_{j} 〉 x_{i} x_{j}

, represents the second-order feature interaction between each feature and other features, where

〈 v_{i}, v_{j} 〉

calculates the dot product of two feature vectors, namely

v_{i}

and

v_{j}

. It is the weight of the feature combination

x_{i} x_{j}

, where

x_{i} x_{j}

is the multiplication result of

x_{i}

and

x_{j}

. “

+

” indicates the addition operation of the three proposed components. Evidently, the traditional FM model only constructs the second-order feature interactions without considering higher-order feature interactions.

2.2. DeepFM

As introduced above, DeepFM is derived from FM. It combines DNN with FM, in which FM focuses on learning the low-order features, whereas DNN attempts to learn the high-order feature combinations. The structure of DeepFM is shown in Figure 1.

As shown in Figure 1, FM and DNN jointly use the output of the embedded layer [27] as their inputs. Then, the DeepFM model concatenates the results of FM and DNN to generate the final output. The final output of DeepFM is expressed as:

y_{D e e p F M} = s i g m o i d (y_{F M} + y_{D N N})

(2)

where

y_{D e e p F M} \in (0, 1)

is the final CTR prediction result,

s i g m o i d

is the sigmoid activation function,

y_{F M}

is the output of the FM model,

y_{D N N}

is the output of the DNN model, and “

+

” represents the sum of tensors of the same dimension. The effect of this “

+

” operation makes full use of the complementarity among the FM and DNN models. The corresponding feature interactions of DeepFM lack a certain interpretability because the DNN model only implicitly completes high-order feature interactions. It usually recommends less relevant items when the user–item interactions are very sparse or high-rank. Hence, the DeepFM model needs the FM model introduced above to improve the memory ability and interpretability of CTR prediction. Another disadvantage of DeepFM is that it does not implement feature refinement in the low-order features before feature interactions, and the corresponding interaction procedure is too simple. Although the DeepFM model has several apparent disadvantages, it is a firm foundation for creating the proposed Se-xDFEFM model.

3. The Proposed Se-xDFEFM Model

The Se-xDFEFM model consists of an input layer, embedding layer, Senet layer, linear interaction layer, combination layer, compressed interaction network layer, hidden layer, and output layer. Figure 2 illustrates the structure of the Se-xDFEFM model:

Unlike DeepFM or other correlated models, the proposed Se-xDFEFM model designs an improved but effective feature interactions method named Se-FEFM. It first performs a feature refinement to suppress noisy information and choose important low-order features through a Senet module using a kind of mechanism. Then, it implements FEFM in the online interaction layer, which helps learn the low-order feature interactions. Unlike the above-mentioned CTR models, owing to the proposed feature refinement and low-order feature interactions strategies, Se-xDFEFM refines the low-order features to a certain degree and offers much more valuable low-order information for the successive high-order feature interactions. Therefore, the feature levels used in the proposed model are symmetrical. We use both low-order and high-order features to complete CTR. Third, a CIN is proposed to explicitly learn the high-order feature interactions. Simultaneously, other valuable higher-order feature interactions are implicitly learned by a DNN. Therefore, unlike the above-mentioned CTR models, the Se-xDFEFM model constructs more complex but effective high-order feature interactions from the complementary explicit and implicit perspectives to generate new features with better discriminative ability for the final prediction. Hence, the CIN layer and DNN layer in our model are symmetrical. The final output of the proposed Se-xDFEFM model is shown as follows:

y_{S e - x D F E F M} = s i g m o i d (y_{S e - D F E F M} + y_{C I N} + y_{L i n e a r})

(3)

where

y_{S e - x D F E F M} \in (0, 1)

is the final CTR prediction result;

s i g m o i d

is the sigmoid activation function;

y_{S e - D F E F M}

is the output that passes through the Senet module, the linear interaction layer, the combination layer, and the DNN layer;

y_{C I N}

is the output of the CIN layer; and

y_{L i n e a r}

is the linear output from raw features; and “

+

” represents the sum of tensors of the same dimension. Hence, the effect of this “

+

” operation makes full use of the complementarity among the CIN, Linear, and Se-DFEFM models. Each layer of the proposed model is introduced as follows in detail. Based on the definition of each layer, we will redefine the final output of the Se-xDFEFM model.

3.1. Embedding Layer

The input features usually have very high but sparse dimensions in a large-scale recommendation system. Additionally, no obvious temporal and spatial correlations can be observed in these features.

Therefore, the corresponding feature vectors must be compressed into a low-dimensional space first [28,29]. Additionally, the dense features in the numerical field need to be mapped into the same low-dimensional space to ensure the uniformity of the embedding output. Therefore, the embedding layer can integrate diverse features and compress its dimension. The output vector (

E

) of the embedding layer is defined as follows:

E = [e_{1}, e_{2}, \dots, e_{i}, \dots, e_{f}]

(4)

where

f

is the total number of feature fields,

e_{i} \in R^{D}

is the

i - th

field-embedding vector,

1 \leq i \leq f

, and

D

is the dimension of the embedding vector.

3.2. Senet Layer

The output features of the embedding vector have different importance. Hence, we use the Senet module to dynamically tune the weight of each feature, which helps to capture the most important features in the low-order level for the subsequent CTR prediction. The Senet module first generates a weight vector, namely

W = [w_{1}, w_{2}, \dots, w_{i}, \dots, w_{f}]

, for the embedding vector and then rescales the original embedding vector using

W

. A new embedding vector, namely

V = [v_{1}, v_{2}, \dots, v_{i}, \dots, v_{f}]

, is obtained in turn. Here,

w_{i}

represents the weight of

v_{i}

,

1 \leq i \leq f

.

The Senet module includes three steps: squeeze, excitation, and reweight.

(1) Squeeze calculates the “summary statistics” of each feature field embedding. Therefore, the original embedding vector,

E = [e_{1}, e_{2}, \dots, e_{i}, \dots, e_{f}]

, is compressed into a statistical vector,

H = [h_{1}, h_{2}, \dots, h_{i}, \dots, h_{f}]

, through average pooling, where

1 \leq i \leq f

,

h_{i}

is a scalar value, which represents the “summary statistics” of the

i - th

feature field embedding after averaging all dimensions, and

h_{i}

is calculated through the following average pooling strategy:

h_{i} = f_{s q} (e_{i}) = \frac{1}{D} \sum_{t = 1}^{D} e_{i}^{(t)}

(5)

where

D

denotes the dimension of each original embedding vector,

e_{i}

is the

i - th

original embedding vector, and

f_{s q} (e_{i})

means that the

i - th

embedding vector of each dimension is summed and averaged. Hence,

h_{i}

is the value that represents the global information about the

i - th

field feature.

H = [h_{1}, h_{2}, \dots, h_{i}, \dots, h_{f}]

represents the global information calculated by all field features, where

1 \leq i \leq f

,

f

is the total number of feature fields.

(2) Excitation calculates the weight of the feature group, namely

G

, based on the “summary statistics” vector

H

. The first layer represents a dimension reduction layer with the parameters

W_{1}

. In contrast, the second layer increases the corresponding dimension with the parameters

W_{2}

. The whole calculation process of the “Excitation” operation is shown as follows:

G = f_{e x} (H) = σ_{2} (W_{2} σ_{1} (W_{1} H))

(6)

where

W_{1} \in R^{f \times \frac{f}{r}}

and

W_{2} \in R^{f \times \frac{f}{r}}

denote the parameters of the first and second layers, respectively;

σ_{1}

and

σ_{2}

are the activation functions of the first and second layers, respectively;

r

is a reduction ratio;

f

is the total number of feature fields; and

f_{e x} (H)

calculates the weight vector through the “summary statistics” vector

H

.

(3) Reweight: each weight is used to weigh the original embedding and generate a new embedding. The specific calculation procedure of the “Reweight” operation is shown as follows:

E_{n e w} = F_{r e w e i g h t} (G, E) = [g_{1} \cdot e_{1}, g_{2} \cdot e_{2}, \dots, g_{i} \cdot e_{i}, \dots, g_{f} \cdot e_{f}] = [v_{1}, v_{2}, \dots, v_{i}, \dots, v_{f}]

(7)

where

g_{i} \in R

,

e_{i} \in R^{D}

,

v_{i} \in R^{D}

, and we define

1 \leq i \leq f

.

D

denotes the dimension of the embedding vector, and

f

is the total number of feature fields.

F_{r e w e i g h t} (G, E)

reweighs the original embedding,

E

, with the weight vector

G

. Therefore, we obtain

E_{n e w}

, which is composed of a group of

v_{i}

.

3.3. Linear Interaction Layer and Combination Layer

The linear interaction layer is designed to calculate the second-order feature interactions. The traditional feature interaction methods include inner product and Hadamard product [30,31]. However, these methods do not consider broader category field concepts. To address this problem, the proposed Se-xDFEFM model absorbs the field pair symmetric matrix to better capture the importance of feature interactions. The eigenvalue of the field pair symmetric matrix represents the interaction intensity of the field pair, which is called FEFM in this study. The corresponding calculation procedure is shown as follows, consisting of three components:

F E F M ((w, v, W), x) = w_{0} + \sum_{i = 1}^{n} w_{i} x_{i} + \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} W_{F (i), F (j)} 〈 v_{i}, v_{j} 〉 x_{i} x_{j}

(8)

where

n

is the total number of features,

w_{0} \in R

is the global bias,

w_{i} \in R

is the parameter of the first-order feature,

W_{F (i), F (j)}

is a

D \times D

symmetric matrix, and

D

denotes the dimension of the embedding vector.

〈 \cdot, \cdot 〉

represents the dot product of two vectors.

v_{i}

and

v_{j}

are the

i - th

and

j - th

features of the embedding vector, respectively.

x_{i}

and

x_{j}

are the values of the

i - th

feature and

j - th

feature, respectively. FEFM does not learn the feature embedding in a specific field. The symmetric matrix,

W_{F (i), F (j)}

, is the embedding of the two feature fields

F_{(i)}

and

F_{(j)}

. The interaction between the

i - th

and

j - th

features is obtained indirectly through the symmetric matrix,

W_{F (i), F (j)}

.

W_{F}

is based on feature field pairs. Like Equation (1), according to these parameter definitions, the first component,

w_{0}

, represents the global bias. The second component,

\sum_{i = 1}^{n - 1} w_{i} x_{i}

, represents the sum of each first-order feature. The third component,

\sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} W_{F (i), F (j)} 〈 v_{i}, v_{j} 〉 x_{i} x_{j}

, represents the second-order feature interaction between each feature and other features.

〈 v_{i}, v_{j} 〉

calculates the dot product of two vectors, namely

v_{i}

and

v_{j}

.

x_{i} x_{j}

is the multiplication result of

x_{i}

and

x_{j}

. The difference between Equation (1) and Equation (8) is that Equation (8) uses

W_{F (i), F (j)}

to learn the importance of the interaction between feature fields. Here, “

+

” indicates the addition operation of the three proposed components. Se-xDFEFM uses the Hadamard or inner products to calculate the corresponding feature interactions.

As shown in Figure 2, an interaction vector,

P = [p_{1}, \dots, p_{m}]

, is generated from the original embedding,

E

. And an interaction vector,

Q = [q_{1}, \dots, q_{m}]

, is obtained based on the new embedding,

E_{n e w}

, generated by the FEFM module, where

m

is the number of embedding vector interactions. Hence, the input of the combination layer is the two interaction vectors, namely

P

and

Q

. The two vectors are concatenated to generate the combination

S

, as shown below:

S = F_{c o m b i n e (P, Q)} = [p_{1}, \dots, p_{m}, q_{1}, \dots, q_{m}] = [c_{1}, \dots, c_{i}, \dots, c_{2 m}]

(9)

where

F_{c o m b i n e (P, Q)}

is the combination operation of the two interaction vectors,

P

and

Q

, and

c_{i}

is one element of

P

or

Q

.

3.4. CIN and DNN

CIN is proposed to complete explicit high-order feature interactions, and the network complexity does not increase exponentially with the interaction degree. We formulate the embedding matrix as

X^{0} \in R^{f \times D}

, where the

i - th

row in

X^{0}

is the embedding vector of the

i - th

field:

X_{i, *}^{0} = e_{i}

. The output of the

k - th

layer in CIN is also a matrix named

X^{k} \in R^{L_{k} \times D}

, where

L_{k}

denotes the number of the embedding vectors in the

k - th

layer. Then,

X^{k}

is calculated as follows:

X_{l, *}^{k} = \sum_{i = 1}^{L_{k - 1}} \sum_{j = 1}^{L_{0}} W_{i, j}^{k, l} (X_{i, *}^{k - 1} \circ X_{j, *}^{0})

(10)

where

X_{l, *}^{k}

denotes the

l - th

embedding vector of the

k - th

layer output matrix.

X_{i, *}^{k - 1}

denotes the

i - th

embedding vector of the

k - 1 - th

layer output matrix.

W^{k, l} \in R^{L_{_{k - 1}} \times L_{_{0}}}

is the parameter matrix for the

l - th

feature vector of the

k - th

layer,

W_{i, j}^{k, l}

denotes the value of the

i - th

row and

j - th

column in the parameter matrix, and

X_{j, *}^{0}

denotes

e_{j}

.

X_{l, *}^{k}

is obtained through the interactions between

X^{k - 1}

and

X^{0}

. “

\circ

” represents the Hadamard product. Equation (10) calculates

L_{k - 1}

vectors of the previous layer and

L_{k}

vectors of the input layer

X^{0}

; then, it calculates the Hadamard product in pairs to obtain

L_{k - 1} * L_{0}

vectors. All vectors are weighed and summed according to

W_{}^{k, l}

parameters to generate

X_{l, *}^{k}

, which is the

l - th

embedding vector of the

k - th

layer. The output,

X^{k}

, of the

k - th

layer can be obtained by using

L_{k}

different parameter matrices.

Hence, CIN can explicitly complete feature interactions, and the corresponding interaction degree increases with the increase in layer depth. The next layer depends on the output of the previous layer and additional input. Let

T

denote the network depth. Each hidden layer,

X^{k}

,

k \in [1, T]

, has a connection with the output units. Therefore, sum pooling is performed on the feature map of each layer to obtain

p_{i}^{k}

as follows:

p_{i}^{k} = \sum_{j = 1}^{D} X_{i, j}^{k}

(11)

where

D

denotes the dimension of the embedding vector, and

X_{i, j}^{k}

denotes the

i - th

embedding vector of the

j - th

dimension of the

k - th

layer output matrix.

Each layer can produce a pooling vector named

P^{k} = [p_{1}^{k}, p_{2}^{k}, \dots, p_{L_{k}}^{k}]

with the length of

L_{k}

. The pooling vectors of each layer are concatenated before connecting to the output units:

P^{+} = [P^{1}, P^{2}, \dots, P^{T}]

(12)

where

T

denotes the network depth.

The Se-xDFEFM uses DNN to learn implicit higher-order features. CIN and DNN complement each other and make full use of each kind of feature interaction, increasing the strength of the proposed model. The DNN is comprised of several fully connected layers, implicitly capturing high-order feature interactions. As shown in Figure 2, the input of DNN is the output of the combination layer.

3.5. Output Layer

Finally, the Se-xDFEFM model combines the outputs of DNN and CIN with the proposed linear interaction layer, which forms a kind of powerful complementarity and completes high-quality CTR prediction. The final output of Se-xDFEFM is expressed as:

y_{S e - x D F E F M} = s i g m o i d (w_{l i n e a r}^{T} a + w_{d n n}^{T} x_{d n n}^{k} + w_{c i n}^{T} P^{+} + b)

(13)

where

a

is the original feature;

x_{d n n}^{k}

,

P^{+}

are the corresponding outputs of DNN and CIN, respectively;

w_{l i n e a r}

,

w_{d n n}

,

w_{c i n}

, and

b

are learnable parameters of our model;

b

is the bias parameter; and “

+

” represents the sum of tensors of the same dimension. The actual effect of this “

+

” operation takes full advantage of the complementary information among different models.

y_{S e - x D F E F M} \in (0, 1)

is the final prediction of the proposed CTR model, and

s i g m o i d

is the sigmoid activation function. Evidently, Equation (13) is a redefinition of Equation (3) because DeepFM is the firm basis of our model.

4. Experimental Results and Analysis

4.1. Datasets

We conducted detailed experiments on two public CTR datasets, namely Criteo [32] and Avazu [33]. The Criteo dataset is widely used to evaluate the performance of CTR models. It is composed of approximately 45 million user advertising behaviors provided by the Criteo advertising company, including 26 category features and 13 numerical features. The Avazu dataset consists of the ad CTR for several consecutive days sorted in chronological order. It contains 40 million actual click logs. Each click data point contains 24 features. Owing to the limited computing resources, only the first 10 million data points of the two datasets were used to train the proposed model. We randomly split all the instances in a ratio of 8:1:1 for training, validation, and test, respectively, according to [16,23,34]. To ensure fair performance comparisons, we reproduced a group of well-known CTR baselines under the same experimental settings.

4.2. Evaluation Metrics

The area under ROC (AUC) [35] and cross-entropy (LogLoss) [36] metrics were used to evaluate the CTR models. AUC refers to the area enclosed by the receiver operating characteristic (ROC) curve and coordinate axis; the closer its value is to 1 or the upper-left corner of the coordinate axis, the better the CTR prediction performance. This metric is not sensitive to whether the sample categories are balanced. The LogLoss metric indicates the difference between the predicted value and the real value. The smaller the LogLoss value, the better the CTR prediction performance.

4.3. Baselines

Here, the baseline is the recent work focusing on the research of click-through rate prediction. Three kinds of mainstream models are chosen as the baselines of this study, which are described as follows:

(1): The explicit interaction-based CTR prediction models FM [5] and AFM [14];
(2): The implicit interaction-based CTR prediction models NFM [9], FiBiNET [16], and DeepFEFM [25]; and
(3): CTR models that perform both explicit and implicit interactions: DCN [19], DeepFM [13], and xDeepFM [21].

4.4. Experimental Environment and Parameter Settings

We conducted experiments on our server with two 1080Ti GPUs and Intel Xeon E5-2620 V4 CPU. All experiments were implemented under the software environment of ubuntu16.04 with the TensorFlow 2 framework. For DNN, the number of hidden layers is set to 3, and the number of neurons in each layer is set to 256 through validation experiments. The dropout ratio is 0.5, and the activation function is ReLU in DNN. We chose Sigmoid as the activation function to output the final prediction of CTR models. All the CTR models are optimized using Adam [37], and the learning rate is uniformly set to 0.0001.

4.5. Experimental Results and Discussion

4.5.1. Performance Comparisons with the Baseline Models

First, the proposed Se-xDFEFM model is compared with the above-mentioned state-of-the-art baseline models (we reproduced all these baselines on our server and completed the relevant experiments using the same experimental settings for fair comparisons). The two metrics, including AUC and LogLoss, are selected to evaluate each CTR model. The experimental results are shown in Table 1. The indicator “Improve₁” represents the corresponding performance improvement of the Se-xDFEFM model relative to the most competitive baseline. Similarly, the indicator “Improve₂” represents the corresponding performance improvement of the Se-xDFEFM model relative to the second competitive baseline model. For the CTR prediction task, owing to the large amount of data, the measuring standard of these two indicators is ‰, which means that a performance improvement of more than 1‰ is effective and evident [38,39]. The corresponding improvement is essential and helps achieve satisfactory benefits.

As shown in Table 1, the prediction performance of DNN-based CTR models, such as DeepFM [13], DCN [19], etc., is significantly better than that of traditional models, including FM [5], AFM [14], and other models using low-order feature interactions, which further demonstrates that the DNN model is effective for deep feature learning in the CTR prediction task. The DNN model can capture much more non-linear high-order information to complete CTR prediction more effectively. This is also a significant reason why we also employ DNN in our Se-xDFEFM model. Moreover, if both the explicit and implicit feature interactions are considered simultaneously, the corresponding prediction ability of the FiBiNET [16] and xDeepFM [21] models is further improved. The explicit and implicit feature interactions complement each other, which contributes to boosting the final prediction performance. Hence, we need to ensure diverse feature interactions to promote the final CTR prediction performance. This builds a firm foundation for our model.

Notably, the proposed Se-xDFEFM model obtains the best recommendation performance on both datasets. For example, on the Criteo dataset, compared with the state-of-the-art baseline models, such as xDeepFM [21], FiBiNET [16], and DeepFEFM [25], the AUC values of our model increase by 2.9‰, 2.5‰, and 2.5‰, respectively, whereas the corresponding LogLoss values of our model decrease by 2.6‰, 2.1‰, and 2.0‰, respectively. These two groups of values correspond well. A larger loss decrease leads to larger performance improvement. Similarly, on the Avazu dataset, the corresponding AUC values of the Se-xDFEFM model increase by 2.3‰, 1.6‰, and 1.3‰, respectively. The corresponding LogLoss values reduce by 1.3‰, 1.0‰, and 0.8‰, respectively. The proposed Se-xDFEFM model is effective and robust for CTR prediction. We performed the following detailed and deep-level analysis to understand the underlying reasons behind these experimental phenomena.

First, Se-xDFEFM refines the embedding information based on the embedded Senet module, which lays a firm foundation for the subsequent diverse feature interactions. Then, the FEFM mechanism is implemented in our linear interaction layer. The field pair symmetric matrix with improved discriminative ability is obtained to learn the low-order feature interactions, which is a necessary but powerful basis for the subsequent high-order feature interactions. Hence, in contrast to most advanced CTR models based on DL technology, such as xDeepFM [21] and FiBiNET [16], the Se-xDFEFM model not only uses the Senet module to refine the features before using CIN to explicitly construct higher-order features but also completes low-order feature interaction through the proposed FEFM layer before inputting them into the DNN model. This strategy has two evident advantages. First, this helps filter the noisy information in the original features. Second, it offers much more valuable low-order information for the successive high-order feature interactions. These two advantages are beneficial for promoting the final prediction performance. Third, unlike DeepFEFM [25] or FiBiNET [16], which only use implicit interaction, we combine the explicit and implicit feature interactions to obtain a more effective final feature. On the one hand, we use a CIN to explicitly learn high-order feature interactions; on the other hand, we propose an elaborate DNN model to implicitly learn higher-order feature interactions. The two kinds of high-order feature interactions complement each other, which builds an important foundation for CTR. More importantly, owing to the mining of much more valuable low-order information, the DNN and CIN models can better capture that significant information for the CTR prediction. Then, we make full use of the complementarity between these interaction modes to improve the final prediction performance. Notably, our model outperforms DeepFM [13] an xDeepFM [21], which employ both implicit and explicit higher-order feature interactions because Se-xDFEFM not only uses Senet to refine the features but also learns more valuable low-level information through FEFM. Therefore, Se-xDFEFM is an effective CTR model that seamlessly integrates feature refinement, low-order features, high-order features, and implicit and explicit feature interactions into an organic whole. Certainly, there is still some room to improve the recommendation performance on the Avazu dataset. In the future, we plan to use the well-known multi-head self-attention mechanism from Transformer to make more effective explicit feature interactions.

In conclusion, the Se-xDFEFM model is effective and robust for the CTR prediction task. The low-order important information and high-order feature interactions complement each other positively and form a kind of joint force to promote the final CTR performance of the Se-xDFEFM model.

4.5.2. Impact of Different Linear Interaction Methods

As analyzed above, different linear interaction methods, including inner product and Hadamard product, will affect the final performance, as well as the field feature. Hence, we present the following experiments in this section: we propose modified Se-xDFEFM models using the inner product or Hadamard product before or after the application of the field feature. The inner product differs from the Hadamard product in terms of theoretical analysis and actual codes. The inner product produces a scalar value, whereas the Hadamard product generates a matrix through a multiplication operation. The corresponding experimental results are shown in Table 2, obtaining four variants of our model.

In Table 2, Se-xD_i indicates that only the inner product without a field feature is used to complete the linear interaction. Se-xD_h indicates that only the Hadamard product without a field feature is used to complete the linear interaction. Se-xDFEFM_h means that both the field feature and Hadamard product are employed to implement linear interaction. Se-xDFEFM_i means that both the field feature and inner product are utilized to realize linear interaction. Here, we also propose two indicators, namely Improve₃ and Improve₄, to observe the actual performance improvement. The measured standard of these two indicators is ‰. Improve₃ represents the performance improvement of the Se-xDFEFM_i model relative to Se-xDFEFM_h, whereas Improve₄ represents the performance improvement of the Se-xDFEFM_i model relative to Se-xD_i.

As shown in Table 2, the inner product method outperforms the Hadamard product on each dataset. This phenomenon is more evident when the field feature is absorbed into our model, which further improves the prediction performance of the modified Se-xDFEFM model. As analyzed above, the field feature effectively establishes a powerful foundation for the subsequent high-order feature interactions. Our model can make full use of the information in the feature field. However, how to reduce the parameter complexity in the process of extracting field features is still a considerable challenge.

4.5.3. Impact of the CIN Layers

Owing to the deep-level structure, the number of CIN layers could affect the order of explicit feature interactions and impact the final CTR prediction performance of the proposed Se-xDFEFM model. Hence, in this section, we want to decide the best number of CIN layers through validation experiments. The experimental results are shown in Figure 3. We used LogLoss and AUC to draw the corresponding experimental graphs, in which LogLoss follows the left vertical axis, whereas AUC follows the right vertical axis; this represents a concise mode to efficiently present our results.

As shown in Figure 3, when the number of CIN layers is equal to two, the highest AUC and lowest LogLoss can be observed in the two datasets. The two metrics can validate each other; Figure 3 demonstrates this phenomenon well, indicating that the best prediction performance is obtained when the CIN model contains two layers. Hence, suitable feature interactions can result in performance improvement. In contrast, too many high-order feature interactions can result in unexpected noises or model complexity, which could decrease the final CTR prediction performance. As another extreme setting, it is the equivalent to removing the CIN module in our model when the number of CIN layers is equal to zero. The corresponding performance decreases dramatically. This also demonstrates the importance of the proposed explicit feature interactions. It also further validates that CIN and DNN can complement each other, which means that the implicit and explicit feature interactions can form a joint force to improve the final CTR prediction performance of the Se-xDFEFM model. In summary, we set the CIN layers to two in all experiments.

4.5.4. Impact of the Number of CIN Neurons

Like the number of CIN layers, the number of CIN neurons also affects the explicit feature interactions procedure and final CTR performance of the proposed Se-xDFEFM model. Hence, in this section, we want to decide the best number of CIN neurons through validation experiments. The experimental results are shown in Figure 4. We used LogLoss and AUC to draw the corresponding experimental graphs, in which LogLoss follows the left vertical axis, whereas AUC follows the right vertical axis.

As shown in Figure 4a, when the number of neurons in each layer reaches 200, the best AUC and lowest LogLoss can be observed on the Avazu dataset. As shown in Figure 4b, when the number of the neurons in each layer reaches 250, the best AUC and lowest LogLoss can be observed on the Criteo dataset. The two metrics can validate each other; Figure 4 demonstrates this phenomenon well. The possible cause of this phenomenon is that the Criteo dataset has more fields than the Avazu dataset, so the Criteo dataset needs a relatively more complex CIN to complete effective high-order feature interactions. In the future, the corresponding recommendation performance should be further improved from the perspective of model structure optimization. In conclusion, it is necessary to modulate the neurons in each layer of the CIN model to obtain the best CTR prediction performance. In all experiments, we set the number of neurons in each layer to 200 on the Avazu dataset, and the corresponding value on the Criteo dataset is 250.

4.5.5. Impact of Embedding Dimension

As illustrated in Figure 2, the embedding layer contains the most important low-order feature information for the subsequent high-order feature interaction. Hence, we need to set the dimension of the embedding layer elaborately to obtain the best CTR prediction performance. The validation experimental results are shown in Figure 5. We also used LogLoss and AUC to draw the corresponding experimental graphs, in which LogLoss follows the left vertical axis, whereas AUC follows the right vertical axis.

As shown in Figure 5, when the embedding dimension is equal to 10, the best AUC and LogLoss values of the Se-xDFEFM model can be observed on the two datasets. Lower loss makes the proposed model more effective for CTR prediction. The two metrics can validate each other; Figure 5 presents this phenomenon well. However, as the embedding dimension continues to increase, worse AUC and LogLoss values are observed, especially for the Avazu dataset, which indicates that an excessively high dimension of the embedding layer results in a certain amount of noise. This, in turn, negatively affects the final CTR prediction performance. In summary, the embedding dimension of the Se-xDFEFM model needs to be tuned carefully to obtain the best CTR prediction performance.

4.5.6. Ablation Experiment

As mentioned above, the Se-xDFEFM model includes the embedded Senet module, the embedded FEFM module, a DNN, and a CIN. Each module plays a role in our CTR prediction model. In this section, we want to validate the actual contribution of these modules through detailed ablation experiments. This helps to highlight our further research direction. In this experiment, the Se-xDFEFM model without CIN, the Se-xDFEFM model without the FEFM module, the Se-xDFEFM model without the Senet module, and the Se-xDFEFM model without DNN are denoted as Se-xDFEFM-CIN, Se-xDFEFM-FEFM, Se-xDFEFM-Senet, and Se-xDFEFM-DNN, respectively, resulting in another set of model variants (Table 2). The experimental results are shown in Table 3. Moreover, we compared our model variants with many state-of-the-art baselines using AUC (Figure 6).

As shown in Table 3, (1) on the two datasets, removing any module leads to performance degradation of CTR prediction. Hence, all four modules can effectively improve the final recommendation performance. (2) The proposed Se-xDFEFM model combined with the CIN, the embedded Senet module, DNN, and the FEFM modules can obtain the best prediction performance among all the variants. (3) Removing the FEFM module leads to the largest performance degradation on both datasets. This indicates that the proposed embedded FEFM module contributes the most to the final CTR prediction performance among all investigated modules. On the one hand, it directly processes the embedding layer, which can obtain the most primitive and valuable low-order features for CTR prediction. Furthermore, the linear interaction layer embedded in the FEFM module can generate more valuable information for prediction. Hence, unlike the state-of-the-art DL-based CTR models, the FEFM module offers much more valuable low-order information for the successive high-order feature interactions. (4) On the Criteo dataset, CIN contributes slightly more than the Senet module. This means that explicit high-order feature interactions are more effective for the final CTR prediction. The Senet module is a kind of attention mechanism, which can effectively filter noisy information and reduce negative impacts. Therefore, it can improve the final recommendation performance. Conversely, on the Avazu dataset, the Senet module is relatively more important. (5) Using both CIN and DNN models results in better performance than using a single model, which indicates that the implicit and explicit feature interactions complement each other and can form a joint force to improve the final CTR prediction performance of the Se-xDFEFM model. (6) Our model variants, such as Se-xDFEFM-CIN and Se-xDFEFM-SENET, can obtain very competitive prediction performance compared with state-of-the-art baselines (Figure 6). This also validates the robustness, effectiveness, and scalability of our Se-xDFEFM model from another significant perspective.

In summary, the following valuable conclusions can be drawn. The descending order of importance of each module in our model on the Criteo dataset is “FEFM > DNN > CIN > Senet”. The descending order of importance of each module in our model on the Avazu dataset is “FEFM > DNN > Senet > CIN”. Therefore, for the more complex Criteo dataset we should focus more on the low-order feature refinement method, whereas we should further modify the structure of the CIN model for the Avazu dataset. These results suggest future research directions. Moreover, our model variants achieve very competitive recommendation performance, demonstrating their scalability.

5. Conclusions and Future Work

We propose a novel CTR prediction model called Se-xDFEFM, which fits the “symmetry” concept well, including symmetry core technologies, symmetry feature levels, symmetry CNN and DIN layers, and symmetry browsing records and click records.

Se-xDFEFM seamlessly integrates feature refinement through the embedded Senet module, low-order features extracted by the FEFM module, implicit and explicit feature interactions into an organic whole. Extensive experimental results on two public datasets demonstrate that the Se-xDFEFM model is effective and robust for CTR prediction, outperforming other state-of-the-art baselines, including DeepFM [13], FiBiNET [16], AFM [14], and xDeepFM [21]. Unlike other models, the FEFM module offers much more valuable low-order information for the successive high-order feature interactions. All the modules form a kind of joint force to promote the final CTR performance. Notably, our model variants outperform the state-of-the-art baselines, validating their scalability.

However, our model is subject to three limitations. First, the complexity of the CIN is an important issue, which may increase the complexity of the whole model. We intend to use the state-of-the-art multi-head self-attention network to reduce the complexity of explicit high-order feature interaction. Second, our model does not model user behavior in sequence. Hence, we plan to use the deep interest network [40] to address this problem. Third, our model does not use different processing methods for dense and sparse embeddings; how to effectively promote interaction between dense and sparse embeddings is an important challenge for future work. We hope all these strategies can form the basis for future research and further improve CTR prediction performance.

Author Contributions

Software, G.L., G.X., G.W., G.L. and H.Z.; Methodology, G.L., G.X., G.W., Y.Y. and H.Z.; Validation, G.L., G.X., G.W., Y.Y. and H.Z.; Writing—Original Draft, G.L., G.X. and H.Z.; Writing—Review and Editing, G.L., G.X. and H.Z.; Validation, G.W., Y.Y., G.L. and D.J.; Validation, G.W., Y.Y., C.L. and D.J.; Data Curation, G.L., G.X., G.W., Y.Y., C.L. and H.Z.; Formal Analysis, G.X., G.W., Y.Y. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by the National Natural Science Foundation of China (Grant Nos. 62161011 and 61861016), the Natural Science Foundation of Jiangxi Provincial Department of Science and Technology (Grant Nos. 20212BAB202006 and 20202BABL202044), the Key Research and Development Plan of Jiangxi Provincial Science and Technology Department (Grant No. 20192BBE50071), the Humanity and Social Science Foundation of the Jiangxi Province (Grant No. 22TQ01), the Science and Technology Projects of Jiangxi Provincial Department of Education (Grant Nos. GJJ190323 and GJJ200644), and the Humanity and Social Science Foundation of Jiangxi University (Grant Nos. TQ20108 and TQ21203).

Data Availability Statement

The datasets used in the study are publicly available to everyone and can be accessed at: http://labs.criteo.com/downloads/download-terabyte-click-logs/ (accessed on 11 October 2022), http://www.kaggle.com/clavazu-ctr-prediction/ (accessed on 11 October 2022), for criteo and avazu dataset, respectively.

Acknowledgments

We thank the authors of [31,32] for collecting and organizing the datasets used in this study. The authors also would like to thank the editor and the reviewers for their helpful suggestions.

Conflicts of Interest

The authors declare no conflict to interest.

Appendix A

Table A1. Abbreviations and full names.

Abbreviation	Full Name	Abbreviation	Full Name
CTR	click-through rate	WDL	wide & deep learning
LR	logistic regression	AFM	attentional factorization machine
GBDT	gradient boosting decision tree	FiBiNET	feature importance and bilinear feature interaction network
FM	factorization machine	DCN	deep & cross network
DeepFM	Factorization machine-based neural network	xDeepFM	extreme deep factorization machine
NFM	neural factorization machine	FGCNN	feature generation by convolutional neural network
PNN	product-based neural network	CNN	convolutional neural network
ONN	operation-aware neural network	FWFM	Field-weighted factorization machine
Se-xDFEFM	Senet and extreme deep field-embedded factorization machine	FEFM	field-embedded factorization machine
Senet	squeeze-excitation network	xCrossNet	feature structure-oriented model
CIN	compressed interaction network	AutoInt	self-attentive neural networks
DNN	deep neural network	AUC	area under ROC
FFM	field factorization machine	LogLoss	cross entropy
DL	deep learning	ROC	receiver operating characteristic

Table A2. Advantages and disadvantages of the models mentioned in the paper.

Model	Advantages	Disadvantages
LR	Simple and highly interpretable	Manual construction of feature interactions
GBDT	Manually identification of features is avoided	Not suitable for large sparse datasets
FM	Automatic construction of feature interaction	Only low-order interaction can be learned
FFM	Learns the interactions between different feature fields	Large parameter scale
NFM	Learns implicit high-order features through DNN	Implicit construction of high-order features
PNN	Employs inner product combined with outer product	Implicit construction of high-order features
ONN	Generates multiple embedding vectors for each feature	Large parameter scale
WDL	Good generalization ability and memory ability	Feature interactions must be built manually
DeepFM	Automatic construction of feature interaction	FM can only learn low-order features
AFM	Learns the weight of feature interactions through attention mechanism	Deep neural network is not used
FiBiNET	Learns feature weight through Senet and learns feature interactions through billinear interaction layer	Implicit construction of high-order features
DCN	Automatically builds high-order features	Feature combination in a bit-wise mode
xDeepFM	High-order feature combination in a vector-wise mode	The weight of each feature is not considered
FGCNN	Generates new features by CNN	Large parameter scale
FWFM	Efficiently models different features in different fields	Large parameter scale
FEFM	Learns the interactions between different feature fields with lower complexity than that of FFM	The weight of each feature is not considered
AutoInt	Explicit construction of higher-order features using multi-head self-attention mechanism	Implicit construction of high-order features
xCrossNet	Dense and sparse features are calculated by the cross and product layers, respectively	Implicit construction of high-order features

References

Zhang, X.; Qin, J.; Zheng, J. A Social Recommendation based on metric learning and Users’ Co-occurrence Pattern. Symmetry 2021, 13, 2158. [Google Scholar] [CrossRef]
Sharma, B.; Hashmi, A.; Gupta, C.; Khalaf, O.I.; Abdulsahib, G.M.; Itani, M.M. Hybrid Sparrow Clustered (HSC) Algorithm for Top-N Recommendation System. Symmetry 2022, 14, 793. [Google Scholar] [CrossRef]
Richardson, M.; Dominowska, E.; Ragno, R. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007; pp. 521–530. [Google Scholar]
He, X.; Pan, J.; Jin, O.; Xu, T.; Liu, B.; Xu, T.; Shi, Y.; Atallah, A.; Bowers, S.; Candela, J.Q. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop On Data Mining for Online Advertising, New York, NY, USA, 24 August 2014; pp. 1–9. [Google Scholar]
Rendle, S. Factorization machines. In 2010 IEEE International Conference on Data Mining; IEEE: Sydney, Australia, 2010; pp. 995–1000. [Google Scholar]
Juan, Y.; Zhuang, Y.; Chin, W.S.; Chih-Jen, L. Field-aware factorization machines for CTR prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 43–50. [Google Scholar]
Feng, S.Y.; Gangal, V.; Wei, J.; Chandar, S.; Vosoughi, S.; Mitamura, T.; Hovy, E. A survey of data augmentation approaches for nlp. arXiv 2021, arXiv:2105.03075. [Google Scholar]
Buhrmester, V.; Münch, D.; Arens, M. Analysis of explainers of black box deep neural networks for computer vision: A survey. Mach. Learn. Knowl. Extr. 2021, 3, 966–989. [Google Scholar] [CrossRef]
He, X.; Chua, T.S. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 355–364. [Google Scholar]
Qu, Y.; Cai, H.; Ren, K.; Zhang, W.; Yu, Y.; Wen, Y.; Wang, J. Product-based neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM); IEEE: Barcelona, Spain, 2016; pp. 1149–1154. [Google Scholar]
Yang, Y.; Xu, B.; Shen, F.; Zhao, J. Operation-aware neural networks for user response prediction. Neural Netw. 2020, 1210, 161–168. [Google Scholar] [CrossRef]
Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A factorization-machine based neural network for CTR prediction. arXiv 2017, arXiv:1703.04247. [Google Scholar]
Xiao, J.; Ye, H.; He, X.; Zhang, H.; Fei, W.; Chua, T. Attentional factorization machines: Learning the weight of feature interactions via attention networks. arXiv 2017, arXiv:1708.04617. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, L. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, San Francisco, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Huang, T.; Zhang, Z.; Zhang, J. FiBiNET: Combining feature importance and bilinear feature interaction for click-through rate prediction. In Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; pp. 169–177. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Covington, P.; Adams, J.; Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 191–198. [Google Scholar]
Wang, R.; Fu, B.; Fu, G.; Wang, M. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, 14 August 2017; pp. 1–7. [Google Scholar]
Shan, Y.; Hoens, T.R.; Jiao, J.; Wang, H.; Yu, D.; Mao, J. Deep Crossing: Web-scale modeling without manually crafted combinatorial features. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 255–262. [Google Scholar]
Lian, J.; Zhou, X.; Zhang, F.; Chen, Z.; Xie, X.; Sun, G. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1754–1763. [Google Scholar]
Liu, B.; Tang, R.; Chen, Y.; Yu, J.; Guo, H.; Zhang, Y. Feature generation by convolutional neural network for click-through rate prediction. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 1119–1129. [Google Scholar]
Song, W.; Shi, C.; Xiao, Z.; Duan, Z.; Xu, Y.; Zhang, M. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1161–1170. [Google Scholar]
Pan, J.; Xu, J.; Ruiz, A.L.; Zhao, W.; Pan, S.; Sun, Y.; Lu, Q. Field-weighted factorization machines for click-through rate prediction in display advertising. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 1349–1357. [Google Scholar]
Pande, H. Field-Embedded Factorization Machines for Click-through rate prediction. arXiv 2020, arXiv:2009.09931. [Google Scholar]
Yu, R.; Ye, Y.; Liu, Q.; Wang, Z.; Yang, C.; Hu, Y.; Chen, E. Xcrossnet: Feature structure-oriented learning for click-through rate predictions. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Cham, Switzerland, 2021; pp. 436–447. [Google Scholar]
Wang, J.; Huang, P.; Zhao, H.; Zhang, Z.; Zhao, B.; Lee, D.L. Billion-scale commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 839–848. [Google Scholar]
Okada, S.; Ohzeki, M.; Taguchi, S. Efficient partition of integer optimization problems with one-hot encoding. Sci. Rep. 2019, 9, 1–12. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sun, Z.; Guo, Q.; Yang, J.; Fang, H.; Guo, G.; Zhang, J.; Burke, R. Research commentary on recommendations with side information: A survey and research directions. Electron. Commer. Res. Appl. 2019, 37, 100879. [Google Scholar] [CrossRef] [Green Version]
Zhu, J.; Liu, J.; Yang, S.; Zhang, Q.; He, X. FuxiCTR: An Open Benchmark for Click-Through Rate Prediction. arXiv 2020, arXiv:2009.05794. [Google Scholar]
Zhang, W.; Qin, J.; Guo, W.; Tang, R.; He, X. Deep Learning for Click-Through Rate Estimation. arXiv 2021, arXiv:2104.10584. [Google Scholar]
Criteo. Available online: http://labs.criteo.com/downloads/download-terabyte-click-logs/ (accessed on 11 October 2022).
Avazu. Available online: http://www.kaggle.com/clavazu-ctr-prediction/ (accessed on 11 October 2022).
Wang, R.; Shivanna, R.; Cheng, D.; Jain, S.; Lin, D.; Hone, L.; Chi, E.H. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 1785–1797. [Google Scholar]
Graepel, T.; Candela, J.Q.; Borchert, T.; Herbrich, R. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In Proceedings of the 27th International Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; Omnipress: Madison, WI, USA, 2010. [Google Scholar]
Vovk, V. The fundamental nature of the log loss function. In Fields of Logic and Computation II; Springer: Cham, Switzerland, 2015; pp. 307–318. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Jiang, Z. Research on ctr prediction for contextual advertising based on deep architecture model. J. Control Eng. Appl. Inform. 2016, 18, 11–19. [Google Scholar]
Wang, Q.; Liu, F.; Xing, S.; Zhao, X. Research on CTR prediction based on stacked autoencoder. Appl. Intell. 2019, 49, 2970–2981. [Google Scholar] [CrossRef]
Zhou, G.; Zhu, X.; Song, C.; Fan, Y.; Zhu, H.; Ma, X.; Yan, Y.; Junqi, J.; Li, H.; Gai, K. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, USA, 19–23 August 2018; pp. 1059–1068. [Google Scholar]

Figure 1. The architecture of DeepFM.

Figure 2. The architecture of Se-xDFEFM.

Figure 3. (a) Performance impact of the number of CIN layers on Avazu; (b) performance impact of the number of CIN layers on Criteo.

Figure 4. (a) Performance impact of the number of CIN neurons on Avazu; (b) performance impact of the number of CIN neurons on Criteo.

Figure 5. (a) Performance impact of embedding dimension on Avazu; (b) performance impact of embedding dimension on Criteo.

Figure 6. Comparisons between our model variants and the mainstream models on Criteo.

Table 1. Performance comparisons with recent CTR models.

Model	Criteo		Avazu
Model	AUC	LogLoss	AUC	LogLoss
FM [5]	0.7727	0.4697	0.7626	0.3852
AFM [14]	0.7756	0.4680	0.7631	0.3849
NFM [9]	0.7919	0.4539	0.7649	0.3840
DeepFM [13]	0.7942	0.4521	0.7691	0.3815
DCN [19]	0.7944	0.4518	0.7692	0.3815
xDeepFM [21]	0.7947	0.4517	0.7696	0.3813
FiBiNET [16]	0.7951	0.4512	0.7703	0.3810
DeepFEFM [25]	0.7951	0.4511	0.7706	0.3808
Se-xDFEFM	0.7976	0.4491	0.7719	0.3800
Improve₁	2.5‰	2.0‰	1.3‰	0.8‰
Improve₂	2.5‰	2.1‰	1.6‰	1.0‰

Table 2. Performance comparisons with recent CTR models.

Model	Criteo		Avazu
Model	AUC	LogLoss	AUC	LogLoss
Se-xD_h	0.7947	0.4517	0.7701	0.3810
Se-xD_i	0.7951	0.4513	0.7699	0.3811
Se-xDFEFM_h	0.7963	0.4502	0.7705	0.3807
Se-xDFEFM_i	0.7976	0.4491	0.7719	0.3800
Improve₃	1.3‰	1.1‰	1.4‰	0.7‰
Improve₄	2.5‰	2.2‰	2.0‰	1.1‰

Table 3. Detailed results of ablation analysis experiments.

Model	Module				Criteo		Avazu
Model	CIN	FEFM	SENET	DNN	AUC	LogLoss	AUC	LogLoss
Se-xDFEFM-CIN		√	√	√	0.7966	0.4504	0.7711	0.3804
Se-xDFEFM-FEFM	√		√	√	0.7946	0.4516	0.7692	0.3813
Se-xDFEFM-SENET	√	√		√	0.7969	0.4496	0.7701	0.3808
Se-xDFEFM-DNN	√	√	√		0.7949	0.4514	0.7694	0.3815
Se-xDFEFM	√	√	√	√	0.7976	0.4491	0.7719	0.3800

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, G.; Xu, G.; Wu, G.; Ye, Y.; Li, C.; Zhang, H.; Ji, D. Se-xDeepFEFM: Combining Low-Order Feature Refinement and Interaction Intensity Evaluation for Click-Through Rate Prediction. Symmetry 2022, 14, 2123. https://doi.org/10.3390/sym14102123

AMA Style

Li G, Xu G, Wu G, Ye Y, Li C, Zhang H, Ji D. Se-xDeepFEFM: Combining Low-Order Feature Refinement and Interaction Intensity Evaluation for Click-Through Rate Prediction. Symmetry. 2022; 14(10):2123. https://doi.org/10.3390/sym14102123

Chicago/Turabian Style

Li, Guangli, Guangxin Xu, Guangting Wu, Yiyuan Ye, Chuanxiu Li, Hongbin Zhang, and Donghong Ji. 2022. "Se-xDeepFEFM: Combining Low-Order Feature Refinement and Interaction Intensity Evaluation for Click-Through Rate Prediction" Symmetry 14, no. 10: 2123. https://doi.org/10.3390/sym14102123

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Se-xDeepFEFM: Combining Low-Order Feature Refinement and Interaction Intensity Evaluation for Click-Through Rate Prediction

Abstract

1. Introduction

2. Model Basis

2.1. FM

2.2. DeepFM

3. The Proposed Se-xDFEFM Model

3.1. Embedding Layer

3.2. Senet Layer

3.3. Linear Interaction Layer and Combination Layer

3.4. CIN and DNN

3.5. Output Layer

4. Experimental Results and Analysis

4.1. Datasets

4.2. Evaluation Metrics

4.3. Baselines

4.4. Experimental Environment and Parameter Settings

4.5. Experimental Results and Discussion

4.5.1. Performance Comparisons with the Baseline Models

4.5.2. Impact of Different Linear Interaction Methods

4.5.3. Impact of the CIN Layers

4.5.4. Impact of the Number of CIN Neurons

4.5.5. Impact of Embedding Dimension

4.5.6. Ablation Experiment

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI