2.1. Background-Generative Adversarial Networks
Generative adversarial networks (GANs) are deep models that learn to generate samples representing a data distribution
pdata(
x) [
14]. A GAN consists of two functions: a generator
G that converts samples
z from a prior distribution
pz(
z) into candidate examples
G(
z); and a discriminator
D that looks at real samples from
pdata(
x) and those synthesized by
G, and estimates the probability that a particular example is authentic, or fake.
G is trained to fool
D with artificial samples that appear to be from
pdata(
x). The functions
G and
D therefore have adversarial objectives which are described by the minimax function used to adapt their respective parameters:
In the game defined in Equation (
1), the discriminator tries to make
D(
G(
z)) approach 0; the generator tries to make this quantity approach unity [
14].
Conditional GANs. Additional information about the input data can be used to condition the GAN model. To learn different segments of the target distribution, an auxiliary input signal
y is presented to both the generator and discriminator functions. The objective function for the conditional GAN [
15,
16] becomes
where the joint density to be learned by the model is
pdata(
x,
y).
Coupled GANs. Further generalization of the GAN idea was developed to learn a joint data distribution
over multiple domains within an input space [
17]. The coupled GAN includes two paired GAN models [
)], each of which is trained only on marginal distributions
,
from the constituent domains of the joint distribution. The coupling mechanism shares weights between the lower layers of
and
, and the upper layers of
and
. respectively. The architectural location of the shared weights in each case corresponds to the proximity of the greatest degree of abstraction in the data (for
G, nearest to the input latent space
z; for
D, near the encoded semantics of class membership). By constraining weights in this manner, the joint distribution
between
view and
buy behaviors can be learned from training data.
2.2. Model Architecture
The present model design combines elements of both conditional and coupled GANs as described above. The constituent networks of the coupled GAN recommender were realized in software using the Tensorflow and Keras [
18,
19] deep learning libraries. The GAN models were developed by adaptation of baseline models provided in an open source repository [
20].
A schematic view of the
RecommenderGAN model used the current work is presented in
Figure 1. The generators (left) receive input latent vectors
z and user segment
y, and output matrices
G1,
G2. Discriminators (right) are trained alternatively with these artificial arrays and real samples
X1,
X2, and try to discern the difference. The error in this decision is backpropagated to update weights in the generators.
Architectural details of the network were settled upon after iterative experimentation and observation of results from different model design configurations and hyperparameter sets. Specifics of the layered configuration and dimensions of the final model appear in
Table 1.
For the present dataset, more extensive (layer-wise) coupling of the weights within the generator networks proved necessary to obtain useful statistical results upon analysis. This is in contrast to the existing literature on coupled GANs, in which the main application domain is computer vision [
17].
It was determined that the use of dropout layers [
21] in
improved convergence during training, but such layers had negligible positive effect when included in the discriminator sub-models.
The final model included a total of 326, 614, 406 parameters, of which 257, 407, 020 were trainable.
The optimization algorithm used for training the model was Adam (stochastic gradient descent), using default values for the configurable parameters (
https://keras.io/api/optimizers/adam/).
2.3. Data Preparation
Electronic commerce dataset. The adversarially trained recommender model was developed using a dataset collected from an online retailer (
https://www.kaggle.com/retailrocket/ecommerce-dataset). The dataset describes website visitors and their behaviors (view, “add” or buy items available for purchase); the attributes of these items; and a graph describing the hierarchy of categories to which each item belongs. Customers have been anonymized, identifiable only by a unique session number; all items, properties and categories were similarly hashed for confidentiality reasons.
The dataset comprises 2,756,101 behavioral events: (2,664,312 views, 69,332 “add to carts”, 22,457 purchases) observed on 1,407,580 distinct visitors. There are 417,053 unique items represented in the data.
Three different schemes for joint distribution learning can be constructed from the behaviors view, addtocart, and buy in the dataset—(view, add), (add, buy), and (view, buy). The (view, buy) scheme was studied here because it directly connects viewing a raw recommendation and a corresponding purchase.
User segmentation. In Reference [
22], models of user engagement with digital services were developed based on metrics covering three aspects of observable variables:
popularity,
activity and
loyalty. Each of these areas and metrics suggest means for grouping users in an implicit feedback situation.
In the current study, users were segmented based on counts of the number of interactions made within a session (this is referred to as “click depth” in Reference [
22]). Each visitor/session in the dataset were assigned to one of five behavioral segments according to click depth, which increases in proportion to bin number. Visitor counts by segment for the (
view, buy) scheme are summarized in
Table 2.
Note that the first segment is empty in this table. User segmentation was based on the entire sample; for the (view, buy) group, no paired data fell into the lowest click depth count bin. (At least two actions are required to view and buy, and 86% of the entire sample comprised two or fewer interactions, and total buys less than 1%). The total sample size considered for the four segments was 10,591 visitors.
Compressed representation. The e-commerce dataset contained 417,053 distinct items in 1669 product categories. Data matrices covering the full-dimensional space () are prohibitively large for practical computation. This is exacerbated by the number of trainable parameters in the model (>257,000,000).
To address this cardinality issue, a compressed representation of the product data was created using an arithmetic coding algorithm [
23]. For each visit, training data matrices were constructed having fixed dimensions of 1169 rows (one product category per row) and 300 columns (encoded bit strings representing items in corresponding category). This was the maximum encoded column dimension; for row encodings of lesser length, the bit string is prepended with zeros. The decoded result is identical regardless of length of this prefix; this is important in order to subsequently identify specific items recommended by the system.
The encoded, sparse data matrices profiling visitor behavior for
Views (
V) and
Buys (
B) can be expressed symbolically as:
where elements
are indicator variables denoting whether a visitor interacts with category ‘
m’ and item ‘
n’, and (r = 1669 × c = 300) is the encoded matrix dimension.
GAN training is carried out in the compressed data space, and final recommendations are read out after decoding to the full column dimension of all items comprising the product catalog.
2.4. Evaluation Metrics
Assessments of recommendation systems in academic research often include utility statistics of the returned results (such as novelty, precision, sensitivity), or overall system computational efficiency [
24,
25,
26]. To estimate the business value derived from deployed systems, effectiveness measures may be direct (conversion rates, click-through rates), or inferred (increased subscriptions, overall revenue lift, for example) [
27].
In the implicit feedback situation considered here, recommendations are created from sampling a joint distribution of
behaviors. Consider the potential paths for system evaluation as suggested in
Figure 2. In the left-hand column are the paired training data
; on the right, the generated recommendations
.
Without knowledge of the relevance of recommendations by human ratings or via purchasing behavior, evaluation in this preliminary work is based on objective metrics of similarity between the generated
lists (path #4). Items contained in the intersection of
generated recommendation sets are taken to signify the highest likelihood for completion of a transaction within the context of a given visitor session. This idea is illustrated in
Figure 3.
Two metrics of evaluation are proposed:
- 1
Specific items contained within the overlapping category sets that are both viewed and “bought”—a putative conversion rate;
- 2
Coherence between categories in the paired recommendations.
Estimation of conversion rate is the most important statistic considered here; it is crucial for evaluation and optimization of recommender systems in terms of user utility, as well as for targeted advertising [
28].
Category overlap is prerequisite to demonstration of the feasibility of the current approach to product recommendation.
Product conversion rate. Define the conversion rate as the number of items recommended and bought, to the count of all items recommended, conditioned on the set of overlapping product categories returned by the system:
where
are items,
are product categories,
N is the number of GAN realizations,
y denotes the user segment and “
” denotes the cardinality of its argument. Note that in the current analysis, it is assumed that all recommended items
are viewed by a visitor.
Category similarity. The average Jaccard similarity between recommended categories
is given by
Training distribution statistics. Summary statistics comparing the distributions
(
Figure 2, path #1) are observed to provide qualitative information about the effectiveness of target distribution learning.
Null hypothesis tests. A legitimate question to ask upon analyzing the current results is this: “Are the generator realizations samples of the target joint distribution, or do they simply represent random noise?”.
To address this question, the analysis includes statistics estimated from simulation trials (n = 500) in which randomly selected elements from the (V, B) matrices are set equal to unit value, while maintaining the average sparsity observed in the decoded GAN predictions.
The random trials are meant to test the null hypothesis that there is no correlation between paired elements in the generator output.
The alternative hypothesis is that the recommendations contain relevant information that may provide utility to the system user.
2.5. Recommendation Experiments
Training. The system was trained on the encoded data for 1100 epochs, in randomly-selected batches of 16 examples each.
Statistics of training data for all networks comprising the model were observed during the training iterations.
The
statistics monotonically approached the true distribution those of the true until around epoch nunber 1110, at which point the GAN output began to diverge. One explanation for this may be that the representational capacity of the networks on this abstract learning task may have become exhausted [
29]. Examples of this statistical evolution during training are shown in
Figure 4. Note that training data matrix values were scaled onto the range [−1,+1], where the value “−1” corresponds to a zero valued element in the sparse raw data arrays.
Label smoothing on the positive ground truth examples presented to the discriminators was used to regularize the learning procedure [
30].
At training stoppage, the observed discriminator accuracies where consistently in the ≈45–55% range, indicating that these models were unable to differentiate between the real and fake distributions produced by the generators [
14].
Testing. Testing a machine learning model refers to evaluation of its output predictions obtained using out-of-sample (unseen) input data. This provides an estimate of the quality and generalization error of the model. Techniques such as cross-validation are often used to assess generalization potential. In contrast to many other learning algorithms, GANs do not have an objective function, rendering performance comparison of different models difficult [
31].
For a concrete illustration, imagine that a GAN has been trained on a distribution of real images of human faces, and generates synthetic face samples after sufficient training iterations [
32].
The degree to which the sampled distribution has learned to approximate the target distribution can be estimated by qualitative scoring; the assessment is subjectively accomplished by human observers, who easily measure how “facelike” these artificial faces appear to the eye.
Alternatively, objective metrics based on training and generated image data can be applied in some cases. Example objective metrics are proposed in Reference [
31].
In the present application, the generated “images” are abstract representations of consumer activity, not concrete objects. An out-of-sample test in the conventional sense is not possible. The metrics of generative model performance and null hypothesis tests as described in
Section 2.4 constitute the testing of the model developed in this work.
GAN predictions. After training, the model was stimulated with a noise vector and a user segment conditioning signal, producing a series of coupled
predictions
, as depicted in
Figure 1. The discriminators
serve only to guide training, and are disabled for the inference procedure.
A total of 2500 generation realizations was produced for each user segment. The recommendation matrices were decoded onto the full-dimensional space ( ) by the inverse arithmetic coding algorithm used in data preparation.
An
sub-sample of these realizations was taken to compute key recommender evaluation metrics (Equations (
4) and (
5)). This was done because the decoding algorithm involves compute-intensive iteration, and is not amenable for computation on GPUs at this writing. Numerical optimization of the arithmetic decoding should be addressed in future extensions of this work.
The evaluation statistics were also calculated in null hypothesis tests, the results of which were averaged to obtain an estimate of the distribution expected under this hypothesis.