Job Recommendations: Benchmarking of Collaborative Filtering Methods for Classifieds

Kwieciński, Robert; Górecki, Tomasz; Filipowska, Agata; Dubrov, Viacheslav

doi:10.3390/electronics13153049

Open AccessArticle

Job Recommendations: Benchmarking of Collaborative Filtering Methods for Classifieds

¹

OLX Group, Królowej Jadwigi 43, 61-872 Poznań, Poland

²

Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poznańskiego 4, 61-614 Poznań, Poland

³

Department of Information Systems, Poznań University of Economics and Business, Al. Niepodległości 10, 61-875 Poznań, Poland

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3049; https://doi.org/10.3390/electronics13153049 (registering DOI)

Submission received: 30 May 2024 / Revised: 24 July 2024 / Accepted: 27 July 2024 / Published: 1 August 2024

(This article belongs to the Special Issue Context-Aware Computing and Smart Recommender Systems in the IoT, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Classifieds pose numerous challenges for recommendation methods, including the temporary visibility of ads, the anonymity of most users, and the fact that typically only one user can consume an advertised item. In this work, we address these challenges by choosing models and evaluation procedures that are considered accurate, diverse, and efficient (in terms of memory and time consumption during training and prediction). This paper aims to benchmark various recommendation methods for job classifieds, using OLX Jobs as an example, to enhance the conversion rate of advertisements and user satisfaction. In our research, we implement scalable methods and represent different approaches to the recommendations: Alternating Least Square (ALS), LightFM, Prod2Vec, RP3Beta, and Sparse Linear Methods (SLIM). We conducted A/B tests by sending millions of messages with recommendations to perform online evaluations of selected methods. In addition, we have published the dataset created for our research. To the best of our knowledge, this is the first dataset of its kind. It contains 65,502,201 events performed on OLX Jobs by 3,295,942 users who interacted with (displayed, replied to, or bookmarked) 185,395 job ads over two weeks in 2020. We demonstrate that RP3Beta, SLIM, and ALS perform significantly better than Prod2Vec and LightFM when tested in a laboratory setting. Online A/B tests also show that sending messages with recommendations generated by the ALS and RP3Beta models increases the number of users contacting advertisers. Additionally, RP3Beta had a 20% more significant impact on this metric than ALS.

Keywords:

job recommendations; classifieds; collaborative filtering; scalability testing; A/B tests; recommender systems; matrix factorization; graph-based recommender systems; OLX job interactions dataset; diversity metrics

1. Introduction

Online classifieds are websites where advertisers post advertisements concerning the sale or rental of services or products [1]. This paper addresses online advertising using the example of OLX Jobs.

Classifieds strive to address the needs of both advertisers and responders to online ads. To match these groups, better recommendation methods are becoming more critical. Compared to traditional e-commerce (e.g., Amazon (https://www.amazon.com/, accessed on 26 June 2024) or eBay (https://www.ebay.com/, accessed on 26 June 2024)), classifieds must face different challenges, which we address in our evaluation procedure. One is the temporal visibility of ads, which we decided to address by frequently retraining the model. Hence, the model’s efficiency, which we report, is essential to the evaluation. Another challenge is that an advertiser usually needs only one user to complete the transaction. This aspect was addressed by including diversity analysis in our assessment.

In our research, we focus on approaches emerging from collaborative filtering. Collaborative filtering models have proven their accuracy, even though they do not utilize additional information about users or items, which is sometimes hard to obtain or process in practice. This paper aims to evaluate different recommendation methods for use in the classifieds sector. The high sparsity of our dataset, along with the significant number of ads and users, contributes to the complexity of the research problem.

Below, we list our main contributions.

We comprehensively evaluate diverse, scalable, and collaborative filtering approaches in a laboratory setting regarding accuracy, diversity, and efficiency. We evaluate the accuracy for different groups of users, depending on the number of items they interacted with, together with statistical analysis of differences. The proposed evaluation methodology addresses the major challenges of classifieds.
We present the results of two online A/B tests demonstrating the impact of the ALS and RP3Beta models on OLX users. More than 1 million users participated in each test. The reported impact may be used to estimate the importance of recommender systems in the classifieds domain and enable other companies to make more informed decisions about developing such models.
We publish and present a job interactions dataset. To our knowledge, this is the largest publicly available dataset on job interactions.

This paper consists of seven sections. Section 2 is a literature review on underlying developments. Section 3 describes the dataset developed and made publicly available on Kaggle. The design of the lab experiment, as well as the implementation of methods, are described in Section 4. Section 5 focuses on the results of the lab evaluation of our implementations as well as the statistical importance of the results. Section 6 presents the results of the online implementations performed during A/B tests with users, as well as a discussion of the results. Conclusions are presented in Section 7.

2. Related Work

In this section, we begin by outlining the unique characteristics of classifieds and job domains. We then provide an overview of recommender systems. Next, we discuss the factors that influence our selection of models and provide a detailed description of each model. Finally, we specify the research gaps.

2.1. Classified ad Sites

Classified ad sites (classifieds) are online versions of newspaper advertisement sections, with ads concerning vehicles, real estate, employment, pets, etc. Classifieds are a particular type of e-commerce site, as they do not support the finalization of transactions between parties—they are used instead to communicate about the product or service and to set up meetings while the transaction is carried out offline; this distinguishes them from traditional e-commerce sites such as eBay or Amazon [2].

Our research is focused on OLX, which has a presence in over 30 countries worldwide. One of the ad types published on classifieds is job offers. In our research, we identified several characteristics that distinguish job ads on classifieds from, for example, car or real estate ads:

The requirements in an ad indicate the type of user who may respond to the ad—not all users possess the required competencies.
More intensive involvement of the user during the search process, as finding a job is a primary need compared to, for example, buying a car.
The job location dramatically impacts the user—to buy a mobile phone or a car, a user may travel or use a parcel service to deliver the item. Still, a job requires relocation if it is distant from the user’s home.
From a technical perspective, a job description has many qualitative aspects, and depending on the position, the differences may be significant even between ads from the same company.

On top of these features, during our research, we also identified the following challenges for recommendation methods on classifieds that influenced our implementation of the methods:

The number of users and offers: with OLX Jobs, we dealt with millions of users and tens of thousands of offers, both constantly changing.
A user does not have to create a profile to interact with ads, meaning that information from a profile cannot be used for recommendation, and recommendations are determined by user behavior during viewing.
An ad usually has a limited number of visitors (interacting with the ad), as an ad usually concerns a unique offer. After the need is addressed, the ad is disabled.
There is no information on whether a transaction occurred; on classifieds, conversion relates to receiving an answer to a published ad, as any transaction is offline.

These challenges create difficulties not only for implementation but also for testing the accuracy of the proposed methods. We considered these challenges while preparing the dataset and implementing and evaluating the methods described in this paper.

2.2. Recommendation Systems

It has been shown that the problem of overchoice [3] can decrease the consumption of products [4] and reduce revenue to the site [5]. Providers can ask their users for additional information about their expectations regarding items to reduce the number of choices presented. Even if the user is willing to spend time providing this information, the number of possibilities is usually too large to present to the user. To address this problem, many companies, such as Netflix [6] and Amazon [7], have created personalized recommendation systems whose role is to suggest relevant items to users [8].

Recommendations are used in multiple domains, including music, tourism, fashion, and food. Numerous studies have been made within the domain of e-commerce [9], classifieds [10], and job recommendations [11,12,13]. Li et al. [14] significantly improved their recommendations on LinkedIn by utilizing deep transfer learning to create domain-specific job-understanding models. Zhu et al. [15] proposed a cross-domain recommendation model utilizing a heterogeneous graph and community detection algorithm to recommend courses and jobs. Lacic et al. [16] evaluated different autoencoder architectures for session-based job recommendations and presented the results on popular job recommendation datasets.

Typically, we distinguish two approaches for providing recommendations: content-based recommendations and collaborative filtering [8].

In content-based recommendations [17], we utilize information about users (e.g., hobby, education, skills) and items (e.g., price, location, category). These systems usually use the interaction history (ratings, visits, purchases, and replies) of a single user but do not consider the ratings of all users simultaneously.

Collecting/processing information about users and items is not always feasible. In the collaborative filtering approach [18,19], we only utilize information about user interactions between items. Even without information on user features and item features, these systems can provide very accurate recommendations. We chose this approach because of the lack of user features and the difficulty of outperforming collaborative filtering by utilizing item features in a production environment (we elaborate on this in Section 6.3).

One of the most significant shortcomings of collaborative filtering is the impossibility of providing recommendations regarding users and items without interactions, known as the cold-start problem [20]. One possible solution to reducing this problem is the frequent retraining of the model. With many user interactions, this strategy requires a focus on the scalability of the model [21].

Another important factor determining the model choice is the type of interaction between a user and an item [20]. We call the feedback explicit when the user explicitly indicates their preference toward the item (e.g., rates it on a scale of 1–5). Another type is implicit feedback, when we can only assume the user’s preference toward the item based on user behavior (e.g., buying an item or watching a movie). Implicit feedback is less accurate, more abundant, and usually expresses only favorable preferences [22].

Laboratory evaluations of recommender systems are usually based on accuracy metrics [23]. Recently, much attention has been placed on other aspects, such as diversity, uncertainty, novelty, and coverage [24,25]. With classifieds, an item can usually be consumed by only one user, so we should avoid recommending the same item to a significant number of users. Hence, focusing on catalog coverage [26] is essential. One of the most challenging tasks in evaluating recommender systems is finding the offline metric most correlated with the online goal [27].

In this paper, we benchmark collaborative filtering models for a sizable implicit feedback dataset. We compare the accuracy, diversity, and efficiency of selected recommendation methods in a laboratory setting and then prove the methods’ quality in online tests with users.

2.3. Recommendation Methods

Below, we list the most important aspects impacting method selection for our research.

Popularity in the literature. We selected methods whose performances were thoroughly evaluated. All but one of the selected methods have been cited at least hundreds of times.
Popularity in industry. Evaluating this aspect was challenging because companies rarely share the results of conducted experiments. Some popularity indicators include implementation availability and the number of people who use it (e.g., reflected in the number of stars and forks on GitHub). We also considered the experience of data scientists working within our company, who deployed most of the selected methods for some other use cases in the past.
Scalability. We only considered solutions capable of handling millions of users and items. Additionally, we reduced the cold-start problem by frequently retraining the model. Hence, we had to choose techniques to be trained on our dataset within a few hours.
Diversity of approaches. We wanted to test methods representing different “families” to assess their applicability to classifieds. We planned to identify the most promising family of approaches to continue research in this direction.

Finally, ALS, LightFM, Prod2Vec, RP3Beta, and SLIM were selected for testing and evaluation. The following subsections describe the reasons for choosing each specific method and the method definition. A common limitation of these methods is their inability to recommend items that are not visited by any user and generate recommendations for users who have not visited any item (cold-start problem, [21]). With our dataset, without user and item features, it is impossible to overcome this problem by any personalized method.

Selected methods have already been compared across different domains. Anelli et al. [28] compared RP3Beta, SLIM, ALS, and BPRMF (which is equivalent to LightFM when using the BPR loss function) on three datasets: Amazon Digital Music, Epinions, and MovieLens 1M, in terms of accuracy, diversity, and efficiency. The numbers of interactions in these datasets are, respectively,

145, 523

,

300, 475

, and

571, 531

. Our study utilized a dataset containing tens of millions of interactions, similar to those in OLX production systems. Additionally, we provide the results of an online evaluation.

Dacrema et al. [29] compared the accuracy of SLIM, RP3Beta, and ALS on several datasets (CiteULike-a, MovieLens 1M, Pinterest, Amazon Music, Amazon Movies, MovieLens 20M, Netflix, Yelp, and Gowalla). They showed that each of these algorithms outperformed the other two on at least one dataset. Some of their findings were further confirmed on MovieLens 1M and Pinterest datasets by Anelli et al. [30]. Zhu et al. [31] compared BPRMF, SLIM, and Prod2Vec on three datasets (Amazon Books, Yelp, and Gowalla), showing the superiority of SLIM. Dong et al. [32] analyzed the accuracy of RP3Beta, SLIM, and ALS on nine datasets, with SLIM providing the best results on eight datasets and ALS on one. Unlike our study, these papers neither compared the selected methods in terms of diversity or efficiency nor conducted an online evaluation.

Recently, several recommendation methods based on neural networks have been proposed. It has been shown that adequately determined and tuned simple models often outperform them [28,31,32,33]. Additionally, due to computational complexity, models based on neural networks are less frequently tested on datasets containing tens of millions of interactions [28]. Hence, we leave it for future research to propose more advanced and scalable methods that provide superior results on our dataset.

2.3.1. Matrix Factorization Models

Matrix factorization techniques have been successfully utilized in recommender systems since the Netflix Prize competition [34]. The main idea is to approximate the sparse user–item interaction matrix as a product of two smaller, dense matrices. We evaluate two matrix factorization approaches: LightFM and ALS.

LightFM [35] model was proposed to overcome the cold-start problem of matrix factorization methods. In our case, it reduces to classical matrix factorization without additional information about users and items. Despite this, we decided to use this approach because of its efficient official implementation in supporting multiple loss functions: logistics, BPR [36], WARP [37], and k-OS WARP [38].

Let

U

be a set of users and

I

a set of items. For a given user

u \in U

and item

i \in I

, we define

r_{u i} = 1

if user u interacts with item i, and

r_{u i} = 0

, otherwise. The LightFM model predicts the score

r_{u i}

as follows:

{\hat{r}}_{u i} = x_{u} \cdot y_{i} + b_{u} + b_{i},

where

x_{u}

and

y_{i}

denote user u’s and item i’s k-dimensional latent representations, and

b_{u}

and

b_{i}

are user u’s and item i’s biases. All of these parameters are learned during the training.

The logistic loss function should be used when both positive and negative feedback types are available, so it is inappropriate for our dataset, which we confirmed experimentally during the hyperparameter tuning.

The BPR, WARP, and k-OS WARP loss functions are pairwise learning-to-rank approaches that are more relevant for our top-k recommendations task. They try to modify model parameters to maximize the difference between the scores of the items

i \in D_{u}

with which user u interacts and of the items

i \notin D_{u}

with which the user does not interact with.

Specifically, in the BPR loss function [36], we attempt to maximize the expression, as follows:

\sum_{(u, i, j) \in S} ln σ ({\hat{r}}_{u i} - {\hat{r}}_{u j}) - λ_{Θ} {| | Θ | |}^{2},

where

S = {(u, i, j) | u \in U, i \in D_{u} and j \notin D_{u}},

σ (x) = \frac{1}{1 + e^{- x}},

λ_{Θ}

is a model-specific regularization parameter,

Θ

is a parameter vector, and

| | \cdot | |

is the Euclidean norm.

We present WARP [37] as a case of k-OS WARP [38]. The k-OS WARP loss function is a sum of losses for each user:

\sum_{u \in U} L ({\hat{r}}_{u}, D_{u}),

where

{\hat{r}}_{u}

is a vector of estimated scores

{\hat{r}}_{u i}

for each item

i \in I

.

For a given user u, we order the interacted items

D_{u}

in decreasing order of the estimated scores. Namely, we order the items as

i_{1}, i_{2}, \dots, i_{| D_{u} |}

, where the indices are chosen in such a way that we have the following:

{\hat{r}}_{u i_{1}} \geq {\hat{r}}_{u i_{2}} \geq \dots \geq {\hat{r}}_{u i_{| D_{u} |}} .

Then, we have the following:

L ({\hat{r}}_{u}, D_{u}) = \frac{1}{Z} \sum_{s = 1}^{| D_{u} |} P (s) Φ ({rank}_{i_{s}} ({\hat{r}}_{u})),

where

{rank}_{i} ({\hat{r}}_{u}) = \sum_{j \in I ∖ D_{u}} 1 (1 + {\hat{r}}_{u j} \geq {\hat{r}}_{u i})

, 1 is an indicator function,

Φ (n) = \sum_{m = 1}^{n} \frac{1}{m}

, and

Z = \sum_{s = 1}^{| D_{u} |} P (s)

normalizes the weights induced by P.

In practice, we sample a positive item i concerning the weighting function P and randomly pick unseen items j until, after N samplings,

1 + {\hat{r}}_{u j} \geq {\hat{r}}_{u i}

. Then, we perform a gradient step to minimize

Φ (⌊ \frac{| I ∖ D_{u} |}{N} ⌋) (1 - {\hat{r}}_{u i} + {\hat{r}}_{u j})

.

Finally, the k-OS WARP loss is achieved for

P (k) = 1

and

P (m) = 0

if

m \neq k

. The WARP loss is achieved when we set

P (m) = 1

for

m \in N

. We can skip ordering positive items from

D_{u}

to simplify the computations.

In our research, all loss functions (logistic, BPR, WARP, and k-OS WARP) were treated as hyperparameters of the LightFM model and optimized during training.

The ALS model (alternating least squares), also known as WRMF—weighted regularized matrix factorization [39], is an example of a matrix factorization method designed for implicit feedback datasets. We chose this method because of its proven performance, scalability, and popularity in research and industry. Additionally, the ALS model has already been implemented and used in OLX.

We define the confidence of an observation as

c_{u i} = 1 + α r_{u i}

, where

α

is a hyperparameter. We also define the preference as

p_{u i} = 1 (r_{u i} > 0)

. Then, in ALS, we optimize the following expression:

min_{x_{*}, y_{*}} \sum_{u, i} c_{u i} {(p_{u i} - x_{u}^{⊤} y_{i})}^{2} + λ (\sum_{u} | | x_{u} {| |}^{2} + \sum_{i} | | y_{i} {| |}^{2}),

where

x_{u}

is user u’s f-dimensional embedding,

y_{i}

is item i’s f-dimensional embedding, and

λ

is a regularization parameter. The model’s name refers to the optimized method for this function. The alternating least squares procedure assumes that user or item latent factors are fixed and analytically determine the others. The procedure is efficiently repeated a few times [39,40].

In matrix factorization approaches, the embedding for each user is learned separately, which might result in high time and memory utilization if the number of users is vast. Additionally, the recommendations might lack explainability because two users with the same interactions might receive different recommendations. Other considered models (SLIM, RP3Beta, Prod2Vec) provide the same set of recommendations for users with the same interactions.

2.3.2. Neighborhood-Based Models

The SLIM model (sparse linear methods for Top-N recommender systems [41]) is based on item-based nearest neighbors regression [8]. We chose this method because of its advantages over item-based k-nearest-neighbors and matrix factorization models [41] and good performance against non-neural and neural approaches [29].

In this model, we approximate the user–item matrix

R

by learning the item–item sparse similarity matrix

W

. Explicitly,

W

is learned by minimizing the following expression:

\frac{1}{2} | | R - {RW | |}_{F}^{2} + \frac{β}{2} {| | W | |}_{F}^{2} + {λ | | W | |}_{1},

where

w_{i j} \geq 0

and

w_{i i} = 0

for all i, j,

1 \leq i, j \leq | I |

,

{| | \cdot | |}_{F}

is the matrix Frobenius norm:

{| | W | |}_{F} = \sqrt{\sum_{i = 1}^{| I |} \sum_{j = 1}^{| I |} w_{i j}^{2}},

{| | \cdot | |}_{1}

is the entry-wise

ℓ_{1}

-norm:

{| | W | |}_{1} = \sum_{i = 1}^{| I |} \sum_{j = 1}^{| I |} | w_{i j} |

and

β

and

λ

are regularization parameters.

We can see that each column of

W

can be learned independently, which improves the scalability of this model [41].

A limitation of this method is that it does not generate user and item embeddings, which could be helpful for other tasks (e.g., clustering of users or items). We can treat the columns of the matrix

W

as item embeddings, but they are sparse

| I |

-dimensional vectors, so they probably cannot be directly utilized in other tasks.

2.3.3. Graph-Based Approaches

RP3Beta [42] is a graph-based collaborative filtering approach. Similarly to Dacrema et al. [29], we present this model as an extension of the P3alpha model [43]. We consider a bipartite and undirected graph

G = (V, E)

, where each vertex

v \in V

represents either a user or item and each edge

{v_{u}, v_{i}} \in E

represents the interaction between a given user u and a given item i. Let

A

be the adjacency matrix and

D

the degree matrix of the graph G. Then, the scores of P3alpha are stored in the matrix

{({(D^{- 1} A)}^{\circ α})}^{3},

where ∘ denotes the element-wise power of a matrix (Hadamard power). More precisely, the score

{\hat{r}}_{u i}

is stored on the intersection of the row representing user u and the column representing item i. We can see that this score is a sum of scores assigned to paths of length 3 connecting user u and item i in the graph G.

In RP3Beta, the score is computed with P3alpha and divided by the popularity of the items raised to the power of

β

to mitigate the popularity bias.

We chose this model because of its simplicity, scalability, and proven performance [29,30], especially given the high sparsity of our dataset, which significantly decreases memory utilization when training the model. This enables us to directly compute and store the item–item similarity matrix in memory instead of the random walks approximation proposed in the original paper.

RP3Beta, similar to SLIM, does not generate user and item embeddings. Another related limitation is its inability to recommend items whose distance on the user–item bipartite graph is greater than three from a given user. Specifically, item i can be recommended to user u only if at least one user

u^{'}

has visited item i and at least one of the items visited by user u. Additionally, the RP3Beta model does not have parameters optimized during training; hence, it might be unable to exploit the complex relations in data.

2.3.4. Prod2Vec

The Prod2Vec [44] (or Item2vec [45]) model is based on the Word2Vec model [46], widely used in natural language processing. The model focuses on learning vector representations of items using a sequence of items as a “sentence” and items within that sequence as “words”. This method was chosen because it is scalable and very different from the compared methods. Additionally, this approach is based on the current state-of-the-art hybrid deep neural network recommender system used in other categories at OLX.

Prod2Vec uses the skip-gram model by maximizing the objective function over the set

S

of item sequences, defined as follows:

L = \sum_{s \in S} \sum_{i_{k} \in s} \sum_{\begin{matrix} - c \leq l \leq c \\ l \neq 0 \end{matrix}} log P (i_{k + l} | i_{k}),

where c is the context length for item sequences, and items from the same sequence are ordered arbitrarily. The probability

P (i_{k + l} | i_{k})

of observing a neighboring item

i_{k + l}

given the current item

i_{k}

is defined using the soft-max function:

P (i_{k + l} | i_{k}) = \frac{exp (y_{i_{k}}^{⊤} y_{i_{k + l}}^{'})}{\sum_{i = 1}^{| I |} exp (y_{i_{k}}^{⊤} y_{i}^{'})},

where

y_{i}

and

y_{i}^{'}

are the input and output vector representations of item i, and

| I |

is the number of unique items in the vocabulary. Prod2Vec models the context of an item sequence, where items with similar contexts (i.e., with similar neighboring interactions) have similar vector representations.

Before generating recommendations, we normalize all input vector representations, which improves efficiency. To generate recommendations, we represent the user as an average of input vector representations of interacted items. Then, we search for items whose input representations have the highest cosine similarity with the user representation (k-nearest neighbors in the latent space).

A potential limitation of this method is a strict dependency between user and item representations. Some interactions should potentially have a greater impact on user representations than others (which is possible in matrix factorization models).

2.4. Research Gap

This paper aims to benchmark different recommendation methods in the job classifieds domain, addressing a specific research gap. To the best of our knowledge, the recommendation methods addressed have never been tested or benchmarked on a large scale while working online in the job domain. In this work, we compare these methods in a laboratory setting using the dataset we published and in online A/B tests carried out in a production environment.

3. OLX Job Interactions Dataset

This research provides a commercial dataset that contains 65,502,201 events performed on OLX Jobs by 3,295,942 users who interacted with 185,395 job ads in two weeks in 2020. We published the dataset on Kaggle (https://www.kaggle.com/olxdatascience/olx-jobs-interactions, accessed on 26 June 2024).

Each row of the dataset represents the event a given user performs regarding a given ad at a given time (Table 1). The following events involving user behaviors were included:

click: the user visited the item detail page;
bookmark: the user added the item to bookmarks;
chat_click: the user opened the chat to contact the item’s owner;
contact_phone_click_1: the user revealed the phone number attached to the item;
contact_phone_click_2: the user clicked to make a phone call to the item’s owner;
contact_phone_click_3: the user clicked to send an SMS message to the item’s owner;
contact_partner_click: the user clicked to access the item’s owner’s external page;
contact_chat: the user sent the item’s owner a message.

The data sparsity (% of missing entries in the user–item matrix) equals 99.9923%, which is very low but typical for classifieds (for comparison, the data sparsity on the MovieLens 20M Dataset is 99.47%, making it almost 70 times denser). Distributions of interactions per user, interactions per item, and events are provided in Figure 1 and Figure 2, and Table 2, respectively. We used a logarithmic scale because the distributions are strongly right-skewed.

According to De Ruijt and Bhulai [47], the most popular datasets used for the evaluation of job recommendations systems are CareerBuilder12 (https://www.kaggle.com/competitions/job-recommendation, accessed on 16 June 2024), RecSys16 [12], and RecSys17 [13]. Table 3 shows the statistics of considered datasets. The number of unique interactions is the number of unique user–item pairs where the user interacted with the item. Even though researchers use the RecSys16 and the RecSys17 datasets, they were published during data science competitions under strictly limited conditions and are no longer available. Hence, we could not calculate the number of unique interactions for these datasets. We did observe that the OLX job interactions dataset has the highest number of interactions and types among the considered datasets.

It should be noted that maintaining the confidentiality of ads and users was a priority when preparing this dataset. The measures taken to protect privacy included the following:

Original user and item identifiers were replaced with unique random integers;
Some undisclosed constant integer was added to each timestamp;
Some fractions of interactions were filtered out;
Some additional artificial interactions were added.

4. Experimental Setup

4.1. Implementation

We implemented the methods mentioned above based on previous developments. Details on how our implementations differ from the previously tested approaches are provided in Table 4. The implementation of the methods for reproducibility of the results is available on GitHub (https://github.com/rob-kwiec/olx-jobs-recommendations, accessed on 29 May 2024).

4.2. Laboratory Research and Online Testing Goals

Our ultimate business goal was to increase the number of users applying for jobs through OLX Jobs. To achieve that, we try to solve a ranking problem; more precisely, we recommend ten items with the highest chance of receiving a given user’s interaction. The issue of finding an offline metric most correlated with the application rate in the job domain was considered by Mogenet et al. [27]. Since our goal is similar, we followed their results by optimizing for precision@10 (precision for the first ten recommended items [23]). However, we also report and discuss other accuracy, diversity, and efficiency metrics.

We are focused on users actively looking for jobs by interacting with job ads. Therefore, we do not consider users without interactions (cold-start users).

4.3. Data Split

The dataset described in Section 3 was split into training–validation and test datasets as required by the research methodology. The test set includes 20% of the newest interactions. This means that out of 14 days, approximately 3 days were included in the test set (see Figure 3). This way of splitting the dataset into training and test sets is the most realistic setting according to Meng et al. [55], who studied several splitting strategies for evaluating recommendation systems.

In both datasets, we also limited the number of interactions to unique interactions between a user and an item. If a user interacted more than once with an item, only the first interaction was counted, and the timestamp of that interaction was associated with the item. We also did not distinguish between types of interactions. In addition, two more modifications were applied: only users who appeared in the training and validation dataset were included in the test set (because we could not provide personalized recommendations to other users without additional knowledge about them), and we filtered out all user–item pairs that were present in the training set (to avoid recommending items that had already been seen). As a result, we ended up with about 38 M rows in the training and validation set and about 6 M rows in the test set. More statistics of these two datasets are presented in Table 5.

4.4. Model Tuning

For the sake of cost and time, when performing the optimization of parameters, we made a restriction to 20% of users and 20% of items of the training and validation set. Then, we split the restricted dataset into the training–validation dataset with the rules described in the section above. For each model, there were 100 iterations chosen by Bayesian optimization using Gaussian processes implemented in scikit-learn (https://scikit-optimize.github.io/stable/modules/generated/skopt.gp_minimize.html, accessed on 29 May 2024). The configuration file defines spaces for hyperparameters and is available in our code repository (https://github.com/rob-kwiec/olx-jobs-recommendations/blob/main/src/tuning/config.py, accessed on 29 May 2024). For example, for RP3Beta, we restricted ourselves to searching for

α

and

β

in the interval [0, 2].

Recommendations in the optimization process were prepared for 30,000 users, and as mentioned above, the optimization score used was precision@10. Figure 4 shows the precision score depending on the parameters. It is visible that ALS was not highly optimized within the process, but for other methods, this optimization was crucial to improving precision. Hyperparameters for all models are presented in Table 6.

4.5. Model Evaluation

We evaluate our models in terms of accuracy, diversity, and efficiency.

The accuracy of the recommendation method on implicit feedback datasets is typically evaluated by examining which items from the list of the top k recommendations the user interacted with in the test set [8]. Many evaluation metrics determine the metrics for each user separately and present the mean as the final evaluation score of the method [23]. The recommendation system may be considered a classifier, where the user’s interactions in the test set are positive examples, and the top k recommendations are items classified as positive by the model. Therefore, methods used for evaluating classifiers can be directly applied to recommender systems. For instance, the value of precision@k for a given user is the number of recommended items the user interacted with in the test set divided by the number k of items recommended to that user. Other metrics discussed in this work were described by Tamm et al. [23].

In the classifieds domain, primarily the jobs domain, only one user is usually needed to complete a transaction regarding a given ad. In this case, the diversity of recommendations is even more critical than in other domains (like music, movies, or other e-commerce, excluding classifieds). To assess this aspect, we use three measures: test coverage, Shannon entropy [56], and the Gini index [56]. Test coverage refers to the fraction of test items recommended to at least one user. With Shannon entropy and the Gini Index, we ignore items outside the test set. Greater test coverage and Shannon entropy values indicate higher diversity, whereas greater values of the Gini index indicate lower diversity.

The final laboratory evaluation step focuses on efficiency. For this purpose, we utilized an AWS SageMaker ml.m5.4xlarge instance equipped with 64 GB of RAM and 16 vCPUs, featuring an Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz, running on Amazon Linux AMI 2018.03. All methods were evaluated concerning three functions: data preprocessing, model fitting, and item recommendation. The execution time was calculated for the total time needed to compute the model (in hours). The memory peak was calculated in GB, and maximum memory usage was considered when executing a given function. This evaluation was performed using the tracemalloc library (https://docs.python.org/3/library/tracemalloc.html, accessed on 29 May 2024).

5. Laboratory Evaluation

This section describes the laboratory evaluation of the methods discussed in the developed dataset. We used the best-performing hyperparameters found in the previous section. Each model was trained on the entire training and validation dataset, and recommendations were produced for all 619,389 users from the test set. To better present the values of each offline metric, we compared the described methods with simple, non-personalized baselines: the most popular product approach and random recommendations.

5.1. Accuracy

All metrics in this section consider the first ten recommendations, e.g., precision@10, recall@10. Higher metric values indicate a higher quality of recommendations. The values should not be directly compared with the results achieved on other datasets because metrics heavily depend on the distribution of the dataset (for example, high sparsity) and the train/test splitting strategy [55].

5.1.1. Evaluation of Accuracy

Table 7 presents the outcomes of the evaluation. Most of the observed differences are statistically significant due to the large number of evaluated users. We elaborate more on the statistical significance of the results in Section 5.1.2.

Our approaches significantly outperform random and most popular item recommendations. The low performance of the most popular approach results from the nature of classifieds, primarily the domain of the job, where users are mostly interested in a specific location and category. Building more sophisticated, personalized recommendation systems is reasonable in such a scenario.

It is demonstrated that RP3Beta outperforms other approaches in terms of accuracy metrics. We believe that this is related to the high sparsity of our dataset. RP3Beta calculates the recommendations deterministically (by leveraging paths of length three on a user–item bipartite graph). All other approaches utilize machine learning techniques to find user representations (ALS, LightFM), item representations (ALS, LightFM, Prod2Vec), or direct item–item similarities (SLIM), which might be in cases where users or items have a low number of interactions.

5.1.2. Statistical Comparison for Precision@10

We present a detailed statistical comparison for our main metric, precision@10, to identify differences between the methods. To begin with, we test the null hypothesis that all methods perform the same and the observed differences are merely random (omnibus test). The Friedman test [57,58] with the Iman and Davenport [59] extension is probably the most popular omnibus test, and it is usually a good choice when comparing more than five different algorithms [60,61]. Let

R_{i j}

be the rank of the jth of K methods on the ith of N samples and

R_{j} = \frac{1}{N} \sum_{i = 1}^{N} R_{i j} .

In our case, the number of samples, N, is the number of users in the test set,

N = 619, 389

. The test compares the mean ranks of methods and is based on the following statistics:

\begin{matrix} F_{F} = \frac{(N - 1) χ_{F}^{2}}{N (K - 1) - χ_{F}^{2}}, \end{matrix}

where

\begin{matrix} χ_{F}^{2} = \frac{12 N}{K (K + 1)} \sum_{i = 1}^{K} R_{i}^{2} - 3 N (K + 1) \end{matrix}

is the Friedman statistic, which has the F distribution with

K - 1

and

(K - 1) (N - 1)

degrees of freedom. The p-value from this test is equal to 0. The obtained p-value indicates that we can safely reject the null hypothesis that all algorithms perform the same. Therefore, we can proceed with the post hoc tests to detect significant pairwise differences among all the methods. Demšar [62] proposes the use of the Nemenyi test [63], which compares all algorithms pairwise. For a significance level

α

, the test determines the critical difference (CD). If the difference between the average rankings of two algorithms is greater than

CD = q_{α} \sqrt{\frac{K (K + 1)}{6 N}},

the null hypothesis that the algorithms have the same performance is rejected (

q_{α}

is based on the studentized range statistic divided by

\sqrt{2}

). Demšar [62] proposed a plot to check the differences (the CD plot visually). Algorithms not joined by a line can be considered different in the plot.

In our case, with a significance of

α = 0.05

, any two algorithms with a difference in the mean rank above 0.0114 will be regarded as non-equal (Figure 5).

5.1.3. Accuracy for Users with Different Numbers of Interactions

The only information we have about our users is their interactions. In Figure 6, we examine how the number of items a user has interacted with influences our main metric, precision@10. We divide our users into ten similarly sized groups based on the number of interactions.

We can see that the order of models sorted by precision@10 does not depend on the group, except that SLIM performed better than RP3Beta for users with at least 22 interactions. The difference is statistically significant for each group of users (discussed in Section 5.1.4). The RP3Beta model can be seen as a particular case of the item-based collaborative filtering approach, which uses a deterministic similarity measure between items [64]. On the other hand, SLIM uses machine learning methods to learn similarities. We suppose that such learned similarity scores are less biased but have higher variance, which leads to better performance for users with many interactions because averaging is done over a more significant number of scores.

5.1.4. Statistical Evaluation of Accuracy for Users with Different Numbers of Interactions

In this subsection, we discuss the difference between the RP3Beta and SLIM models depending on the number of user interactions, as observed in Figure 6.

To compare two methods over multiple datasets statistically, Demšar [62] recommends the Wilcoxon signed-rank test [65]. The Wilcoxon signed-rank test is a non-parametric alternative to the paired t test, which ranks the differences in performances of two algorithms for each dataset, ignoring the signs. It compares the ranks of the positive and negative differences.

Table 8 shows that the difference between the RP3Beta and SLIM models is statistically significant for each user group.

5.2. Diversity

In Table 9, we can observe that Prod2Vec and LightFM provide the greatest diversity after excluding random recommendations (which should have the greatest diversity). The diversities of the most accurate models, RP3Beta, SLIM, and ALS, are similar, except for the case of the lower test coverage of ALS. The relation between test coverage and precision@10 is shown in Figure 7.

5.3. Overlap of Methods

In our research, we focus on methods that are worth testing with real users. Therefore, we studied the overlap between recommendations offered by the implemented methods concerning the user–item pairs (Overlap Coefficient [66]). We expect more significant differences in recommendations when comparing methods with a low overlap.

The results of this experiment are reported in Table 10. RP3Beta and SLIM offer similar recommendations. Although they are calculated differently, they use a sparse item similarity matrix. The comparison group with ALS has medium similarities with other methods. The most different from the overlap perspective are also the most diverse models, namely Prod2Vec and LightFM. The low overlap of ALS and LightFM emphasizes the importance of choosing the loss function in matrix factorization approaches.

5.4. Evaluation of Efficiency

The results of the efficiency evaluation are presented in Figure 8. RP3Beta, ALS, and LightFM outperform SLIM and Prod2Vec regarding execution time, whereas Prod2Vec requires the least memory. To retrain the model frequently, we set up the maximal acceptable execution time as 3 h, which was satisfied by ALS, LightFM, and RP3Beta. From a cost perspective, the acceptable memory utilization was set up to 110 GB, which all models fulfilled. We can observe that our models could be trained even on a single laptop.

5.5. Summary

The laboratory research enabled us to select models to be tested online with OLX users. Firstly, we chose the RP3Beta model for online tests because it is the best approach regarding accuracy and execution time. The relatively high memory consumption is still low enough for the model to be tested in production. Investigating the impact of this method’s relatively low diversity may be an exciting line of future research.

SLIM is the second-best model in terms of precision@10. Nevertheless, we did not select this method for online evaluation due to its low efficiency and high overlap with the already selected RP3Beta model. Improving the efficiency of the SLIM implementation might change this decision.

Among other models, the highest precision@10 is achieved by ALS. Additionally, ALS was already implemented and used for job recommendations at OLX; therefore, we decided to include it in the online comparison.

Even though Prod2Vec and LightFM produce the most diverse recommendations, we decided not to test them online due to the significantly worse precision@10 and efficiency with Prod2Vec.

6. Online Evaluation

To evaluate the effectiveness of the selected methods, we conducted two online A/B tests with OLX Jobs users. We focused on users who had recently visited job ads on our platform. We sent them emails with ten recommended job ads, and when a user had installed an OLX application, a push notification with one recommended job ad was also sent. Then, we observed whether the user applied for any job within the next 48 h (converted user).

6.1. Impact of the ALS Model

The first experiment aimed to answer the question: does the introduction of the recommendation method increase the number of users applying for jobs? In this test, we split the users into two groups: a control group without recommendations and a group with recommendations generated using the ALS model. Only

\approx 10 %

of users were assigned to the control group because the ALS model was already implemented and used for job recommendations at OLX before the test. We preferred to gather a minimal sample by running a more extended test to observe a longer-term impact on our users. The test was carried out for 25 days in March 2021. The results are reported in Table 11. The statistical significance of the advantage of ALS over the control group was examined using the chi-squared test, and the p-value was found to equal 0. This experiment confirmed our hypothesis that recommendations affect activity, as the percentage of converted users after receiving a communication increased significantly by

(16.83 % - 15.98 %) / 15.98 % \approx 5.3 %

. The observed rise translates into hundreds of additional users applying for jobs through OLX Jobs daily.

6.2. Comparison of ALS and RP3Beta Models

In the second experiment, we tested whether the best model from laboratory testing outperformed ALS. We split our users into three groups, receiving recommendations from different recommendation systems: ALS, RP3Beta, and a mixed variant. We generated half of the recommendations with ALS and the other half with RP3Beta. The second experiment was conducted for 28 days in March and April 2021. The results are presented in Table 12. The percentage of converted users is calculated here relative to all users; however, changing the recommendation system did not affect users who did not open the message. We decided to filter these out, as their behavior added unnecessary noise to our data. Table 13 presents the impact on users who opened the message. The advantage of RP3Beta over ALS is statistically significant (p-value

\approx 10^{- 6}

). The difference between the variants RP3Beta and ALS+RP3Beta is not statistically significant (the p-value is 0.19).

Out of the users who opened the message, we observed 1.28 pp (percentage points) more converted users from RP3Beta than from ALS. Since the message open rate was 13.2%, this is equivalent to an increase of

1.28 \times 0.132 \approx 0.17

pp of all targeted users. Based on the first experiment’s results, ALS brings a 0.85 pp increase over the control group. This means that replacing ALS with RP3Beta increases our impact on all target users by approximately

0.17 / 0.85 = 20 %

.

6.3. Discussion

We have proved that sending job recommendations from the ALS model increases the number of users responding to job ads by more than 5%. Moreover, RP3Beta significantly outperforms ALS.

Additionally, we carried out another online A/B test comparing the RP3Beta model with a variant where we replaced half of the RP3Beta recommendations with recommendations generated from the deep learning hybrid recommender system utilized at OLX (https://tech.olx.com/item2vec-neural-item-embeddings-to-enhance-recommendations-1fd948a6f293, accessed on 29 May 2024). We decided not to describe that test in this paper, as this work focuses on collaborative filtering approaches. We also wanted to keep our results reproducible on the published dataset. Unfortunately, we could not publish all of the features utilized by our internal model. We did not observe any improvements over RP3Beta, even though the second model utilizes additional knowledge about job ads. Hence, we confirmed that a properly determined and tuned simple model might outperform more advanced deep solutions. This is consistent with the results from Dacrema et al. [64] and Anelli et al. [30], who compared several non-neural approaches, including RP3Beta, with more advanced neural models.

The efficiency of the RP3Beta model enables OLX Jobs to train it from scratch multiple times per day with very low costs (on a CPU). The state-of-the-art collaborative filtering model at OLX jobs is currently used to generate email and push recommendations for our users. We recently adapted this model to serve on-site recommendations in real-time, considering recent user interactions with items [67]. We also proposed the P3 learning to rank (P3LTR) model [68], a generalization of the RP3Beta, where we learn the importance of user–item relations based on types of user–item past interactions and timestamps.

In this work, we have addressed the research gap of the lack of comparisons of classical collaborative filtering approaches in the job domain by providing comprehensive laboratory research on a real-world dataset and an online evaluation with millions of users. Additionally, we have made the source code for offline assessment and the dataset publicly available. We encourage other researchers to develop new recommendation methods to help users find their ideal job more quickly.

The methods considered in this work only require (user, item, and timestamp) triplets as input data for training and evaluation; hence, they can be applied across several domains, even for datasets consisting of tens of millions of interactions. We expect good results in the classifieds domain based on our initial results obtained in other categories of OLX.

7. Summary

This paper has addressed the topic of job recommendations in online classifieds, using the example of OLX Jobs. We performed an extensive evaluation of approaches representing different families of recommendation methods.

During the laboratory research, we demonstrated that RP3Beta, SLIM, and ALS perform significantly better than LightFM and Prod2Vec. We also found that Prod2Vec offers the most extraordinary diversity, which may be necessary when working on the novelty of recommendations. Moreover, we examined how similar the recommendations generated by different models are.

We also conducted an evaluation by sending millions of messages to online users. A/B tests were carried out using ALS and RP3Beta. The results of these tests demonstrated that sending job recommendations generated by these models increases the number of users contacting advertisers from OLX Jobs. With RP3Beta available in the production environment, we improved our effectiveness by 20% (when considering user reactions).

When evaluating the methods from the classifieds perspective, a model recommending the most popular (as yet unseen) items was unsuitable for job recommendations. A recommendation model based on random recommendations was also added for comparison to demonstrate the advantage of the researched methods.

Another significant result of this research is the dataset, which will enable further research on recommendations for classified needs.

One of the weaknesses of our work is a lack of comparison against deep neural networks (DNNs) or graph neural networks (GNNs). For such methods, satisfying our time constraints without more expensive infrastructure (e.g., with GPU) is challenging. Relaxing our constraints or comparing the most efficient DNNs or GNNs would be interesting for future work. Another weakness is that the comparison is only on a single dataset. The size of the dataset has a significant impact on method selection. It would be interesting to benchmark the same methods on datasets of different sizes, either subsets of the original dataset or other datasets from the job domain (e.g., from different markets the OLX operates). Additionally, it would be interesting to explore the correlation between diversity and online results, aiming to develop an offline metric that better aligns with business goals.

Author Contributions

Conceptualization, R.K.; methodology, R.K. and T.G.; software, R.K. and V.D.; validation, R.K., V.D., T.G. and A.F.; formal analysis, R.K. and T.G.; investigation, R.K. and T.G.; resources, R.K.; data curation, R.K. and V.D.; writing—original draft preparation, R.K., A.F. and T.G.; writing—review and editing, R.K., A.F. and T.G.; visualization, R.K. and T.G.; supervision, T.G. and A.F.; project administration, R.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset: https://www.kaggle.com/datasets/olxdatascience/olx-jobs-interactions, accessed on 29 May 2024; source code: https://github.com/rob-kwiec/olx-jobs-recommendations, accessed on 29 May 2024.

Conflicts of Interest

Authors Robert Kwieciński, Agata Filipowska and Viacheslav Dubrov were employed by the company OLX Group. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Davis, M. What are Online Classifieds? Big Easy Magazine, 4 January 2021. [Google Scholar]
Jones, S. Online Classifieds; Pew Internet & American Life Project: Washington, DC, USA, 2009. [Google Scholar]
Gourville, J.; Soman, D. Overchoice and Assortment Type: When and Why Variety Backfires. Mark. Sci. 2005, 24, 382–395. [Google Scholar] [CrossRef]
Chernev, A. When More Is Less and Less Is More: The Role of Ideal Point Availability and Assortment in Consumer Choice. J. Consum. Res. 2003, 30, 170–183. [Google Scholar] [CrossRef]
Boatwright, P.; Nunes, J. Reducing Assortment: An Attribute-Based Approach. J. Mark. 2001, 65, 50–63. [Google Scholar] [CrossRef]
Gómez-Uribe, C.; Hunt, N. The Netflix Recommender System. ACM Trans. Manag. Inf. Syst. 2015, 6, 1–19. [Google Scholar] [CrossRef]
Smith, B.; Linden, G. Two Decades of Recommender Systems at Amazon.com. IEEE Internet Comput. 2017, 21, 12–18. [Google Scholar] [CrossRef]
Aggarwal, C.C. Recommender Systems: The Textbook, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Li, S.; Karahanna, E. Online Recommendation Systems in a B2C E-Commerce Context: A Review and Future Directions. J. Assoc. Inf. Syst. 2015, 16, 72–107. [Google Scholar] [CrossRef]
Twardowski, B. Modelling Contextual Information in Session-Aware Recommender Systems with Neural Networks. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 273–276. [Google Scholar] [CrossRef]
Alotaibi, S. A survey of job recommender systems. Int. J. Phys. Sci. 2012, 7, 5127–5142. [Google Scholar] [CrossRef]
Abel, F.; Benczúr, A.; Kohlsdorf, D.; Larson, M.; Pálovics, R. RecSys Challenge’16: Proceedings of the Recommender Systems Challenge; Association for Computing Machinery: New York, NY, USA, 2016. [Google Scholar]
Abel, F.; Deldjoo, Y.; Elahi, M.; Kohlsdorf, D. RecSys Challenge’17: Proceedings of the Recommender Systems Challenge 2017; Association for Computing Machinery: New York, NY, USA, 2017. [Google Scholar]
Li, S.; Shi, B.; Yang, J.; Yan, J.; Wang, S.; Chen, F.; He, Q. Deep Job Understanding at LinkedIn. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 2145–2148. [Google Scholar] [CrossRef]
Zhu, G.; Chen, Y.; Wang, S. Graph-Community-Enabled Personalized Course-Job Recommendations with Cross-Domain Data Integration. Sustainability 2022, 14, 7439. [Google Scholar] [CrossRef]
Lacic, E.; Reiter-Haas, M.; Kowald, D.; Reddy Dareddy, M.; Cho, J.; Lex, E. Using autoencoders for session-based job recommendations. User Model. User-Adapt. Interact. 2020, 30, 617–658. [Google Scholar] [CrossRef]
Pazzani, M.J.; Billsus, D. Content-Based Recommendation Systems. In The Adaptive Web: Methods and Strategies of Web Personalization; Springer: Berlin/Heidelberg, Germany, 2007; pp. 325–341. [Google Scholar] [CrossRef]
Chen, R.; Hua, Q.; Chang, Y.S.; Wang, B.; Zhang, L.; Kong, X. A Survey of Collaborative Filtering-Based Recommender Systems: From Traditional Methods to Hybrid Methods Based on Social Networks. IEEE Access 2018, 6, 64301–64320. [Google Scholar] [CrossRef]
Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.S. Neural Graph Collaborative Filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar] [CrossRef]
Schafer, J.B.; Frankowski, D.; Herlocker, J.; Sen, S. Collaborative filtering recommender systems. In The Adaptive Web; Springer: Berlin/Heidelberg, Germany, 2007; pp. 291–324. [Google Scholar] [CrossRef]
Takács, G.; Pilászy, I.; Németh, B.; Tikk, D. Scalable Collaborative Filtering Approaches for Large Recommender Systems. J. Mach. Learn. Res. 2009, 10, 623–656. [Google Scholar] [CrossRef]
Jawaheer, G.; Szomszor, M.; Kostkova, P. Comparison of implicit and explicit feedback from an online music recommendation service. In Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems, Barcelona, Spain, 26 September 2010; pp. 47–51. [Google Scholar] [CrossRef]
Tamm, Y.M.; Damdinov, R.; Vasilev, A. Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently? In Proceedings of the Fifteenth ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September–1 October 2021; pp. 708–713. [Google Scholar] [CrossRef]
Kaminskas, M.; Bridge, D. Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems. ACM Trans. Interact. Intell. Syst. 2016, 7, 1–42. [Google Scholar] [CrossRef]
Kunaver, M.; Pozrl, T. Diversity in Recommender Systems, A Survey. Knowl.-Based Syst. 2017, 123, 154–162. [Google Scholar] [CrossRef]
Ge, M.; Delgado-Battenfeld, C.; Jannach, D. Beyond Accuracy: Evaluating Recommender Systems by Coverage and Serendipity. In Proceedings of the Fourth ACM Conference on Recommender Systems, Barcelona, Spain, 26–30 September 2010; pp. 257–260. [Google Scholar] [CrossRef]
Mogenet, A.; Pham, T.A.N.; Kazama, M.; Kong, J. Predicting Online Performance of Job Recommender Systems with Offline Evaluation. In Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; pp. 477–480. [Google Scholar] [CrossRef]
Anelli, V.W.; Bellogín, A.; Di Noia, T.; Jannach, D.; Pomo, C. Top-N Recommendation Algorithms: A Quest for the State-of-the-Art. In Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, Barcelona, Spain, 4–7 July 2022; pp. 121–131. [Google Scholar] [CrossRef]
Dacrema, M.F.; Boglio, S.; Cremonesi, P.; Jannach, D. A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research. ACM Trans. Inf. Syst. 2021, 39, 1–49. [Google Scholar] [CrossRef]
Anelli, V.W.; Bellogín, A.; Di Noia, T.; Pomo, C. Reenvisioning the Comparison between Neural Collaborative Filtering and Matrix Factorization. In Proceedings of the Fifteenth ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September–1 October 2021; pp. 521–529. [Google Scholar] [CrossRef]
Zhu, J.; Dai, Q.; Su, L.; Ma, R.; Liu, J.; Cai, G.; Xiao, X.; Zhang, R. BARS: Towards Open Benchmarking for Recommender Systems. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 2912–2923. [Google Scholar] [CrossRef]
Dong, Y.; Li, J.; Schnabel, T. When Newer is Not Better: Does Deep Learning Really Benefit Recommendation From Implicit Feedback? In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 942–952. [Google Scholar] [CrossRef]
Anelli, V.W.; Malitesta, D.; Pomo, C.; Bellogin, A.; Di Sciascio, E.; Di Noia, T. Challenging the Myth of Graph Collaborative Filtering: A Reasoned and Reproducibility-driven Analysis. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 18–22 September 2023; pp. 350–361. [Google Scholar] [CrossRef]
Koren, Y.; Bell, R.; Volinsky, C. Matrix Factorization Techniques for Recommender Systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Kula, M. Metadata embeddings for user and item cold-start recommendations. arXiv 2015, arXiv:1507.08439. [Google Scholar]
Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian Personalized Ranking from Implicit Feedback. arXiv 2012, arXiv:1205.2618. [Google Scholar]
Weston, J.; Bengio, S.; Usunier, N. Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; pp. 2764–2770. [Google Scholar]
Weston, J.; Yee, H.; Weiss, R.J. Learning to Rank Recommendations with the K-Order Statistic Loss. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013; pp. 245–248. [Google Scholar] [CrossRef]
Hu, Y.; Koren, Y.; Volinsky, C. Collaborative Filtering for Implicit Feedback Datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 263–272. [Google Scholar] [CrossRef]
Takács, G.; Pilászy, I.; Tikk, D. Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering. In Proceedings of the Fifth ACM Conference on Recommender Systems, Chicago, IL, USA, 23–27 October 2011; pp. 297–300. [Google Scholar] [CrossRef]
Ning, X.; Karypis, G. SLIM: Sparse Linear Methods for Top-N Recommender Systems. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining, Vancouver, BC, Canada, 11–14 December 2011; pp. 497–506. [Google Scholar] [CrossRef]
Christoffel, F.; Paudel, B.; Newell, C.; Bernstein, A. Blockbusters and Wallflowers: Accurate, Diverse, and Scalable Recommendations with Random Walks. In Proceedings of the 9th ACM Conference on Recommender Systems, Vienna, Austria, 16–20 September 2015; pp. 163–170. [Google Scholar] [CrossRef]
Cooper, C.; Lee, S.H.; Radzik, T.; Siantos, Y. Random Walks in Recommender Systems: Exact Computation and Simulations. In Proceedings of the 23rd International Conference on World Wide Web, New York, NY, USA, 7–11 April 2014; pp. 811–816. [Google Scholar] [CrossRef]
Grbovic, M.; Radosavljevic, V.; Djuric, N.; Bhamidipati, N.; Savla, J.; Bhagwan, V.; Sharp, D. E-Commerce in Your Inbox: Product Recommendations at Scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 1809–1818. [Google Scholar] [CrossRef]
Barkan, O.; Koenigstein, N. ITEM2VEC: Neural item embedding for collaborative filtering. In Proceedings of the 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), Vietri sul Mare, Italy, 13–16 September 2016; pp. 1–6. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems—Volume 2, Red Hook, NY, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
De Ruijt, C.; Bhulai, S. Job recommender systems: A review. arXiv 2021, arXiv:2111.13576. [Google Scholar]
Kula, M. LightFM—Github Repository. 2023. Available online: https://github.com/lyst/lightfm (accessed on 29 May 2024).
Frederickson, B. ALS—Github Repository. 2022. Available online: https://github.com/benfred/implicit/blob/main/implicit/als.py (accessed on 29 May 2024).
Levy, M.; Grisel, O. SLIM—Github Repository. 2014. Available online: https://github.com/Mendeley/mrec/blob/master/mrec/item_similarity/slim.py (accessed on 29 May 2024).
Quadrana, M. SLIM—Github Repository. 2021. Available online: https://github.com/MaurizioFD/RecSys2019_DeepLearning_Evaluation/blob/master/SLIM_ElasticNet/SLIMElasticNetRecommender.py (accessed on 29 May 2024).
Bernardis, C. RP3Beta—Github Repository. 2021. Available online: https://github.com/MaurizioFD/RecSys2019_DeepLearning_Evaluation/blob/master/GraphBased/RP3betaRecommender.py (accessed on 29 May 2024).
Paudel, B.; Christoffel, F.; Newell, C.; Bernstein, A. Updatable, Accurate, Diverse, and Scalable Recommendations for Interactive Applications. ACM Trans. Interact. Intell. Syst. 2016, 7, 1–34. [Google Scholar] [CrossRef]
Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; pp. 45–50, ISBN 2-9517408-6-7. Available online: http://is.muni.cz/publication/884893/en (accessed on 29 May 2012).
Meng, Z.; McCreadie, R.; Macdonald, C.; Ounis, I. Exploring Data Splitting Strategies for the Evaluation of Recommendation Models. In Proceedings of the 14th ACM Conference on Recommender Systems, Virtual, 22–26 September 2020; pp. 681–686. [Google Scholar] [CrossRef]
Shani, G.; Gunawardana, A. Evaluating Recommendation Systems. In Recommender Systems Handbook; Springer: Boston, MA, USA, 2011; pp. 257–297. [Google Scholar] [CrossRef]
Friedman, M. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
Friedman, M. A Comparison of Alternative Tests of Significance for the Problem of m Rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
Iman, R.L.; Davenport, J.M. Approximations of the critical region of the Friedman statistic. Commun. Stat.-Theory Methods 1980, 9, 571–595. [Google Scholar] [CrossRef]
García, S.; Herrera, F. An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn. Res. 2008, 9, 2677–2694. [Google Scholar]
García, S.; Fernández, A.; Luengo, J.; Herrera, F. Advanced Nonparametric Tests for Multiple Comparisons in the Design of Experiments in Computational Intelligence and Data Mining: Experimental Analysis of Power. Inf. Sci. 2010, 180, 2044–2064. [Google Scholar] [CrossRef]
Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Nemenyi, P. Distribution-Free Multiple Comparisons. Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 1963. [Google Scholar]
Dacrema, M.F.; Cremonesi, P.; Jannach, D. Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. In Proceedings of the 13th ACM Conference on Recommender Systems, Copenhagen, Denmark, 16–20 September 2019; pp. 101–109. [Google Scholar] [CrossRef]
Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
Vijaymeena, M.; Kavitha, K. A Survey on Similarity Measures in Text Mining. Mach. Learn. Appl. Int. J. 2016, 3, 19–28. [Google Scholar] [CrossRef]
Kwieciński, R.; Melniczak, G.; Górecki, T. Comparison of Real-Time and Batch Job Recommendations. IEEE Access 2023, 11, 20553–20559. [Google Scholar] [CrossRef]
Kwieciński, R.; Górecki, T.; Filipowska, A. Learning edge importance in bipartite graph-based recommendations. In Proceedings of the 2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS), Sofia, Bulgaria, 4–7 September 2022; pp. 227–233. [Google Scholar] [CrossRef]

Figure 1. Number of interactions per user. Source: own study.

Figure 2. Number of interactions per item. Source: own study.

Figure 3. Dataset preparation (split into training–validation and testing, depending on the number of days). Source: own study.

Figure 4. Precision for each model, depending on the parameters. Source: own study.

Figure 5. Critical difference plot (for precision). Source: own study.

Figure 6. Precision@10 for each model, depending on the number of items with which the user interacted. It is best viewed in color. Source: own study.

Figure 7. Test coverage versus precision@10 for considered models. Source: own study.

Figure 8. Execution time and memory utilization. It is best viewed in color. Source: own study.

Table 1. Example rows from the dataset.

User	Item	Event	Timestamp
1745587	168661	click	1582216025
843008	62838	click	1582485868
14285	30469	bookmark	1582247367
1142944	80122	click	1581805847
2835659	23728	chat_click	1582397836

Source: own study.

Table 2. Distribution of events.

Action	Frequency (%)
click	89.794
contact_phone_click_1	2.628
bookmark	2.511
chat_click	2.136
contact_chat	1.448
contact_partner_click	0.701
contact_phone_click_2	0.679
contact_phone_click_3	0.103

Source: own study.

Table 3. Comparison between job recommendation datasets. The letter D denotes the standard deviation.

	OLX Jobs Interactions	CareerBuilder12	RecSys16	RecSys17
Interactions	65.50 M	1.60 M	8.83 M	8.27 M
Unique interactions	47.17 M	1.60 M	-	-
Users	3.30 M	0.3 2M	1.37 M	1.50 M
Average number of unique interactions per user	14.31 ( $D = 29.23$ )	4.99 ( $D = 11.42$ )	<6.46	<5.51
Items	0.19 M	0.37 M	1.36 M	1.31 M
Average number of unique interactions per item	254.42 ( $D = 426.02$ )	4.38 ( $D = 8.19$ )	<6.50	<6.31
Density of the interaction matrix (%)	0.0077%	0.0014%	<0.0005%	<0.0004%
Types of interactions	8	1	4	5
Timestamps	✓	✓	✓	✓
User features	✗	✓	✓	✓
Item features	✗	✓	✓	✓

Source: own study.

Table 4. Methods re-implemented for the research.

Method	Family of Methods	Source	Difference from the Original Method
LightFM	Matrix factorization	Implementation based on: [48]. Supporting paper: [35].	None
ALS	Matrix factorization	Implementation based on: [49]. Supporting papers: [39,40].	None
SLIM	Neighborhood-based	Sources that inspired our implementation: [50,51]. Supporting paper: [41].	None
RP3Beta	Graph-based	Source that inspired our implementation: [52]. Supporting papers: [29,53].	Performed direct computations on sparse matrices instead of random walks approximation.
Prod2Vec	Word2Vec	Implementation based on: [54]. Supporting papers: [44,45].	Sequences of interactions, ordered by a timestamp. Representing the user as an average of the representations of interacted items. We also experimented with CBOW (continuous bag of words).

Source: own study.

Table 5. Comparison between training/validation and test sets. The letter D denotes standard deviation.

	OLX Job Interactions Interactions	Training and Validation Set	Test Set
Interactions	65.50 M	52.40 M	6.11 M
Unique interactions	47.17 M	38.25 M	6.11 M
Users	3.30 M	2.83 M	0.62 M
Average number of unique interactions per user	14.31 ( $D = 29.23$ )	13.50 ( $D = 26.11$ )	9.87 ( $D = 14.38$ )
Items	0.19 M	0.18 M	0.13 M
Average number of unique interactions per item	254.42 ( $D = 426.02$ )	214.05 ( $D = 365.93$ )	47.51 ( $D = 83.84$ )
Density of the interaction matrix (%)	0.0077%	0.0076%	0.0077%

Table 6. Model hyperparameters.

Model	Model Hyperparameters
als	{‘factors’: 357, ‘regularization’: 0.001, ‘iterations’: 20, ‘event_weights_multiplier’: 63}
lightfm	{‘no_components’: 512, ‘k’: 3, ‘n’: 20, ‘learning_schedule’: ‘adadelta’, ‘loss’: ‘warp’, ‘max_sampled’: 61, ‘epochs’: 11}
prod2vec	{‘vector_size’: 168, ‘alpha’: 0.028728, ‘window’: 20, ‘min_count’: 16, ‘sample’: 0.002690026, ‘min_alpha’: 0.0, ‘sg’: 1, ‘hs’: 1, ‘negative’: 200, ‘ns_exponent’: −0.16447846705441527, ‘cbow_mean’: 0, ‘epochs’: 22}
RP3Beta	{‘alpha’: 0.61447198, ‘beta’: 0.1443548}
slim	{‘alpha’: 0.00181289, ‘l1_ratio’: 0.0, ‘iterations’: 3}

Source: own study.

Table 7. Accuracy: evaluation results. All presented metrics were described by Tamm et al. [23]. For the mAP metric, we used the variant with

x = min (k, r)

.

Table 7. Accuracy: evaluation results. All presented metrics were described by Tamm et al. [23]. For the mAP metric, we used the variant with

x = min (k, r)

.

Metric	RP3Beta	SLIM	ALS	Prod2Vec	LightFM	Most Popular	Random
precision	0.0484	0.0472	0.0434	0.0368	0.0359	0.0012	0.00006
recall	0.0783	0.0736	0.0657	0.0580	0.0564	0.0012	0.00005
ndcg	0.0759	0.0721	0.0657	0.0567	0.0545	0.0016	0.00007
mAP	0.0393	0.0365	0.0329	0.0282	0.0264	0.0006	0.00002
MRR	0.1365	0.1314	0.1230	0.1065	0.1034	0.0038	0.00019
LAUC	0.5391	0.5368	0.5328	0.5289	0.5281	0.5006	0.49999
HR	0.3131	0.3066	0.2878	0.2537	0.2547	0.0112	0.00059

Source: own study.

Table 8. Results of Wilcoxon signed-rank test between RP3Beta and SLIM models depending on the number of items with which the user interacted.

Bin	p-Value
[1.0, 3.0)	0
[3.0, 5.0)	0
[5.0, 8.0)	0
[8.0, 11.0)	0
[11.0, 16.0)	0
[16.0, 22.0)	1.25 × 10⁻⁶
[22.0, 31.0)	9.27 × 10⁻⁷
[31.0, 45.0)	0
[45.0, 74.0)	0
[74.0, 852.0)	0

Source: own study.

Table 9. Diversity: evaluation results.

Metric	RP3Beta	SLIM	ALS	Prod2Vec	LightFM	Most Popular	Random
test coverage	0.5725	0.5171	0.3038	0.7400	0.7031	0.0002	0.9778
Shannon	9.5271	9.6728	9.6270	10.4031	10.1385	2.3296	11.7267
Gini	0.9083	0.9029	0.9120	0.7956	0.8397	0.9999	0.1159

Source: own study.

Table 10. Overlap of models.

Model	RP3Beta	SLIM	ALS	Prod2Vec	LightFM
RP3Beta	100%	73%	53%	37%	38%
SLIM	73%	100%	50%	35%	35%
ALS	53%	50%	100%	38%	37%
Prod2Vec	37%	35%	38%	100%	28%
LightFM	38%	35%	37%	28%	100%

Source: own study.

Table 11. Comparison of activity with or without recommendations.

Variant	Users	% Converted Users
control	129,308	15.98%
ALS	1,170,262	16.83%

Source: own study.

Table 12. Comparison of different recommendation methods.

Variant	Users	% Converted Users
ALS	343,892	15.25%
RP3Beta	345,273	15.40%
ALS+RP3Beta	343,896	15.30%

Source: own study.

Table 13. Comparison of different recommendation methods for users who opened the message.

Variant	Users	% Converted Users
ALS	44,775	19.66%
RP3Beta	46,097	20.94%
ALS+RP3Beta	45,469	20.59%

Source: own study.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kwieciński, R.; Górecki, T.; Filipowska, A.; Dubrov, V. Job Recommendations: Benchmarking of Collaborative Filtering Methods for Classifieds. Electronics 2024, 13, 3049. https://doi.org/10.3390/electronics13153049

AMA Style

Kwieciński R, Górecki T, Filipowska A, Dubrov V. Job Recommendations: Benchmarking of Collaborative Filtering Methods for Classifieds. Electronics. 2024; 13(15):3049. https://doi.org/10.3390/electronics13153049

Chicago/Turabian Style

Kwieciński, Robert, Tomasz Górecki, Agata Filipowska, and Viacheslav Dubrov. 2024. "Job Recommendations: Benchmarking of Collaborative Filtering Methods for Classifieds" Electronics 13, no. 15: 3049. https://doi.org/10.3390/electronics13153049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Job Recommendations: Benchmarking of Collaborative Filtering Methods for Classifieds

Abstract

1. Introduction

2. Related Work

2.1. Classified ad Sites

2.2. Recommendation Systems

2.3. Recommendation Methods

2.3.1. Matrix Factorization Models

2.3.2. Neighborhood-Based Models

2.3.3. Graph-Based Approaches

2.3.4. Prod2Vec

2.4. Research Gap

3. OLX Job Interactions Dataset

4. Experimental Setup

4.1. Implementation

4.2. Laboratory Research and Online Testing Goals

4.3. Data Split

4.4. Model Tuning

4.5. Model Evaluation

5. Laboratory Evaluation

5.1. Accuracy

5.1.1. Evaluation of Accuracy

5.1.2. Statistical Comparison for Precision@10

5.1.3. Accuracy for Users with Different Numbers of Interactions

5.1.4. Statistical Evaluation of Accuracy for Users with Different Numbers of Interactions

5.2. Diversity

5.3. Overlap of Methods

5.4. Evaluation of Efficiency

5.5. Summary

6. Online Evaluation

6.1. Impact of the ALS Model

6.2. Comparison of ALS and RP3Beta Models

6.3. Discussion

7. Summary

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI