Next Article in Journal
Blockchain and Homomorphic Encryption for Data Security and Statistical Privacy
Previous Article in Journal
Efficient Speech Signal Dimensionality Reduction Using Complex-Valued Techniques
Previous Article in Special Issue
Applying “Two Heads Are Better Than One” Human Intelligence to Develop Self-Adaptive Algorithms for Ridesharing Recommendation Systems
 
 
Article
Peer-Review Record

Job Recommendations: Benchmarking of Collaborative Filtering Methods for Classifieds

Electronics 2024, 13(15), 3049; https://doi.org/10.3390/electronics13153049
by Robert Kwieciński 1,2, Tomasz Górecki 2,*, Agata Filipowska 3 and Viacheslav Dubrov 1
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Electronics 2024, 13(15), 3049; https://doi.org/10.3390/electronics13153049
Submission received: 30 May 2024 / Revised: 24 July 2024 / Accepted: 27 July 2024 / Published: 1 August 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This article presents a comprehensive benchmarking of various collaborative filtering methods for job recommendations on classifieds, specifically using the OLX Jobs dataset. It evaluates the performance of different models like ALS, LightFM, Prod2Vec, RP3Beta, and SLIM in terms of accuracy, diversity, and efficiency. The study includes results from online A/B tests with over 1 million users, demonstrating that recommendations from ALS and RP3Beta models significantly increase user engagement with advertisers. Additionally, the authors have made their dataset publicly available for further research in the field of job recommendations.

I have some major concerns as follows:
1.  The tested methods are quite old, and most of them were released more than 10 years ago. Why don't adopt any methods published more recently?
2. The purpose of writing this kind of article is not clear. Simply using the published works to evaluate in a new scenario does not have much academic meaning in my point of view. The authors at least propose their new method or new insight.
3. While the paper discusses scalability, the actual scalability of the models in a constantly changing environment like classifieds may vary.
4. The paper indicates a trade-off between recommendation diversity and accuracy, which could be explored further for optimization if the authors truly want to present an innovative paper.

 

Comments on the Quality of English Language

The article generally is easy to follow.

Author Response

Comment 1: The tested methods are quite old, and most of them were released more than 10 years ago. Why don't adopt any methods published more recently?

Response 1: The recommendation systems domain is currently under a replication crisis. There are several papers outlining the problem:

  • Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, Dietmar Jannach. A troubling analysis of reproducibility and progress in recommender systems research. ACM Transactions on Information Systems, 39:1–49, 2021.
  • Vito Walter Anelli, Alejandro Bellogín, Tommaso Di Noia, Dietmar Jannach, Claudio Pomo. Top-n recommendation algorithms: A quest for the state-of-the-art. Proceedings of the 30th ACM Conference on User Modeling, Adaptation, and Personalization, UMAP ’22, strona 121–131, New York, NY, USA, 2022. Association for Computing Machinery.
  • Yushun Dong, Jundong Li, Tobias Schnabel. When newer is not better: Does deep learning really benefit recommendation from implicit feedback? Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, strona 942–952, New York, NY, USA, 2023. Association for Computing Machinery.
  • Vito Walter Anelli,Daniele Malitesta, Claudio Pomo, Alejandro Bellogin, Eugenio Di Sciascio, Tommaso Di Noia. Challenging the myth of graph collaborative filtering: a reasoned and reproducibility-driven analysis. Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, pages 350–361, New York, NY, USA, 2023. Association for Computing Machinery.

As indicated by the authors of these papers, there are several reasons:

  • weak baselines,
  • hyperparameter tuning is conducted only for the proposed method, not for baselines,
  • split for train and test sets that may result in a data leakage (e.g., some test set interactions happened before some train set interactions of other users)
  • selecting different datasets, usually not published,
  • not publishing the source code.

We provide a substantial industrial dataset for the job recommendation domain and establish the baselines without the abovementioned mistakes. Because of the size of the data, several recent approaches may not satisfy our efficiency constraints (e.g., we recently experimented with graph neural networks).

Comment 2: The purpose of writing this kind of article is unclear. Using the published works to evaluate in a new scenario does not have much academic meaning in my point of view. The authors at least propose their new method or new insight.

Response 2: The following contributions have academic meaning (sorted from the most impactful to the least).

  • We publish and present a dataset of job interactions. To our knowledge, this is the largest publicly available dataset on job interactions.
  • We establish a baseline on this dataset. Our work is fully reproducible due to the published source code and dataset.
  • We proposed an evaluation methodology that addresses the significant challenges of classifieds.
  • We provide results of online A/B testing with over 1 million users.
  • We concisely present a mathematical formulation of all considered methods.

Comment 3: While the paper discusses scalability, the actual scalability of the models in a constantly changing environment like classifieds may vary.

Response 3: Yes. The models are frequently retrained to capture the newest interactions, users, and items. Hence, all model metrics (accuracy, diversity, efficiency) may change over time. 

Comment 4: The paper indicates a trade-off between recommendation diversity and accuracy, which could be explored further for optimization if the authors truly want to present an innovative paper.

Response 4: We added the plot showing the dependency between test coverage and precision@10 for considered models. Studying the impact of diversity on online metrics to specify the tradeoff between optimizing for accuracy and diversity could be interesting for future research.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript focuses on benchmarking different recommendation methods for job classifieds, aiming to improve advertisement conversion rates and user satisfaction. The study implemented scalable methods like ALS, LightFM, Prod2Vec, RP3Beta, and SLIM and evaluated them based on accuracy, diversity, and efficiency criteria. It also explored the overlap of recommendations offered by different methods, highlighting the importance of choosing loss function in matrix factorization approaches. Results showed that RP3Beta, SLIM, and ALS had similar diversities, with RP3Beta having a 20% more significant impact on user engagement than ALS in online A/B tests.

Despite these strong aspect assessments above, weak aspects and suggestions for improvement are listed below.

1. The manuscript lacks a detailed discussion of the specific challenges faced while implementing the recommendation methods, such as any unexpected issues or limitations encountered during the research process. To enhance the manuscript, it would be beneficial to include a section dedicated to discussing the practical challenges faced while implementing the recommendation models.

2. Including a subsection that outlines the limitations of each recommendation method used in the study would offer a more comprehensive understanding of each approach's strengths and weaknesses.

3. A detailed analysis of the computational resources required for each method and how they impact scalability and efficiency could further enrich the discussion and help readers assess the practical implications of adopting these methods.

4. The reviewer suggests incorporating a discussion of the findings' generalizability to other similar platforms or domains, which would strengthen the manuscript by providing insights into the broader applicability of the benchmarked recommendation methods.

Author Response

Comment 1: The manuscript lacks a detailed discussion of the specific challenges faced while implementing the recommendation methods, such as any unexpected issues or limitations encountered during the research process. To enhance the manuscript, it would be beneficial to include a section dedicated to discussing the practical challenges faced while implementing the recommendation models.

Response 1: The main challenge was to select methods to handle such an extensive dataset and implement them efficiently, but we already indicated that in the manuscript. We do not recall any significant challenges.

Comment 2:  Including a subsection that outlines the limitations of each recommendation method used in the study would offer a more comprehensive understanding of each approach's strengths and weaknesses.

Response 2: We described why each presented method was selected and discussed its limitations in the respective subsections at the end. Since we did not discuss Prod2Vec's limitations, we added one to address this comment.

Comment 3: A detailed analysis of the computational resources required for each method and how they impact scalability and efficiency could further enrich the discussion and help readers assess the practical implications of adopting these methods.

Response 3: We commented that selected models could be trained even on a single laptop. We also specified maximal execution time and memory constraints (according to Reviewer #4, Concern #7).

Comment 4: The reviewer suggests incorporating a discussion of the findings' generalizability to other similar platforms or domains, which would strengthen the manuscript by providing insights into the broader applicability of the benchmarked recommendation methods.

Response 4: We added a paragraph in Section 6.3.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors benchmarked collaborative filtering methods for classifieds in the application of job recommendations. There are various parts to be enhanced by the authors:
Comment 1. Abstract: Clarify “real-world setting”.
Comment 2. More terms should be included in the keywords.
Comment 3. Section 1 Introduction:
(a) Estimations beyond 2024 are also important.
(b) Paragraph 2, references are missing.
(c) The second and fourth contributions need to be further elaborated.
Comment 4. Section 2 Related works:
(a) An introductory paragraph is required before Subsection 2.1. In addition, the authors should explain the reasons to divide the literature review into three parts.
(b) Too many references (about 40) were cited in this section. As a technical article (instead of a review article), it is unusual to have excessive references without detailed analysis and proper comparison.
(c) Numbering should be applied to equations.
(d) Most of the cited references were not the latest references (recent 5-year). The literature review is considered incomplete and requires a significant update.
Comment 5. Section 3 Dataset:
(a) The authors should emphasize and reflect in the headings of sections and subsections that one of the contributions is data collection of a large-scale dataset and publishing it online.
(b) The first URL provided for accessing the dataset was not valid.
Comment 6. Section 4 Experimental setup:
(a) Some of the content belongs to methodology instead of the experiment (i.e., the evaluation of the models).
(b) Please justify “The test set includes 20% of the newest interact times. This means that out of 14 days, approximately 3 days were included in the test set (see Figure 3).” How about the results of other settings?
(c) Table 4, apart from the optimal settings, numerical analysis is required for the hyper-parameter tuning.

Author Response

Comment 1: Abstract: Clarify “real-world setting”.

Response 1: We rephrased the sentence, “We conducted online A/B tests by sending millions of messages with recommendations to evaluate models in a real-world setting.”

Comment 2: More terms should be included in the keywords.

Response 2: We added a few more.

Comment 3:

Section 1 Introduction:
(a) Estimations beyond 2024 are also important.
(b) Paragraph 2, references are missing.
(c) The second and fourth contributions need to be further elaborated.

Response 3:
(a) We removed the sentence regarding estimations for the classifieds market.
(b) Added references for eBay and Amazon
(c) Improved.

Comment 4:

Section 2 Related works:
(a) An introductory paragraph is required before Subsection 2.1. In addition, the authors should explain the reasons to divide the literature review into three parts.
(b) Too many references (about 40) were cited in this section. As a technical article (instead of a review article), it is unusual to have excessive references without detailed analysis and proper comparison.
(c) Numbering should be applied to equations.
(d) Most of the cited references were not the latest references (recent 5-year). The literature review is considered incomplete and requires a significant update.

Response 4:
(a) Indeed, added.
(b) We removed some references. The remaining ones improve our manuscript, but we can consider removing them if advised.
(c) Numbering all the equations could decrease readability. Hence, we decided to number only the ones we are referring to (which, in this case, means there is no numbering at all because we do not refer to equations).
(d) We added references to 5 relevant papers from 2020-2023. Unfortunately, we need to keep the references to several older articles because the methods we considered were proposed between 2008 and 2016 (the reason for using relatively old methods we provided in response to Reviewer #1, Comment #1).

Comment 5: Section 3 Dataset:
(a) The authors should emphasize and reflect in the headings of sections and subsections that one of the contributions is data collection of a large-scale dataset and publishing it online.
(b) The first URL provided for accessing the dataset was not valid.

Response 5:
(a) We renamed a section 3.
(b) The URL works fine on our side. Please let us know if it still does not work for you.

Comment 6: Section 4 Experimental setup:
(a) Some of the content belongs to methodology instead of the experiment (i.e., the evaluation of the models).
(b) Please justify “The test set includes 20% of the newest interact times. This means that out of 14 days, approximately 3 days were included in the test set (see Figure 3).” How about the results of other settings?
(c) Table 4, apart from the optimal settings, numerical analysis is required for the hyper-parameter tuning.

Response 6:
(a) Indeed. We moved them to the newly created section 4.5. Model evaluation.
(b) This way of splitting the dataset into train and test sets is the most realistic setting according to:
Meng, Z., McCreadie, R., Macdonald, C., & Ounis, I. (2020). Exploring Data Splitting Strategies for the Evaluation of Recommendation Models. Proceedings of the 14th ACM Conference on Recommender Systems. Several authors use leave-one-last or user-based temporal splits, but there is a risk of data leakage.
(c) In Figure 4, we outlined the impact of hyperparameter selection on each model’s performance on the test set. We used Bayesian evaluation to find the optimal hyperparameters. Hence, performing a more profound analysis requires utilizing advanced techniques (e.g., functional analysis of variance). We decided to skip it due to space constraints.

Reviewer 4 Report

Comments and Suggestions for Authors 1) The newly introduced dataset should be compared with existing datasets for evaluating job recommendation systems. 2) In the introduced dataset, a given user-item pair may have several interactions. The number of unique user-item pairs in the dataset should be reported. 3) Statistics of train and test sets should be presented. 4) In the Friedman test, it should be explicitly written what is considered a sample and, hence, what the value of N is. 5) The labels in Figure 6 should be bigger. 6) It would be useful to plot an accuracy metric (e.g., precision) against a diversity metric (e.g., test coverage). 7) In section 5.5, the authors wrote, "The relatively high memory consumption is still low enough for the model to be tested in production." Authors should describe more precisely the maximal acceptable training time and memory consumption. 8) Authors should explain why, in the experiment described in Section 6.1, only ~10% of users were included in the control group Comments on the Quality of English Language

use the active form instead of passive

Author Response

Comment 1: The newly introduced dataset should be compared with existing datasets for evaluating job recommendation systems. 

Response 1: We added a comparison between the OLX Jobs Interactions dataset and the three most popular datasets used in the job recommendations domain (CareerBuilder12, RecSys16, and RecSys17).

Comment 2: In the introduced dataset, a given user-item pair may have several interactions. The number of unique user-item pairs in the dataset should be reported. 

Response 2: We report it in the table comparing different datasets.

Comment 3: Statistics of train and test sets should be presented. 

Response 3: We added a table comparing train and test sets.

Comment 4: In the Friedman test, it should be explicitly written what is considered a sample and, hence, what the value of N is.

Response 4: We clarified that the user in the test set is considered a sample.

Comment 5: The labels in Figure 6 should be bigger.

Response 5: Fixed.

Comment 6: It would be useful to plot an accuracy metric (e.g., precision) against a diversity metric (e.g., test coverage).

Response 6:
We added a plot of precision@10 versus test coverage.

Comment 7: In section 5.5, the authors wrote, "The relatively high memory consumption is still low enough for the model to be tested in production." Authors should describe more precisely the maximal acceptable training time and memory consumption.

Response 7: To retrain the model frequently, we set the maximal acceptable execution time at 3 hours, which was satisfied by ALS, LightFM, and RP3Beta. From a cost perspective, the acceptable memory utilization was set at 110GB, which all models fulfilled. We added these clarifications to the manuscript.

Comment 8: Authors should explain why, in the experiment described in Section 6.1, only ~10% of users were included in the control group

Response 8: Only ~10% of users were assigned to the control group because the ALS model was implemented and used for job recommendations at OLX before the test. We preferred to gather a minimal sample by running a more extended test to observe a longer-term impact on our users. We added these clarifications to the manuscript.

 

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The response from the authors failed to address my concerns. For instance, the authors claimed, " Because of the size of the data, several recent approaches may not satisfy our efficiency constraints (e.g., we recently experimented with graph neural networks)." Yet, various GNNs can handle large graph datasets efficiently. One of the well-known scalable GNNs is GraphSAGE, which was excluded from their baselines[1]. I insisted that the evaluated methods were too old and didn't represent the new advances of the recommender systems. So, I cannot support its publication this time.

[1] Zhang J, Xue R, Fan W, et al. Linear-Time Graph Neural Networks for Scalable Recommendations[C]//Proceedings of the ACM on Web Conference 2024. 2024: 3533-3544.

Author Response

Comment 1:

The response from the authors failed to address my concerns. For instance, the authors claimed, "Because of the size of the data, several recent approaches may not satisfy our efficiency constraints (e.g., we recently experimented with graph neural networks)." Yet, various GNNs can handle large graph datasets efficiently. One of the well-known scalable GNNs is GraphSAGE, which was excluded from their baselines[1]. I insisted that the evaluated methods were too old and didn't represent the new advances of the recommender systems. So, I cannot support its publication this time.

Response 1: 

In the classifieds domain, new items constantly appear, and we decided to retrain our collaborative filtering models frequently (a few times a day). Additionally, we had budget constraints that forced us to use CPU only (without GPU) with at most 110GB of memory, where the single model should be trained from scratch in at most 3 hours (we mentioned these constraints in the manuscript). Even though we did not consider deep learning methods, the time constraint was not satisfied for 2 out of 5 selected methods.

In the paper you mentioned, the running time of the newly proposed method LTGNN was 4.2 times shorter than 3-layer LightGCN but 5.5 times higher than matrix factorization for the biggest considered dataset. This may not be acceptable in our case, especially without a GPU.

We experimented with several GNNs on the smaller dataset (coming from a smaller market in which OLX operates), including GraphSage and LightGCN. We checked how fast they are on the published OLX Jobs Interactions dataset with the same machine used in the manuscript. For both models, we used a batch size of 1024, 3 layers, embedding size of 128, restricted to 10 randomly sampled positive neighbors on each layer, and one negative sample for each positive sample when constructing minibatches. Additionally, we used a pooling aggregator and ReLU activation function for GraphSAGE. We implemented these methods using the DGL library with PyTorch backend. We observed that it would take ~32 hours per epoch to train the LightGCN model and ~45 hours per epoch for the GraphSAGE model (screenshots for LightGCN and GraphSAGE, respectively are below):

We can read in the introduction of the paper you mentioned:
“GNN-based CF models have not been widely employed in industrial-level applications majorly due to their scalability limitations. In fact, classic CF models like MF and DNNs are still playing major roles in real-world applications due to their computational advantages, especially in large-scale industrial recommender systems."
It confirms that employing GNNs efficiently in large-scale applications is still challenging (the paper was published in May 2024). Hence, we leave for future work the comparative analysis of the newest GNN approaches on the large dataset we introduced.

 

Reviewer 2 Report

Comments and Suggestions for Authors

The authors handle all comments

Author Response

We didn't find any comments from Reviewer 2.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors enhanced the quality of the paper; however, some key issues remain unaddressed.

 

Follow-up comment 1: Two newly added research contributions, “We provide the results…” and “We concisely present a mathematical formulation…” are not research contributions but standard elements in a research article. Instead, the authors should elaborate on how the results outperform the existing works and explain the novelties of the methods.
Follow-up comment 2: Limited changes are made in Section 2. As a literature review, the authors should ensure the cited works align with the topic of interest of this research paper. In addition, please summarize the methodologies, results, and limitations of the existing works.
Follow-up comment 3: The resolutions of some figures are not good enough. Please enlarge the figures to ensure that no content is blurred.
Follow-up comment 4: Conventionally, we use “.” Instead of “,” for numerical values.
Follow-up comment 5: Table 6, what are the ranges of hyper-parameter tuning?

Author Response

Comment 1:

Two newly added research contributions, “We provide the results…” and “We concisely present a mathematical formulation…” are not research contributions but standard elements in a research article. Instead, the authors should elaborate on how the results outperform the existing works and explain the novelties of the methods.

Response 1:

Research articles often skip providing the results of A/B tests and restrict the evaluation to offline evaluation. Providing the online evaluation results to many users can estimate the importance of recommender systems in the classifieds domain and allow other companies to make more informed decisions about developing such models. We improved the description of this contribution in our paper. Regarding the contribution “We concisely present a mathematical formulation…”, we fully described the methods very concisely, in many cases shorter than in the original papers. We agree that it is (or should be) a standard element in a research article. Hence, we decided to remove it from this list of contributions.

 

Comment 2: Limited changes are made in Section 2. As a literature review, the authors should ensure the cited works align with the topic of interest of this research paper. In addition, please summarize the methodologies, results, and limitations of the existing works.

Response 2: 

We added two paragraphs in Section 2.3 to discuss some works where at least three out of five selected personalized algorithms were considered. We also indicated how our work differs from these works. We extensively evaluated accuracy, diversity, efficiency, and online experiments on a large dataset. In contrast, the related works focus on smaller datasets or only accuracy results. Additionally, we provided the dataset for a domain of job recommendations with very few publicly available datasets. Hence, establishing this dataset's baseline will be valuable to the research community.

Additionally, we removed a few other citations which needed to be more crucial.

Comment 3: 

The resolutions of some figures are not good enough. Please enlarge the figures to ensure that no content is blurred.

Response 3:

We are unsure which figures require improvements, but we will improve them before publication if needed.

Comment 4: 

Conventionally, we use “.” Instead of “,” for numerical values.

Response 4:

Thank you for pointing it out. We accidentally used  “,” instead of “.” in some newly added tables. Fixed.

Comment 5: 

Table 6, what are the ranges of hyper-parameter tuning?

Response 5:

The manuscript mentions that the ranges of hyperparameters are available in our code repository: https://github.com/rob-kwiec/olx-jobs-recommendations/blob/main/src/tuning/config.py. If needed, we can add a table to describe them.

 

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

I partially agree with the author's response this time, but I reserve my previous opinion. I still have some concerns about the testing method, as outlined below.

(1) Cold start problem: The selected method cannot recommend items that have not been visited by any user, or provide recommendations to users who have not visited any items.

(2) User and item embeddings: Some methods (such as SLIM and RP3Beta) cannot generate user and item embeddings, which may limit the application of other tasks (such as clustering of users or items).

(3) Complex relations: The RP3Beta model cannot optimize parameters during training, so it may not fully utilize the complex relations in the data.

(4) Dependency: The Prod2Vec method strictly relies on the relationship between user and item representations, which may cause some interactions to have too large an impact on user representations.

These deficiencies may affect the performance and scalability of recommendation systems in practical applications. If the authors could address at least one of the issues I have mentioned, I would be pleased to recommend its publication.

Author Response

Thank you for raising these highly valid points. We already addressed some of them in our other works:
[1] R. Kwiecinski, T. Górecki and A. Filipowska, "Learning edge importance in bipartite graph-based recommendations," 2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS), Sofia, Bulgaria, 2022, pp. 227-233, doi: 10.15439/2022F191,
[2] R. KwieciÅ„ski, G. Melniczak and T. Górecki, "Comparison of Real-Time and Batch Job Recommendations," in IEEE Access, vol. 11, pp. 20553-20559, 2023, doi: 10.1109/ACCESS.2023.3249356.

We briefly mentioned these works in Section 6.3. The work which we discuss here (Job recommendations: benchmarking of collaborative filtering methods for classifieds) was initially prepared before [1] and [2], but was published only on arXiv. We spent almost two years in a prolonged review process consisting of several rounds of reviews in another journal. The current version is significantly improved compared to the arXiv version.

Comment 1. Cold start problem: The selected method cannot recommend items that have not been visited by any user, or provide recommendations to users who have not visited any items.

Response 1. It is indeed a standard limitation of all selected methods. We decided to frequently retrain our models (a few times a day) to minimize the number of cold start users/items. Retraining the model so often is a significant constraint and requires selecting more computationally efficient methods. Even though we decided to do so because we observed outstanding performance of collaborative filtering methods during online tests. For instance, we developed a hybrid recommendation system at OLX, which enabled us to solve the cold start problem for items. Still, we observed during A/B tests that our collaborative filtering models provide better recommendations (we briefly described it in 2nd paragraph in Section 6.3). 

Frequent retraining significantly reduced the cold start problem but didn’t solve it. In the case of item cold start, the impact is relatively small in the jobs (or classifieds) domain. The fact that we start recommending new items a few hours after publication is acceptable (in domains such as news recommendations, it is much more important to recommend fresh items).

In case of user cold start, let’s consider two cases:

  1. The user does not have any interactions at the prediction time.
    Providing personalized recommendations to such users requires additional user features, which are often hard to gather (users are anonymous and browse ads without creating an account). Getting some features (e.g., user location) is possible, but addressing this problem was not a priority because, at OLX, we usually start providing job recommendations to users actively looking for a job (i.e., interacting with some job ads). Additionally, we wanted to make our work fully reproducible, hence we wanted to rely only on the dataset we published. We decided not to take the legal risk of providing additional features to the dataset (e.g., some Netflix users filed a class action lawsuit against Netflix regarding the possibility of identifying them in the Netflix Prize dataset).
  2. The user does not have any interactions at the training time but has some interaction during prediction time.
    This issue we addressed in our work Comparison of Real-Time and Batch Job Recommendations [2]. We proposed an infrastructure that allows taking into consideration interactions performed even a few seconds before prediction time. Our solution can be applied to several models:
    - RP3Beta and SLIM, the recommendations provided in real-time and batch mode for the users with the same set of interactions are the same,
    - ALS, LightFM, and Prod2Vec can still be applied, but they require some modifications in calculating recommendations.

Comment 2. User and item embeddings: Some methods (such as SLIM and RP3Beta) cannot generate user and item embeddings, which may limit the application of other tasks (such as clustering of users or items).

Response 2. Indeed, SLIM and RP3Beta do not provide user and item embeddings, limiting their applicability to other tasks. This is one of the main motivations for our work on Graph Neural Networks. We developed a new method, P3GNN, which generalizes the RP3Beta model and provides user and item embeddings. The technique was described by one of the authors in a Ph. D. dissertation defended last month (available at https://bip.amu.edu.pl/nadanie-stopnia-doktora2/informatyka/kwiecinski-robert, unfortunately in Polish language). We did not describe it in this work because training time significantly exceeds our time constraints (it is similar to LightGCN mentioned in response to the previous review). We are considering optimizing the implementation and using more computational resources (GPU) to make offline comparisons against the methods described in this work. We prefer not to introduce this method in this work (we think the work could be too long).

Comment 3. Complex relations: The RP3Beta model cannot optimize parameters during training, so it may not fully utilize the complex relations in the data.

Response 3. We fully agree. To address this, we introduced a generalization of RP3Beta, which utilizes additional information about interactions (e.g., type of interaction, diversity, and timestamps) in our work Learning edge importance in bipartite graph-based recommendations [1].

Comment 4. Dependency: The Prod2Vec method strictly relies on the relationship between user and item representations, which may cause some interactions to have too large an impact on user representations.

Response 4. In Prod2Vec, item representations are L2-normalized, and user representations are calculated as the average of the item representations that the given user intersects with. We think this normalization may reduce the extremely high impact of some interactions. Even though the length of item representations after normalization is the same, it still may happen that some interactions have a higher impact than others (it heavily depends on how items are distributed in the latent space). 

On the other hand, we think treating some interactions as more critical than others may be beneficial. For instance, there may be some job roles with very few ads, but the users who interact with at least one of these ads are highly likely to interact with the others. In this case, even if the user interacted with several other ads, it may be better to recommend to this user other ads related to this job role (which may be achieved by the higher impact of his interaction with the ad from this job role than from other interactions).

In the P3LTR model, our extension of the RP3Beta model [1], we learn the importance of different interactions. We could use these importances to weight item representations when building user representations in Prod2Vec or even train the Prod2Vec and edge importances end-to-end. We prefer not to introduce such methods in this work (we think the work could become too long).

Back to TopTop