Measuring Inter-Bias Effects and Fairness-Accuracy Trade-Offs in GNN-Based Recommender Systems

Chizari, Nikzad; Tajfar, Keywan; Moreno-García, María N.

doi:10.3390/fi17100461

Open AccessArticle

Measuring Inter-Bias Effects and Fairness-Accuracy Trade-Offs in GNN-Based Recommender Systems

by

Nikzad Chizari

¹

,

Keywan Tajfar

²

and

María N. Moreno-García

^1,*

¹

Department of Computer Science and Automation, University of Salamanca, 37008 Salamanca, Spain

²

Department of Statistics, University of Tehran, Tehran 1417614411, Iran

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(10), 461; https://doi.org/10.3390/fi17100461

Submission received: 10 September 2025 / Revised: 2 October 2025 / Accepted: 6 October 2025 / Published: 8 October 2025

(This article belongs to the Special Issue Deep Learning in Recommender Systems)

Download

Browse Figures

Versions Notes

Abstract

Bias in artificial intelligence is a critical issue because these technologies increasingly influence decision-making in a wide range of areas. The recommender system field is one of them, where biases can lead to unfair or skewed outcomes. The origin usually lies in data biases coming from historical inequalities or irregular sampling. Recommendation algorithms using such data contribute to a greater or lesser extent to amplify and perpetuate those imbalances. On the other hand, different types of biases can be found in the outputs of recommender systems, and they can be evaluated by a variety of metrics specific to each of them. However, biases should not be treated independently, as they are interrelated and can potentiate or mask each other. Properly assessing the biases is crucial for ensuring fair and equitable recommendations. This work focuses on analyzing the interrelationship between different types of biases and proposes metrics designed to jointly evaluate multiple interrelated biases, with particular emphasis on those biases that tend to mask or obscure discriminatory treatment against minority or protected demographic groups, evaluated in terms of disparities in recommendation quality outcomes. This approach enables a more comprehensive assessment of algorithmic performance in terms of both fairness and predictive accuracy. Special attention is given to Graph Neural Network-based recommender systems, due to their strong performance in this application domain.

Keywords:

GNN-based recommender systems; Graph Neural Networks; unfairness; bias

1. Introduction

Studying and addressing biases in Artificial Intelligence (AI) models is a pressing concern in current research. If not properly managed, AI can amplify existing social inequalities by reflecting and reinforcing biases present in the training data or algorithmic design [1,2]. In recommender systems (RS), bias affects the fairness and diversity of the content users receive. These systems can prioritize popular items (popularity bias), limit the variety of suggested content (item coverage bias), and overlook the preferences of protected or minority groups. This not only reduces user satisfaction for underrepresented communities but also perpetuates inequalities by limiting access to certain items and providing more biased information to train new models. Identifying and mitigating these biases is crucial for creating inclusive and balanced recommendation environments.

The interrelation among different types of bias in RS can be complex and self-reinforcing. Popularity bias occurs when a system disproportionately recommends popular items, often at the expense of less popular ones. This bias stems from a power–law distribution in the data, also known as the long-tail, which indicates that there are very few items with many ratings (or interactions) and a large number of items with few ratings [3,4]. Recommendations skewed towards the most popular items can directly impact on item coverage, as the system focuses on a small subset of frequently interacted items while neglecting a broader, more diverse range of options. Low item coverage exacerbates biased recommendations for protected or minority groups [5]. If a system favors content reflecting majority preferences, it may fail to surface items relevant to underrepresented communities, further marginalizing their cultural or identity-based content. Additionally, as these minority-rated items receive less exposure, they generate fewer interactions, reinforcing their invisibility in future recommendations. This feedback loop means that protected groups receive less personalized and relevant suggestions, while their content remains buried under the weight of popular, mainstream items.

In addition to coverage, popularity bias also affects evaluation metrics such as precision, recall, NDCG (Normalized Discounted Cumulative Gain), MAP (Mean Average Precision), or MRR (Mean Reciprocal Rank) [6]. When training is performed on datasets with a long-tail distribution, where the most popular items have many more ratings or interactions, it is expected that the test sets will also present this distribution. Therefore, since metrics are obtained by averaging the results of all users in the test set, recommendation methods biased toward the most popular items tend to perform better, although they are disadvantaging minorities.

There are different metrics to assess bias in AI models and RS. Popularity bias can be evaluated by means of metrics widely accepted such as average popularity, Gini Index, and item coverage [7,8,9]. However, there is no clear consensus on how to measure unfair recommendations for protected or minority groups. First, some metrics are based on assessing whether users in minority groups receive similar recommendations to those in majority groups. This approach is not appropriate in RS because precisely the distinctive characteristics of the groups can have a strong influence on their preferences. It would then be more convenient to use metrics based on the quality of recommendations. Secondly, many of the metrics based on measuring the difference in the quality of recommendations for different groups take as a reference the errors in the prediction of the ratings instead of taking into account the quality of the top-N recommendation lists, which is the procedure currently considered more appropriate. Finally, it is usual to evaluate each type of bias separately without taking into account the impact of the biases on each other.

In this paper we address the problems discussed above in order to analyze the influence of some metrics on others and the behavior of different types of recommendation algorithms. We pay special attention to methods based on GNN (Graph Neural Networks) because of their proven effectiveness in this field, although some studies have shown that graph learning algorithms are more prone to bias [10,11].

Specifically, this work attempts to answer the following research questions (RQ).

RQ1. To what extent do the different types of biases interrelate and how do they affect or mask the quality of the recommendations?

RQ2. How can the recommendation algorithm (or algorithm category) that performs best in terms of quality and bias minimization be selected?

With respect to RQ1, this study examines the way in which popularity bias shapes the outcomes of recommendation algorithms and obscures their discriminatory effects on gender-based minority groups.

The contribution of this work with respect to RQ2 lies in the proposal of a metric that jointly evaluates the item coverage of the algorithms (as the inverse of popularity) and the disparity in recommendation quality between minority and majority groups. This metric enables the identification of the algorithm that exhibits the most favorable performance with respect to both types of bias.

The rest of the paper is organized as follows. The following section provides a brief overview of related work. Next, the proposed evaluation approach is introduced. Section 4 reports on a case study addressing popularity bias and gender discrimination, together with an analysis of the resulting findings. Finally, the conclusions and potential directions for future work are outlined.

2. Related Work

Among the various types of biases, popularity bias is one of the most extensively studied in the recommender systems literature, from both item and user perspectives. The item-centered perspective considers only the rather evident fact that the most popular items appear more frequently in the recommendations, without considering how this may affect different users [3]. On the contrary, user-centered perspective evaluates the way in which this skewed distribution of items may result in unfair treatment of users [12]. The work presented in [13] illustrates this unequal treatment in the domain of movie recommendation categorizing users into three groups based on their interest in more or less popular movies. The study conducted in [8] shows that most extended recommendation algorithms suffer from popularity bias in the domain of book recommendation, resulting in users with more common preferences being more likely to receive higher quality recommendations.

All these studies focus on different types of users based on the degree of popularity of the items they prefer. However, none of them examine how popularity bias affects other biases, such as the unfair treatment of protected demographic groups.

The quantification of popularity bias can be performed by measuring the average popularity of the recommended items [8]. An alternative way is to evaluate the diversity of these items by means of the Gini index [7]. The item coverage metric, inversely proportional to the popularity, can also be used [9].

In the area of AI, several metrics have been proposed to assess biases or unfair treatment of some demographic groups that are often minorities. Some of them evaluate the inequality in output of algorithms for these groups. In the RS domain, some are based on comparing the distributions of predicted rating values for different groups. However, these differences do not necessarily represent unfair treatment but different group preferences. For example, when analyzing age discrimination, the fact that young people are recommended different music than older people does not indicate that the recommendations are biased, but rather that these age groups have different tastes. This category of metrics includes demographic parity [14], also known as statistical parity or independence, which compares the probabilities of machine learning algorithm outputs for the different groups. A similar measure is Differential fairness [15] that also aims to compare output probabilities, but differences are computed using ratios and can be applied for multiple protected attributes. Non-parity unfairness [16,17] also belongs to this type but referred to RS outputs. It represents the absolute difference between the average of the ratings of the different groups. Equality of representation [18] is another metric derived from demographic parity that has two variants: at the network level, it measures bias between different groups among all recommendations given in the network, and at the user level, it measures the fraction of users having a particular sensitive attribute value, among the recommendations given to each user.

A proper evaluation of bias in (RS) requires comparing the quality of recommendations across different demographic groups. Existing fairness metrics can be broadly categorized along two dimensions: prediction-based vs. ranking-based metrics and group fairness vs. individual fairness.

Prediction-based metrics focus on errors in rating prediction. Standard error measures such as MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error) quantify the deviation between predicted and actual ratings. Within the fairness literature, some measures extend this approach by quantifying deviations in predicted ratings across user groups. In this group we can mention the equalized odds metric [19], which compares classifier error rates among protected groups. A combination of demographic parity and equalized odds is proposed in [20] to measure disparate impact, concept associated with group fairness. Other metrics, such as Value Unfairness [21] and Absolute Unfairness [21] quantify deviations in predicted ratings across user groups, either considering the direction of errors (Value Unfairness) or focusing solely on absolute errors (Absolute Unfairness). These metrics provide insights into group disparities at the level of rating predictions.

Ranking-based metrics evaluate the quality of top-N recommendation lists, which is more relevant in practice, as users typically interact only with the highest-ranked items. Therefore, errors made on items not on that list have no impact on the user [22]. Standard metrics in this category include precision, recall, and ranking-aware measures such as NDCG (Normalized Discounted Cumulative Gain) and MRR (Mean Reciprocal Rank). Recent works have adapted these metrics to fairness assessment. For instance, Neophytou et al. [23] analyzed variations in NDCG across demographic groups defined by gender and age, and [24] proposed a metric to quantify the relative NDCG difference between groups. Zehlike et al. [25] introduced the notion of utility in fair top-k ranking, ensuring that each individual included in the top-k is more qualified than candidates not included.

There is other work that focuses on the fairness of groups according to their side of the interaction, for example, whether they are producers or consumers [12]. Unfairness for providers belonging to different continents is studied in [26] focusing on disparities in both the visibility and exposure given to content providers. One of these studies considers the integration of multiple, heterogeneous fairness definitions [27] but does not specifically address the interrelationship between different types of biases.

The bias sensitivity of GNN methods has also been extensively studied [10,11], although in the specific field of GNN-based recommender systems it has been less explored [28,29,30,31]. In this context, some evaluation metrics have also been proposed, such as the relative difference between the NDCG@N values of demographic groups [24]. The work by [29] focuses on analyzing robustness of recommendations in GNN-based RS from both user and provider perspectives. The fairness metric used is based on Demographic Parity (DP), which evaluates disparities in recommendations across different demographic groups according to a given metric. However, the aim is not to quantify fairness, but rather to examine its stability against attacks on graph data.

3. Methodology

The previous sections have presented studies that provide evidence of the interrelationship among different types of biases. One of the most prominent is the connection between unfairness in recommendation quality for protected groups and popularity bias. Recommender systems that favor popular items, resulting in lower item coverage, may perform better on quality metrics, not necessarily due to superior recommendation quality, but because popular items are more likely to appear in the test set. This fact may mask the unfairness of recommendation algorithms in terms of the quality of results provided to minority groups. Prior research has further shown that popularity bias is not only a source of biased performance evaluation, but also a reinforcing mechanism that amplifies existing inequalities in exposure and visibility. By disproportionately recommending already popular items, such systems reduce the opportunities for niche or minority-preferred content to be surfaced, thereby exacerbating representational harms. This dynamic has been shown to intersect with demographic biases, such that users or items associated with underrepresented groups may receive systematically poorer recommendations; however, these disparities may remain undetected in standard evaluations due to the confounding effect of popularity bias.

3.1. Proposed Approach

The objective of this work is to examine interrelationship between biases and to provide a means of assessing model performance across different demographics groups while accounting for the confounding effects of other biases. To do that, it is necessary to complement traditional accuracy-oriented metrics with measures as item coverage that explicitly capture the diversity or breadth of items recommended. Accordingly, we consider an accuracy-oriented metric

M

, along with the item coverage metric denoted as IC.

First, for the given metric

M

than can be precision, recall, NDCG, MRR, etc., we compute the relative difference (

R D_{M}

) between groups for the top-N lists regarding the metric

M

. We propose a variant of the absolute relative difference introduced in [24], in which we explicitly distinguish between protected (or minority) groups and non-protected (or majority) groups. Unlike the original formulation, we do not apply the absolute value, as this allows the resulting bias measure to be negative when the metric favors protected groups over non-protected ones.

R D_{M} = \frac{(M_{U G} - M_{P G})}{M_{U G} + M_{P G}}

(1)

where

M_{U G}

is the metric value for the majority or unprotected group and

M_{P G}

is the value for the minority or protected group.

Since low

R D_{M}

and high

I C

values are desirable, geometric means may be an appropriate way to combine them, provided that both metrics are normalized. As the relative difference

R D_{M}

ranges from −1 to 1, it is normalized by means of Equation (2). In this way, both metrics range between 0 and 1.

N R D_{M} = \frac{R D_{M} + 1}{2}

(2)

Consequently, we define the inter-bias metric

I B_{M}

as the geometric mean of the normalized

R D_{M}

and the item coverage (

I C

) metric:

I B_{M} = \sqrt{N R D_{M} (1 - I C)}

(3)

The inter-bias metric allows for the evaluation of the overall bias of the recommendations by incorporating both perspectives.

N R D_{M}

is multiplied by (1 − IC) to align the directionality of the two metrics: higher

N R D_{M}

indicates greater bias, and lower coverage represents greater concentration on popular items, both of which signify undesirable outcomes. This combination method penalizes imbalances, ensuring that both individual metrics contribute meaningfully and preventing a high value in one metric from offsetting a low value in the other, thereby promoting a more balanced evaluation.

3.2. Evaluation Metrics

To assess the quality of the top-N recommendation lists in the case study presented in the following section, we employed a set of widely used evaluation metrics, including NDCG, Precision, Recall, Hit Rate, and MRR. These metrics capture different aspects of recommendation performance, from ranking quality and accuracy to coverage and the prioritization of relevant items. Their definitions are provided below.

NDCG (Normalized Discounted Cumulative Gain): Evaluates the quality of a recommendation list by considering both the relevance of the items in the list and their positions in the ranking. It assigns higher importance to relevant items appearing at the top of the list, reflecting the impact of ranking order on user experience.
Precision: Measures the proportion of relevant items among those recommended to the user. It captures the accuracy of the system by indicating how many of the suggested results are actually useful.
Recall: Assesses the proportion of relevant items retrieved by the system out of all relevant items available for the user. It reflects the system’s ability to cover as many relevant elements as possible.
Hit Rate (Hit): Indicates whether at least one relevant item appears in the recommendation list. It is a binary metric at the user level, registering a “hit” if at least one relevant element is included in the top-N recommendations.
MRR (Mean Reciprocal Rank): Evaluates the position of the first relevant item in the recommendation list. It rewards systems that rank relevant items higher, giving more credit when the first relevant result appears near the top.

In addition, item coverage is quantified as the proportion of distinct items recommended by the system with respect to the total set of items available in the datasets [9].

4. Case Study and Results

To answer the two research questions initially posed, an experimental study was conducted to analyze the recommendations generated by different recommendation algorithms using two public datasets. Both have long-tail distribution to different degrees and include the gender attribute of the users, which is the sensitive feature that we have considered to evaluate the biases related to minority group unfairness. Our case study therefore focuses on the study of gender bias and its interrelation with popularity bias.

A random sample of 100,000 interactions of the LastFM [4] and MovieLens [32] datasets were used in the study. The number of users and interactions is unbalanced between genders, with females being the minority as shown in Figure 1. Figure 2 reveals a long-tail distribution in both datasets, more pronounced in LastFM.

This distribution could lead, as explained in the previous section, to recommendations biased toward the most popular items and obscure the unfair treatment of minority groups by recommendation algorithms. To confirm this, we examined the results obtained for the two groups with the following recommendation methods, whose implementations have been provided by the RecBole [33] and RecBole-GNN [34] libraries: Item-KNN, MF (Matrix Factorization), DMF (Deep Matrix Factorization), NeuMF (Neural Collaborative Filtering), NNCF (Neighborhood-based Neural Collaborative Filtering), NGCF (Neural Graph Collaborative Filtering), SGL (Self-supervised Graph Learning), LightGCN (Light Graph Convolution Network) and DGCF (Disentangled Graph Collaborative Filtering). The choice of these widely used methods reflects their proven effectiveness, while recent studies have shown that graph neural network-based approaches can provide even more efficient solutions [35] for large and complex graphs. The methods selected for the study are particularly suitable, as they are widely adopted in comparative evaluations and fairness analyses, providing a well-established benchmark for assessing group-level disparities in recommendation outcomes.

First, we examine the behavior of the algorithms without considering age groups. Subsequently, we analyze their behavior individually for each group in order to identify potential differences and to assess whether the hypothesis stated in Research Question 1 can be confirmed. Both datasets were partitioned into training (80%), validation (10%), and testing (10%) splits. Hyperparameter tuning was performed using RecBole’s Hyper-Tuning module with grid search.

Figure 3 illustrates the relationship between item coverage and item popularity across different recommendation algorithms in the two studied datasets. It can be observed that algorithms achieving higher item coverage are precisely those recommending less popular items, whereas algorithms favoring highly popular items exhibit substantially lower coverage.

Figure 4 and Figure 5 show the values of different top-N list quality metrics for each method in the LastFM and MovieLens datasets, respectively. These results do not reflect the behavior of the models with respect to gender groups, but rather their overall performance. The most immediate observation, in relation to the objective of our study, is that models characterized by a stronger tendency to recommend popular items achieve higher values on quality metrics. This finding provides support for our previously formulated hypothesis that popularity bias can obscure the evaluation of recommendation quality, and it provides a partial response to research question RQ1.

The results of the studied metrics disaggregated by gender groups (male and female) are presented in Figure 6 and Figure 7, each corresponding to a different dataset. The results indicate that, in the LastFM dataset, which is characterized by a pronounced long-tail distribution, the minority group (female) outperforms the majority group across most metrics and algorithms, with some exceptions, such as the results for the LightGCN algorithm and, to a lesser extent, the DGCF algorithm, both of which are GNN-based methods. In contrast, in the MovieLens dataset, the metric values for the minority group are slightly lower than those of the majority, with the exception of the SGL algorithm and the recall metric, which exhibit the best performance for the female group. These findings support the hypothesis posed in RQ1, suggesting that strong popularity bias may enhance performance while potentially obscuring underlying group-specific disparities and may also hide discriminatory treatment of minority groups.

Next, we present the results of our proposed joint bias evaluation for this case study. First, we report the normalized relative difference

N R D_{M}

between groups (Equation (2)), using NDCG as the representative metric

M

. The results, illustrated in Figure 8, corroborate the previous interpretation by showing that the normalized relative difference (NRD) of NDCG@N across recommendation models for the Male and Female groups exhibits minimal variation in the LastFM dataset, likely due to bias masking, whereas in MovieLens, the NRDs are more pronounced, possibly due to its lower popularity bias, which reduces the masking effect. The behavior of the remaining metrics studied (precision, recall, hit and MRR) is very similar.

The previous outcomes highlight the importance of jointly assessing both performance and fairness-related biases. This is the aim of the proposed Inter-Bias

I B_{M}

metric, whose results across various recommendation algorithms when selecting NDCG as metric

M

(IB-NDCG@N) are shown in Table 1 and Figure 9, thereby addressing RQ2. Analogous to NRD, the other top-N list evaluation metrics show patterns closely resembling those of the NDCG metric.

Regarding the IB metric, the algorithms exhibit markedly unequal behavior on the LastFM dataset with a strong long-tail distribution, whereas their performance is more uniform on the MovieLens dataset, with a long-tail effect substantially less pronounced. This highlights the significant impact that the distribution of user–item interactions in the dataset has on the model behavior with respect to bias. Another noteworthy observation is the contrasting behavior of the matrix factorization (MF) algorithm, as it exhibits the lowest bias in one dataset while becoming the most biased in the other. Regarding the bias sensitivity in GNN-based recommendation algorithms compared to other approaches, the results indicate that their overall bias levels are comparable to those of traditional algorithms, and in some cases even lower. On the LastFM dataset, neural network-based methods (DMF, NeuMF and NNCF) achieve the highest NDCG scores but also exhibit the highest bias levels. Notably, NGCF demonstrates the lowest overall bias among the GNN-based models while maintaining a reasonably good NDCG value. In the MovieLens dataset, SGL displays the lowest overall bias among the evaluated algorithms; however, this comes at the cost of relatively low NDCG values.

The study reveals that different modeling approaches involve trade-offs between accuracy and bias. Neural models often favor accuracy at the cost of increased bias, particularly in long-tail datasets like LastFM. GNN-based methods such as NGCF and SGL are less prone to bias, sometimes at the expense of precision.

5. Conclusions

In this work, we propose a novel approach for jointly evaluating biases in recommender systems, with a focus on how one type of bias may mask or intensify the effects of another. Specifically, the study examines two key biases. The first involves disparities in recommendation quality across demographic groups, with particular attention to gender-based differences between protected (minority) and unprotected (majority) users. The second concerns the overrepresentation of popular items, assessed through item coverage, a metric that inversely reflects item popularity and captures the system’s ability to promote diversity. This combined evaluation offers a more comprehensive understanding of algorithmic behavior and provides a solid foundation for designing more equitable recommender systems.

In future work, we intend to explore additional bias dimensions beyond gender and popularity, such as age and other demographic attributes. We also aim to apply the framework to more diverse domains and datasets and to investigate mitigation strategies capable of addressing multiple biases simultaneously.

Author Contributions

Conceptualization, N.C., K.T. and M.N.M.-G.; methodology, N.C. and M.N.M.-G.; software, N.C. and K.T.; validation, N.C., K.T. and M.N.M.-G.; formal analysis, N.C., K.T. and M.N.M.-G.; investigation, N.C., K.T. and M.N.M.-G.; Data curation, N.C.; writing—review and editing, N.C., K.T. and M.N.M.-G.; visualization, N.C., K.T. and M.N.M.-G.; supervision, M.N.M.-G.; project administration, M.N.M.-G.; funding acquisition, M.N.M.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Regional Ministry of Education of the Junta de Castilla y León (Spain). Project SA061G24, under ORDEN EDU/740/2024 of July 19.

Data Availability Statement

Available datasets have been used. Details are provided in Section 4.

Acknowledgments

During the preparation of this manuscript, the authors used GPT-5 to correct grammatical errors in the text and improve the wording. The authors have reviewed the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

GNN	Graph Neural Networks
RS	Recommender Systems
AI	Artificial Intelligence

References

Fahse, T.; Huber, V.; Giffen, B.V. Managing Bias in Machine Learning Projects. In Proceedings of the International Conference on Wirtschaftsinformatik, Essen, Germany, 9–11 March 2021; pp. 94–109. [Google Scholar]
Kordzadeh, N.; Ghasemaghaei, M. Algorithmic Bias: Review, Synthesis, and Future Research Directions. Eur. J. Inf. Syst. 2022, 31, 388–409. [Google Scholar] [CrossRef]
Celma, O.; Cano, P. From hits to niches? or how popular artists can bias music recommendation and discovery. In Proceedings of the 2nd KDD, Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition, Las Vegas, NV, USA, 24–27 August 2008. [Google Scholar]
Celma, O. Music Recommendation and Discovery: The Long Tail, Long Fail, and Long Play in the Digital Music Space; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2010. [Google Scholar] [CrossRef]
Sánchez-Moreno, D.; Muñoz, M.D.; López, V.F.; Gil, A.B.; Moreno-García, M.N. A session-based song recommendation approach involving user characterization along the play power-law distribution. Complexity 2020, 2020, 7309453. [Google Scholar] [CrossRef]
Bellogín, A.; Castells, P.; Cantador, I. Statistical biases in information retrieval metrics for recommender systems. Inf. Retr. J. 2017, 20, 606–634. [Google Scholar] [CrossRef]
Sun, W.; Khenissi, S.; Nasraoui, O.; Shafto, P. Debiasing the human-recommender system feedback loop in collaborative filtering. In Proceedings of the 2019 World Wide Web Conference (WWW ’19), San Francisco, CA, USA, 13–17 May 2019; pp. 645–651. [Google Scholar]
Naghiaei, M.; Rahmani, H.A.; Dehghan, M. The Unfairness of Popularity Bias in Book Recommendation. arXiv 2022, arXiv:2202.13446. [Google Scholar] [CrossRef]
Wang, X.; Wang, W.H. Providing item-side individual fairness for deep recommender systems. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), Seoul, Republic of Korea, 21–24 June 2022; pp. 117–127. [Google Scholar]
Dai, E.; Wang, S. Say no to the discrimination: Learning fair graph neural networks with limited sensitive attribute information. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM ’21), Virtual, 8–12 March 2021; pp. 680–688. [Google Scholar]
Dong, Y.; Wang, S.; Wang, Y.; Derr, T.; Li, J. On structural explanation of bias in graph neural networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), Washington, DC, USA, 14–18 August 2022; pp. 316–326. [Google Scholar]
Abdollahpouri, H.; Mansoury, M.; Burke, R.; Mobasher, B. The unfairness of popularity bias in recommendation. arXiv 2019, arXiv:1907.13286. [Google Scholar] [CrossRef] [PubMed]
Abdollahpouri, H.; Mansoury, M.; Burke, R.; Mobasher, B.; Malthouse, E. User centered evaluation of popularity bias in recommender systems. In Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’21), Utrecht, The Netherlands, 21–25 June 2021; pp. 119–129. [Google Scholar]
Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; Zemel, R. Fairness through awareness. In Proceedings of the Innovations in Theoretical Computer Science (ITCS ’12), Cambridge, MA, USA, 8–10 January 2012; pp. 214–226. [Google Scholar]
Islam, R.; Keya, K.N.; Pan, S.; Sarwate, A.D.; Foulds, J.R. Differential fairness: An intersectional framework for fair AI. Entropy 2023, 25, 660. [Google Scholar] [CrossRef] [PubMed]
Kamishima, T.; Akaho, S.; Asoh, H.; Sakuma, J. Enhancement of the neutrality in recommendation. In Decisions@RecSys; ACM: New York, NY, USA, 2012; pp. 8–14. [Google Scholar]
Kamishima, T.; Akaho, S.; Asoh, H.; Sakuma, J. Efficiency improvement of neutrality-enhanced recommendation. In Decisions@RecSys; ACM: New York, NY, USA, 2013; pp. 1–8. [Google Scholar]
Rahman, T.; Surma, B.; Backes, M.; Zhang, Y. Fairwalk: Towards fair graph embedding. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI ’19), Macao, China, 10–16 August 2019; ACM: New York, NY, USA; pp. 3289–3295. [Google Scholar]
Hardt, M.; Price, E.; Srebro, N. Equality of opportunity in supervised learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS ’16), Barcelona, Spain, 5–10 December 2016; pp. 3315–3323. [Google Scholar]
Spinelli, I.; Scardapane, S.; Hussain, A.; Uncini, A. Fairdrop: Biased edge dropout for enhancing fairness in graph representation learning. IEEE Trans. Artif. Intell. 2021, 3, 344–354. [Google Scholar] [CrossRef]
Yao, S.; Huang, B. Beyond parity: Fairness objectives for collaborative filtering. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS ’17), Long Beach, CA, USA, 5–7 December 2017. [Google Scholar]
Valcarce, D.; Bellogín, A.; Parapar, J.; Castells, P. Assessing ranking metrics in top-N recommendation. Inf. Retr. J. 2020, 23, 411–448. [Google Scholar] [CrossRef]
Neophytou, N.; Mitra, B.; Stinson, C. Revisiting popularity and demographic biases in recommender evaluation and effectiveness. In Proceedings of the European Conference on Information Retrieval (ECIR ’22), Stavanger, Norway, 10–16 April 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 641–654. [Google Scholar]
Chizari, N.; Tajfar, K.; Shoeibi, N.; Moreno-García, M.N. Quantifying fairness disparities in graph-based neural network recommender systems for protected groups. In Proceedings of the 19th International Conference on Web Information Systems and Technologies (WEBIST ’23), Rome, Italy, 15–17 November 2023; pp. 176–187. [Google Scholar] [CrossRef]
Zehlike, M.; Sühr, T.; Baeza-Yates, R.; Bonchi, F.; Castillo, C.; Hajian, S. Fair Top-k ranking with multiple protected groups. Inf. Process. Manag. 2022, 59, 102707. [Google Scholar] [CrossRef]
Gómez, E.; Boratto, L.; Salamó, M. Provider fairness across continents in collaborative recommender systems. Inf. Process. Manag. 2022, 59, 102719. [Google Scholar] [CrossRef]
Aird, A.; Štefancová, E.; All, C.; Voida, A.; Homola, M.; Mattei, N.; Burke, R. Social choice for heterogeneous fairness in recommendation. In Proceedings of the 18th ACM Conference on Recommender Systems (RecSys ’24), Bari, Italy, 14–18 October 2024. [Google Scholar] [CrossRef]
Boratto, L.; Fabbri, F.; Fenu, G.; Marras, M.; Medda, G. Explaining unfairness in GNN-based recommendation (Extended Abstract). In Proceedings of the Second Learning on Graphs Conference (LoG 2023), Virtual, 27–30 November 2023. [Google Scholar]
Boratto, L.; Fabbri, F.; Fenu, G.; Marras, M.; Medda, G. Robustness in fairness against edge-level perturbations in GNN-based recommendation. In Proceedings of the European Conference on Information Retrieval (ECIR ’24), Glasgow Scotland, 24–28 March 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 38–55. [Google Scholar]
Chizari, N.; Shoeibi, N.; Moreno-García, M.N. A comparative analysis of bias amplification in graph neural network approaches for recommender systems. Electronics 2022, 11, 3301. [Google Scholar] [CrossRef]
Chizari, N.; Tajfar, K.; Moreno-García, M.N. Bias assessment approaches for addressing user-centered fairness in GNN-based recommender systems. Information 2023, 14, 131. [Google Scholar] [CrossRef]
Harper, M.; Konstan, J.A. The MovieLens Datasets: Distributed by GroupLens at the University of Minnesota. 2021. Available online: https://grouplens.org/datasets/movielens/ (accessed on 4 September 2025).
Zhao, W.; Zhang, S.; Xia, Y.; Zhang, Z.; Wang, J.; Liu, Y.; Tang, J. RecBole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM ‘21), Virtual, 1–5 November 2021; pp. 4653–4662. [Google Scholar] [CrossRef]
RUCAIBox. RecBole-GNN: Efficient and extensible GNNs enhanced recommender library based on RecBole. 2021. Available online: https://github.com/RUCAIBox/RecBole-GNN (accessed on 4 September 2025).
Pan, C.-H.; Qu, Y.; Yao, Y.; Wang, M.-J.-S. HybridGNN: A Self-Supervised Graph Neural Network for Efficient Maximum Matching in Bipartite Graphs. Symmetry 2024, 16, 1631. [Google Scholar] [CrossRef]

Figure 1. Number of users and interactions by gender.

Figure 2. Long-tail distribution in the datasets.

Figure 3. Item popularity and item coverage of top-N recommendation list provided by the recommender algorithms in the LastFM and the MovieLens datasets for different values of N.

Figure 4. Top-N list quality metrics for the recommender algorithms in the LastFM dataset for different values of N.

Figure 5. Top-N list quality metrics for the recommender algorithms in the MovieLens dataset for different values of N.

Figure 6. Evaluation of top-N recommendation list for different N values and recommendation models in the LastFM dataset.

Figure 7. Evaluation of top-N recommendation list for different N values and recommendation models in the MovieLens dataset.

Figure 8. Normalized Relative Difference (NRD) of NDCG@N for different recommendation models.

Figure 9. IB-NDCG@N for different recommendation models.

Table 1. IB-NDCG@N for different recommendation models.

	Dataset LastFM			Dataset MovieLens
Model	IB ndcg@5	IB ndcg@10	IB ndcg@15	IB ndcg@5	IB ndcg@10	IB ndcg@15
ItemKNN	0.641	0.621	0.608	0.624	0.580	0.545
MF	0.224	0.070	0.020	0.717	0.695	0.685
DMF	0.663	0.636	0.614	0.613	0.563	0.526
NeuMF	0.663	0.641	0.617	0.593	0.535	0.495
NNCF	0.662	0.639	0.615	0.625	0.569	0.521
NGCF	0.341	0.162	0.077	0.637	0.578	0.536
SGL	0.503	0.363	0.253	0.395	0.276	0.221
LightGCN	0.552	0.339	0.227	0.627	0.568	0.527
DGCF	0.544	0.339	0.227	0.606	0.545	0.506

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chizari, N.; Tajfar, K.; Moreno-García, M.N. Measuring Inter-Bias Effects and Fairness-Accuracy Trade-Offs in GNN-Based Recommender Systems. Future Internet 2025, 17, 461. https://doi.org/10.3390/fi17100461

AMA Style

Chizari N, Tajfar K, Moreno-García MN. Measuring Inter-Bias Effects and Fairness-Accuracy Trade-Offs in GNN-Based Recommender Systems. Future Internet. 2025; 17(10):461. https://doi.org/10.3390/fi17100461

Chicago/Turabian Style

Chizari, Nikzad, Keywan Tajfar, and María N. Moreno-García. 2025. "Measuring Inter-Bias Effects and Fairness-Accuracy Trade-Offs in GNN-Based Recommender Systems" Future Internet 17, no. 10: 461. https://doi.org/10.3390/fi17100461

APA Style

Chizari, N., Tajfar, K., & Moreno-García, M. N. (2025). Measuring Inter-Bias Effects and Fairness-Accuracy Trade-Offs in GNN-Based Recommender Systems. Future Internet, 17(10), 461. https://doi.org/10.3390/fi17100461

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Measuring Inter-Bias Effects and Fairness-Accuracy Trade-Offs in GNN-Based Recommender Systems

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Proposed Approach

3.2. Evaluation Metrics

4. Case Study and Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI