1. Introduction
With the advent of the internet and web-based systems, users have gained access to vast amounts of information. This abundance has introduced the problem of information overload, making it more difficult to find relevant and appropriate content. Recommender systems have become a vital tool to address this issue, offering personalized suggestions based on user interaction histories. Numerous studies have demonstrated their effectiveness in supporting decision-making across various domains, including e-commerce, travel, tourism, and multimedia services through streaming platforms [
1,
2,
3]. The increasing relevance of these platforms, particularly in relation to their content offerings and online retail, has driven significant interest in this research area, resulting in its prominence in high-impact conferences and journals [
4,
5].
Among the different types of recommender systems, collaborative filtering and content-based methods remain the most established and widely applied approaches [
6]. Content-based methods rely on descriptive attributes of items to recommend similar items to those previously favored by the user. In contrast, collaborative filtering uses user–item interaction matrices to infer preferences by identifying patterns among users with similar behavior.
Although recommender systems have been extensively studied and have evolved significantly, they still face key challenges that limit their effectiveness. Two major issues are data sparsity and the cold-start problem [
7,
8]. Sparsity arises when the number of user–item interactions is low relative to the total number of available items, hindering the system’s ability to learn accurate preferences. The cold-start problem occurs with new users or items for which no prior interactions exist, making it difficult to generate high-quality recommendations. Common solutions include suggesting popular items to new users or eliciting preferences through brief surveys.
To mitigate these issues, various forms of side information have been employed. These include user-specific data such as demographic attributes (e.g., age, gender, nationality) and social information (e.g., friends or followers). Item-specific information depends on the application domain, for instance, genre, director, or release year in movie recommendations, or low-level acoustic features in music recommendations. Another valuable form of side information is contextual data, used broadly in context-aware recommender systems. These systems tailor recommendations based on a user’s situation at a given time, but such contextual factors can also improve recommendations independently of the specific context.
Recent advances in computer vision, particularly the development of Vision Transformers (ViTs), have opened new possibilities for enhancing recommender systems. Unlike traditional convolutional approaches, ViTs leverage self-attention mechanisms to capture global semantic relationships within images, enabling the extraction of high-level and contextually rich visual representations. ViTs can help infer user preferences even in the absence of extensive historical data, thereby improving personalization and recommendation accuracy. Their flexibility and effectiveness in modeling unstructured data make ViTs a promising tool for developing more robust and multimodal recommender systems. This study aims to evaluate how incorporating complex side information, particularly visual characteristics, influences the performance of various recommender systems.
This work contributes a novel methodological framework for improving traditional recommender systems by integrating top-k recommendation approaches with unstructured visual information, specifically through item-associated images. Employing Vision Transformers to extract semantic features from item images, the system refines base recommendations without altering underlying models, thus ensuring adaptability and scalability. Unlike prior studies that embed visual data as side information in the recommendation process, our approach explicitly isolates and evaluates the impact of image-based features, providing a clear framework for assessing their role. The proposed architecture is designed for flexibility, enabling adaptation to different case studies and datasets, while offering a clear protocol for assessing the added value of multimodal integration. Experimental validation using well-known movie datasets demonstrates the effectiveness of the approach, highlighting significant performance gains when visual side information are incorporated into the recommendation pipeline.
The presented article is organized into several sections addressing key aspects of the proposed approach. A review of the state of the art and related work is presented (
Section 2) to provide an overview of current research on information systems using side information. The proposed system is detailed next (
Section 3), followed by the selected case studies (
Section 4) and the results obtained (
Section 5). Finally, the conclusions are discussed, along with the main limitations and challenges encountered during the development process (
Section 6).
2. Related Work
Recommender systems have become essential tools for personalized content delivery across a wide range of domains, from e-commerce to entertainment. As discussed earlier, traditional recommendation techniques, such as collaborative filtering and content-based methods, often face limitations due to sparse data and the cold-start problem. To address these challenges, recent research has increasingly focused on incorporating side information that enriches the user–item interaction matrix. This section reviews the types of side information typically used in recommender systems and explores how they are integrated into different recommendation paradigms.
2.1. Types of Side Information in Recommender Systems
To mitigate challenges such as data sparsity and the cold-start problem, different techniques leveraging various types of side information are employed. These include item metadata and attributes, as well as data derived from social networks that allow inferring preferences based on close contacts. Tags or categories associated with items, such as movie genres or music types, also often reflect user tastes. Sun et al. [
9] propose a categorization of side information based on its structure, dividing it into structured and non-structured data.
2.1.1. Structured Data
Structured data in recommender systems can be grouped into flat, network-based, hierarchical, and knowledge graph-based features [
10,
11,
12]. Flat features, such as metadata or genres, represent attributes at the same level and are widely used to compute item similarity and mitigate the cold-start problem [
7,
8]. For instance, MARec [
13] is a metadata alignment model that combines non-linear similarities between flat feature vectors, such as genres, actors or directors, in movie recommendations, helping to mitigate the cold-start problem by merging multiple similarity matrices. Network features leverage user relationships, such as friendships, to infer preferences based on social influence, improving recommendations in sparse or cold-start scenarios [
11,
14]. Hierarchical features capture inclusion or specialization relationships (e.g., genre > subgenre), enabling better generalization and explainability [
15]. For example, hierarchical structures allow transferring preferences across related categories. Finally, knowledge graphs unify heterogeneous information (e.g., user demographics and item attributes) into a semantic space, enabling nuanced preference modeling beyond direct similarities [
12,
16,
17].
2.1.2. Non-Structured Data
Non-structured data, such as text, images, and video, play a complementary role in enhancing traditional recommendation systems by enriching item and user representations [
7,
18].
Textual features, often derived from associated content like user reviews or item descriptions, help capture the nuanced and subjective dimensions of user preferences [
7,
19]. For instance, Chou et al. [
7] introduce an ordinal-consistency-based matrix factorization model that exploits textual side information to guide collaborative filtering by aligning the ordinal nature of ratings with latent factor inference.
Visual features, on the other hand, are extracted from item-related images, such as posters or covers, using autoencoders, neural networks or, more recently, Transformer-based models such as the Vision Transformer (ViT), which have been demonstrated to significantly influence user decision-making, often combined with textual information [
20,
21,
22,
23]. To better understand the proposed research, it is essential to mention the work of Dosovitskiy et al. [
24], where the first ViT was introduced. It adapts the Transformer architecture from natural language processing to image classification by splitting each image into fixed-size patches (e.g., 16 × 16 pixels), embedding them linearly, and adding positional encodings before processing them as a sequence in a standard Transformer encoder. When pretrained on large datasets and fine-tuned for specific tasks, ViT achieves performance comparable or superior to advanced convolutional neural networks (CNNs), while requiring fewer computational resources during training.
2.2. Recommendation Methods That Incorporate Side Information
This section reviews techniques that integrate side information, especially non-structured textual and visual data, to improve collaborative filtering, content-based, and deep learning recommendation systems.
Regarding collaborative filtering models with side information, incorporating textual and visual features has been shown to improve personalization. For text, approaches include memory-based models that compute semantic similarity or analyze sentiment in reviews [
19], and latent factor models enriched with semantic word representations [
8], sentiment analysis [
25], aspect modeling [
26], and topic modeling [
27]. Also, there are some recent proposals that utilizes K-Nearest Neighbor (KNN) techniques in tandem with textual side information to improve recommendations [
28,
29]. For visual data, convolutional neural networks extract features combined with latent factors, benefiting systems in domains with sparse data [
20,
30].
Regarding content-based methods with side information, base text-representation methods such as Term Frequency–Inverse Document Frequency (TF-IDF) are used with other techniques, such as word embeddings and synset extraction, to represent textual side information, improving evaluation and mitigating the cold-start problem [
31,
32]. For visual features, computer vision algorithms and architectures like VGG19 are employed to complement textual information obtained with semantic and embedding techniques and strengthen user feedback [
31,
33,
34,
35].
Regarding deep learning models, they integrate side information to overcome data limitations, using neural networks, such as CNNs or RNNs, and autoencoders to process text [
36,
37,
38]. Neural architectures are also combined with factorization techniques to extract aspects and sentiments [
39]. Transformer-based models and Large Language Models, such as BERT, generate contextual embeddings that improve recommendations and explanations [
40,
41]. For visual features, both from-scratch and pretrained networks (e.g, ResNet, VGG) are used to extract visual representations [
23,
42,
43], along with specialized Transformer models such as FashionBERT [
44]. Recent hybrid vision architectures like ViT-CoMer combine CNN-based multi-scale feature pyramids with a Vision Transformer backbone to enhance dense visual representation, constituting a notable application for leveraging side information [
45]. Similarly, Transformer-based multi-view, multi-label frameworks, such as Label-Guided Masked View- and Category-Aware Transformers, tackle incomplete views and labels by enabling cross-view feature aggregation and label-aware embeddings [
46].
This review indicates that recent approaches employ a wide range of techniques to exploit side information. In content-based recommender systems, although less extensively explored, researchers have incorporated text-representation methods, such as TF-IDF and its variants, as well as embedding-based techniques, including word embeddings, in combination with visual side information. Likewise, in collaborative filtering, latent factor models, like Singular Value Decomposition (SVD), and neighborhood-based methods like KNN have been adapted to integrate side information, thereby improving recommendation performance. Despite their relative simplicity, these techniques continue to demonstrate the field’s proficiency in leveraging side information.
Unlike prior studies that embed visual side information directly during model training, our approach applies a modular late-stage re-ranking using semantic features extracted from item images. This design enhances traditional top-k recommender systems while maintaining flexibility, interpretability, and applicability in cold-start scenarios.
3. Proposed System
Following the previous section on the state of the art, the methodology that will serve as the foundation for future case studies is defined. The core of the proposal lies in the design and implementation of a two-stage system, structured in two differentiated modules, aiming to optimize the use of available side information to generate more accurate and personalized final recommendations. In this case, visual side information associated with items is chosen due to the increasing relevance of computer vision in recent years. The overall structure is illustrated in
Figure 1.
The main motivation behind this modular architecture is its flexibility: it can be easily adapted to different scenarios by combining and varying system components according to specific study requirements. At the same time, it enables the evaluation of how incorporating unstructured visual information sources impacts the quality of the generated recommendations.
In the first module, a base top-k recommendation system is implemented. It relies on traditional recommendation approaches to generate the list of items as a starting point in the use of this methodology. This system takes as input a dataset comprising user-related features, such as explicit or implicit preferences depending on the specific use case.
The second module focuses on visual feature extraction and re-ranking. This module incorporates side information from images associated with the previously recommended items. The aim is to enrich the recommendation process by integrating visual attributes from catalog items, improving standard evaluation metrics in top-k systems, as well as the personalization of final recommendations. Each module is described in detail below:
Base Top-k Recommender System: This module relies on traditional recommendation systems, specifically collaborative filtering and content-based filtering approaches. These techniques have proven effective in various contexts and provide a solid foundation for experimentation. The dataset employed, depending on the specific situation, is introduced at the beginning of the module and serves as the foundation for the subsequent recommendations. A common requirement for all implementations is their ability to generate top-k recommendation lists, i.e., lists containing the k most relevant items for each user according to the model.
Feature Extraction and Re-ranking Module: This module aims to implement a system for re-ranking the previously generated top-k recommendation lists. For each recommended item, the system compares its associated image with the images of the user’s favorite items, computing pairwise similarities. A visual score is then calculated by averaging these values, re-ranking the list so that items with higher scores appear earlier. To compute these similarities, the system relies on the extraction of visual features from image data. A recent and powerful approach for this task is the use of Vision Transformers (ViTs). ViT is a neural network architecture designed for image processing, which divides each image into patches, flattens them, and applies a self-attention mechanism to model relationships between patches. With this approach, feature vectors can be obtained, acting as the basis for computing visual similarity and subsequently re-ranking the top-k list.
A key feature of our framework is the incorporation of a visual re-ranking stage powered by pretrained ViT models. Posters of recommended movies are compared against those of the user’s highly rated films using deep visual embeddings. Cosine similarity between feature vectors is computed, and the recommendation list is re-ranked based on visual affinity, capturing aesthetic preferences not addressed by traditional models.
By eliminating convolutions, ViT demonstrates that self-attention mechanisms can capture long-range dependencies between patches, allowing the model to scale without losing global spatial information. The term “long-range” refers to the model’s ability to directly capture interactions between distant regions of an image. ViT enables more direct and comprehensive integration of visual information, enhancing the model’s ability to understand complex contexts and distributed patterns across the image.
Experimental Workflow
Figure 2 outlines the general pipeline. Each experiment follows the same stages:
Data Preparation: Load and preprocess metadata or rating data, depending on the recommendation paradigm.
Base Recommendation Model: Train and apply either a content-based or collaborative model.
Initial Evaluation: Assess recommendation performance using held-out test data and standard metrics.
Visual Re-Ranking: Extract visual features using ViT models and re-rank recommendations based on cosine similarity with user-preferred posters.
Re-ranking Evaluation: Re-evaluate the top-k recommendations to measure the impact of the visual re-ranking stage.
4. Case Studies and Experimental Setup
To evaluate the performance of various recommender systems, a set of case studies was designed to simulate real-world movie recommendation scenarios. These case studies allow for a systematic comparison of different algorithms and assess the effect of an additional visual re-ranking phase on recommendation quality. Two personalized splits are generated per user, composed of ten films each: the first split is used to generate the recommendations, and the second one serves as ground truth for evaluation. Content-based and collaborative filtering approaches are evaluated using standard top-
k ranking metrics. The visual embeddings are extracted using several state-of-the-art ViT-based models from the HuggingFace (
https://huggingface.co/ (accessed on 9 February 2025)) repository. This diversity ensures robustness and mitigates model-specific bias.
4.1. Datasets Used
To develop the proposed recommender systems, various datasets were selected according to the adopted approach (content-based or collaborative filtering) and the application domain. In this work, focused on movie recommendation, datasets include both textual and structured information about items (e.g., descriptions, genres, or metadata), as well as explicit user ratings. To incorporate complementary visual information, external resources were used to retrieve graphical representations, such as posters, associated with each item. The methodological process includes the selection and integration of different data sources, extraction of relevant features depending on the system type (e.g., item attributes for content-based systems, interaction patterns for collaborative filtering systems), and normalization or transformation of these data for use in the recommendation models.
The primary source is The Movies Dataset (
https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset (accessed on 26 January 2025)) for both metadata and ratings. The metadata_small.csv dataset provides enriched metadata for a wide range of movies, including approximately 45,000 films. This dataset supports content-based models through structured and textual information, such as
movieId: Unique identifier for each movie;
Title, genres, overview, and tagline: Semantic descriptors;
release_date, production_companies, runtime: Auxiliary metadata.
These fields facilitate the creation of content vectors used to compute similarities between movies and generate personalized recommendations.
To incorporate user preferences, a MovieLens (
https://grouplens.org/datasets/movielens/ (accessed on 8 February 2025)) based dataset is used, particularly the
ratings_small.csv dataset integrated within The Movies Dataset, which has over 100,000 user ratings. In collaborative filtering cases, this dataset is used for training the models based on the user ratings. A subset for testing is created, for both the content-based and collaborative filtering cases, by selecting active users who have rated at least 20 movies, employing 50 curated users. It contains user–item interactions and includes
userId, movieId: Identifiers linking users and movies;
Rating: user-provided score (0.5 to 5.0);
Timestamp: Rating date in Unix epoch format.
This dataset supports collaborative filtering by identifying patterns of user behavior and is also used in content-based models to infer user preferences based on positively rated items.
To enrich the recommendation process with visual elements, the TMDB API (
https://developers.themoviedb.org/3 (accessed on 27 January 2025)) is employed to retrieve official movie posters. The integration is achieved via
links.csv, found in The Movies Dataset repository, which includes
Posters are then processed using computer vision models to extract visual features, which are later used to re-rank recommendations based on visual similarity with the user’s preferences.
The choice of datasets is grounded in their widespread use in recommender systems research and the richness of their information. The Movies Dataset provides detailed semantic metadata for content-based approaches. MovieLens is a benchmark for collaborative filtering due to its volume and quality of user interactions. Finally, the TMDB API adds visual data, particularly movie posters, enhancing the system with multimodal inputs. Its open-access policy makes it especially suitable for academic purposes.
4.2. Base Recommendation Models
To properly frame the base recommendation models, this study evaluates four techniques identified in the recent literature for integrating visual side information. These methods cover two complementary paradigms: content-based approaches, which rely on item metadata, and collaborative filtering approaches, which exploit user–item interaction patterns.
In the content-based case, item metadata is leveraged to capture textual semantics. This study uses the overview field from The Movies Dataset, where descriptions are transformed into vector representations using two techniques:
TF-IDF: A classical text-representation method that produces sparse, high-dimensional vectors by quantifying the relative importance of terms within documents, enabling content comparison through keyword relevance.
Sentence Embeddings: Dense semantic vectors generated with the SentenceTransformer framework, which extend text-representation methods by encoding entire sentences into lower-dimensional representations that capture contextual meaning.
The first model is trained using scikit-learn library (
https://surpriselib.com/ (accessed on 5 April 2025)), while the second one is trained using the Sentence Transformer library (
https://sbert.net/ (accessed on 12 June 2025)). The final recommendation lists are re-ranked using the same visual similarity process described above.
In the collaborative filtering case, recommendations are derived from patterns in user feedback from the MovieLens dataset. Similarity scores are computed between candidate items and those positively rated by the user, and the top-k unseen items with the highest scores are recommended. Two representative approaches are considered:
KNN: A neighborhood-based collaborative filtering approach, where item similarity is computed using Pearson correlation and recommendations are derived from the nearest neighbors of a candidate item.
SVD: A latent factor model that decomposes the user–item interaction matrix into a lower-dimensional space, uncovering hidden patterns in preferences and enabling recommendations through user–item proximity in the latent feature space.
Both models are trained with the Surprise library (
https://surpriselib.com/ (accessed on 5 April 2025)), and recommendations are generated based on the predicted user–item ratings. The final recommendation lists are then re-ranked using the same visual similarity process described above.
4.3. Visual Re-Ranking Methodology
Posters of recommended and previously liked movies are processed via multiple ViT-based models. Cosine similarities are averaged to compute a visual similarity score for each recommendation. This re-ranking stage aims to enrich the recommendation with implicit aesthetic preferences and evaluate its contribution in terms of improved top-k ranking metrics. The models used are the following:
google/vit-base-patch16-224-in21k (
https://huggingface.co/google/vit-base-patch16-224-in21k (accessed on 9 February 2025)): This model represents the original implementation of ViT. It was trained in a supervised manner on the ImageNet-21k dataset, which contains approximately 14 million images distributed across 21,843 classes. The model divides images into 16 × 16 pixel patches, uses absolute positional embeddings, and employs a special CLS token for classification tasks. Its training on such a large dataset allows it to capture general visual representations useful for various computer vision tasks.
google/vit-large-patch16-224-in21k (
https://huggingface.co/google/vit-large-patch16-224-in21k (accessed on 9 February 2025)): This extended version of the original ViT model significantly increases model capacity by having a larger number of parameters and layers. Like its base counterpart, it is trained in a supervised manner on the ImageNet-21k dataset, but its larger size enables it to capture more complex and detailed visual patterns. The architecture continues to divide images into 16 × 16 pixel patches and uses a CLS token, but benefits from greater network depth to generate richer semantic representations.
facebook/deit-base-patch16-224 (
https://huggingface.co/facebook/deit-base-patch16-224 (accessed on 9 February 2025)): Based on the Data-efficient Image Transformer (DeiT) model, this was developed by Facebook, now Meta, with the goal of improving training efficiency for ViTs. Unlike the original ViT, DeiT incorporates an additional distillation token that allows the model to learn from a teacher, such as a convolutional network, during training. This strategy facilitates more efficient training with fewer data while maintaining competitive performance on classification tasks.
facebook/dinov2-large-imagenet1k-1-layer (
https://huggingface.co/facebook/dinov2-large-imagenet1k-1-layer (accessed on 9 February 2025)): This model belongs to the second generation of DINO (Self-Distillation with No Labels), a family of self-supervised learning methods. DINOv2 trains ViTs without the need for annotations, using a distillation strategy between augmented views of the same image. The large variant with a single layer focuses on capturing general-purpose representations from natural images, trained on ImageNet-1k. The result is an efficient model capable of learning versatile and transferable visual features without relying on traditional supervised learning.
microsoft/beit-base-patch16-224 (
https://huggingface.co/microsoft/beit-base-patch16-224 (accessed on 9 February 2025)): The Bidirectional Encoder representation from Image Transformers (BEiT) model developed by Microsoft introduces a self-supervised pretraining approach inspired by language modeling methods such as BERT. BEiT is initially trained to predict masked visual tokens, using a visual tokenizer based on VQ-VAE. It is then fine-tuned on supervised tasks such as image classification. This strategy allows the model to learn rich semantic representations without requiring large amounts of labeled data.
microsoft/beit-large-patch16-224-pt22k-ft22k (
https://huggingface.co/microsoft/beit-large-patch16-224-pt22k-ft22k (accessed on 9 February 2025)): This is an enhanced version of the BEiT model featuring a deeper architecture and trained using a two-stage approach. The first stage is a self-supervised pretraining on ImageNet-22k using masked token prediction, and the second is supervised fine-tuning on the same dataset. This combination allows the model to learn robust representations at both semantic and categorical levels. Thanks to its size and training strategy, this version is particularly suitable for tasks requiring detailed and contextual visual understanding.
4.4. Evaluation Metrics
The evaluation of the recommender system models in this work is based on a set of standardized metrics in the field. These metrics allow for an accurate quantification of the model’s performance when generating recommendations for unseen users or items, as well as an assessment of its ability to generalize, reduce prediction errors, and provide robust and relevant suggestions across diverse user contexts.
Firstly, NDCG (Normalized Discounted Cumulative Gain) (Equation (
1)) is calculated by giving more importance to relevant items that appear in the top positions of the recommendation list. First, the DCG or Discounted Cumulative Gain (Equation (
2)) is calculated, which sums the relevance of the recommended items applying a discount based on position, and then it is normalized using the IDCG or Ideal DCG (Equation (
3)), which represents the DCG of an ideally ordered list. The resulting value of NDCG ranges from 0 to 1. Unlike other metrics, NDCG takes into account not only the relevance of the recommended items but also their position in the list, assigning greater weight to relevant items appearing earlier. This makes it particularly useful when the order of recommendations is critical for the user experience. It is defined as
where
k represents the position up to which the result list is evaluated. The variable
indicates the relevance of the element at position
i in the obtained result list, while
denotes the relevance of the element at position
i in the ideally ordered list by relevance, which represents the best possible ranking.
Secondly, MAP (Mean Average Precision) (Equation (
4)) calculates the proportion of relevant items in a recommendation list of length
k for a given user, and then averages over all users. In this way, a global view of performance across all system users is obtained. It is defined as
where
n represents the total number of queries or users considered. The
metric is calculated as the average of the individual average precisions
. In the expression for
,
is the set of relevant items for query
i,
denotes the precision at position
k of the ordered result list, and
is a binary function that is 1 if the item at position
k is relevant, and 0 otherwise.
Lastly, MRR (Mean Reciprocal Rank) (Equation (
6)) evaluates the rank, or position, of the first relevant item in the recommended ranking for each user. As with the previous metric, a global view of performance is obtained by averaging over all users of the system. It is defined as
where
n represents the total number of queries or users. The variable
indicates the rank of the first relevant result in the list of results returned for query
i.
4.5. Test Environment
The different case studies developed were trained and evaluated in a local computing environment without the need for cloud services. The specifications of the equipment and software used can be seen in
Table 1.
This hardware was adequate for the case studies, supporting the generation of recommendations and the re-ranking process employing item images. However, for more demanding tasks, such as experiments involving ViT techniques, the use of a dedicated GPU could further enhance performance.
5. Results
Based on the case studies defined in the previous section, this part presents a detailed analysis of the results obtained for each evaluated recommendation approach. The primary objective is to quantify system performance before and after visual re-ranking, and to compare the quality of recommendations delivered to users. The structure of the different experiments is shown in
Figure 3.
Based on the case studies, four techniques were selected as the basis for the experiments: TF-IDF, Sentence Embeddings, KNN, and SVD. Together, these techniques provide complementary perspectives, allowing us to evaluate how visual features enhance both content-based and collaborative filtering paradigms. An additional objective is to test whether this side information can help content-based methods, traditionally shown to perform worse than collaborative-filter ones, become competitive alternatives.
The analysis was carried out using the previously described metrics, calculated on user-specific test sets. This approach enables us to observe performance differences depending on the type of information used, whether it is a content-based or a collaborative filtering approach, and also to evaluate the added effect of visual similarity on the final recommendation ranking.
The results are organized according to the two main approaches considered in the study: content-based methods and collaborative filtering techniques. For each case, the metric values are reported and analyzed both with and without the re-ranking stage, enabling a more precise and comprehensive evaluation of each strategy.
Based on the proposed system, experiments were conducted using recommendation lists of lengths 5 and 10, referred to as top@5 and top@10. These list sizes were selected to avoid overwhelming the re-ranking module with excessive visual input, which negatively impacts the re-ranking performance and, consequently, the overall recommendation quality.
Before detailing the evaluations of each use case, the abbreviations used to identify the models and their variants are presented, in order to simplify and streamline the presentation:
TF-IDF: Base content-based model using text frequency.
SE: Base content-based model using Sentence Embeddings.
KNN: Base collaborative filtering model applying the k-nearest neighbors method.
SVD: Base collaborative filtering model using Singular Value Decomposition.
Model + G: Base model with visual re-ranking using Google’s ViT model.
Model + G-L: Base model with visual re-ranking using Google’s ViT Large model.
Model + F: Base model with visual re-ranking using Facebook’s DEiT model.
Model + F-L: Base model with visual re-ranking using Facebook’s DINOv2 Large model.
Model + M: Base model with visual re-ranking using Microsoft’s BEiT Base model.
Model + M-L: Base model with visual re-ranking using Microsoft’s BEiT Large model.
5.1. Evaluation of the Content-Based Approach
In this section, two representative approaches of content-based methods are explored. The first employs the classical technique of Term Frequency–Inverse Document Frequency (TF-IDF), which represents items through lexical feature vectors. The second relies on semantic representations obtained through Sentence Embeddings, capable of capturing deeper relationships between terms. Both methods share the idea of recommending items based on their similarity to the user profile, constructed from previously consumed content, although they differ in how they represent and process textual information.
In addition, the system’s behavior is evaluated both in its original version and after the application of visual re-ranking. The objective is to determine whether the inclusion of visual side information significantly contributes to improving the quality of the recommendations generated by this method.
5.1.1. TF-IDF-Based Experiment
The model based on TF-IDF constitutes the baseline version without visual information, serving as a reference to evaluate the improvements from the re-ranking.
Table 2 and
Table 3 show the values obtained for each metric under the different configurations.
The top@10 results confirm that re-ranking based on visual side information significantly enhances recommendation performance. The baseline TF-IDF model, which does not use visual features, shows considerably lower metrics, with NDCG@10 of 0.215, MAP@10 of 0.019, and MRR@10 of 0.177. In contrast, models using visual embeddings from ViT, BEiT, and DINOv2 achieve improvements ranging from 27% to 49% in NDCG@10, 47% to 84% in MAP@10, and 45% to 79% in MRR@10.
Among Google-based models, the ViT-Large variant (TF-IDF + G-L) delivers consistent improvements, showing the advantage of larger model size and deeper pretraining. Facebook models also perform well, especially the DINOv2-based TF-IDF + F-L, which obtains the highest scores across all metrics, demonstrating the strength of self-supervised features. Microsoft’s BEiT models, particularly TF-IDF + M-L, achieve competitive results close to DINOv2, highlighting the potential of masked patch learning.
Top@5 metrics further support these findings. All visual models outperform the TF-IDF baseline, with the best results from TF-IDF + F-L and TF-IDF + M-L, showing improvements of 31.6% in NDCG@5, 55.6% in MAP@5, and 48.8% in MRR@5. These results underscore the robustness of visual re-ranking in prioritizing relevant items.
Finally, a case study illustrated in
Table 4 shows how the TF-IDF + F-L model reorders items more effectively, promoting relevant content to higher ranks, and exemplifying the practical impact of self-supervised visual features in recommendation tasks.
In summary, the results show that the use of visual models for re-ranking consistently improves the performance of the recommendation system. The differences between models are explained not only by their architecture and size, but also by the training strategies used. DINOv2 large stands out as the best option in terms of visual feature extraction, especially useful in scenarios where ranking precision in the top positions is critical. It is closely followed by BEiT Large, which offers a solid alternative with a good balance between complexity and performance. Finally, the models from Google also show considerable improvements, being particularly attractive in contexts where integration with existing infrastructures is prioritized or where visual embeddings generated with ViT are already available.
5.1.2. Sentence Embedding-Based Experiment
The second approach is based on the Sentence Embedding (SE) technique. The evaluated configurations include the visual re-ranking variants mentioned earlier.
Table 5 and
Table 6 show the values obtained for each metric under the different configurations.
The top@10 results indicate that incorporating visual side information for re-ranking significantly improves performance even when starting from a Sentence Embedding (SE) baseline. The SE model without re-ranking shows low effectiveness, with NDCG@10 of 0.121 and zero values for MAP@10 and MRR@10, reflecting poor ranking of relevant items.
By adding visual features, notable gains are achieved, with relative improvements of up to 42.1% in NDCG@10, 52.9% in MAP@10, and 57.1% in MRR@10. These results confirm the positive impact of visual re-ranking on models optimized at the latent level.
The ViT-based models yield the highest improvements, with SE + ViT-G reaching an NDCG@10 of 0.217, MAP@10 of 0.021, and MRR@10 of 0.211. The large version, SE + ViT-G-L, achieves similar outcomes. Facebook models, particularly SE + DINO-F-L, also show enhanced performance with an NDCG@10 of 0.190, MAP@10 of 0.018, and MRR@10 of 0.178. The DeiT-based version performs slightly below DINOv2 but still improves over the baseline. Microsoft’s BEiT models provide consistent gains, with SE + BEiT-M-L achieving NDCG@10 of 0.196, MAP@10 of 0.018, and MRR@10 of 0.183, approaching the results of DINOv2.
The top@5 analysis supports these trends. While the SE baseline reaches only 0.075 in NDCG@5 and 0.067 in MRR@5, visual re-ranking models raise these values to approximately 0.100. Despite a slight drop in MAP@5, the overall improvements highlight better ranking precision in the most visible positions.
Overall, the results confirm that even when starting from a recommendation model based on SE, the incorporation of visual information through computer vision models improves the positioning of relevant items. Although the impact is less pronounced than in the case of the TF-IDF model, the gains obtained justify the exploration of multimodal approaches in recommendation tasks.
5.2. Evaluation of the Collaborative Filtering Approach
This section presents the evaluation results of two classic collaborative filtering approaches. One is based on nearest neighbors, or K-nearest neighbors (KNNs), and the other is based on matrix factorization using Singular Value Decomposition (SVD). Both approaches share the goal of estimating preferences based on past interactions between users and items, although they differ in how they model this information. KNN relies on explicit similarities between users or items, while SVD extracts underlying latent factors.
As in the previous case, the system’s behavior is evaluated both in its original version and after the application of visual re-ranking. The objective is to determine whether the inclusion of visual side information significantly contributes to improving the quality of the recommendations generated by this method.
5.2.1. K-Nearest Neighbor-Based Experiment
The KNN model constitutes the baseline version without visual information, serving as a reference point to evaluate the benefits of re-ranking. The evaluated configurations include the visual re-ranking variants mentioned earlier.
Table 7 and
Table 8 show the values obtained for each metric under the different configurations.
The top@10 results in the collaborative filtering scenario using the KNN method confirm that visual re-ranking significantly enhances recommendation quality. The baseline KNN model, without visual input, achieves NDCG@10 of 0.322, MAP@10 of 0.055, and MRR@10 of 0.266. Visual re-ranking leads to improvements ranging from 4.2% to 36.6% in NDCG@10, 21.2% to 56.1% in MAP@10, and 22.2% to 61.8% in MRR@10.
Google-based models deliver strong gains, with KNN + ViT-G achieving NDCG@10 of 0.439, MAP@10 of 0.086, and MRR@10 of 0.430. The large variant (KNN + G-L) maintains similar performance, though with slight drops in NDCG and MRR, suggesting that larger models may be more sensitive to collaborative signal variation. Facebook’s models show a similar pattern. While the base version (KNN + F) offers moderate improvements, the large variant (KNN + F-L) stands out with NDCG@10 of 0.420, MAP@10 of 0.082, and MRR@10 of 0.410. These results highlight the strength of DINOv2’s self-supervised features even when visual content is not central to the original recommendation process. Microsoft’s BEiT models also contribute positively. KNN + BEiT-M slightly outperforms the baseline, while the large version (KNN + M-L) reaches competitive levels, with NDCG@10 of 0.435, MAP@10 of 0.082, and MRR@10 of 0.430, reinforcing the value of masked patch representations in collaborative contexts.
Top@5 metrics show consistent improvements as well. The baseline reaches NDCG@5 of 0.287, MAP@5 of 0.040, and MRR@5 of 0.251, while the best visual models increase these values to NDCG@5 of 0.338, MAP@5 of 0.052, and MRR@5 of 0.333. These gains underline the ability of visual re-ranking to prioritize more relevant items at the top ranks.
As illustrated in
Table 9, the KNN + F-L model effectively promotes item ID 18872 from the sixth position to the first, demonstrating its ability to surface higher-relevance items overlooked by the baseline. This example highlights the practical advantages of integrating visual side information in collaborative filtering tasks.
The results for this KNN-based approach confirm that visual re-ranking also significantly benefits systems based on collaborative filtering. Although the system already starts from a strong signal (derived from user behavior), the incorporation of visual side information helps refine the final ranking and promote items with higher visual similarity. Among the best-performing models, Google’s models, specifically the KNN + G variant, emerge as the most robust options in terms of performance, while DINOv2 and BEiT Large models offer a competitive alternative with lower complexity and greater interoperability in existing environments.
5.2.2. Singular Value Decomposition-Based Experiment
The second evaluated approach is based on the Singular Value Decomposition (SVD) technique, which allows capturing latent relationships between users and items. As with KNN, the effect of visual re-ranking on the recommendations generated by this model is analyzed, using the same ViT variants.
Table 10 and
Table 11 show the values obtained for each metric under the different configurations.
The top@10 results demonstrate that visual models substantially improve SVD-based recommendations. The baseline SVD model achieves NDCG@10 of 0.373, MAP@10 of 0.067, and MRR@10 of 0.308. With visual models, improvements reach up to 40.5% in NDCG@10, 52.2% in MAP@10, and 66.2% in MRR@10, confirming the value of visual features even in latent factor models.
The Google-based variants show solid performance, with SVD + G reaching NDCG@10 of 0.447 and its large version (SVD + G-L) offering slight improvements in MAP and MRR. Microsoft and Facebook models provide stronger and similar results, particularly SVD + F-L and SVD + M-L, the latter achieving the best overall performance with NDCG@10 of 0.524, MAP@10 of 0.102, and MRR@10 of 0.512. These findings confirm that visual re-ranking enhances recommendation quality, with SVD + F-L and SVD + M-L emerging as the most effective configurations. Meanwhile, the Google-based variants offer a simpler alternative with consistent gains and easier integration.
Top@5 results reinforce these trends. The base SVD model obtains NDCG@5 of 0.323, MAP@5 of 0.055, and MRR@5 of 0.289. Visual re-ranking with large models from Facebook and Microsoft improves performance across all metrics, reaching NDCG@5 of 0.405, MAP@5 of 0.075, and MRR@5 of 0.400, further validating the effectiveness of visual information in refining top-ranked recommendations.
In conclusion, the results show that integrating visual side information consistently enhances the quality of item lists generated by recommender systems, both in models based on explicit similarity like KNN and those based on latent factors like SVD. Representations from DINOv2 and BEiT Large stand out for their robustness, while Google’s ViT models provide a strong balance between simplicity and performance. These results confirm the usefulness of combining these techniques with visual side information to improve the identification of relevant items, especially in the top positions of the ranking.
5.3. Comparison Between the Content-Based and Collaborative Filter Case Studies
This section aims to compare the results obtained in the previous sections for both the content-based system and those based on collaborative filtering.
Figure 4 presents a direct comparison of the NDCG@10, MAP@10, and MRR@10 values, along with their top@5 counterparts, for each of the variants used across all case studies. This allows for a clear visualization of the impact of visual side information in different contexts.
In the case of the KNN and SVD models, visual re-ranking also yields significant relative improvements in NDCG performance. For KNN, variants such as KNN + G and KNN + M-L achieve up to 36% and 35% relative improvements in NDCG@10 and NDCG@5, respectively, compared to the base model. Similarly, SVD + M-L reaches an NDCG@10 of 0.524 and an NDCG@5 of 0.405, representing relative gains of approximately 40% and 25% over the base SVD model. These results demonstrate that even strong collaborative filtering approaches can benefit substantially from visual re-ranking, leading to more relevant and accurate recommendations.
Content-based models also benefit from visual re-ranking, though the magnitude of improvement varies. In the case of TF-IDF, the variant TF-IDF + F-L achieves an NDCG@10 of 0.321 and an NDCG@5 of 0.254, which correspond to relative improvements of approximately 49% and 32% over the base TF-IDF model. Similarly, TF-IDF + M-L performs comparably, indicating that visual features from different sources consistently enhance retrieval effectiveness. For the SE mode, the best-performing variant SE + G improves NDCG@10 by 79% and NDCG@5 by 33% relative to the base SE model. These results suggest that the use of visual side information effectively complement content-based approaches, helping them approach the performance of more complex collaborative filtering models.
In summary, the results confirm the following:
The highest relative improvements are observed in the content-based cases, given its lower starting point.
The highest absolute metrics are achieved by the SVD models with re-ranking.
DINOv2 and BEiT-Large stand out for their performance in TF-IDF and SVD cases, while Google models stand out in the SE and KNN cases.
Methods based on Sentence Embedding (SE) show great potential to match the baseline performance of TF-IDF when combined with visual re-ranking techniques.
Likewise, TF-IDF methods can reach the baseline performance of much more complex models like KNN.
This comparison suggests that visual re-ranking techniques not only enhance TF-IDF systems to make them competitive, but also significantly boost SE, KNN, and SVD models, establishing themselves as a cross-cutting improvement strategy for recommendation systems.
5.4. Computational Cost of the Process
This section discusses the computational cost, comparing execution times of base models with those after ViT-based re-ranking.
The base models exhibited very low inference times, with simpler methods such as TF-IDF averaging just 0.02 s per user to generate recommendations. These efficiencies help mitigate the additional latency incurred during the more computationally demanding re-ranking phase. When the re-ranking was applied using the more basic ViT models, the execution time increased to an average of 80.56 s per user, which represents an increase by a factor of more than 4028 with respect to the base TF-IDF model. These results highlight the substantial computational overhead introduced by the re-ranking process. For more complex ViT models, the relative increase would be even greater, further emphasizing the trade-off between accuracy gains and efficiency. For example, the Dinov2 model takes up to 200 s per user, making the evaluation of the case studies significantly more time-consuming.
In addition to execution time, RAM usage is another critical factor to consider. The use of basic ViT models typically consumes up to 8 GB of RAM, while more complex architectures can require up to 10 GB or more, depending on the implementation and batch size. This increased memory demand stems from the need to store intermediate feature maps, attention matrices, and model parameters during inference. It is important to account for this consumption alongside the operating system, background tasks, and development environment overhead, such as an IDE, which can further increase memory pressure. Considering both execution time and memory requirements, it is recommended to use systems with high RAM capacity when working with ViT-based re-ranking models. Ensuring sufficient memory helps prevent slowdowns or crashes, providing a more stable and efficient environment for experimentation and deployment.
Regarding the inference of ViT models, basic architectures require approximately 9.49 s to extract feature vectors for 10 pairs of images and calculate their similarity scores. More complex ViT models can take up to 21.4 s for the same task due to larger parameter counts and more intricate attention mechanisms. These inference times directly impact the scalability of the re-ranking process, especially when dealing with large datasets or real-time recommendation scenarios. For instance, in systems requiring hundreds or thousands of similarity calculations, the cumulative delay can become a bottleneck, highlighting the need to carefully balance model complexity with performance requirements.
In summary, while ViT-based re-ranking significantly improves recommendation accuracy, it comes at the cost of increased execution time, memory usage, and inference latency. Many of these challenges can be mitigated through the use of a dedicated GPU, optimizing the implementation of the algorithms or precomputing and mapping all feature vectors from all items in advance, allowing for more efficient and scalable deployment without compromising performance.
6. Conclusions
This study introduced and validated a methodology for enhancing recommendation systems by integrating traditional top-k recommendation techniques with non-structured side information, specifically visual content associated with the recommended items. The proposed architecture combines a baseline recommender, content-based or collaborative filtering, with a visual feature extraction module. This module re-ranks recommended items according to the visual similarity of their movie posters. This modular design facilitates experimentation while isolating the contribution of visual information.
Experiments conducted on well-known datasets, including The Movies Dataset and MovieLens (augmented with images via the TMDB API), demonstrate consistent performance improvements across all baseline models—CB, KNN, and SVD. Notably, content-based methods showed the most significant gains, often matching or surpassing collaborative techniques when enhanced with visual re-ranking. Among the tested vision models, ViT-based architectures such as Google ViT and ViT-Large, as well as more complex models such as DINOv2 and BEiT-Large, provided the strongest improvements. These results confirm the value of incorporating visual side information into recommendation systems.
Despite these promising results, the study has several limitations. The computational cost of Vision Transformers remains a barrier to scalability, presenting increases up to 4028 times the original, particularly in resource-constrained environments. The quality and availability of visual data (e.g., missing or uninformative posters) also affect performance. Furthermore, pretrained ViT models were used without fine-tuning due to hardware constraints, which may have limited the expressiveness of the visual features. Finally, the fusion of structured and visual data was limited to a late-stage re-ranking process, missing opportunities for deeper multimodal integration.
In addition to movie recommendation, the proposed methodology has the potential to enhance recommender systems across a variety of domains where visual content is available. For instance, e-commerce platforms could leverage product images to improve personalized suggestions or fashion and furniture recommendation systems could benefit from visual–semantic alignment. Future work will aim to extend experiments to such applications, evaluating the effectiveness of integrating visual features with traditional recommendation techniques in diverse settings.
Another future research direction involves building more integrated multimodal architectures, which offer promising avenues to enhance recommendation performance. Instead of limiting visual information to a late-stage re-ranking process, the system could incorporate visual feature vectors directly into the recommendation model, enabling end-to-end learning that jointly optimizes structured and visual data. Advanced architectures, such as multimodal Transformers or hybrid neural recommender networks, could facilitate deeper interactions between modalities and capture more complex relationships.
Finally, more advanced and recent base recommendation methods are planned for future application, building upon the promising results obtained with the more traditional models presented here. While this work proposes a framework based primarily on established recommender techniques, the outcomes highlight a strong potential for future experiments leveraging state-of-the-art approaches. In addition, vision models will be fine-tuned on domain-specific data to improve their effectiveness for recommendation tasks. These developments seek to improve the scalability and personalization capabilities of multimodal recommendation systems.