1. Introduction
With the development of Information and Communication Technology (ICT) and the widespread use of the Internet, the rapidly growing e-commerce market allows users to easily search for and purchase a variety of newly released products [
1]. Despite these advantages, users face information overload problems when selecting items that suit their preferences [
2,
3]. This phenomenon has become a reason for e-commerce platforms to recognize the importance of recommender systems, which take into account user preference and recommend items that are likely to be purchased based on one’s need [
4]. The mechanism behind recommender systems is based on the analysis of past user behavior for the recommendations [
5]. Therefore, users can reduce their information search costs, and companies can secure their competitive advantages for customer management and sales improvement.
Collaborative Filtering (CF) is one of the typical recommender systems that has been widely used [
6]. However, since it solely relies on users’ past behavior, failing to fully address users’ behavioral motivations leads to a data sparsity problem [
7]. To solve this issue, previous studies have used auxiliary information such as online reviews, which contain information related to user and item features [
2,
8]. For example, Zheng, Noroozi and Yu [
6] used a Convolutional Neural Network (CNN) on review sets of users and items for extracting textual representations for recommendations. Liu, Chen and Chang [
8] used Bidirectional Encoder Representations from Transformers (BERT) and Robustly optimized BERT approach (RoBERTa) models as the encoder to extract textual information from review texts. The experimental results showed such methods can significantly enhance recommendation performance, as they incorporate more diverse information from the review texts. Although mitigating the issue of data sparsity and improving recommendation performance, these review-based recommender system studies have a limitation in that they fail to capture user preferences for visual representations [
9].
Nowadays, most online reviews are composed of multimodal contents including both images and texts [
10]. Images with visual information can convey detailed aspects of items and users’ preferences in an intuitive manner that review texts may not be able to deliver [
11]. From the sentence “I love this design” in the review text shown in
Figure 1, it is evident that the user appreciates a specific aspect of the phone case design. However, relying solely on the text, it is difficult to determine which specific part of the design the user appreciates. In other words, images can directly deliver user preferences that are not expressed in the text [
9]. Therefore, texts as textual information and images as visual information in multimodal reviews provide different types of information that can be complementary or substitutive [
11]. Meanwhile, some studies confirm that fusing these two types of information enhances the effectiveness of the recommendations [
10]. Therefore, it is necessary to propose a recommender system that can capture complementary and substitutive effects between texts and images.
For the fusion strategy, most studies used multiplication or concatenation operations, which do not fully reflect the interactions between different modalities [
12]. To achieve the proposed idea, this study applies the advanced fusion method based on attention mechanism to capture the complementarity between texts and images. Specifically, this study leverages the advantages of co-attention to model the complementarity between review texts and images by focusing on aligned features. Therefore, this study proposes a novel recommender system model called CAMRec (Co-Attention based Multimodal Recommender system), which effectively integrates the complementary effects of texts and images into the recommendation tasks. First, we use the RoBERTa model to extract textual information, which demonstrates excellent performance on various Natural Language Processing (NLP) tasks. Second, we use the VGG-16 model to extract visual information, which effectively recognizes complex patterns and features of images. Third, we apply a co-attention mechanism to obtain the joint representations of textual and visual information through dependencies between them. Finally, we integrate the overall features to predict user preference for a specific item. To evaluate the recommendation performance of the proposed CAMRec model, we use two datasets from Amazon. The experimental results confirmed that the proposed CAMRec outperforms various baseline models. The primary contributions of this study can be summarized as follows:
This study proposed CAMRec, which reflects user preferences from textual and visual perspectives and provides recommendations based on the complementarity between the two modalities.
This study explored the impact of users’ perception of visual representations on performance in multimodal recommender systems. It offers valuable insights into how visual features reflect user preferences.
This study conducted extensive experiments using real-world datasets from Amazon. The experimental results offer new directions for future research in the field of recommender systems.
The rest of this study is as follows:
Section 2 addresses related works, and
Section 3 describes our proposed CAMRec model.
Section 4 outlines the experimental data, evaluation metrics, and baseline models used in this study.
Section 5 shows the experimental results and discussions about them. Finally,
Section 6 concludes this study.
4. Experiments
In this study, we conducted extensive experiments using two publicly available datasets from the e-commerce platform Amazon. To verify the performance of the proposed model, we will answer the following research questions (RQs).
RQ 1: Does the proposed CAMRec model provide better recommendation performance compared to other baseline models?
RQ 2: How do the fused features of texts and images impact recommendation performance?
RQ 3: Which fusion method is the most effective in fusing texts and images?
4.1. Datasets
This study uses two datasets from the Amazon as the experimental datasets: Cell Phones and Accessories, and Electronics (
https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/, accessed on 10 May 2024). Amazon datasets are widely used in various recommender system studies as they contain vast amounts of purchase records, as well as review texts and images. To conduct experiments effectively, this study filters and uses user reviews simultaneously containing text and image data.
Table 1 summarizes the detailed statistics of the datasets used in this study. We randomly divided 70% for training, 10% for validation, and 20% for test sets [
1].
4.2. Evaluation Metric
To evaluate the performance of our model, we use Mean Absolute Error (
MAE) and Root Mean Squared Error (
RMSE), which are widely used metrics in recommender systems. The calculation of MAE is shown in Equation (14), where the absolute differences between predicted and actual ratings are summed and then divided by the total number of evaluated subjects. Each error contributes equally to the final measure, regardless of its magnitude. Meanwhile,
RMSE is calculated as in Equation (15), which is the squared error between the predicted and actual ratings that is divided by the total number of subjects under evaluation. Compared to
MAE,
RMSE has a relatively larger weight to the larger error value.
4.3. Baseline Models
To validate the trustworthiness of the proposed CAMRec model, we selected the following baseline models for comparison. These have been widely used in recommender system studies. Here are the explanations for each baseline model.
PMF [
28]: This model predicts ratings by modeling latent factors of the user and the item using the evaluation matrix as input based on the Gaussian distribution. This model is effective on sparse, imbalanced rating data.
NeuMF [
29]: This model combines the Generalized Matrix Factorization (GMF) and MLP to measure the nonlinear relationship between the user and item latent factors.
DeepCoNN [
6]: This model uses two CNN processors for each user’s review and item’s review to extract features, which are combined with Factorization Machine (FM) to predict ratings.
RSBM [
7]: This model extracts the semantic features of review texts using CNN and self-attention mechanism, which predicts the rating based on the importance of each extracted feature.
VBPR [
18]: This model extracts visual representations from item images and incorporates them into the MF model. It can solve the cold-start problems and provide accurate recommendations using Bayesian Personalized Ranking (BPR).
UCAM [
30]: This model predicts ratings by integrating context information into the NeuMF model. In this study, we use RoBERTa and VGG-16 models to extract feature representations from the review texts and images, and then integrate them as context information into the model.
4.4. Experimental Settings
To effectively compare the proposed CAMRec model with the baseline models, we trained our model using the training dataset, determined the optimal hyperparameter values using the validation dataset, and measured the recommendation performance using the test dataset. This study set early stopping if the validation loss did not decrease for five iterations to prevent overfitting, and the results from these experiments are reported as the averaged value of five times. We conducted experiments using 128.0 GB RAM and NVIDIA V100 GPU.
The batch size, learning rate, embedding sizes, and the number of multi-heads were adjusted to find the optimal hyperparameters of the proposed CAMRec model. Specifically, we found the optimal value for each hyperparameter from the following range: [64, 128, 256, 512] for batch size, [0.001, 0.005, 0.0001, 0.0005] for learning rate, [32, 64, 128, 256] for embedding size of user and item, and [2, 4, 6, 8, 10, 12] for the number of multi-heads. For the Cell Phones and Accessories dataset, we determined 256, 0.001, 128, and 6 as the optimal hyperparameter for the batch size, learning rate, embedding size, and the number of multi-heads, respectively. For the Electronics dataset, we determined 256, 0.005, 128, and 4 as the optimal hyperparameters. As for the hyperparameters of the baseline models, we empirically determined their optimal parameter settings by referring to the authors’ original papers. Meanwhile, we applied the same settings in our experiments if some hyperparameters were not specified. As for the pre-trained RoBERTa and VGG-16 models, we did not perform fine-tuning for the experimental datasets. The textual and visual features used in this study were extracted directly from these pre-trained models, leveraging their ability to capture general language and visual patterns.
5. Experimental Results
5.1. Performance Comparison to Baseline Models (RQ 1)
We compare the recommendation performance of the proposed CAMRec model with the baseline models. The results from
Table 2 indicate that the CAMRec outperforms the baseline models in all datasets. Next, we provide insights into three different aspects.
First, the models using rating as the sole information (e.g., PMF and NeuMF) show the lowest recommendation performance among all baseline models. Such results indicate that a recommendation method using only rating information has limitations in capturing the specific purchase motives of users. However, since NeuMF leverages the nonlinearity of deep neural networks, it can model complex interactions between users and items that linear models like PMF cannot.
Second, models using online reviews (e.g., DeepCoNN, RSBM, VBPR, and UCAM) show better performance than models using only the rating information. The results confirm the improvement in recommendation performance as online reviews contain more information on users’ preferences for a specific item. Meanwhile, among the models using review text information, RSBM outperforms DeepCoNN. RSBM uses the target user’s review text of an item, while DeepCoNN uses all review texts to learn the user’s latent expressions that overlook the evaluation of the item varies with the user. Therefore, the results suggest that learning the target user’s review text is more effective for personalized recommendations.
Third, the multimodal model such as UCAM shows better recommendation performance than the models using a single source of information. Multimodal models can minimize the information loss by utilizing both texts and images as the sources of information. Therefore, such results suggest simultaneously using textual and visual information can minimize information loss and ensure accurate recommendations.
Finally, our proposed CAMRec model show the best recommendation performance out of all baseline models for the following reasons: (1) Unlike PMF and NeuMF models using rating information to capture the interaction between users and items, the CAMRec model enhances the recommendation performance using various preference representations inherently found in online reviews. (2) Unlike DeepCoNN, RSBM, and VBPR models that incorporate a single modality, the proposed CAMRec model uses both textual and visual representations to learn. Therefore, the CAMRec model can provide recommendations through a more accurate reflection of diverse user preferences that a single modality may not capture. (3) Unlike the UCAM model which simply concatenates images and texts, the CAMRec model applies a co-attention mechanism to model the complementarity between the two modalities. Therefore, its recommendation performance is expected to improve as the model introduces the complementary and substitutive effects of texts and images.
5.2. Effect of Components of CAMRec (RQ 2)
The CAMRec model uses fused textual and visual features to ensure more accurate recommendations. In this section, ablation studies are conducted to verify whether the fused features enhance the recommendation performance. CAM-R is the model that only uses rating information. CAM-RT is the model that uses both rating and textual information from review texts. CAM-RI is the model that uses both rating and visual information from review images.
As suggested in
Table 3, models using online reviews outperform the model using the rating in all datasets. This indicates that review-based representations can effectively improve recommendation performance as online reviews contain more specific information about users’ preferences and item features. Therefore, using online reviews is crucial for modeling the preference features for users and items. Moreover, the composition of CAM-RI shows lower performance than the one of CAM-RT. The result suggests that written language is a clearer way of expressing users’ specific opinions or evaluations than images, thereby acquiring more accurate information about the item. Finally, the CAMRec model shows better recommendation performance than all other variants. It leads to enhanced performance in the case of fusing rating and visual and textual information, which implies the complementarity of texts and images and enables the learning of better comprehensive representations for users and items.
5.3. Effect of Fusion Strategy (RQ 3)
In this study, we apply the co-attention mechanism to fuse textual and visual features based on complementarity modeling. Therefore, we conducted an experiment to verify which fusion method can be the representative method for fusing the features. Specifically, we determined the addition, average, multiplication, and concatenation operations as the comparison, which are widely used in previous studies. Here, the multiplication operation is performed element-wise. The experimental results are shown in
Table 4.
The experimental results confirm that using the co-attention mechanism provides the best performance, as it leverages the attention mechanism to simultaneously learn the interactions between different modalities and fuse such information by weighting the degree of influence of each modality to the others. However, other operations fuse the modalities, overlooking the complementary and substitutive effects between them, thereby suppressing the performance.
6. Conclusions and Future Work
To solve the data sparsity issues in the recommender systems, previous studies have extracted various features from online reviews to integrate them into models. Since online reviews contain rich and specific user preference information, they enhance recommendation performance with clear motivations. Although online reviews contain multimodal information, including texts and images, many studies focus solely on textual information, which overlooks the informational value of images. Regarding the notion of each modality providing different information, previous studies have been limited in considering the complementary or substitutive effect of different modalities. To target this research gap, this study proposes a novel recommender system with a co-attention mechanism to incorporate the complementarity between texts and images. The experimental results were evaluated using two datasets from Amazon, and the proposed CAMRec model outperformed baseline models. This implies that the complementary fusion of texts and images significantly contributes to enhancing recommendation power. Moreover, the CAMRec model is proposed to have scalability, and can efficiently process large-scale datasets with superior performance. This ensures the model is well-suited for real-world applications, specifically in environments where data volumes continuously increase. Therefore, this study provides a new perspective in the field of recommender systems and demonstrates that our proposed model enhances recommendation performance while being adaptable to scalable and large-scale implementations.
The theoretical implications of this study are as follows. First, this study attempted to fuse textual and visual features, which has not yet been performed in previous studies. Therefore, this study provides a new direction and extends the scope of recommender system-related studies. Second, this study provides significant insights in determining how visual information from images plays a vital role in reflecting users’ preferences in multimodal studies and in determining an ideal image processing method for the utmost recommendation performance. Third, this study utilizes a co-attention mechanism for modeling the complex interactions between text and images. Therefore, this study lays the theoretical foundation for the exploitation of multimodal data, providing contributions in the fields of multimodal learning as well as deep learning, and opening new possibilities for how to effectively fuse various data sources.
The practical implications of this study are as follows. First, since the proposed CAMRec model can provide more sophisticated and personalized recommendations, users’ experience will significantly improve. This may contribute to increasing customer satisfaction and user retention in e-commerce platforms. Second, the proposed CAMRec model is designed with domain-agnostic characteristics, making it highly scalable and adaptable across various domains. Although this study uses datasets from Amazon, which is common in recommender system studies, the model’s architecture is flexible enough to be applied to other domains such as fashion, entertainment, or social media platforms. Since each domain involves multimodal data (e.g., texts and images), the CAMRec model can process data through its co-attention mechanism. Moreover, the scalability of the model supports large-scale datasets and growing operational demands, making it well-suited for providing efficient recommendations in contexts beyond e-commerce. Finally, the feature composition of our proposed CAMRec model is model-agnostic, which allows for flexible implementation. Therefore, businesses can easily apply it to their existing hardware or software environments without major overhauls. Such flexibility enables companies to overcome various technical constraints while optimizing the model for their specific operational needs. Moreover, companies can enhance customer experience and business performance by incorporating the insights gained from this study.
Although the effectiveness of this study has been demonstrated in various perspectives, some limitations still exist. First, this study uses RoBERTa and VGG16 models to extract textual and visual information from online reviews. Besides these two models, other existing models are also used for feature extraction in studies. Therefore, we can conduct comparisons in future works by using different models to verify whether the extraction model impacts recommendation performance. Second, the experimental datasets used in this study are collected from Amazon, which limits the evaluation to the e-commerce domain. Since the proposed CAMRec model is designed with domain-agnostic characteristics, the effectiveness and generalizability in other domains need to be verified. Besides e-commerce, other domains also have different types of multimodal data and user behaviors. Therefore, we can explore the applicability of the CAMRec model across diverse domains in future works. Third, this study approaches the recommender system using multimodal data, including review texts and images. Besides these, different pieces of auxiliary information, such as titles and descriptions of items, have also widely been used in previous studies and impact recommendation performance. Therefore, we can use extra auxiliary information in future works to verify if it improves recommendation performance. Fourth, this study applies co-attention to achieve the purpose of introducing the complementarity between texts and images in the recommendations. As for the fusion strategy, some studies have proposed other advanced techniques, such as cross-attention mechanisms, gated multimodal units, multimodal factorized bilinear pooling, etc. Therefore, we can compare advanced fusion techniques in future works to verify whether the applied co-attention mechanism is effective in recommendation performance. Finally, although this study demonstrates the potential of multimodal data in improving recommendation performance, the cold-start issues commonly present in recommender systems have not been specifically addressed. Since this study leverages rich multimodal data, we believe it can mitigate cold-start issues with detailed descriptive content for items. Therefore, we can analyze whether the proposed CAMRec model better addresses cold-start scenarios in future works.