Cross-Modal Manifold Propagation for Image Recommendation

Jian, Meng; Guo, Jingjing; Fu, Xin; Wu, Lifang; Jia, Ting

doi:10.3390/app12063180

Open AccessArticle

Cross-Modal Manifold Propagation for Image Recommendation

by

Meng Jian

¹

,

Jingjing Guo

¹,

Xin Fu

^2,*,

Lifang Wu

^1,* and

Ting Jia

¹

Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China

²

School of Water Conservancy and Environment, University of Jinan, Jinan 250022, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(6), 3180; https://doi.org/10.3390/app12063180

Submission received: 18 January 2022 / Revised: 18 March 2022 / Accepted: 19 March 2022 / Published: 21 March 2022

Download

Browse Figures

Versions Notes

Abstract

:

The growing complex user intention gap and information overload are obstacles for users to access the desired content. User interactions and the involved content indicate rich evidence of users’ interests. It is required to investigate interaction characters over user interest and information distribution, and this alleviates information overload for personalized recommendation. Therefore, this work explores user interests with interactions and visual information from users’ historical records for image recommendation. This paper introduces cross-modal manifold propagation (CMP) for personalized image recommendation. CMP investigates the trend of user preferences by propagating users’ historical records along with users’ interest distribution, which produces personalized interest-aware image candidates according to user interests. CMP simultaneously leverages visual distribution to spread users’ visual records relying on the dense semantic visual manifold. Visual manifold propagation estimates detailed semantic-level user-image correlations for ranking candidate images in recommendations. In the proposed CMP, both user interest manifold and images’ visual manifold compensate each other in propagating users’ records to predict users’ interaction. Experimental results illustrate the effectiveness of collaborative user-image propagation of CMP for personalized image recommendation. Performance improved by more than

20 %

compared to that of existing baselines.

Keywords:

cross-modal collaboration; collaborative propagation; manifold propagation; personalized recommendation; user interest

1. Introduction

The recent ubiquitous impact of social platforms and related techniques has highly accelerated the creation and spread of multimedia information [1]. The social multimedia environment provides rich interactions for users to generate and enrich content, and even collaborate. As reported on November 2021 (https://www.visualcapitalist.com/from-amazon-to-zoom-what-happens-in-an-internet-minute-in-2021/, accessed on 15 January 2022), during an Internet minute, the Google engine processes

5.7

million searches, Facebook receives

44.0

million views, TikTok serves

167.0

million videos, and

12.0

million emails are sent. Such a variety of social platforms usher in the evolution of people communicating with each other via information sharing, and following the activities and comments of others. This inevitably produces a wide range of overloaded multimedia content, which lays tremendous obstacles to quality of service, especially response speed. To this end, personalized recommendations are promising for guaranteeing effective and efficient access to users’ desired multimedia content. However, users’ diverse preferences can not be easily uncovered due to extremely sparse user-image correspondence in current social environments [2]. Figure 1 takes content from Huaban.com to roughly illustrate the relationship between user space and visual content space with intention and semantic gaps. Since both users and visual content play critical roles in the modern social environment, it is crucial to understand and bridge user intention and multimedia semantic gaps for personalized recommendations.

Personalized image recommendation is intrinsically modeled as the cross-modal matching problem between users and images [1]. Recent works [3,4] successfully modeled user interests with graph-based learning due to the inherently heterogeneous pairwise interactions. Their advance inspired further investigating the heterogeneous graph of interactions by manifold propagation [5] to learn users’ interests, since a manifold actively works on learning the global graph structure. Contextual structures are involved in compensating collaborative signals to mine cross-modal correlation on personalized recommendations [2,6,7]. We thus propose cross-modal manifold propagation (CMP) to match pairwise user images with correlation estimation. Figure 2 illustrates the framework of CMP to perform cross-modal learning collaboratively on both user interest and visual distributions. Figure 2 shows that CMP takes user interest distribution (blue) to spread characters of users’ image records over users’ interest manifold, which induces a set of interest-aware image candidates for each user. On the other hand, the visual recommendation module (green) is desired to spread users’ image records along the visual semantic manifold of images and produce detailed visual estimation as dense user-image correlation. An interest-aware semantic fusion strategy is further introduced to incorporate detailed visual-aware user-image correlation with interest-aware image candidates to rank personalized candidates and recommend high-score images to the user. The main contributions of this work are as follows.

We designed user interest-oriented semantic image ranking for cross-modal collaborative recommendation by investigating the distributions of both user interest and visual semantics.
The proposed CMP reveals the trend of interests for users and estimates interest-aware user-image correlation by spreading users’ image records on users’ interest manifold.
CMP leverages visual manifold modularization to help reduce the computational burden on visual manifold propagation, and promote estimating semantic visual-aware user-image scores for recommendation.

Experiments were performed on the public Huaban dataset and achieved

45.0 %

Precision on Top 1 recommendations. Experimental results with detailed analysis verified the collaborative recommendation ability of the proposed CMP. A preliminary version of this work [8] was presented at the ACM SIGMM International Conference on Multimedia Retrieval, 2019.

The rest of this paper is organized as follows. Related works are reviewed in Section 2. Details of the proposed CMP method are introduced in Section 3. Section 4 outlines the conducted experiments and provides corresponding analysis. Section 5 discusses and concludes this work.

2. Related Work

This section reviews and discusses several relevant works of collaborative filtering, and visual and multimodal recommendations to the proposed CMP method.

2.1. Collaborative Filtering and Visual Recommendation

Many research efforts are put into personalized content mining to match relevant multimedia content for users’ personalized requirements. Representative user-image correlation estimation methods are built on collaborative filtering (CF) [3,9,10] or content-based (CB) learning [11,12,13]. CF recommends others’ content to a specific user relying on common interactions among them, whose quality is substantially influenced by their historical records. NBI [14] measures pairwise correlation relying on resource-allocation dynamics, in which bipartite representation extracts intrinsic relationships of network for personalized recommendation. Wu et al. [9] proposed a generalized flexible autoencoder in view of denoising to embed users and items for recommendations. Zhang et al. [15] explored the embedding efficiency of CF models and proposed discrete collaborative filtering to maintain inherent user-image correlation with Hamming similarity-based constraint. It produces compact binary embeddings by the balanced and uncorrelated coding constraints resulting in an apparent improvement on recommendation performance. To address cold starts and data sparsity, cross-network collaboration [16] is proposed by paralleling an auxiliary network that integrates user embeddings propagated from the auxiliary network to promote collaborative recommendation performance. Chen et al. [10] introduced a hierarchical attention mechanism into modeling users’ interests. In a tower structure, the component-level attention module selects informative components, and the item-level attention module learns to score item preferences. The attention provides a detailed contribution of each component in modeling users’ interests, which effectively enhances recommendation performance. GCN [17] propagates users’ interest with graph convolution on heterogeneous interaction graph. NeuMF [1] proposes nonlinear interaction mapping on pairwise latent embeddings of users and items for personalized recommnendation. Feng et al. [3] investigated nonlinear and higher-order relationships among items by nonlinear neural networks to capture the complicated effects of users in decision making for recommendations. NGCF [4] explicitly encoded high-order interactions in embeddings to promote representative capability on users’ interests. The progressively improved recommendation performance by these CF models verified the effectiveness of graph-based propagation in modeling users’ interests. However, graph-based propagation is still subject to sparse interactions. Auxiliary information is highly required to bridge the information gap for propagation.

By considering visual information, content-based recommendations [11] aim to recommend items to users on the basis of matching between item descriptions and user interest profiles. Cantador et al. [12] measured content similarity with tag frequencies to jointly build user and image profiles for recommendation. LASSO regressor was employed in [13] to emphasize discriminative contents on each user’s preference for matching users and visual contents. Li et al. [18] modeled image collections using group sparse reconstruction and measured their dynamic similarity to users’ historical records for image collection recommendation. You et al. [19] proposed inferring users’ interests from posted images by analyzing visual content and aggregating them together. Hong et al. [20] narrowed the semantic gap by producing a joint semantic space of visual attributes and image indexing. They integrated the feature space with spectral hashing resulting in performance improvement on image matching. A hybrid model [21] implemented item-based collaborative filtering involving latent semantic embeddings. It simultaneously guaranteed the relevance of recommended contents on users’ interests. MMR [22,23] employs dense visual distribution to propagate and estimate users’ interests for personalized recommendation. It takes rich paths for graph propagation. The performance verified the active role of visual signals in mining users’ interests. User-image interaction estimation is limited by single-modal collaborative filtering or visual methods due to the sparse interactions between users and images/items.

2.2. Multimodal Collaborative Recommendation

The current multimedia explosion has inspired multimodal collaborative learning to compensate with each other in bridging gaps between users’ preferences and multimedia information for personalized recommendation. Mei et al. [24] integrated multimodal content relevance and user feedback by performing relevance feedback and attention fusion in the recommendation system. Forestiero [25] proposed a heuristic recommender system featured by multiagent swarm clustering. To address dynamic and unreliable issues, smart objects are associated with mobile agents, and a global agent organization performs autonomously agent discrimination to easily select smart objects. Performance improved by about

50 %

on recommendation over existing models. Comito et al. [26] built an online model to filter relevant topics of interest from huge and complex social media environments. They employed both textual content and latent representations of words in a text to cluster short texts. This increases the possibility of a small corpus being observed, and performance showed significant improvement. Yuan et al. [27] investigated multimodal collaborative feature learning of images, texts, and videos with a unified deep architecture. On multimodal learning, the deep Boltzmann machine acts to generate fused representation by combining cross-modal features [28]. Geng et al. [6] investigated a unified feature-learning model to transform heterogeneous user and image spaces into a unified feature space to facilitate image recommendation. On cross-network learning, boosted multifeature learning [29] modeled item weight distribution by classification error and domain similarity to bridge the cross-domain gap. Yang et al. [30] further constructed a modal correlation constraint with cross-domain constraint in an autoencoder to maintain consistency in multimodal feature learning. Cross-modal feature matching in [31] learnt the matrices of different modalities jointly with a prior of local groups for image retrieval. User records, visual content, and tags were involved in building a tripartite graph in [32] to estimate cross-modal user-image correlation in personalized recommendations. Sejal et al. [33] combined text and visual features to fill the semantic gap, which employed text-based retrieval and pairwise visual similarity for image recommendation. Harakawa et al. [34] proposed field-aware factorization in investigating emotional factors and model users’ interests for recommendation by collaboratively fusing multimodal factors. Bai et al. [35] proposed a two-layer tensor model of joint interaction and context operation. Tang et al. [36] conducted cross-modal learning with tensor completion by anchor graph regularization. Auxiliary information is embedded to expand users and items as two-dimensional locations in a decision tree for inferring dense correlation [37]. VNPR [7] extracts users’ visual interests to aid latent ID embedding for joint matching between users and items. HASC [2] involved social contextual signals in a hierarchical interest propagation to infer users’ interests. These achievements illustrate that it is still far from complete regarding users’ diverse and complex interests, even when fed with a wide range of auxiliary information. Most of these multimodal learning methods emphasize embedding learning with multiple auxiliary information, and not interaction estimation. Interaction estimation, as the core in personalized recommendation, remains to be explored. Therefore, this work examines the compensation between interactions and visual contents, and implements interaction estimation in a multimodal perspective of user interest and visual distributions.

3. Cross-Modal Manifold Propagation

Due to the extremely sparse interactions between users and images in current social networks, it is inadequate for single-modal learning to uncover users’ interests in front of the dynamic and diverse interests and information overload. Multimodal collaborative filtering for personalized recommendation becomes an excellent trend for the sustainable advance of social networks. As illustrated in Figure 1, visual images and individual user preferences play dominant roles in the social environment. However, they belong to two independent spaces of different characters. This work investigates the collaborative effects of user interest and visual semantics to infer cross-modal interactions between users and images and facilitate personalized image recommendations.

Cross-modal manifold propagation (CMP) is proposed as illustrated in Figure 2 with users’ interest and visual semantic modules jointly modeling user-image interactions for personalized recommendation. Manifold propagation can successfully capture global discriminative structures hidden in data [5]. In terms of manifold construction, several elaborately designed deep graph models such as [38] built a detailed structure for recommendation. Despite the excellent performance of deep graph construction, it does not apply to investigating the collaborative nature of both users’ interests and visual semantics. For simplicity, CMP employs pairwise common interests in the user interest module and visual distribution in the visual module to form the weighted graph in their manifolds. In this section, the details of the proposed CMP are introduced as user interest manifold propagation (UMP) in Section 3.1, visual semantic manifold propagation (VMP) in Section 3.2 and cross-modal collaborative image ranking for recommendations in Section 3.3.

3.1. Interest Manifold Propagation

Users’ historical records intrinsically reflect users’ preferences for multimedia information. This inspired us to investigate users’ personalized preferences with their historical records. However, considering the volume of multimedia content, the amount of user-image records is incomplete to uncover users’ entire interests, not to mention for learning interest representations for users. To leverage precious historical records, the proposed CMP evaluates common interests among users and constructs a manifold of user interest distribution. User interest manifold takes a user interest graph

G_{u} = (U, W^{u})

with

W^{u} = [W_{i j}^{u}]

representing the common interest degree of users

U = [u_{1}, \dots, u_{C}]

as Equation (1),

i, j = 1, 2, . . ., C

.

W_{i j}^{u} = \frac{|S_{i} \cap S_{j}|}{\sqrt{|S_{i}| |S_{j}|}}

(1)

where

S_{i} / S_{j}

represents the set of images with which user

u_{i} / u_{j}

interacts,

|S_{i}| / |S_{i}|

is the size of

S_{i} / S_{j}

, and

|S_{i} \cap S_{j}|

measures the number of images that users

u_{i}

and

u_{j}

share in common, ∩ is the intersection operation,

i, j = 1, 2, . . ., C

. Equation (1) evaluates the proportion of images shared by users

u_{i}

and

u_{j}

to represent their common interests. The more common the images the two users share are, the higher the common interest degree

W_{i j}^{u}

exists between them is.

On the interest manifold, as shown in the blue module in Figure 2, characteristics of user image records could be spread along with the distribution of user interest. The interest-based recommendation module uncovers interest-aware cross-modal correlation

F_{u}

in view of user interests. Suppose that

V = [v_{1}, \dots, v_{n}]

represents an image set, and

Y = {[y_{i j}]}_{n \times C}

are corresponding interactions from user historical records.

y_{i j} = 1

records existing interactions between image

v_{i}

and user

u_{j}

; otherwise,

y_{i j} = 0

, for

i = 1, 2, \dots, n,

j = 1, 2, \dots, C

. CMP employs the following constrained interest smoothness operation to perform interest manifold propagation and infer personalized interest degree

F_{u}

to images V as

F_{u}^{*} = \underset{F_{u}}{arg min} (1 - α) {∥F_{u} - Y∥}_{F}^{2} + α t r (< F_{u}, L^{u} F_{u} >)

(2)

where

F_{u} = {[f_{i j}^{u}]}_{n \times C}

represents estimated correlation

f_{i j}^{u}

of user

u_{j}

to image

v_{i}

with superscript u denoting the user interest manifold used in propagating users’ records, visual image index

i = 1, 2, \dots, n

, user index

j = 1, 2, \dots, C

.

{∥*∥}_{F}^{2}

calculates the

F r o b e n i u s

norm,

α

is a trade-off parameter to constrain between the global interest smoothness and the distinguishing interest of users. User graph Laplacian

L^{u} = I - D^{- 1 / 2} W^{u} D^{- 1 / 2}

is derived from the common interest degree matrix

W^{u} = {[w_{i j}^{u}]}_{n \times n}

, identity matrix I and user degree matrix D diagonalized by column sum of

W^{u}

, and

t r (*)

denotes the trace operator. Higher

f_{i j}^{u}

implies more possibility that user

u_{j}

would prefer image

v_{i}

.

Furthermore, an analytical solution of interest manifold propagation was derived as Equation (3), representing estimated interest-aware user-image correlation.

F_{u}^{*} = {((1 - α) I + α L^{u})}^{- 1} Y

(3)

The user interest manifold propagation module is referred to as UMP in blue of the collaborative learning framework in Figure 2. The user interest-based recommendation module endows users visible to and is aware of others’ interests on visual images, i.e., to know the trend of general visual preferences. Taking into consideration the incompleteness of interest manifold in UMP module, user-image correlation in Equation (3) reflects only users’ interest tendency, and not a detailed interest degree. Therefore, estimated user-image correlation in UMP was treated as preliminary filtering for users’ personalized interests and was employed to select a user-specific candidate set for recommendation.

3.2. Visual Manifold Propagation

Informative visual content in social multimedia environments brings a massive amount of visual information at low costs. It inspired us to take informative visual semantics of images as supplementary over sparse user-image records. The green module in Figure 2 shows the constructed visual manifold propagation (VMP) to estimate semantic detailed cross-modal correlation. VMP transfers characteristics of users’ visual records along with visual semantic distribution and produces a semantic-aware user-image correlation estimation in view of visual semantics. Visual distribution-oriented propagation greatly benefits from images’ dense semantic visual correlation. Distribution provides a more informative detailed correlation than the sparse user-image correlation in UMP.

Since deep models, such as AlexNet [39], have achieved great success in discriminative feature learning, high-level semantic visual features could be robustly extracted instead of mining explicit semantics by classification or clustering. On analyzing visual semantics, CMP immediately adopts AlexNet [39] to capture visual semantic features to construct visual manifold distribution. The visual recommendation module in Figure 2 builds a visual manifold by a visual adjacency graph

G_{v} = (V, W^{v})

on image set V, where

W^{v} = [W_{i j}^{v}]

contains pairwise visual similarity between images

v_{i}

and

v_{j}

with cosine measurement on their semantic features.

Unlike data incompleteness in the interest-based recommendation module, visual correlations require massive volumes and tend to be more complicated. This leads the visual manifold to a wide range of visual semantics. In the case of complex distribution, Gong et al. [5] found that manifold propagation would result in biased structure learning with high possibility. A predefined learning strategy is expected to simplify and guide the learning procedure without bias from complex data correlation. Therefore, the visual module was designed to learn user-image correlation in a decomposed manner. Since the nature of semantic-similar contents is distributed to be relatively compact, the proposed CMP employs semantic-compact distribution of the visual manifold to break down the global manifold into several submanifolds. The operation is called manifold modularization. It immediately relieves the semantic complexity of the visual manifold propagation and alleviates the computational burden on the global visual manifold. The modularity Q of visual manifold is measured by Equation (4).

\begin{matrix} Q = & \frac{1}{2 m} \sum_{i, j = 1, 2, . . ., n} [W_{i j}^{v} - \frac{k_{i} k_{j}}{2 m}] δ (o_{i}, o_{j}) \\ = & \frac{1}{2 m} \sum_{a \in A} [a_{i n} - \frac{a_{t o t}^{2}}{2 m}] \end{matrix}

(4)

where

W_{i j}^{v}

indicates semantic visual correlation between images

v_{i}

and

v_{j}

in

W^{v}

,

m = \frac{1}{2} \sum_{i, j} W_{i j}^{v}

provides the normalized weights in

W^{v}

,

k_{i} = \sum_{j} W_{i j}^{v}

is the corresponding degree of

v_{i}

in graph

G_{v}

, and

o_{i}

indexes the corresponding sub-manifold of image

v_{i}

.

δ (o_{i}, o_{j})

indicates whether

v_{i}

and

v_{j}

belong to the same sub-manifold with

δ (o_{i}, o_{j}) = 1

if

o_{i} = o_{j}

, otherwise

δ (o_{i}, o_{j}) = 0

. A is the set of sub-manifolds decomposed by modularity,

a_{i n}

is the sum of the weights in a specific sub-manifold a, and

a_{t o t}

is the sum of weights of the edges incident to the corresponding sub-manifold a, for

a \in A

,

i, j = 1, 2, . . ., n

.

In the beginning of manifold modularization, images in

V = [v_{1}, \dots, v_{n}]

are individually treated as a single submanifold to initialize modularized manifold. The change in modularity takes place when allotting an image

v_{i}

into its neighboring submanifolds. Extracting an image

v_{i}

or entering it into a submanifold would both result in a gain to global modularity. Gain of modularity

Δ Q

by allotting an image

v_{i}

into a specific submanifold is calculated by Equation (5).

\begin{matrix} Δ Q & = [\frac{a_{i n} + k_{i, i n}}{2 m} - {(\frac{a_{t o t} + k_{i}}{2 m})}^{2}] \\ - [\frac{a_{i n}}{2 m} - {(\frac{a_{t o t}}{2 m})}^{2} - {(\frac{k_{i}}{2 m})}^{2}] \\ = \frac{1}{2 m} (k_{i, i n} - \frac{a_{t o t} k_{i}}{m}) \end{matrix}

(5)

where

k_{i, i n}

denotes the sum of the weights between image

v_{i}

and images in its neighboring submanifold a. The image is then allotted to submanifold a, taking the maximal gain of modularity

Δ Q

. In

m a x Δ Q > 0

, image

v_{i}

is allotted to corresponding submanifold a. Otherwise, without positive gain,

m a x Δ Q \leq 0

, image

v_{i}

would stay in its original submanifold. It is repeated until the submanifold to which each image belongs does not change anymore. Lastly, with stable maximized modularity Q, the global visual manifold was separated into several semantic submanifolds A containing locally compact semantic distribution. The visual manifold modularization is summarized in Algorithm 1. It provides relatively dominant semantic distribution for the sequential manifold propagation to infer pairwise user-image interactions.

Algorithm 1 Visual manifold modularization.

Input:

V = [v_{1}, \dots, v_{n}]

—visual image database.

Output: Modularized visual manifold.

1:: Images in $V = [v_{1}, \dots, v_{n}]$ are individually treated as a single submanifold.
2:: For each image $v_{i}$ , calculate change in modularity $Δ Q$ with Equation (5) in cases of assigning $v_{i}$ to its neighboring submanifolds.
3:: If $m a x$ $Δ$ Q > 0, assign image $v_{i}$ to the submanifold with maximal gain of modularity $Δ Q$ .
4:: Until a stable modularity is achieved, the global manifold is modularized into several submanifolds.

With manifold modularization, the proposed CMP method conducts manifold learning to individually propagate users’ records along each visual submanifold

a \in A

by optimizing Equation (6).

F_{v}^{*} = ⋃_{a \in A} \underset{F_{v}}{arg min} (1 - α) {∥F_{v} - Y∥}_{F}^{2} + α t r (< F_{v}, L^{a} F_{v} >)

(6)

where

⋃_{a \in A}

is the union operation over submanifold

a \in A

,

F_{v} = {[f_{i j}^{v}]}_{n \times C}

is gathered over all semantic submanifolds

a \in A

, denoting the estimated semantic correlation degree

f_{i j}^{v}

of image

v_{i}

to user

u_{j}

propagated through the visual semantic manifold.

α

acts as a trade-off parameter between the global smoothness over visual semantic distribution and the discriminative features of images.

L^{a} = I - D^{- 1 / 2} W^{a} D^{- 1 / 2}

is the graph Laplacian matrix of

W^{a} = {[w_{i j}^{a}]}_{p \times p}

with identity matrix I and degree matrix D of column sum of

W^{a}

,

t r (*)

calculates the trace of matrix, p is the size of images in submanifold a. Higher

f_{i j}^{v}

implies more semantic correlation in image

v_{i}

to interest of user

u_{j}

.

An analytical solution of semantic visual manifold propagation can be derived in Equation (6) for user-image correlation estimation as

F_{v}^{*} = ⋃_{a \in A} {((1 - α) I + α L^{a})}^{- 1} Y

(7)

The visual-based recommendation module produces semantic-aware user-image correlation

F_{v}

to infer users’ potential preference to images from the view of global visual distribution.

3.3. Cross-Modal Collaborative Ranking

As described in Section 3.1, the interest-based recommendation module selects interest-aware image candidates in the view of user interest where local details tend to be coarse, ignoring attributes of user interested content. The incompleteness of user-image correspondence in UMP worsens this. In contrast, VMP greatly focuses on local semantic distribution. It was designed to prefer images that are as similar as possible to the user’s records. This may limit the interest scope of recommendation in a semantic space. As shown in the framework of Figure 2, the proposed CMP collaboratively fuses the cross-modal estimation of UMP and VMP modules to incorporate their advantages in both global interest and local visual semantic views. The final dense user-image correlation is estimated by jointly optimizing UMP and VMP as Equation (8).

\{\begin{matrix} F_{u}^{*} = \underset{F_{u}}{arg min} (1 - α) {∥F_{u} - Y∥}_{F}^{2} + α t r (< F_{u}, L^{u} F_{u} >) \\ F_{v}^{*} = ⋃_{a \in A} \underset{F_{v}}{arg min} (1 - α) {∥F_{v} - Y∥}_{F}^{2} + α t r (< F_{v}, L^{a} F_{v} >) \end{matrix}

(8)

where their first terms focus on respecting historical records Y, and the second terms propagate records on user interest manifold and visual semantic manifold for user-image correlation estimation.

Fusion discussion: with user-image correlations

F_{u}

and

F_{v}

of both user interest and visual semantic views, CMP introduces fusion for estimating user-image correlation as follows:

F = F_{u} ⊙ F_{v}

(9)

where

F_{u}

and

F_{v}

are estimated user-image correlations propagated along user interest manifold by UMP and visual semantic manifold by VMP, respectively. ⊙ could be any commonly used fusion operators, such as min, max, and ave pooling. By considering the merits of UMP and VMP modules, CMP designs an interest-aware semantic fusion to derive the final user-image correlation for recommendation. On the one hand, an interest-aware image candidate set was selected by ranking

F_{u}

to match each user’s interests. On the other hand, the detailed semantic user-image correlations by VMP were provided to rank the candidates to conduct the multi-view fusion of user interest and visual semantics. Lastly, CMP recommends images with high correlation F as customized visual contents for a specific user. In CMP, user interest manifold propagation plays a role in selecting user-interest image candidates, while visual manifold propagation estimates detailed user-image semantic scores for the candidates. This endows the recommended images with characters of both user and visual spaces while following users’ individual preferences. Algorithm 2 summarizes the proposed cross-modal collaborative manifold propagation for image recommendation.

Algorithm 2 Cross-modal collaborative manifold propagation.

Input:

V = [v_{1}, \dots, v_{n}]

—visual image database;

U = [u_{1}, \dots, u_{C}]

—set of users;

Y = {[y_{i j}]}_{n \times C}

—historical records between V and U.

Output: Image list for personalized recommendation.

1:: Construct interest manifold of users and visual manifold of images with graphs $G_{u}$ and $G_{v}$ by measuring common interests between users and semantic visual correlation of images, respectively.
2:: Propagate users’ historical image records along user interest manifold for estimating interest-aware user-image correlations $F_{u}$ by Equation (3).
3:: Generate an interest-aware image candidate set for each user on the basis of UMP.
4:: Decompose semantic visual manifold into semantic compact submanifolds by measuring modularity gain as Equation (5).
5:: Propagate users’ historical image records along modularized semantic visual submanifolds by Equation (7) for semantic-aware user-image correlations $F_{v}$ .
6:: Fuse $F_{u}$ and $F_{v}$ with Equation (9), and rank the candidate images of each user according to their semantic-aware user-image correlations for personalized recommendation.

4. Experimental Analysis

This section conducts experiments and the corresponding analysis to show the capability of the proposed CMP in fusing advantages of both user interest and visual semantics for image recommendation. A social database was collected from social image sharing platform Huaban (http://huaban.com/, accessed on 15 January 2022), which contains actual records of user-image correspondence composed of 33,926 images and 4737 users with 1,610,984 feedback instances of sparsity

1.0 %

. User-image interactions are split into half for training and half for random testing. In the scenario of image recommendation, manifold propagation in the proposed CMP builds graphs with all users and images, on which only given interactions for training are available to learn pairwise user-image correlations

F_{u}

and

F_{v}

for recommendations. The following experiments were conducted to evaluate the effectiveness of CMP.

As a primary component in CMP, the manifold construction of user interests and visual semantics is explored first to investigate their complementary role in image recommendation.
On collaborative fusion, experiments investigate multiple fusion rules compared with the introduced interest-aware semantic fusion to illustrate its merit.
The performance of CMP is compared with that of single-modal UMP and VMP to illustrate the cross-modal collaborative ability of user manifold propagation and visual manifold propagation in CMP.
Experiments were conducted to compare the recommendation performance of CMP to that of network-based inference (NBI) [14], collaborative filtering-based (CF) recommendation [40], content-based (CB) recommendation [11], content-based bipartite graph (CBG) [41], collaborative representation based inference (CRC) [42], SVM based inference [43], hybrid recommendation [21], progressive manifold ranking (PMR), and modularized manifold ranking (MMR) [23].

Precision, mean average precision (MAP), and mean reciprocal rank (MRR) act to quantitatively verify performance for image recommendation. Precision evaluates the percentage of correctly recommended images in top-N image recommendation. Higher precision indicates more correct recommended images to users’ personalized preferences and better performance. MAP evaluates the mean of average precision on the ranking order of each correctly recommended image, which measures recommendation performance by considering the orders of recommended images. Higher MAP implies better performance in which correct images take front rankings in recommendations. MRR assesses the reciprocal of the recommended order of users’ first correct recommended image, and adopts averages of all users as performance measurement. Higher MRR indicates better performance with correct recommendations ranked in front. We carefully implemented the proposed CMP and its comparisons with the same experimental settings to report results on the same datasets for a fair comparison. All experiments were conducted on a PC with 3.6 GHz CPU and 32 G RAM running a Linux environment and MATLAB R2013a.

4.1. Interest Manifold Construction

In interest-oriented recommendation, researchers prefer to model user profiles with a user-interest representation learned from user-image correspondence or relationships [3,9,31,44]. As a representative method, Liu et al. [44] employed Word2vec on user-image correspondence to learn user-interest representation. With user representation, cosine similarity between users is another promising way to construct a manifold. Therefore, two options are compared for user interest manifold construction:

Relationship-based manifold on the common interest of users from user–image relationships.
Representation-based manifold on user representation and similarity evaluation.

On user-interest manifold construction, experiments compare the recommendation performance of UMP with that of the user-image relationship-based and user representation-based manifold. Figure 3 compares the precision of top-N recommendation of UMP module using both user-image relationship-based manifold and user representation-based manifold. Figure 3 shows that UMP with relationship-based manifold outperformed that of the representation-based manifold. Results imply that the common interest of users from relationships captures more inherent interest distribution for personalized recommendation. It uses rare sparse cross-modal information better in user interest manifold construction. Relationship-based interest manifold immediately leverages given user-image information without distortion. Therefore, the user interest manifold of common interest derived from the user-image relationship preserves the inherent structure of user interest space and achieves better performance. Although user representation is also derived from user-image networks, the sparse rare user-image correspondence could not be fully preserved in feature learning, which leads to missing information to some degree. Additionally, predefined similarity measurement leads to the captured manifold losing more information and worsens the situation. Representation-based manifold construction suffers much, with twice the losses of information than those of feature learning and similarity measurement, distorting the inherent user interest manifold and degrading recommendation performance. Therefore, the user interest manifold propagation module in the proposed CMP employs the relationship of common interest derived manifold to propagate users’ image records in user space, as illustrated in Figure 2.

4.2. Visual Manifold Construction

Along with sparse user-image interactions, informative visual contents are involved in CMP to perform semantic-oriented recommendation. The visual distribution of images is more powerful due to their abundant amounts of information than due to the valuable but incomplete user-image correspondence. Similar to user interest manifold construction, Badrul et al. [45] evaluated the pairwise correlation between images according to commonly shared users. The more that common users like two images, the higher the correlation between the two images. The correlation of images measures the proportion of common users shared by the two images as follows:

W_{i j} = \frac{|M_{i} \cap M_{j}|}{\sqrt{|M_{i}| |M_{j}|}}

(10)

where

M_{i}

represents the set of users taking correspondence to image

v_{i}

,

|M_{i}|

is the size of

M_{i}

,

|M_{i} \cap M_{j}|

calculates the number of common users having a relationship with images

v_{i}

and

v_{j}

simultaneously, for

i, j = 1, 2, . . ., n

. It is obvious that the manifold of images constructed on common shared users is extremely sparse due to the rare user-image correspondences. It might contain very limited information, which would not fit the large-scale visual recommendation problem. To verify the point, experiments compared image recommendation by VMP with two options of image manifold construction as follows.

Relationship-based manifold by commonly shared users between images over user-image relationships.
Representation-based manifold on semantic visual correlations over AlexNet-based visual features.

Figure 4 presents the precision of top-N recommendation of visual manifold propagation with user-image relationship-based manifold and visual representation-based manifold. Figure 4 shows that VMP with representation-based manifold outperformed that with a relationship-based manifold. Results verify that (1) visual features as representation-based performance are more powerful in capturing image distribution for users’ visual records propagation, and (2) the sparse user-image relationship provides extremely limited information between images in front of the large-scale scope of images and their complicated contents. Because visual contents contain rich information, and AlexNet provides good representation in revealing images’ semantic distribution, the representation-based visual manifold is able to offer a dense innate and detailed visual network. In addition, since semantic visual correlation produces a better construction of visual distribution, it enables the user preference propagation on the visual manifold to contain more detailed visual relationships in recommending images. It could intuitively play a role in compensating UMP module on local semantics. Therefore, the visual manifold propagation module employs dense visual distribution of high-level semantic features by AlexNet and different but better source data to build visual manifolds, as illustrated in Figure 2.

4.3. Collaborative Fusion

Candidate size in cross-modal fusion: CMP propagates user records along user interest manifold by UMP and generates a user interest-aware candidate set for reranking based on visual-view scores from VMP. The size of the candidate set affects recommendation performance. Therefore, a suitable candidate size is important for leveraging the advantages of both the views of user interest and visual semantics. Figure 5 presents the precision of recommendation with different candidate sizes in CMP. CMP performed better when almost 1500 images are selected as candidates for recommendation. This implies that the trade-off between interest and visual semantics in recommending candidates lies in 1500 images. Fewer candidates cannot guarantee the scope of user interests. More candidates would result in similar contents of users’ historical records to jam in front of the recommendation list with high possibility. Both the above cases would significantly affect users’ communication experience. Therefore, CMP takes the trade-off of 1500 candidates for cross-modal image recommendation.

Cross-modal fusion rules: To verify the fusion strategy, a recommendation list is generated with local visual propagation within the candidate set of UMP, i.e., VMP is only conducted on a candidate set of UMP called UMIM. We also performed local UMP recommendations on the candidate set of VMP called IMUM. Experiments compared the designed interest-aware visual fusion in CMP with UMIM, IMUM, and min, max, and ave pooling of user-image correlations estimated by both UMP and VMP modules. Figure 6 shows the recommended precision of the proposed CMP compared with different fusion strategies. CMP performed better than the other strategies. Compared with UMIM and IMUM, CMP considers global distributions of both user interests and visual semantics with higher precision. At the same time, UMIM and IMUM take one global and the other local distributions. It means the proposed interest-aware visual fusion in CMP successfully leverages the merits of both UMP and VMP in estimating user-image correlation. It demonstrates that the proposed interest-aware visual fusion strategy fits the recommendation requirements better than its comparisons.

Cross-modal vs. single-modal: As analyzed in Section 4.1 and Section 4.2, the UMP module constructs recommendations on user-image correspondence, while the VMP module builds that on semantic visual features. They employ different data sources to investigate user interest and visual distributions, respectively. The two modules in CMP intuitively compensate each other to some degree. To verify the collaborative capability of UMP and VMP in the proposed CMP framework, an ablation study was performed on them. Figure 7 presents the precision and MAP of top-N recommendation for the proposed cross-modal CMP as compared with single-modal UMP and VMP. Results indicate that CMP consistently performed better than VMP regarding both precision and MAP, and CMP achieved better MAP than that of UMP. On the first 6 recommended images, CMP recommended more correct images than UMP did, while the remaining ranked images recommended by UMP outperformed the others regarding precision. CMP performed better in MAP than in precision, which means that CMP recommended correct images in the front order of recommendation sequence, as the order is greatly important in recommendation.

Global evaluation MRR is further provided on the order of correct recommendation of CMP in Table 1. This shows that the proposed CMP achieved

45 %

in MRR, which was significantly higher than the

26.5 %

of UMP and

32 %

of VMP. MAP and MRR results indicate that the correct images recommended by CMP were ranked in better order for recommendation because in CMP, user interest manifold propagation ensures interest-aware recommendation, and semantic visual manifold propagation guarantees semantic preference of recommended images to users. Performance demonstrates that UMP and VMP modules can collaboratively learn with each other well. They compensate in a certain way to perform user-image correlation learning for recommendation.

4.4. Experimental Comparison

To verify effectiveness, the recommendation performance of the proposed CMP was further compared to that of NBI [14], CF [40], CB [11], CBG [41], CRC [42], SVM [43], hybrid [21], PMR, and MMR [23]. Figure 8 presents the precision and MAP of CMP as compared with that of state-of-the-art methods. The proposed CMP consistently achieved better performance than that of its competitors in terms of precision and MAP. Performance demonstrates that CMP learnt better user-image correlations by incorporating user interest and semantic visual distributions in recommendation. The proposed CMP recommendation method also took greater advantage in cases of recommending the front images. Such performance greatly benefits from the semantic visual propagation of users’ records to involve semantic correlation in user-image inference. The proposed CMP performed better than CF and NBI, which illustrates the effectiveness of visual semantics in cross-modal collaborative learning. CMP outperforms MMR in terms of precision and MAP, which demonstrates the collaborative ability of user interest based manifold propagation in capturing global interest trend for recommendation. Lack of investigating users’ interests, CB recommends images similar to users’ records, limiting its performance.

Table 2 provides a global evaluation of the proposed CMP recommendation by MRR compared with other methods. It shows the proposed CMP achieved better MRR than the others did, and implied the correct recommendation taking front order in the list. Regarding MRR, CMP achieved

45.0 %

, a

13.0 %

improvement on the second

32 %

of MMR. This also verifies the effectiveness of collaborative learning over user interest and visual semantics in CMP. CMP thus effectively utilizes semantic the visual distribution of images and global user-image correspondence by collaborative manifold propagation, which successfully recommends high-quality and interest-aware images for users.

5. Conclusions

This paper presented a cross-modal manifold propagation for personalized recommendation. User interest propagation spreads users’ image records in the user space, ensuring awareness of global interest trends in recommendations. Visual propagation emphasizes the effects on propagating users’ image records in semantic visual space, which guarantees detailed semantic correlation in recommendations. The proposed CMP collaboratively investigates user-image correlation learning across user and visual spaces with sparse user-image correspondence. It unveils the inherent compensation between interactions and visual signals in revealing users’ interests. Experimental results and corresponding analysis demonstrated the collaborative recommendation ability of user interest and visual semantics in estimating user-image correspondence, which successfully bridges user intention gap and visual semantic gap for recommendations.

As a graph-based learning model, the proposed CMP inevitably meets the oversmoothness issue, which limits the effectiveness of manifold propagation in learning users’ interests. Therefore, for a recommendation task, the rationality of manifold propagation should be further discussed and modified with elaborate deep propagation layers to alleviate oversmoothness on learning users’ interests. On social networks, there is multimodal information to jointly help users in communicating and expressing themselves, such as through texts and videos. It is promising to explore multimodal compensation with building contrastive layers to jointly propagate and learn users’ interests. We would further investigate semantics embedded in multimodalities and fuse their roles with users’ personalized interests for recommendations.

Author Contributions

Methodology, M.J., J.G. and T.J.; writing—original draft preparation, X.F.; Writing—review & editing, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grants no. 62176011, 61702022, 61802011, and 61976010; the Beijing Municipal Education Committee Science Foundation under grant no. KM201910005024; the Inner Mongolia Autonomous Region Science and Technology Foundation under grant no. 2021GG0333; and the Beijing Postdoctoral Research Foundation under grant no. Q6042001202101.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.S. Neural Collaborative Filtering. In Proceedings of the International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
Wu, L.; Chen, L.; Hong, R.; Fu, Y.; Xie, X.; Wang, M. A Hierarchical Attention Model for Social Contextual Image Recommendation. IEEE Trans. Knowl. Data Eng. 2019, 32, 1854–1867. [Google Scholar] [CrossRef] [Green Version]
Xue, F.; He, X.; Wang, X.; Xu, J.; Liu, K.; Hong, R. Deep item-based collaborative filtering for top-N recommendation. ACM Trans. Inf. Syst. (TOIS) 2019, 37, 1–25. [Google Scholar] [CrossRef] [Green Version]
Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.-S. Neural Graph Collaborative Filtering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
Gong, C.; Tao, D.; Yang, J.; Liu, W. Teaching-to-Learn and Learning-to-Teach for multi-label propagation. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1610–1616. [Google Scholar]
Geng, X.; Zhang, H.; Bian, J.; Chua, T.-S. Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 7–13 December 2015; pp. 4274–4282. [Google Scholar]
Niu, W.; Caverlee, J.; Lu, H. Neural Personalized Ranking for Image Recommendation. In Proceedings of the ACM International Conference on Web Search and Data Mining (ACM WSDM), Marina Del Rey, CA, USA, 5–9 February 2018; pp. 423–431. [Google Scholar]
Jian, M.; Jia, T.; Yang, X.; Wu, L.; Huo, L. Cross-modal collaborative manifold propagation for image recommendation. In Proceedings of the ACM SIGMM International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019; pp. 344–348. [Google Scholar]
Wu, Y.; Dubois, C.; Zheng, A.X.; Ester, M. Collaborative denoising auto-encoders for top-N recommender systems. In Proceedings of the ACM International Conference on Web Search & Data Mining, Shanghai, China, 31 January–6 February 2016; pp. 153–162. [Google Scholar]
Chen, J.; Zhang, H.; He, X.; Nie, L.; Wei, L.; Chua, T.S. Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention. In Proceedings of the ACM International Conference on Research & Development in Information Retrieval (SIGIR), Tokyo, Japan, 7–11 August 2017; pp. 153–162. [Google Scholar]
Pazzani, M.J.; Billsus, D. Content-based recommendation systems. Adapt. Web 2007, 4321, 325–341. [Google Scholar]
Cantador, I.; Bellogín, A.; Vallet, D. Content-based recommendation in social tagging systems. In Proceedings of the ACM Conference on Recommender Systems, Barcelona, Spain, 26–30 September 2010; pp. 237–240. [Google Scholar]
Lovato, P.; Bicego, M.; Segalin, C.; Perina, C.; Sebe, N.; Cristani, M. Faved! Biometrics: Tell me which image you like and I’ll tell you who you are. IEEE Trans. Inf. Forensics Secur. 2014, 9, 364–374. [Google Scholar] [CrossRef] [Green Version]
Zhou, T.; Ren, J.; Medo, M.; Zhang, Y.-C. Bipartite network projection and personal recommendation. Phys. Rev. E 2007, 76, 46–115. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, H.; Shen, F.; Liu, W.; He, X.; Luan, H.; Chua, T.-S. Discrete collaborative filtering. In Proceedings of the ACM International Conference on Research & Development in Information Retrieval (SIGIR), Pisa, Italy, 17–21 July 2016; pp. 325–334. [Google Scholar]
Yan, M.; Sang, J.; Xu, C.; Hossain, M.S. A unified video recommendation by cross-network user modeling. ACM Trans. Multimed. Comput. Commun. Appl. 2016, 12, 53. [Google Scholar] [CrossRef]
Niepert, M.; Ahmed, M.H.; Kutzkov, K. Learning Convolutional Neural Networks for Graphs. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2014–2023. [Google Scholar]
Li, Y.; Mei, T.; Cong, Y.; Luo, J. User-curated image collections: Modeling and recommendation. In Proceedings of the IEEE International Conference on Big Data, Santa Clara, CA, USA, 29 October–1 November 2015; pp. 591–600. [Google Scholar]
You, Q.; Bhatia, S.; Luo, J. A picture tells a thousand words–About you! User interest profiling from user generated visual content. Signal Process. 2016, 124, 45–53. [Google Scholar] [CrossRef] [Green Version]
Hong, R.; Li, L.; Cai, J.; Tao, D.; Wang, M.; Tian, Q. Coherent Semantic-Visual Indexing for Large-Scale Image Retrieval in the Cloud. IEEE Trans. Image Process. 2017, 26, 4128–4138. [Google Scholar] [CrossRef]
Wang, H.; Zhang, P.; Lu, T.; Gu, H.; Gu, N. Hybrid recommendation model based on incremental collaborative filtering and content-based algorithms. In Proceedings of the 2017 IEEE International Conference on Computer Supported Cooperative Work in Design, Wellington, New Zealand, 26–28 April 2017; pp. 337–342. [Google Scholar]
Jia, T.; Jian, M.; Wu, L.; He, Y. Modular manifold ranking for image recommendation. In Proceedings of the IEEE International Conference on Multimedia Big Data (BigMM), Xi’an, China, 13–16 September 2018; pp. 1–5. [Google Scholar]
Jian, M.; Guo, J.; Zhang, C.; Jia, T.; Wu, L.; Yang, X.; Huo, L. Semantic Manifold Modularization-based Ranking for Image Recommendation. Pattern Recognit. 2021, 120, 108100. [Google Scholar] [CrossRef]
Mei, T.; Yang, B.; Hua, X.-S.; Li, S. Contextual video recommendation by multimodal relevance and user feedback. ACM Trans. Inf. Syst. 2011, 39, 1–24. [Google Scholar] [CrossRef]
Forestiero, A. Heuristic recommendation technique in Internet of Things featuring swarm intelligence approach. Expert Syst. Appl. 2022, 197, 115904. [Google Scholar] [CrossRef]
Comito, C.; Forestiero, A.; Pizzuti, C. Word Embedding based Clustering to Detect Topics in Social Media. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI), Thessaloniki, Greece, 14–17 October 2019; pp. 192–199. [Google Scholar]
Yuan, Z.; Sang, J.; Xu, C.; Liu, Y. A unified framework of latent feature learning in social media. IEEE Trans. Multimed. 2014, 16, 1624–1635. [Google Scholar] [CrossRef]
Srivastava, N.; Salakhutdinov, R. Multimodal Learning with Deep Boltzmann Machines. In Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 2222–2230. [Google Scholar]
Xang, X.; Zhang, T.; Xu, C.; Yang, M.-H. Boosted multifeature learning for cross-domain transfer. ACM Trans. Multimed. Comput. Commun. Appl. 2015, 11, 1–18. [Google Scholar]
Yang, X.; Zhang, T.; Xu, C. Cross-domain feature learning in multimedia. IEEE Trans. Multimed. 2014, 17, 64–78. [Google Scholar] [CrossRef]
Kang, C.; Xiang, S.; Liao, S.; Xu, C.; Pan, C. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimed. 2015, 17, 370–381. [Google Scholar] [CrossRef]
Zhang, J.; Yang, Y.; Tian, Q.; Zhuo, L.; Liu, X. Personalized social image recommendation method based on user-image-tag model. IEEE Trans. Multimed. 2017, 19, 2439–2449. [Google Scholar] [CrossRef]
Sejal, D.; Ganeshsingh, T.; Venugopal, K.R.; Iyengar, S.S.; Patnaik, L.M. ACSIR: ANOVA cosine similarity image recommendation in vertical search. Int. J. Multimed. Inf. Retr. 2017, 6, 1–12. [Google Scholar] [CrossRef]
Harakawa, R.; Takehara, D.; Ogawa, T.; Haseyama, M. Sentiment-aware personalized tweet recommendation through multimodal FFM. Multimed. Tools Appl. 2018, 77, 18741–18759. [Google Scholar] [CrossRef]
Bai, P.; Ge, Y.; Liu, F.; Lu, H. Joint interaction with context operation for collaborative filtering. Pattern Recognit. 2019, 88, 729–738. [Google Scholar] [CrossRef] [Green Version]
Tang, J.; Shu, X.; Li, Z.; Jiang, Y.-G.; Tian, Q. Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2027–2034. [Google Scholar] [CrossRef] [Green Version]
Yu, X.; Jiang, F.; Du, J.; Gong, D. A cross-domain collaborative filtering algorithm with expanding user and item features via the latent factor space of auxiliary domains. Pattern Recognit. 2019, 94, 96–109. [Google Scholar] [CrossRef]
Wu, L.; Sun, P.; Hong, R.; Fu, Y.; Wang, X.; Wang, M. SocialGCN: An efficient graph convolutional network based model for social recommendation. arXiv 2018, arXiv:1811.02815. [Google Scholar]
Krizhevsk, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Wu, L.; Zhang, L.; Jian, M.; Zhang, D.; Liu, H. Image recommendation on content-based bipartite graph. In Proceedings of the International Conference on Internet Multimedia Computing and Service, Qingdao, China, 23–25 August 2017; pp. 339–348. [Google Scholar]
Zhang, L.; Yang, M.; Feng, X. Sparse representation or collaborative representation: Which helps face recognition? In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 471–478. [Google Scholar]
Joachims, T. Making large-scale SVM learning practical. Tech. Rep. 1998, 8, 499–526. [Google Scholar]
Liu, H.; Wu, L.; Zhang, D.; Jian, M.; Zhang, X. Multi-perspective User2Vec: Exploiting re-pin activity for user representation learning in content curation social network. Signal Process. 2018, 142, 450–456. [Google Scholar] [CrossRef]
Sarwar, B.M.; Karypis, G.; Konstan, J.A.; Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the Web Conference (WWW), Hong Kong, China, 1–5 May 2001; pp. 285–295. [Google Scholar]

Figure 1. Illustration of users’ intention gap between user and visual spaces, visual semantic gap between semantic concepts in visual space, and sparse user-image correspondence in current social networks.

Figure 2. Framework of proposed cross-modal collaborative manifold propagation involving two modules, interest-based (blue) and visual (green) inference modules for image recommendation.

Figure 3. Precision of Top-N recommendation of user interest manifold propagation with user-image relationship-based graph and user representation-based graph.

Figure 4. Precision of top-N recommendation of image manifold propagation with user-image relationship-based graph and visual representation-based graph.

Figure 5. Effect of different candidate sizes in CMP by precision of Top-N recommendation.

Figure 6. Precision of Top-N recommendation for the proposed CMP with different fusion strategies of UMIM, IMUM, and min, max, and ave pooling.

Figure 7. Precision and MAP of top-N recommendation for proposed cross-modal collaborative manifold propagation (CMP) as compared with single-modal user interest manifold propagation (UMP) and visual manifold propagation (VMP).

Figure 8. Performance comparison between CMP and NBI, CF, CB, CBG, CRC, SVM, hybrid, PMR, and MMR regarding precision and MAP.

Table 1. Recommendation performance of CMP as compared with that of UMP and VMP by MRR.

	UMP	VMP	CMP
MRR	25.6%	32.0%	45.1%

Bold number represents best performance by MRR.

Table 2. Performance of CMP and comparisons regarding MRR.

	CRC	SVM	CF	CBG	CB	NBI	Hybrid	PMR	MMR	CMP
MRR	0.16%	4.0%	1.3%	12.7%	2.9%	22.9%	5.4%	26.8%	32.0%	45.1%

Bold number represents the best performance by MRR.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jian, M.; Guo, J.; Fu, X.; Wu, L.; Jia, T. Cross-Modal Manifold Propagation for Image Recommendation. Appl. Sci. 2022, 12, 3180. https://doi.org/10.3390/app12063180

AMA Style

Jian M, Guo J, Fu X, Wu L, Jia T. Cross-Modal Manifold Propagation for Image Recommendation. Applied Sciences. 2022; 12(6):3180. https://doi.org/10.3390/app12063180

Chicago/Turabian Style

Jian, Meng, Jingjing Guo, Xin Fu, Lifang Wu, and Ting Jia. 2022. "Cross-Modal Manifold Propagation for Image Recommendation" Applied Sciences 12, no. 6: 3180. https://doi.org/10.3390/app12063180

APA Style

Jian, M., Guo, J., Fu, X., Wu, L., & Jia, T. (2022). Cross-Modal Manifold Propagation for Image Recommendation. Applied Sciences, 12(6), 3180. https://doi.org/10.3390/app12063180

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Modal Manifold Propagation for Image Recommendation

Abstract

1. Introduction

2. Related Work

2.1. Collaborative Filtering and Visual Recommendation

2.2. Multimodal Collaborative Recommendation

3. Cross-Modal Manifold Propagation

3.1. Interest Manifold Propagation

3.2. Visual Manifold Propagation

3.3. Cross-Modal Collaborative Ranking

4. Experimental Analysis

4.1. Interest Manifold Construction

4.2. Visual Manifold Construction

4.3. Collaborative Fusion

4.4. Experimental Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI