**1. Introduction**

Content curation social networks (CCSNs) are booming social networks where users demonstrate, collect, and organize their multimedia contents. Pinterest is a typical CCSN. Since its inception in March 2010, Pinterest developed fast, broke through the 10 million user barrier [1] in January 2012, and CCSNs have gone on to become a very popular social network worldwide. In China, there are many Pinterest-like networks such as Huaban, Meilishuo, Mogu Street, and Duitang online from 2010–2012. The rapid growth of CCSNs has attracted much attention from multiple research fields, such as network characteristic analysis [1,2], user behavior research [3–5], social influence analysis [6], link analysis [7], word embedding [8], search engines [9], user modeling [10–12], and recommender systems [13–25].

As is well known, CCSNs are content-centric social networks [10]. Different from user-centric social networks, users on CCSNs pay more attention to the contents that users collect, which are not only communication carriers but also carriers of user interests. Taking one of the best known CCSNs, Pinterest, as an example, a "pin" is an image with a brief text description supplied by the users. A "board" is a collection of some similar style pins. In other words, the pins are curated into "boards" by categories [19]. On CCSNs, the collection of a user is composed of several boards, and a board is composed of pins. The relationships between users, boards, categories, and pins are shown in Figure 1. A "user" represents the users on CCSNs. A "board" represents a container of pins and is organized into different categories. A "category" is the category of the board, which is given by the user. A "pin", which is created by a user, is the basic unit, composed of an image and a corresponding brief text description. Users on CCSNs can "follow" the users they are interested in like on Twitter or Facebook. "Re-pin" is an action like a "repost" or a "retweet", whereby users can re-save pins and re-organize them with new descriptions and new categories to their own board. "Create" is similar to "post" on Twitter or Weibo, allowing users to post their original contents on CCSNs. From the figure, we can see that the pin is the basic unit in CCSNs.

**Figure 1.** Items on content curation social networks (CCSNs) and their relationships.

Besides the content collections, there are also abundant social behaviors on CCSNs. Users can follow other users or other users' boards, users can also re-pin other users' pins and collect them into their own board. Furthermore, the re-pin path is recorded in CCSNs. All the users who have re-pinned a pin can be connected using a re-pin path. All users of the same re-pin path have collected the same image, but they have organized them into different boards and different categories. As shown in Figure 2.

On content-centric CCSNs, most user activities are related to the pins. Liu et al. [25] found that only 30% of pins are re-pinned from their followers by statistics on Huaban. Furthermore, users do not follow the users from whose boards they re-pin the pins [4]. A non-trivial number of pins are collected from non-followees [5], and those from native followees are more than those from cross-domain followees [3]. These observations sugges<sup>t</sup> that social relationships are not the main motivation of content discovery on CCSNs. On the contrary, user interests represented by pins play an important role in user behaviors on CCSNs. It is possible that content-based recommender algorithms will be more effective than social behavior-based algorithms such as collaborative filtering. Inspired by this, we managed to implement recommendations for different tasks based on identical representations of pins. The problem can be broken down into two questions: how to represent a given pin effectively; and how to implement the different tasks with the obtained representation.

**Figure 2.** Illustration of a re-pin tree composed of some re-pin paths. Each star represents a pin and the *Ci* next to it is the category given by the corresponding user. Note that all pins in the same re-pin tree have an identical image.

As shown in Figure 3, a pin is an image with its text description, hence it is obvious that both modalities should be utilized for complete representations. In order to fully utilize two modalities, we propose a framework that can learn the multimodal joint representation of pins. Image representations and text representations are obtained separately by deep models and are then fused to form multimodal joint representations. An intermediate layer of a convolutional neural network (CNN) is used to extract image representations. In order to establish the relation between image representations and user interests, some chosen images are annotated with their category distributions, which are the statistics of selections of users, to fine-tune the CNN. Text representations are means of word vectors in a word2vec model trained on public text corpora. Then, a multimodal deep Boltzmann machine (DBM) is trained with two modalities as inputs and the activation probabilities of the top layer are extracted as the final representation of pins.

(**a**)

(**b**)

**Figure 3.** Examples of pins on CCSNs: (**a**) Pinterest; (**b**) Huaban.

Recommendation tasks include recommending pins to users, recommending thumbnails to boards, recommending categories to boards, and recommending boards to users. On the basis of the representation of pins, pin recommendation becomes a problem of similarity measurement in the representation space, which can be solved by ranking the similarities between the candidates and the target pin. A board thumbnail consists of representative pins that can be selected by clustering pins in the board. The board category, which is the coarse interest selected by its owner, is considered to be the accumulation of the category distribution of its pins, and the category distribution can be obtained with a trained multidimensional logistic regression (LR). Boards and users are treated as pin collections, modeled as the Fisher vector (FV) of all their pin representations, and recommended to target users, similar to the pin recommendation method.

This paper makes the following contributions:


#### **2. Related Work**

With the rise of CCSNs, several studies have been performed, of which search engine, user modeling, and recommender systems are the most relevant. Most prior work only studied monomodal data. Yang et al. [16] recommended boards re-ranked with image representations based on boards with the text representations model. Liu et al. [22] recommended pins with two unimodal representations separately. Cinar et al. [11] predicted categories of pins with two kinds of unimodal representations and fused the two modality results using decision fusion. All the models are late fusion models that do not concern multimodal joint representations.

Multimodal joint representation includes unimodal representation models and multimodal fusion schemes. For image representation, CNNs have achieved remarkable performance in the field of computer vision. Creating a large labelled dataset is the key to train CNNs. Cinar et al. [11] and You et al. [12] directly used a pin's category as its label. However, this label may not be absolutely correct since the same image may have different categories selected by different users. Geng et al. [10] trained a multitask CNN with ontological concepts, but the ontology was constructed in the fashion domain and was difficult to extend to all other domains. Zhai et al. [21] extracted more detailed labels on Pinterest by taking top text search queries, but the quality and consumption of this annotation highly depends on the search engine. Inspired by the fact that the predefined categories on CCSNs are not independent objects but related notions, labels formed by statistics category distributions are used and a CNN is fine-tuned as a multilabel regressor. With regard to text representation, one-hot representations [13,16] and distributed representations, such as the word2vec tool [11], have been used. From the practical point of view, the word2vec tool [26], which can capture syntactic and semantic relationships between words in the corpus, is more scalable. In addition, mean vectors [27] of the word2vec tool can obtain usable text representation without further learning.

Several multimodal fusion studies are being performed on classification and retrieval. Except for directly concatenating modalities, most existing schemes are designed based on models such as CNNs [28] and recurrent neural networks [8]. These models mainly learn the consistency between

multiple modalities and cannot deal with missing input modalities well. On the generative side, latent Dirichlet allocations (LDAs) [29], restricted Bolzmann machines (RBMs) [30], deep autoencoders (DAEs) [31], and deep Boltzmann machines (DBMs) [32] have been proven to be feasible methods for learning the consistency and complementarity between modalities and can easily deal with some missing modalities. However, limited studies have focused on fusing features obtained from these deep learning models. Zhang et al. [33] used a DAE for fusing the textual features extracted by training the Word2Vec tool [26] and visual features generated by the 6-th layer of AlexNet [34]. However, there are no existing studies that have used information from all modalities from CCSNs for recommendation tasks. In this paper, we trained a multimodal DBM to handle a situation in which the data from CCSNs are unlabeled and some modality inputs are missing, and we used features obtained by deep learning as the input to make our multimodal representation more accurate and compact.

Compared to pin and board category recommendation, few studies have been performed on board and user recommendations. Kamath et al. and Wu et al. [13,23] model boards and users, respectively, with text data and some collaborative filtering methods [15,20,25] to recommend users with user behaviors, but they do not take images, which are the essential content on CCSNs, into account. Yang et al. [17] represent boards by sparse coding the descriptors of images, but similarly to Yang et al. [16] as mentioned above, their methods require cross-domain information. Moreover, the information loss of the sparse code based on a cluster dictionary is more than the FV based on a Gaussian mixed model (GMM). Furthermore, no studies on board thumbnails have been published.

Existing research only focuses on one recommendation task, while the method in this paper uses identical pin representation to accomplish different recommendations such that the problems are simplified and resource saving.

#### **3. Multimodal Joint Representation of Pins**

A pin, that is, an image with text descriptions, is the basic item and the carrier of user interests on CCSNs. The purpose of this section is to learn the representation of pins from both modalities. As the foundation of further applications, the representation should contain the information of user interests.

The proposed framework of learning multimodal joint representation of pins is shown in Figure 4. The proposed process can be divided into three parts: image representation learning, text representation learning, and the multimodal fusion. For an input pin, the image representation of the pin is extracted by our modified CNN and the text representation is generated by the pre-trained Word2Vec model. Finally, we can obtain the joint representation by fusing both the image and text representations with a modified multimodal DBM model. The whole process comprises three parts: the image representation, the text representation, and the multimodal fusion. For given pins, their images are loaded by a CNN that is fine-tuned on an image dataset that is annotated automatically, and one of the intermediate layers of the CNN is extracted as image representations. Meanwhile, text representations are computed by applying mean pooling on word vectors derived from the word2vec tool, which are trained on some text corpora. Then, a multimodal DBM is trained on both image and text representations. Finally, the activation probabilities of the last hidden layer of the multimodal DBM are inferred as the expected multimodal joint representation of pins.

**Figure 4.** Framework of learning multimodal joint representation of pins (CNN: convolutional neural network).
