*4.1. Pin Recommendation*

Pin recommendation is a crucial function for content discovery on CCSNs. It can be inferred that pins with similar interests are close in the representation vector space. Considering that the different boards a user collects have different characteristics, accordingly, given a target user, the similarity between pins in a board and the candidate pins is computed in the vector space, and the pins are ranked by similarities in descending order. For different boards, different pins are recommended. Most similarity metrics can be used; cosine similarities were computed in our work. Pins are ranked according to the similarity score and the most similar pins are selected as candidates.

#### *4.2. Board Thumbnail Recommendation*

Boards are displayed as thumbnails on all public and personal home pages. A thumbnail includes a cover and two/three small images or just six small images. A well-designed thumbnail can attract other users to access the board. Both Pinterest and Huaban allow users to select a cover from pins of the board, but they do not recommend candidates to users. As illustrated in Figure 6a, if a cover is selected, the small images will automatically be selected from the two latest pins. If the user has not selected an image for the cover, the thumbnail will be composed of the six latest pins. It is difficult for a user to select a suitable image to represent the board without any recommendation. Furthermore, the thumbnail consists of the latest pins possible that could not represent the boards. Boards like the bottom two have such wide interests that images in the thumbnail cannot fully express them. Similarly, thumbnails on Huaban, one of which consists of the cover and the three latest pins, have same drawbacks, as respectively shown in Figure 6b,c.

In view of the above, we defined a new task for recommending board thumbnails. The mean vector of pins in the board are computed, which is the center of the boards. The pins nearest to the center of the board are selected as the cover candidates. Then, we implement clustering, and the closest images with respect to the cluster centers are selected as substitutions for the latest pins.

**Figure 6.** Examples of board thumbnail examples on CCSNs: (**a**) includes examples from Pinterest; (**b**,**<sup>c</sup>**) are from Huaban.

#### *4.3. Board Category Recommendation*

On CCSNs, every board should be assigned a category, though some boards with no category were created before the constraint of the forced-choice approach. However, it is illogical because even if it is difficult to choose a board category from different user interests, users can select the category "other". Board category recommendation is convenient for category choice, not only in terms of first selection but also for further editing.

As mentioned in Section 3.1, interests associated with an image can be spread over the categories which occur in the re-pin tree. The only way to estimate the user preference on one image is to analyze its description and category. Because individual understanding of certain notions differs, even if manual analyzing cannot determine which single category the user intends to describe, it is common sense in this condition. We consider that personalization on CCSNs is mainly formed by the way the user organizes his or her boards. Hence, similarly to how user interests are reflected by pins, user interests reflected by boards should be more than one category. With the increasing number of pins, the category preference of the board in the majority opinion is reinforced. A board interest distribution *B* can be calculated by the average of all its pin interest distributions as

$$Intrect\_{B} = \frac{1}{N\_{B}} \sum\_{i}^{N\_{B}} Intrect\_{I\_{i}} = \left(\frac{1}{N\_{B}} \sum\_{i}^{N\_{B}} p\_{i \gets\_{j}}\right) \in [0, 1]^{N\_{C}},\tag{8}$$

where *NB* denotes the pin count of *B*, *InterestIi* = *piCj*∈ [0, 1]*<sup>N</sup>* is the interest distribution of the *i*-th pin *Ii*. In order to infer *InterestIi* , we trained a multidimensional LR between the representation of pins and the labels obtained in Equation (1). The generated *InterestB* should be normalized immediately. The recommended category is the category which is the highest number in terms of the board interest distribution.

This method can also be used for computing the interest distribution of a user. As an important part of the user profile, the interest distribution of a user can be intuitively represented by normalizing a frequency distribution of categories of boards or pins. However, this distribution has many limitations. First, it cannot deal with the absence of some categories, however, this does not mean that the user is not interested in those categories. Secondly, the ratio between categories may not be accurate, not only because categories are related, but also because images related to certain fine-grained interests are rarer than others and the user cannot collect enough pins related to these interests. Thirdly, it cannot be used to represent the interests of a board, as the categories of pins in it are the same. Our interest distribution of a target user *U* is computed as

$$Intensity\_{lI} = \frac{1}{N\_{lI}} \sum\_{i}^{N\_{lI}} Intressor\_{X\_i} = \left(\frac{1}{N\_{lI}} \sum\_{i}^{N\_{lI}} p\_{iC\_j}\right) \in \left[0, 1\right]^{N\_C} \tag{9}$$

where *NU* denotes the pin count of *U*. Because *InterestIi* actually spreads over all the categories, *InterestU* does not suffer from the absence of some categories. In addition, to some extent, the ratio error, which is caused by the imbalance between pin counts of boards, is reduced, since the strong categories have faster accumulation processes than the weak categories.

#### *4.4. Board and User Recommendation*

As the pins are assembled, the theme of a board emerges. Users can easily collect pins with well-organized boards. For this reason, board recommendation is another important function for content discovery on CCSNs. In this section, we discuss how to model boards and users using the acquired multimodal joint representations of pins.

There is an analogy between user contents on CCSNs and articles, as user contents consist of boards which consist of pins, while articles are composed of paragraphs or sentences that are composed of words. One clear difference between them is that the order of pins or boards may not be that important. Therefore, the loss of order information is not an issue when modeling. Inspired by this, we consider that applying pooling methods to transform a different number of pins into a constant dimension vector, as we mentioned in Section 3.2, is reasonable for board and user modeling. Among pooling methods, the Fisher vector (FV) was chosen as our solution for board and user modeling.

The FV [37] was designed for encoding patch descriptors of an image into a high-dimensional vector. Since boards and users are image collections, a pin can be treated as a descriptor of them. A common method to encode a set of descriptors is to assign them into a visual dictionary, which is composed of prototypical elements such as cluster centers, while the FV approximates the distribution of descriptors with a GMM, whose Gaussian distributions can be treated as a universal probabilistic visual dictionary. As for representation of pins *VXi*= *vij* ∈ <sup>R</sup>*J*, the GMM is defined as

$$\text{GMMM}\left(V\_X\right) = \sum\_{k=1}^{K} \omega\_k \text{norm}\_k\left(V\_X\right),\tag{10}$$

where norm*k* denotes the *k*-th multivariate normal distribution, *ωk* is the weight of the *k*-th mixture component and is subject to the following constraints: ∀*k* : *ωk* ≥ 0 and *K* ∑ *k*=1 *ωk* = 1; and *K* is the number of mixture components. The parameters of the GMM also include (*μkj*) ∈ R*J* and Σ*k*, which are the mean vector and covariance matrix of the *k*-th mixture component, respectively. The FV first computes the partial derivatives with respect to the parameters of the logarithm of the GMM, and then it normalizes them with the Fisher information matrix. The simplified normalized partial derivatives of a board *B* are given by

$$\mathbf{g}\_{\omega\_k} = \frac{1}{N\_B \sqrt{\omega\_k}} \sum\_{i=1}^{N\_B} (\gamma\_{ik} - \omega\_k)\_\prime \tag{11}$$

$$\log \mu\_k = \left( \frac{1}{N\_B \sqrt{\omega \sigma\_k}} \sum\_{i=1}^{N\_B} \gamma\_{ik} \left( \frac{\upsilon\_{ij} - \mu\_{kj}}{\sigma\_{kj}} \right) \right) \in \mathbb{R}^I,\tag{12}$$

$$\mathbf{g}\_{\sigma\_k} = \left( \frac{1}{N\_B \sqrt{2\omega\_k}} \sum\_{i=1}^{N\_B} \gamma\_{ik} \left[ \frac{\left(\upsilon\_{ij} - \mu\_{kj}\right)^2}{\sigma\_{kj}^2} - 1 \right] \right) \in \mathbb{R}^I \tag{13}$$

where *<sup>σ</sup>kj* denotes the standard deviation of the *j*-th dimension of the *k*-th mixture component, and *γik* is the soft assignment of *VPi*to the *k*-th mixture component, which is written as

$$\gamma\_{ik} = \frac{\omega\_k \text{norm}\_k \left(V\_{X\_i}\right)}{\sum\_{k=1}^K \omega\_k \text{norm}\_k \left(V\_{X\_i}\right)} = \frac{\omega\_k \text{norm}\_k \left(V\_{X\_i}\right)}{\text{GMM} \left(V\_{X\_i}\right)},\tag{14}$$

and is also known as the posterior probability or responsibility. All partial derivatives are concatenated to compose the FV. Since one of *ωk* is redundant because of the constraints, the dimension of the FV is (2*J* + 1) *K* − 1. Power normalization and *L*2-normalization [38] are applied to improve the quality of the FV as follows:

$$\text{l.g}\_i \leftarrow \text{sgn}\left(g\_i\right)|g\_i|^\rho,\tag{15}$$

$$\mathcal{g}\_i \leftarrow \frac{\mathcal{g}\_i}{\sqrt{\frac{(2J+1)K-1}{\sum\_i}}} \prime \tag{16}$$

where *ρ* ∈ [0, 1] is the normalization parameter. The FV of a user can be computed in the same manner. Please refer to the original paper for more details regarding the FV.

In essence, the FV is the gradient of the log-likelihood of a board. Notice that the computations of Equations (11)–(13) can be simplified with

$$S\_k^0 = \sum\_{i}^{N\_B} \gamma\_{ik\nu} \tag{17}$$

$$S\_k^1 = \left(\sum\_{i}^{N\_{\overline{\mathcal{B}}}} \gamma\_{ik} v\_{ij}\right) \in \mathbb{R}^J,\tag{18}$$

$$S\_k^2 = \left(\sum\_{i}^{N\_B} \gamma\_{ik} v\_{ij}^2\right) \in \mathbb{R}^J,\tag{19}$$

where *S*0*k* , *S*1*k* , and *S*2*k* are the zeroth order, first order, and second order statistics of the board, respectively. Accordingly, the FV preserves more information than other pooling methods, such as the vector of aggregate locally descriptor and sparse coding, with the same dictionary capacity. It actually measures not only which words in the visual dictionary the pins belong to, but also the differences between the mean vectors of the GMM and the board or user. On the other hand, the FV uses a relatively small dictionary to generate the same dimension vector as the others, such that the computational complexity is lower. In addition, the FV is interpretable. If we consider the mean vectors as the center of interests, improving *K* will make the FV more fine-grained, while the curse of dimensionality is a significant limitation of the FV. For the sake of large-scale applications, the FV could be lossless compressed by sparsity encoding with product quantization [39].

After modeling, boards can be recommended according to the similarity metrics between them and the target board. Because users can be considered image collections with wider interests than boards, user recommendation done in this same manner is also helpful for content discovery, although users on CCSNs are not very interested in following.

#### **5. Experiments and Results**

In this section, the datasets and implementation details are firstly introduced. Then, the performance of our representation of pins are evaluated in an interest analysis. Thereafter, the results of experiments on real-world datasets are presented to verify the feasibility and effectiveness of our recommendation methods.

#### *5.1. Datasets and Implementation Details*

We crawled data used in experiments from Huaban, a typical Chinese CCSN. Huaban provides certain applications similar to those in Pinterest, while the main differences between the two networks are as follows: There are "like" pins or board operations on Huaban but not on Pinterest; Huaban records both users and the paths in a re-pin tree, while Pinterest only records all the users and the initially created user.

We first crawled the pins of 5957 users without images, and then sampled 88 users according to board categories and pin counts. Some extremely active and cold-start users had been confirmed among them to make our dataset diverse and to take the influence of pin counts into account. We then crawled all images of the sampled users and all their "like" pins. In addition, we crawled the top 1000 pins recommended by the system of each category to fine-tune AlexNet and their re-pin paths for automatic annotation. The dataset for recommendation included 151,631 pins, which were categorized into 33 categories from 1694 boards, and the number of unique images for both fine-tuning and recommendation was 167,747. All pins were used as supplement elements for obtaining distributions of the recommended pin categories. The average re-pin path length was 47.57.

After a little manual label balancing, labelled images were split into 80% for training and validating and the remaining 20% for testing. Because the input dimension of AlexNet should be constant, every image was firstly rescaled so that the shorter side was 256 pixels, and then the central 256 × 256 patch of the processed image was cropped out. The loss layer of our AlexNet was replaced. As a comparison, the most frequent category was used as the label to fine-tune a multiclass AlexNet. The dimensions of the FC8 layers of both Alexnets were changed to 33. Image representations were generated from the FC7 layer of the multilabel Alexnet.

We trained our Word2Vec model on Wikipedia dumps (https://dumps.wikimedia.org/) and Sougou Lab dataset (http://www.sogou.com/labs/resource/list\_news.php) with the CBOW (Continuous Bag of Words) model and negative sampling. In addition, the vector dimension was 300. The words with a frequency lower than five were ignored. Word preprocessing, such as removing punctuation, traditional and simplified Chinese conversion, word tokenization, machine translation, and removing stop words, was applied on pin descriptions.

All image and text representations were exploited for the multimodal DBM training. The dimensions of *HT*1, *HT*2 and *HV*1 were the same as their corresponding visible inputs, and dimensions of *HV*2 and *H*3 were set to 2048 to compress the vectors, as the FV would increase the dimension. Each layer was pretrained using a contrastive divergence strategy to accelerate the training of the DBM. Then, missing text representations were extracted using Gibbs sampler and the multimodal joint representation of pins was inferred.

*K* in Equation (10) was set to 1 such that the dimension of the FV of a board was twice that of the pin vector. *α* in Equation (15) was set to 0.5.

To evaluate the effectiveness of the proposed model, we compared it with the following multimodal deep architectures: the Multimodal Autoencoder (MAE), which was proposed in [31] and connects two deep autoencoders of multimodalities by a shared hidden layer; and ICMAE, which imposes Independent Component Analysis (ICA) constraints in the MAE architecture to

de-correlate the relationships among the variables. All the baseline methods had the same number of layers, and we used the same features as inputs to ensure that the comparisons were fair.

#### *5.2. Analysis of Interests Represented by Pins*

Analysis of interests based on pins is the prerequisite of analysis of interests based on boards and users. As mentioned above, it is hard to measure the interest distribution of one pin. Hence, we treated the interest distribution of its image as an approximation, even though some categories would be improved by its text description.

Multidimensional LRs were trained on the dataset to fine-tune for all unimodal representations and multimodal representations. Table 1 illustrates the results, together with those of the multiclass classification with softmax. The mean nonzero error was the average error between all nonzero categories and corresponding predictions. The accuracy of the dominant category checks the consistency of the most frequent category between labels and predictions. The comparison of multiclass and multilabel CNNs shows that our method with multilabel annotation improves the accuracy significantly. This is not only because the interference of related categories could be eliminated by category distributions, but also because more information from the users' collective intelligence was provided for learning. Although the performance of text representations and image representations was not comparable, the performance of the multimodal joint model was better than that of image representations that are complementary between two modalities.


**Table 1.** Comparison of Pin Category Prediction(MAE: Multimodal Autoencoder).

From the results, we can see that our method had the best performance because all unimodal and multimodal representations contained information about user interests and our joint representation contains richer information than other methods. Our method could also analyze interests of images on other networks. The comparison of MAE/ICMAE shows that the joint representation of pins learned by our method has a higher correlation with their categories.
