*5.3. Pin Recommendation*

We invited 10 users to engage in the evaluation of pin recommendation. Each user was given 200 randomly selected target images and corresponding recommendation results of different methods. They were required to decide whether to pin some images of three candidates, but not if they were the owner of the target pin. Table 2 shows the precision of recommendations. A simple content-based filtering, which randomly selects an image with the same category as the target image, was implemented as a reference. All other methods achieved higher accuracies than the category-based method, simply because they utilized more information to reduce the affect of related categories. Object-based and interest-based methods used the probability layer from the original AlexNet and multilabel AlexNet, respectively. The results of those two methods were comparable, while interest distributions were more compact than object distributions. This indicates that even coarse-grained interests of an image were a little more important than what this image was on CCSNs. The other methods computed cosine similarity between representations. We note that using only dominant categories as the label to fine-tune AlexNet led to a decline, which may have been caused by confusion of similar images with different categories. Notice that the performance of multimodal

features was worse than that of image features. We believe that the descriptions could not completely describe all the interests and characteristics that images have. Our text representations were clearly not as effective as our image representations, therefore, image representations were more suitable for image recommendation.


**Table 2.** Comparison of image recommendation.

Figure 7 illustrates 10 images and their recommendation results. Obviously, intrinsic characteristics such as background, scene, pattern, texture, color, object, material, and so forth are maintained in the image representation and usually had an effect on the recommendation, especially for images in the left panel. Images in the right panel show that some abstract notions, for example, style and user interest, influenced the results. All these high-level image features learned from CNNs could significantly improve the accuracy and diversity of recommendations.

**Figure 7.** Examples of recommendation results based on image representation. Images with red borders are the target images.

From the recommendation results, we can clearly see that our model recommended similar styles and types of images. This means that our model could achieve a good recommendation effect in terms of content-based recommendations. This further illustrates that the features extracted by our multimodal joint representation model were effective. The recommendation data were different from the training data, which proves that our model had a good generalization ability.

#### *5.4. Board Thumbnail Recommendation*

In this experiment, we recommended the board thumbnail according to the interest distributions of pins and the representation of pins. Because Huaban does not ye<sup>t</sup> offer the function of editing thumbnails, we manually re-pinned all pins from the original board and changed the orders of the pins to display our results.

Figure 8a,b are the recommendation results in Figure 6b regarding narrow interests. As shown in Figure 6b, pins from the board are album covers of a music group. Four pins in the original thumbnail were all from the same album. Strong categories for this board were "file music book" (20.87%), "design" (16.24%), and "architecture" (11.50%), while those for the cover in Figure 8a are "film music books" (20.44%), "design" (14.15%), and "architecture" (10.81%). Three clusters, whose centers mainly belonged to "photography" (15.72%), "film music books" (80.93%), and "architecture" (47.19%), contained 30, 7, and 4 pins, respectively. This indicates that the recommendation results are consistent with the target board thumbnail. On the other hand, those four components of the thumbnail were from different albums. Similar to the result generated with interest distributions, the result generated with image featured comprise pins from different albums, partly owing to the fact that image representations were also related to interests. Our results also indicated that even narrow interests could be divided. It is obvious that recommending thumbnails for a board about wide interests was easier, the recommendations for Figure 6c are shown in Figure 8c,d. We believe that our recommended thumbnails, which depicted more interests, were more attractive.

**Figure 8.** Results of the board thumbnail recommendation: (**<sup>a</sup>**,**<sup>c</sup>**) are generated with interest distributions of pins; (**b**,**d**) are generated with representation of pins.

#### *5.5. Board Category Recommendation*

The ground truth of board category recommendation is the crawled board category. The performance metric of the experiment was mean reciprocal rank (MRR). We only give the top MRR because there was only one accurate selection of the board category recommendation. The results are shown in Table 3.

From the table, we can see that our model had the highest MRR. Because the board category recommendation results were based on different features but the same classifier, the best result meant the best features. Our best recommendation results illustrate that multimodal representations with the benefit of personalized text representations had a better performance than other baselines.

**Table 3.** Comparison of board category recommendation (MRR: mean reciprocal rank).


## *5.6. Board Recommendation*

Every board was divided into two parts based on the order of pins. One part must be similar to the other part. The user of each half part should be interested in another and naturally further like or follows or re-pin from it. Depending on this fact, half of the board was treated as the only accurate recommendation result, and we retrieved the index in the similarity sequence. Because there were five pins in the top row exhibited on Huaban with common resolution screens, the top five MRR was also demonstrated. Table 4 shows the experimental results.


**Table 4.** Comparison of board recommendation.

From the table we can see two things. Firstly, the same feature encoded with FV performed the best. For example, the method with pin vectors, except for text vectors encoded with the FV, performed better than that with the corresponding pin vectors combined with the mean vector. The better performance is due to the utilization of higher order statistics. Secondly, our representation demonstrated the best performance when different features were encoded with the same method. The results also illustrate that multimodal joint representations have a better board modeling performance than the unimodal representations with lower dimensions.
