*3.1. Image Representation*

Image representation aims to learn image features which not only maintain intrinsic characteristics, but also reflect user interests on CCSNs. CNNs have become the dominant approach in computer vision. Top layers of CNNs can extract high-level image features interpreted as color, material, scene, texture, object, and so on by various means. Intermediate layers of CNNs, especially fully connected (FC) layers, are often used for image representation and for further applications. As supervised learning models, CNNs can capture the relationships between user interests and images if user interests are trained as labels during the training process.

As a typical deep learning framework, a CNN requires a training set with large number of images with corresponding labels. Social networks are good sources for collecting the images, but noisy labels are always a primary problem. On CCSNs, all of the pins are collected by users and the categories are assigned by users, therefore the categories can be seen as labels with a high level of confidence. Users can create boards, and then create or collect pins into the boards to exhibit their interests. When a user creates a board, he is asked to choose one of the predefined categories on CCSNs, and the chosen category is the category of all the pins on the board. This is to say, every pin has a user-selected category. Since the category can reflect the theme of the board and the pin, it can describe coarse-grained user interests and can be trained as the label of an image.

On CCSNs, different users may select different categories for the same image. For example, the pin in Figure 3a is re-pinned by 50 users. Because the image is a poster of the video game NBA 2K12, 26 users categorized the pin into category 'sports', 16 users categorized it into category 'entertainments', and the other eight users categorized it into 'design'. On the basis of the statistical distribution of predefined categories given by users, the category distribution of pins can be computed as

$$\text{Interest}\_{I} = \left(p\_{\mathbb{C}\_{i}} = \frac{f\_{\mathbb{C}\_{i}}}{\sum\_{i=1}^{N\_{\mathbb{C}}} f\_{\mathbb{C}\_{i}}}\right) \in [0,1]^{N\_{\mathbb{C}}},\tag{1}$$

where *fCi* denotes the *i*-th category (*Ci*) frequency, and *NC* is the total number of predefined categories. As a result of the fact that the minority opinion is sometimes hard to understand and spammers exist, in practice, before the computation of category distribution, we set

$$f\_{C\_i} = 0 \quad \text{if} \quad f\_{C\_i} < \frac{\sum\_{i=1}^{M\_C} f\_{C\_i}}{M\_C} \, \text{} \tag{2}$$

where *MC* is the total number of chosen categories that appear in the re-pin tree of *I*, to remove spam and make the sequence represent the majority opinion. Using the proposed annotation method, we were able to acquire labels of collected images without any additional human labor. Furthermore, compared to expensive human-labeled data, we believe the category distribution contributed by the collective user intelligence from re-pin trees is more suitable as the label of a pin. In contrast to existing image representation learning methods, which rely on high-quality label supervision, our category distribution of pins is acquired by mining the rich re-pin relationships from inexhaustible CCSN contents.

We then fine-tuned a pretrained CNN model to accelerate the training process. A deeper and wider architecture commonly performs better, while it is usually more time and space consuming. Thus, AlexNet [34] was chosen as a basis. The core visual deep model could be replaced by any of the other state-of-the-art models, such as GoogLeNet and ResNet. AlexNet, with weights pretrained by ImageNet [35], is commonly used to classify independent objects, though we needed a multilabel regressor model. Accordingly, the loss layer from softmax was changed from a logarithmic loss layer to a sigmoid with a cross entropy loss layer. We define loss function as

$$\mathcal{E} = -\sum\_{i=1}^{N\_{\mathcal{C}}} \left[ p\_{\mathcal{C}i} \ln \hat{p}\_{\mathcal{C}i} + \left( 1 - p\_{\mathcal{C}i} \right) \ln \left( 1 - \hat{p}\_{\mathcal{C}i} \right) \right],\tag{3}$$

where *pCi* denotes the percentage in Equation (1), and *p*<sup>ˆ</sup>*Ci* is the corresponding sigmoid output. After fine-tuning the CNN, its weights are stored for feature extraction. Then, the image representations are the activation values of the FC layer.
