*3.2. Text Representation*

The text description is an important personalized supplement to the image representation. Similar to the image representation, we generate the text representation for the purpose of discovering the relationships between the descriptions and categories of the pins. Contrary to the case of images, descriptions of the same pin may be different. Therefore, it is not easy to build a large high-quality labelled dataset on CCSNs.

Since words used on CCSNs have no obvious difference with those in common situations, we trained a word2vec [26] model on some public corpora for encoding words. The efficient shallow model was designed for studying word representations. The learned word vectors capture a large amount of syntactic word relationships and meaningful semantic relationships. The training dictionary should include words from the category words and the text description to represent the relationships between the text representation and the categories. In addition, word vectors, which encode words into compact vector spaces, are more scalable than one-hot representations, because the vocabulary of natural language is extremely wide. Both the training speed and the quality of the vectors could be improved by several extensions including the hierarchical softmax, negative sampling, noise contrastive estimation, and subsampling of frequent words [36]. For details of the Word2Vec model, please refer to the original paper.

Because of the diverse lengths of the texts, it is necessary to generate vectors with a constant dimension from a set of word vectors to represent a complete text. Some pooling methods, such as mean pooling [27], have been proven feasible in solving this problem. For a text *T* = *Word*1, *Word*2 ··· , *WordMT* , we compute the mean vector in Equation 4 as its text representation,

$$V\_T = \frac{1}{M\_T} \sum\_{i=1}^{M\_T} KeyedVector\_{Word\_i \prime} \tag{4}$$

where *KeyedVectorWordi*denotes the *i*-th word (*Wordi*) vector, and *MT* is the text length.

#### *3.3. Multimodal Fusion*

Different modalities can provide both consistent and complementary information, while their distinct statistical properties make it difficult to combine them into a joint representation that maintains their specific characteristics using a shallow architecture. A multimodal DBM [32] can effectively model a joint distribution over modalities, which adds a shared hidden layer on top of DBMs to combine them.

As illustrated in Figure 5, a multimodal DBM is an undirected graphical model with fully bipartite connections between adjacent layers. Each pathway of it is a DBM, which is structured by stacking two restricted Bolzmann machines (RBMs) in a hierarchical manner. All layers, except the two bottom layers, use standard binary units. An RBM with hidden units *H* = *hj* ∈ {0, 1}*<sup>F</sup>* and visible units *V* = (*vi*) ∈ {0, 1}*<sup>D</sup>* defines the energy function as follows:

$$\mathbb{E}\left(V, H; \theta\right) = -\sum\_{i=1}^{D} \sum\_{j=1}^{F} v\_i w\_{ij} h\_j - \sum\_{i=1}^{D} a\_i v\_i - \sum\_{j=1}^{F} b\_j h\_{j\prime} \tag{5}$$

where *θ* = *wij* ∈ <sup>R</sup>*D*×*F*,(*ai*) ∈ <sup>R</sup>*D*, *bj* ∈ <sup>R</sup>*<sup>F</sup>* are model parameters comprising the symmetric interaction term *wij* between the hidden unit and the visible unit, the visible unit bias term *ai*, and the hidden unit bias term *bj*. RBMs can be considered autoencoders, and one of their applications is dimensionality reduction by reducing *F*. Both bottom layers of our model change to Gaussian–Bernoulli RBMs which use Gaussian distribution to model real-valued inputs. The energy function of a Gaussian–Bernoulli RBM with visible variables *V* = (*vi*) ∈ R*<sup>D</sup>* and hidden variables *H* = *hj* ∈ {0, 1}*<sup>F</sup>* is defined as

$$\mathbb{E}\left(V, H; \theta\right) = \sum\_{i=1}^{D} \frac{\left(\upsilon\_{i} - a\_{i}\right)^{2}}{2\sigma\_{i}^{2}} - \sum\_{i=1}^{D} \sum\_{j=1}^{F} \frac{\upsilon\_{j}}{\sigma\_{i}} w\_{ij} h\_{j} - \sum\_{j=1}^{F} b\_{j} h\_{j},\tag{6}$$

where *σi* denotes the standard deviation of the *i*-th visible unit and *θ* = *wij* ∈ <sup>R</sup>*D*×*F*,(*ai*) ∈ <sup>R</sup>*D*, *bj* ∈ <sup>R</sup>*F*,(*<sup>σ</sup>i*) ∈ <sup>R</sup>*<sup>D</sup>* . During the unsupervised pretraining process of the multimodal DBM, modalities can be thought of as labels for each other. Each of the multimodal DBM layers has a small contribution to eliminating modality-specific correlations. Therefore, in contrast to the modality-full input layers, the top layer can learn representations that are relatively modality free. The joint representation of the image and text inputs can be represented as follows:

$$\mathbb{P}\left(V\_{I\_{\mathcal{I}}}V\_{T\_{\mathcal{I}}};\theta\right) = \sum\_{H\_{I2},H\_{T2},H\_{3}} \mathbb{P}\left(H\_{I2},H\_{T2},H\_{3}\right) \left(\sum\_{H\_{I1}} \mathbb{P}\left(V\_{I\_{\mathcal{I}}}H\_{I1},H\_{I2}\right)\right) \left(\sum\_{H\_{T1}} \mathbb{P}\left(V\_{T\_{\mathcal{I}}}H\_{T1},H\_{T2}\right)\right),\tag{7}$$

where *θ* denotes all model parameters. The reader may refer to the original paper for more details of multimodal DBMs.

**Figure 5.** Architecture of a multimodal deep Boltzmann machine (DBM) for fusing image and text representations.

An advantage of multimodal DBMs is that they can deal with the absence of some modalities. After training our multimodal DBM, even though some pins may have no descriptions, the activation probabilities of *H*3, which are used as our final multimodal joint representation of pins, could be inferred from different conditional distributions with the standard Gibbs sampler. In addition, the multimodal DBMs are used to generate the missing text representation in a similar manner. Moreover, multimodal DBMs can be trained supervised by connecting additional label layers on top of them.

#### **4. Implementation of Recommendations for Different Tasks**

Once the representations of pins were obtained, we then aimed to apply them to the recommender system. According to the practical applications on CCSNs, there are four recommendation tasks: recommending pins or boards to users, recommending thumbnails to boards, and recommending board categories to boards. All the recommendation methods are content-based.
