*4.1. Basic Definitions*

Consider the case of a discrete distribution *P* specified by a set of support points *xi* with *i* = 1, ... , *m* and their associated probabilities *wi*, such that ∑*mi*=<sup>1</sup> *wi* = 1 with *wi* ≥ 0 and *xi* ∈ *M* for *i* = 1, ... , *m*. Usually, *M* = R*<sup>d</sup>* is the *d*-dimensional Euclidean space where *xi* are the support vectors. *M* can also be a symbolic set provided with a symbol-to-symbol similarity. Therefore, *P* can be written as follows in Equation (1):

$$P(\mathbf{x}) = \sum\_{i=1}^{m} w\_i \delta(\mathbf{x} - \mathbf{x}\_i) \tag{1}$$

where *<sup>δ</sup>*(·) is the Kronecker delta.

The WST distance between two distributions *P*(1) = *w*(1) *i* , *x*(1) *i* with *i* = 1, ... , *m*1 and *P*(2) = *w*(2) *i* , *x*(2) *i* with *i* = 1, ... , *m*2 is obtained by solving the following linear program (2): 

$$\mathcal{W}\left(P^{(1)}, P^{(2)}\right) = \min\_{\gamma\_{ij} \in \mathbb{R}^+} \sum\_{i \in I\_1, \ j \in I\_2} \gamma\_{ij} \, d\left(x\_i^{(1)}, x\_j^{(2)}\right) \tag{2}$$

The cost of transport between *x*(1) *i* and *x*(2) *j* , *<sup>d</sup>x*(1) *i* , *x*(2) *j* is defined by the *p*-th power of the norm *x*(1) *i*, *x*(2) *j*, which is usually the Euclidean distance.

Two index sets can be defined as *I*1 = {1, . . . , *<sup>m</sup>*1} and *I*2 likewise, such that

$$\sum\_{i \in I\_1} \gamma\_{ij} = w\_j^{(2)}, \; \forall j \in I\_2 \tag{3}$$

$$\sum\_{j \in I\_2} \gamma\_{ij} = w\_i^{(1)} \; , \; \forall i \in I\_1 \tag{4}$$

Equations (3) and (4) represent the in-flow and out-flow constraints, respectively. The terms *γij* are called matching weights between support points *x*(1) *i* and *x*(2) *j* or the optimal coupling for *P*(1) and *P*(2). The basic computation of OT between two discrete distributions involves solving a network flow problem whose computation typically scales cubically in the sizes of the measure. In the case of a one-dimensional histograms, the computation of the Wasserstein distance can be performed by a simple sorting algorithm and with the application of Equation (5).

$$\mathcal{W}\_{\mathcal{P}}\left(P^{(1)},P^{(2)}\right) = \left(\frac{1}{n}\sum\_{i}^{n} \left|x\_{i}^{(1)\*} - x\_{i}^{(2)\*}\right|^{p}\right)^{\frac{1}{p}}\tag{5}$$

where *x*(1)<sup>∗</sup> *i* and *x*(2)<sup>∗</sup> *i* are the sorted samples. The discrete version of the WST distance is usually called the Earth Mover Distance (EMD). For instance, when measuring the distance between grey scale images, the histogram weights are given by the pixel values and the coordinates by the pixel positions.

Consider now the three univariate histograms in Figure 3, which represent three different stores. Support *xi* is the range of values of the KPI, and the weights *wi* are the number of users whose KPI score falls into that interval. Table 2 shows the differences between the Wasserstein distance and the Manhattan and Euclidean distances.


**Table 2.** The difference between Manhattan, Euclidean and Wasserstein distances.

The Wasserstein distance agrees with the intuition that *Sh* is closer to *Sj* than *Si*. Instead, the Manhattan distance does not discriminate because it assigns the same value to the pairs (*Sh*, *Sj*) and (*Si*, *Sj*). In [30], it was remarked that the information reflected in histograms lies more in the relative value of their coordinates rather than on their absolute value.

The computational cost of optimal transport can quickly become prohibitive. The method of entropic regularization [13] enables scalable computations, but large values of the regularization parameter can induce an undesirable smoothing effect, whereas low values not only reduce the scalability but might induce several numerical instabilities.

#### *4.2. Barycenter and Clustering*

Under the optimal transport metric, it is possible to compute the mean of a set of empirical probability measures. This mean is known as the Wasserstein barycenter and is the measure that minimizes the sum of its Wasserstein distances to each element in that set. Consider a set of *N* discrete distributions, **P** = -*<sup>P</sup>*(1),..., *<sup>P</sup>*(*N*), with *P*(*k*) = *w*(*k*) *i* , *x*(*k*) *i* : *i* = 1, . . . , *mk* and *k* = 1, ... , *N*. Therefore, the associated barycenter, denoted with *P* = {(*<sup>w</sup>*1, *<sup>x</sup>*1),...,(*wm*, *xm*)}, is computed as follows in Equation (6):

$$\overline{P} = \underset{P}{\text{argmin}} \frac{1}{N} \sum\_{k=1}^{N} \lambda\_k \mathcal{W}\left(P, P^{(k)}\right) \tag{6}$$

where the values *λk* are used to weigh the different contributions of each distribution in the computation. Without the loss of generality, they can be set to *λk* = 1*N* ∀ *k* = 1, . . . , *N*.

The concept of the barycenter enables clustering among distributions in a space whose metric is the Wasserstein distance. More simply, the barycenter in a space of distributions is the analog of the centroid in a Euclidean space. The most common and well-known algorithm for clustering data in the Euclidean space is *k*-means. Since it is an iterative distance-based (also known as representative-based) algorithm, it is easy to propose variants of *k*-means by simply changing the distance adopted to create clusters, such as the Manhattan distance (leading to *k*-medoids) or any kernel allowing for nonspherical clusters (i.e., kernel *k*-means). The crucial point is that only the distance is changed, and the overall iterative two-step algorithm is maintained. This is also valid in the case of the Wasserstein k-means, where the Euclidean distance is replaced by the Wasserstein distance and where centroids are replaced by barycenters.
