*5.3. Clustering Analysis*

The barycenters enable the clustering of a set of distributions using the Wasserstein distance in a framework similar to *k*-means, as explained in Section 3.2. Two different clustering approaches are considered and compared with the standard *k*-means.

Consider a network of *n* stores and a set of *k* KPIs. Each store can be represented as *k* univariate histograms (one for each KPI) or one *k*-dimensional histogram. Clustering can be performed to divide the stores into two different groups. In the first case, each clustering iteration requires the computation of 2*k* different univariate barycenter and 2*kS* Wasserstein distances between univariate histograms. In the second case, each clustering iteration requires the computation of two different *k*-dimensional barycenter and 2*S* Wasserstein distances between *k*-dimensional histograms. The first approach considers just the marginals of the entire distribution of KPIs, losing the correlations between them and resulting in a more efficient but less effective algorithm. The second approach can instead quickly become too computationally expensive as the number of KPIs *k* grows.

These two approaches are compared with the *k*-means algorithm performed on the mean of KPIs. Each store is represented as a *k*-dimensional vector, where each component contains the mean of a KPI. To enable the visualization of the clustering, the results of the three algorithms are mapped on the network representation of the stores, as shown in Figure 9.

**Figure 9.** Clusters resulting from the three different approaches: *k*-means (**left**), clustering of the marginals (**center**) and clustering of the multi-dimensional histograms (**right**).

The resulting clusters using the standard *k*-means approach and the approach that considers just the marginals are visually similar, while the approach that consider the whole distributions of KPIs bring to different groups. Therefore, using the multi-dimensional histogram representations of the stores allows one to capture the entire distribution of the KPIs and their correlations, thus bringing different insight.

#### *5.4. Nonlinear Structures in Data*

A key assumption in this paper is that large datasets can exhibit a nonlinear structure, which is not easily captured by a Euclidean space. A key conjecture of this paper is that the WST space of histograms is a non-linear manifold. As a consequence, one can expect that embedding the problem in a Wasserstein space and using barycenters can provide a better synthesis of the dataset than the Euclidean mean.

To test this conjecture, the difference between the Euclidean mean and the barycenter is analyzed. First, a single KPI is considered, and the Euclidean mean and the barycenter of the histograms associated with the 50 stores are computed. The same process is also repeated in the cases of two, three and four KPIs to consider multi-dimensional histograms.

The computational results support the initial hypothesis. Figure 10 shows the Wasserstein distances between the Euclidean means of the histograms and the barycenters. This distance monotonically increases with the dimension of the support space of the histograms.

**Figure 10.** Distance between the Euclidean mean and the Wasserstein barycenter as the histograms' dimensionality increases. The dimensionality of the histograms is on the x-axis, and the Wasserstein distances between the Euclidean mean of the histograms and their Wasserstein barycenters are on the y-axis.

#### **6. Conclusions, Limitations and Perspectives**

The analytics proposed in this paper, based on the Wasserstein distance and barycenters, enables one to capture the quality of the customer experiences and to provide performance measures for the entire network of stores. It is the authors' opinion that the growing diversity and heterogeneity of customers makes a distributional approach more effective for analyzing samples of customer behavior than relying only on parameters such as average and variance. The Wasserstein distance (also known as the optimal transport distance) is shown to uncover nonlinear dependencies in the dataset without requiring the alignment of the distributions' support. This is demonstrated by the growing gap between the Euclidean average and the barycenter as the dimensionality of the support increases. The histograms can also be clustered in the Wasserstein space.

These features are demonstrated in a challenging business problem: the performance evaluation of the Italian store network (50 stores) of a multination retailer. Assessing the relative performance of each store with respect to the others is a critical decision for a company as a basis for the distribution of a performance-related bonus. The results enable the company to move towards a different evaluation platform. The analytics proposed in this paper, based on the Wasserstein distance and barycenters, is suitable to obtain a credible ranking system for the stores.

In terms of limitations, it is fair to remark that, although univariate distributions can be easily handled using the quantile-based closed formula, computational problems may hinder the application of the WST distance to large-scale multivariate problems. This problem is amplified in the computation of the barycenter and in the clustering of histograms in the Wasserstein space.

In terms of perspectives, it should be remarked that a byproduct of the computation of the WST distance between two stores is an optimal transport plan that indicates how much of the "probability mass" is to be moved between each couple of bins in the multivariate histograms representing the two stores. This result can be read as the impact of an improvement of each KPI on the overall score of a store.

**Author Contributions:** All authors contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The data are available upon request to Ilaria Giordani (giordani@oaks.cloud).The data are built on a real word project and were randomized during the study.

**Acknowledgments:** The authors greatly acknowledge the Data Science Lab, Department of Economics Management and Statistics (DEMS), for supporting this work by providing computational resources.

**Conflicts of Interest:** The authors declare no conflict of interest.
