An Optimal-Transport-Based Multimodal Big Data Clustering

Yang, Zheng; Shi, Chongyang; Guan, Ying

doi:10.3390/electronics14040666

Open AccessArticle

An Optimal-Transport-Based Multimodal Big Data Clustering

by

Zheng Yang

^1,*,

Chongyang Shi

² and

Ying Guan

¹

College of Information, Shenyang Institute of Engineering, Shenyang 110136, China

²

School of Computer Science, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(4), 666; https://doi.org/10.3390/electronics14040666

Submission received: 24 December 2024 / Revised: 28 January 2025 / Accepted: 29 January 2025 / Published: 8 February 2025

(This article belongs to the Special Issue Advances in Data-Driven Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Multimodal clustering achieves outstanding performance in various applications by aggregating information from heterogeneous devices. However, previous methods rely on strong-notion distances to fuse crossmodal complementary knowledge, established on a fragile assumption about the existence of a ubiquitous non-negligible intersection between heterogeneous manifolds of modalities. Due to this unstable theoretical basis, previous methods are essentially challenged by limited performance on general multimodal data. To address this challenge, an optimal-transport-based multimodal clustering (OTMC) method is defined as the optimal transport (OT) from multimodal data distributions to clustering distributions, which leverages a weak-topology measure to capture complementary knowledge with clear discriminative structures. OTMC consists of a modality-specific OT delivering private structures and a modality-common OT delivering shared structures, which transports category structures scattered in manifolds of each modality and all modalities to common prototypes, respectively. Furthermore, variational solutions to OTMC are derived by matching the data-prototype joint distribution, which induces the multimodal OT clustering network, to capture discriminative structures. Finally, the experimental results from four real-world datasets demonstrate the superiority of OTMC, helped by never relying on the phantom of heterogeneous manifold intersections. In particular, OTMC obtains 92.15% ACC, 84.96% NMI, and 83.35% ARI on Handwritten, improving by 2.25%, 2.82%, and 3.28%, respectively.

Keywords:

multimodal data; deep multimodal clustering; optimal transport; big data; heterogeneous manifold; variational solution

1. Introduction

With the rapid development and widespread applications of emerging computing technologies, data produced by human beings are experiencing explosive growth [1]. For instance, a smart power plant acquires real-time data through various devices, such as flow sensors, current transducers, and power meters, to monitor the working status of power equipment, in which an enormous amount of data with heterogeneous distributions are collected. In smart cities, smart car-detection technology utilizes multiple cameras and microphones to verify the categories of cars, amassing a great volume of information about cars. These extensive data collected from multiple devices are often referred to as multimodal data, in which the data of each modality contain consistent information that describes shared properties and complementary information about different properties among modalities [2,3,4]. The fusion of complementary and consistent information in multimodal data can further facilitate the understanding of patterns hidden in complex phenomena of the real world [5]. Thus, it is of great importance to design innovative computing paradigms to integrate information from multimodal data.

Multimodal clustering (MC), as a fundamental unsupervised task, aims to aggregate knowledge scattered in heterogeneous modalities to recognize intrinsic patterns of data [6,7,8]. Recently, some pioneering deep multimodal clustering methods have been proposed to leverage hierarchical nonlinear embedding features to explore patterns from fusion information, which can be roughly separated into two classes, i.e., deep fusion methods [9,10] and deep clustering methods [11,12]. The former utilizes self-supervision information of data to extract generalized fusion representations to characterize fusion information, and then performs vanilla single-modality clustering methods to mine patterns, whose two-stage learning strategy may disconnect fusion representation learning and clustering partition, causing suboptimal fusion representations for clustering partitioning. To close the gap between the two processes, deep clustering methods combine fusion representation learning and clustering partition into a joint optimization strategy. They utilize the divergences of data structures to guide the fusion of complementary information among modalities, as well as mining clustering structures in an end-to-end manner.

Existing MC methods have made significant progress in the pattern mining of multimodal data. However, most of them ignore the manifold topology of multimodal data, and focus on distance metrics with strong notions that assume strict overlaps between heterogeneous distributions of data in various modalities, such as Kullback–Leibler (KL) divergence and Jensen–Shannon (JS) divergence, leading to the degradation of performance on pattern mining when they are faced with the challenge of general multimodal data applications [13]. Specifically, general multimodal data collected from diverse sources obey heterogeneous distributions and are embedded in their corresponding manifolds, and the manifolds of data in each modality almost have no non-negligible intersection [14]. In that case, the assumption above is not valid everywhere, which leads to failure in measuring the differences between heterogeneous manifolds when integrating complementary information from multimodal data. In other words, distance metrics with strong notions provide no useful gradient information in training deep clustering models, which blurs discriminative structures when fusing representations of multimodal data. Thus, previous methods may not exhibit desirable performance in pattern mining in real-world applications.

To address this challenge, an optimal-transport-based multimodal clustering method (OTMC) is proposed to conduct multimodal clustering structure mining based on the optimal-transport (OT) theory, which ensures clustering-friendly discriminative structures when fusing complementary modality information with heterogeneous distributions. Specifically, multimodal clustering is defined as a map from multimodal data measures to clustering prototype set measures in OTMC, which utilizes the Wasserstein distance to obtain clear discriminative structures from heterogeneous information with no intersection of data distributions in different modalities. Furthermore, OTMC is decomposed into a modality-specific OT (MsOT) and a modality-common OT (McOT) within a consistent constraint, to disentangle the transportation of private and shared information of each modality. In addition, a generative transport plan is designed to derive the variational solution to OT, which can effectively search the map with a minimal transport cost for mining intrinsic patterns. Finally, although designed for all data with heterogeneous distributions in different modalities, e.g., text and video, OTMC is conducted on four real-world benchmark image datasets in extensive experiments to validate its superiority on high-dimension data.

The main contributions of this paper are as follows:

An innovative weak-notion distance metric-based method is designed to measure differences between the manifold structures of data collected from diverse devices, which ensures the full fusion of complementary information from data with heterogeneous distributions.
Multimodal clustering is innovatively modeled using an optimal-transport-based multimodal clustering method (OTMC), which can capture fusion information with clear discriminative structures from heterogeneous modalities for mining intrinsic patterns.
A variational solution is derived to solve OT based on a generative transport plan, which can precisely match the transport map for transporting the multimodal data to clustering prototypes in OTMC.
Extensive experiments are conducted on four real-world benchmark datasets, which verify the superiority of OTMC compared with other methods in multimodal clustering, helped by never relying on the phantom of heterogeneous manifold intersections. In particular, OTMC obtains 92.15% ACC, 84.96% NMI, and 83.35% ARI on Handwritten, improving by 2.25%, 2.82%, and 3.28%, respectively.

In the rest of this paper, Section 2 gives the related deep fusion methods and deep clustering methods; Section 3 introduces the theoretical framework of OTMC; Section 4 derives the variational generative solution to OTMC and network implementations; Section 5 showcases the experimental results of OTMC in comparison with deep fusion methods and deep clustering methods; and Section 6 concludes this work and points out some possible future research directions for subsequent exploration.

2. Related Work

Clustering is a classical unsupervised learning task. Various effective clustering algorithms like k-means have been proposed and widely utilized as cornerstones of the solutions to problems such as pavement deterioration [15], facility location [16], and the energy management of wireless sensor networks [17]. OTMC learns intrinsic patterns of multimodal data via neural networks, and is closely related to deep fusion methods and deep clustering methods in multimodal clustering.

Deep fusion methods, as two-stage methods, focus on extracting generalized fusion representations of multimodal data, and then leveraging a single-modality clustering algorithm to mine clustering patterns. For example, DCCA [18,19] learned fusion representations of multimodal data via hierarchical nonlinear mappings along with crossmodal correlation maximizing. DMF [20] mined the consensus semantics of multimodal data by leveraging multi-layer transformations subject to non-negative constraints to explore informative spatial bases. AE2-Nets [9] modeled a degradation process from multimodal data to fusion representations within a nested autoencoder architecture to capture consistent and complementary information among modalities. CMIB [10] utilized the rate-distortion principle to explore robust fusion representations by preserving complementary and consistent information and discarding the superfluous information of modalities. MIFN-ML [21] conducted intermediate fusion of biometric and facial representations after manifold-learning-based dimension reduction techniques, which then made a clustering decision for stress detection via a nonlinear mapping. MCBVAD [22] introduced attentions from other views to suppress information from irrelevant views on the basis of maximizing information entropy and mutual information of views, which produced information-fruitful representations to improve k-means clustering results. Further, with the help of a Transformer-based attention learning block, VMC-CD [23] applied cross-view contrastive learning and maximum entropy of categories to obtain discriminative representations for k-means clustering. OSCAMC [24] jointly learned and optimized view-collective matrices, including a combined non-negative embedding matrix, a collective similarity matrix, a joint coefficient matrix, a unified spectral projection matrix, and so on, and then obtained the clustering results from those matrices.

Deep clustering methods integrate complementary information fusion with clustering pattern recognition in an end-to-end architecture, where divergences of data structures are used to supervise model training. MVaDE [25] modeled the generation process from hidden categories to visible data, where the Gaussian distribution was used to constrain the information fusion of each modality. CoMVC [26] utilized crossmodal contrastive learning followed by adaptive linear combination to fuse modality information, and then conducted Cauchy–Schwarz divergence in the fusion representation space for clustering. SDMVC [27] utilized KL-divergence to mine clustering patterns from concatenated complementary information, which uses consistent semantics to constrain the alignments of representation distributions in each modality. CMSC [28] fused modality information on the basis of a consensus subspace which maximized the correlations of modalities, and then performed spectral clustering to mine clustering patterns. DMIM [29] fused complementary information by maximizing the dependence between modalities and mined multimodal data patterns with the help of an over-clustering strategy. DMAC-SI [5] combined invariant semantics representation learning and the Markov decision process of multimodal data clustering into an end-to-end clustering framework, ensuring the flexibility of clustering structure exploration. EMVFC [30] integrated low-dimensional representation learning and cooperative learning in clustering to improve performance, which reduced redundancy features and noise, enhancing the complementary information and correlation.

The deep fusion methods above extracted generalized fusion representations and then performed vanilla single-modality clustering methods to mine patterns, which may disconnect fusion representation learning and clustering partition, causing suboptimal fusion representations for clustering partition. The deep clustering methods above combined fusion representation learning and clustering partition into a joint optimization strategy for mining clustering structures in an end-to-end manner. However, they still relied on strong-notion distances to fuse crossmodal complementary knowledge, which limited their clustering performance. In contrast, optimal transport can define multimodal clustering as a map from multimodal data measures to clustering prototype set measures, which avoids representation-partition disconnections and dependency on the intersection of data distributions in different modalities.

Through the comparison between optimal transport and clustering, it can be seen that a difference between optimal transport and clustering is that optimal transport ex-changes the operation form for the lowest cost. That is, for any two mappings that map data points to cluster centers, optimal transport chooses the one with lowest cost and does not consider how it works. In a sense, optimal transport is more flexible and difficult than clustering. Thus, the solution to optimal transport can be used in clustering. In multimodal clustering, dividing the data into modality-specific and modality-common components is a standard approach. OTMC further distinguishes itself by incorporating the concept of optimal transport into this framework. Specifically, OTMC extends the rationale for optimal transport by partitioning it into modality-common and modality-specific components as well. This allows OTMC to better capture the intrinsic relationships both within and across modalities, offering a more nuanced representation of the data. In addition, there are diverse opinions about how to evaluate the quality of a clustering algorithm so that many metrics are proposed. If the objective of clustering is just minimizing the sum of distances in each cluster in k-means, no more problems appear. So, in this context, our method is a beneficial step toward providing a new direction for clustering, i.e., evaluating clustering quality based on transport cost.

3. OT for Multimodal Clustering

In this section, multimodal clustering is redefined within the optimal-transport architecture, to mine patterns of multimodal data by adaptively learning an optimal transport map between multimodal data and clustering prototypes. Then, rooted in private and common information hidden in multimodal data, OTMC is decomposed into modality-specific OT and modality-common OT, as shown in Figure 1. In addition, the most frequently used notations are listed in Table 1.

3.1. Modeling Multimodal Clustering Based on OT

Optimal transport utilizes a weak topology measure to estimate the similarity between two manifolds defined over a space by computing the cost of moving one point to another within the space. Thus, optimal-transport-based multimodal clustering (OTMC) is proposed to aggregate complementary and consistent information in modalities with heterogeneous manifolds, which provides representations with clear discriminate structures without the help of intersections between data manifolds of different modalities. Specifically, let

X = {X^{m}}_{m = 1}^{M}

denote the measurable space of a multimodal instance with measure

μ

, where

X^{m}

is the m-th modality space in

M

modalities, and let

C = {c_{j}}_{j = 1}^{k}

denote the prototype set with cluster number

k

and prototype measure

ν

, in which

c_{j}

denotes the

j

-th clustering prototype. Multimodal clustering

T

groups data into corresponding clustering prototypes with minimal pre-defined criteria as follows:

T \Leftrightarrow \underset{W}{\arg \min} \sum_{j = 1}^{k} \int_{W_{j}} d (x, c_{j}) d μ (x),

(1)

where

W = {{(W_{j}, c_{j})}_{j = 1}^{k}}

stands for a clustering scheme and

W_{j}

stands for the data partition belonging to the prototype

c_{j}

.

Multimodal clustering groups data into corresponding clusters by minimizing discrepancies between data and prototypes, which is natural and consistent with the optimal transport that finds an optimal-transport map from data to the corresponding clusters, as follows:

T \Leftrightarrow \underset{W}{\arg \min} \sum_{j = 1}^{k} \int_{W_{j}} d (x, c_{j}) d μ (x) \Leftrightarrow \underset{T_{#} μ = ν}{\arg \min} \int_{X} d (x, T (x)) d μ (x),

(2)

where

d (,)

measures the transport map cost between data and prototypes, and

ν = \sum_{j = 1}^{k} ν_{j} δ (c - c_{j})

stands for the prototype measure based on the Dirac function

δ (\cdot)

, in which

ν_{j}

stands for the

j

-th prototype mass. Meanwhile,

T_{#} μ = ν

indicates that the overall data mass transported to the prototype

c_{j}

is the same as the

c_{j}

mass, that is,

μ (T^{- 1} (c_{j})) = ν_{j}

. The cluster partition

W

is equivalent to the map

T

when the prototype measure

ν

is fixed. That is, the map

T

projects instances of cluster

W_{j}

to the prototype

c_{j}

, i.e.,

W_{j} = {x_{i} | T (x_{i}) = c_{j}, x_{i} \in X}

.

3.2. Decomposition of OTMC

In multimodal data, each modality of the same instance utilizes heterogeneous structures to deliver the information from different aspects. In other words, the data of each modality contain both private and common information. Thus, the optimal-transport-based multimodal clustering method is decomposed into the modality-specific OT (MsOT) that captures the private information, and the modality-common OT (McOT) that catches the common information.

Modality-specific OT

T_{s}^{m} (m = 1, \dots, M)

aims to capture the

m

-th modality private information that contributes to recognizing inherent patterns in each modality. Specifically,

T_{s}^{m}

represents the map from the data of the

m

-th modality space

X^{m}

to the

k

prototype set

C_{s}^{m}

with measure-preserving

T_{s}^{m} {}_{#}μ_{s}^{m} = ν_{s}^{m}

, which pushes forward

μ_{s}^{m}

to

ν_{s}^{m}

, i.e.,

T_{s}^{m} : X^{m} \to C_{s}^{m}

.

μ_{s}^{m}

and

ν_{s}^{m}

denote the measure of the space

X^{m}

and the set

C_{s}^{m}

, respectively.

T_{s}^{m}

can be obtained by minimizing the following total cost based on Equation (2):

MsOT = \underset{T_{s}^{m} {}_{#}μ_{s}^{m} = ν_{s}^{m}}{\arg \min} \int_{X^{m}} d_{s}^{m} (x, T_{s}^{m} (x)) d μ_{s}^{m} (x), m = 1, \dots, M,

(3)

where

d_{s}^{m} : X^{m} \times C_{s}^{m} \to ℝ^{+}

is the transport map cost between data and prototypes in the

m

-th modality. The modality-specific OT explores manifold differences between modality-specific data and clustering prototypes by utilizing a measure with weak topology, to learn an optimal transform map from a local perspective, which effectively captures the private information for mining patterns in each modality.

Modality-common OT

T_{c}

catches common information that reveals the consistent semantics of instances among modalities. To be specific,

T_{c}

maps data in multimodal space

X

to the modality-common prototype set

C_{c}

with measure

ν_{c}

, which means the map

T_{c}

pushes forward

μ

to

ν_{c}

.

T_{c}

can be obtained by minimizing the cost based on Equation (2):

McOT = \underset{T_{c#} μ = ν_{c}}{\arg \min} \int_{X} d_{c} (x, T_{c} (x)) d μ (x),

(4)

where

d_{c} : X \times C_{c} \to ℝ^{+}

is the transport map cost of modality-common OT

T_{c}

. Modality-common OT explores manifold differences between multimodal data and modality-common clustering prototypes by utilizing a measure with weak topology to learn an optimal transform map from a global perspective, which effectively captures the common semantic information for mining patterns in multimodal data.

3.3. The Consistent Constraint

Since both modality-specific OT and modality-common OT are used to mine the intrinsic clustering patterns of multimodal data, their transport plan

T_{s}^{m}

and

T_{c}

should be consistent. To this end, a consistent constraint

L_{c o n}

between

ν_{s}^{m}

and

ν_{c}

is introduced to achieve the above objective by keeping the clustering prototypes of modality-specific OT and modality-common OT subject to the same measure:

L_{c o n} = \min \sum_{m = 1}^{M} D (ν_{s}^{m}, ν_{c}),

(5)

where

D (\cdot)

denotes a metric function that evaluates differences between the measures

ν_{s}^{m}

and

ν_{c}

. Finally, modality-common OT

T_{c}

is used for the cluster assignment result.

4. The Variational Generative Solution Network Implementation

In this section, a variational generative solution to OT is derived, and then, it is generalized to multimodal scenarios for optimizing data partitioning within an optimal-transport clustering network.

4.1. Variational Generative Solution

The variational generative solution learns the generative relationship between data and prototypes by constructing the joint distribution

p (x, c)

, which provides a theoretically accurate analytical solution to OT. Specifically,

p (x, c) = p (c | x) p (x)

, where

p (x)

stands for the data distribution and

p (c | x)

denotes the cluster conditional distribution. Since the OT that transports the multimodal data to the clustering prototypes is actually finding the optimal prototype

c_{j}

for each data point

x_{i}

, which is fitting the cluster conditional distribution

p (c | x)

, the joint distribution has the same solution as OT.

To fit the joint distribution over the data measure

μ

and the clustering prototype measure

ν

, the solution is to find an optimal coupling measure

π^{*}

between

μ

and

ν

such that

π (\cdot, c) = μ

and

π (x, \cdot) = ν

are satisfied for any corresponding marginal measures. The optimal coupling

π^{*}

is obtained by finding the infimum of the coupling cost in the set of coupling measures

Π (μ, ν)

as follows:

π^{*} = \underset{π \in Π (μ, ν)}{\arg \inf} \int_{X \times C} d (x, c) d π (x, c),

(6)

where

d (x, c)

stands for the coupling cost between

x

and

c

. The optimal coupling solution

π^{*}

is obtained by reaching the infimum of the total coupling cost over the product space

Π (X \times C)

.

Since

p (x, c) = p (x | c) p (c)

, the fitting of

p (x, c)

can be transformed into the learning of the categorical prior distribution

p (c)

and the generative conditional distribution

p (x | c)

. The former models prior knowledge of the data’s categorical distribution, which can be represented by any distribution

C a t

. The latter learns the generation mechanism of the data to better reflect the probabilistic properties of the data, which can be achieved through the generative transport plan

G

from the clustering prototypes to the data. Then, the categorical distribution

C a t

and the generative transport plan

G

can be updated by minimizing the transport cost between the generated data and the prototypes:

\underset{G, C a t}{\arg \inf} \int_{c ~ C a t} d (G (c), c) d ν (c),

(7)

where

d (\cdot)

is the ground transport cost between

x

and

c

. After fitting the categorical prior distribution

p (c)

and the generative conditional distribution

p (x | c)

,

p (x, c)

can be obtained via the Bayesian rule.

4.2. The Variational Generative Solution to OTMC

Based on the variational generative solution to OT, the modality-specific OT and the modality-common OT decomposed from OTMC can easily be solved. Specifically, the solution to modality-specific OT can be achieved by solving the generative transport plan

G_{s}^{m}

from the categorical prior distribution

C a t_{s}^{m}

to the data of the

m

-th modality space

X^{m}

, minimizing the following ground transport cost:

MsOT = \underset{G_{s}^{m}, C a t_{s}^{m}}{\arg \min} \int_{c ~ C a t_{s}^{m}} d_{s}^{m} (G_{s}^{m} (c), c) d ν_{s}^{m} (c), m = 1, \dots, M,

(8)

where

G_{s}^{m} : C a t_{s}^{m} \to X^{m}

is the generative transport map of the

m

-th modality, and

G_{s}^{m}

generates instances from the categorical distribution, which can retain the real underlying structure information under the supervision of multimodal data. The generative transport plan

G_{s}^{m}

and the categorical distribution

C a t_{s}^{m}

can jointly learn the private data structure information and clustering information of the

m

-th modality.

In practice, the integral over the measure

ν_{s}^{m}

is approximated by Monte Carlo sampling, i.e., the integral of the measure

ν_{s}^{m}

over the categorical distribution

C a t_{s}^{m}

is equivalent to the expectation of all clustering prototypes over

C a t_{s}^{m}

. Thus, Equation (8) can induce the following objective function of modality-specific OT

L_{s}^{m}

:

\underset{G_{s}^{m}, C a t_{s}^{m}}{argmin} E_{X^{m}} E_{C a t_{s}^{m}} [ν_{s}^{m} (c) d_{s}^{m} (G_{s}^{m} (c), c)],

(9)

where

m = 1, \dots, M

indicates the

m

-th modality, and the objective function

L_{s}^{m}

fully captures inherent information of the

m

-th modality by jointly learning structure differences between data and semantic information within data.

Similarly, the optimization of modality-common OT is equivalent to solving the generative transport plan

G_{c}

from the categorical distribution to the data distribution for all modalities:

McOT = \underset{G_{c}, C a t_{c}}{argmin} \int_{c ~ C a t_{c}} d_{c} (G_{c} (c), c) d ν_{c} (c),

(10)

where

G_{c} : C a t_{c} \to X

is the generative transport map for all modalities, which can generate data of all modalities from the common categorical distribution

C a t_{c}

to capture common semantics of data.

The joint optimization of the generative transport plan

G_{c}

and the categorical distribution

C a t_{c}

can simultaneously capture the data structure information of all modalities and crossmodal clustering information. The integral over the measure

ν_{c}

is approximated by the expectation of all prototypes over

C a t_{c}

based on the Monte Carlo sampling, so the objective function of modality-common OT

L_{c}

can be written as:

\underset{G_{c}, C a t_{c}}{argmin} E_{X} E_{C a t_{c}} [ν_{c} (c) d_{c} (G_{c} (c), c)] .

(11)

The objective function

L_{c}

effectively preserves common information in multimodal data by jointly optimizing the structure differences of all modalities and crossmodal semantic information.

Thus, OTMC can be optimized via the following variational solution:

\begin{matrix} L = & \sum_{m = 1}^{M} {argmin}_{G_{s}^{m}, C_{s}^{m}} E_{X^{m}} E_{C a t_{s}^{m}} [ν_{s}^{m} (c) d_{s}^{m} (G_{s}^{m} (c), c)] \\ + {argmin}_{G_{c}, C a t_{c}} E_{X} E_{C a t_{c}} [ν_{c} (c) d_{c} (G_{c} (c), c)] \\ + \sum_{m = 1}^{M} D_{K L} (C a t_{s}^{m}, C a t_{c}) \end{matrix},

(12)

where

D_{K L}

denotes the consistent constraint of categorical prior distributions

C a t_{s}^{m}

and

C a t_{c}

, which enables MsOT and McOT to have the same transport target, so that the results of the two transport schemes tend to be the same. It is implemented by the KL-divergence penalty between

C a t_{s}^{m}

and

C a t_{c}

.

4.3. Multimodal OT Clustering Network

Although they attract much attention from researchers, generative adversarial networks (GANs) involve performance-unstable and computation-expensive adversarial training. Hence, to implement OTMC, a multimodal OT clustering network is designed within a deep neural network architecture which contains a generative network and a clustering network, instead of a GAN. The generative network implements the generative transport plan

G

together with the categorical distribution

C a t (ϖ)

, and the clustering network produces final assignment results from multimodal data.

The clustering network aims to learn the modality-specific and modality-common soft-clustering assignments that are included in MsOT and McOT. Specifically, in modality-specific OT, the clustering network

E_{s}^{m} (\cdot; θ_{s}^{m})

encodes modality-specific data into corresponding latent features, and utilizes Student‘s

t

-distribution in the last layer of

E_{s}^{m} (\cdot; θ_{s}^{m})

to calculate the similarity between latent features

z_{i}^{m}

and clustering prototypes

c_{j}^{m}

to obtain the soft-clustering-assignment matrix

P^{m}

as follows:

\begin{matrix} z_{i}^{m} = E_{s}^{m} (x_{i}^{m}; θ_{s}^{m}) \\ p_{i j}^{m} = \frac{{(1 + | | z_{i}^{m} - c_{j}^{m} | |^{2} / α)}^{- \frac{α + 1}{2}}}{\sum_{j^{'}} {(1 + | | z_{i}^{m} - c_{j^{'}}^{m} | |^{2} / α)}^{- \frac{α + 1}{2}}} \end{matrix},

(13)

where

x_{i}^{m} \in X^{m}

is the

i

-th instance from the

m

-th modality;

c_{j}^{m} \in C a t^{m} (ϖ^{m})

is the

j

-th clustering prototype of the

m

-th modality; and

p_{i j}^{m}

can be interpreted as the possibility that the

i

-th instance is divided into the

j

-th cluster. To guide the learning of networks, a target distribution

Q^{m}

is introduced, and the KL-divergence between

Q^{m}

and the soft-assignment matrix

P^{m}

is used as the objective function:

L_{c l u s t e r}^{m} = D_{K L} (Q^{m}, P^{m}),

(14)

where

q_{i j} = \frac{p_{i j}^{2} / \sum_{i} p_{i j}}{\sum_{j^{'}} (p_{i j^{'}}^{2} / \sum_{i} p_{i j^{'}})}

is the target assignment possibility of the

i

-th instance to the

j

-th clustering prototype. This objective function guides network learning by aligning the soft assignment to the target probability distribution.

For McOT, the objective of the clustering network

E_{c} (\cdot; θ_{c})

is

D_{K L} (Q, P)

, where

P

and

Q

are the soft-assignment matrix and target probability distribution over all modalities, respectively. According to the consistent constraint

L_{c o n}

in Equation (5), each

D (P^{m}, P)

is also regarded as an objective component. Hence, the total objective of the clustering networks is

L_{c l u s t e r} = \sum_{m = 1}^{M} (D (Q^{m}, P^{m}) + D (P^{m}, P)) + D (Q, P),

(15)

where

D (\cdot, \cdot)

denotes the KL-divergence between two assignment matrices. The first term guides the network to learn the clustering information within each modality, the second term is the consistency constraint of modality-specific and modality-common clustering information, and the last term guides the network to capture the clustering information over all modalities. Finally, the soft-assignment matrix

P

is used as the cluster assignment matrix.

The generative network is designed to learn the categorical distribution

C a t (ϖ)

and the generative transport plan

G

included in MsOT and McOT. Specifically, for MsOT, the generative network

G_{s}^{m} (\cdot; φ_{s}^{m})

decodes latent features

{\tilde{z}}_{i}^{m}

sampled from the categorical distribution

C a t (\cdot)

to corresponding data:

\begin{matrix} z_{i}^{m} = s a m p l e (C a t (ϖ^{m})) \\ {\tilde{x}}_{i}^{m} = G_{s}^{m} (z_{i}^{m}; φ_{s}^{m}) \end{matrix},

(16)

where

C a t (ϖ^{m})

is the categorical distribution of the

m

-th modality with parameter

ϖ^{m}

. To update the parameter

ϖ_{s}^{m}

of the generative network and the categorical distribution parameter

ϖ^{m}

, the topological reconstruction loss between the generative data and the original data is used as a guide, namely,

L_{r e c}^{m} = \sum_{i = 1}^{N} D_{w} (x_{i}^{m}, G_{s}^{m} (s a m p l e (C a t (ϖ^{m})))),

(17)

where

D_{w} (\cdot, \cdot)

represents the topological reconstruction loss based on optimal transport to measure the differences in manifold structures between original and reconstructed data. The network structure and loss of the generative network

G_{c} (\cdot; φ_{c})

of McOT are similar to those of MsOT. So, the total objection function of generative networks can be written as follows:

L_{r e c} = \sum_{m = 1}^{M} \sum_{i = 1}^{N} D_{w} (x_{i}^{m}, G_{s}^{m} (z_{i}^{m})) + D_{w} (x_{i}, G_{c} (z_{i})),

(18)

where the purpose of the first term is to optimize the categorical distribution and generative network within the modality, and the purpose of the second term is to optimize over all modalities, capturing the data structure information within and between modalities, respectively.

4.4. The Overall Loss

The overall loss function

L

of the multimodal OT clustering network is defined as follows:

L = L_{r e c} + α L_{c l u s t e r},

(19)

where

α

denotes the trade-off parameter for balancing the topological reconstruction loss

L_{r e c}

and clustering pattern mining loss

L_{c l u s t e r}

, and

L

is used to optimize the multimodal OT clustering network via the stochastic gradient descent strategy, which effectively boosts the performance of multimodal clustering:

θ_{t + 1} = θ_{t} - η_{t} \nabla_{θ} L (θ_{t}),

(20)

where

θ_{t}

and

η_{t}

denote the parameter and the learning rate of the multimodal OT clustering network at time step

t

, respectively, and

\nabla_{θ} L (θ_{t})

denotes the gradient at time step

t

. The overall optimization algorithm is shown in Algorithm 1.

Algorithm 1. Multimodal OT clustering network.

Input: Multimodal dataset

X \in X

and convergence criteria thr.
Output: Soft-clustering-assignment matrix

P

and the parameters of clustering network

E_{s}^{m} (\cdot; θ_{s}^{m})

, E_{c} (\cdot; θ_{c})

, generative network

G_{s}^{m} (\cdot; φ_{s}^{m})

, G_{c} (\cdot; φ_{c})

categorical distribution

C a t (ϖ_{s}^{m})

, C a t (ϖ_{c})

.
Initialize: modality-specific categorical distributions

C a t (ϖ_{s}^{m})

modality-common categorical distribution

C a t (ϖ_{c})

parameters of modality-specific clustering networks

E_{s}^{m} (\cdot; θ_{s}^{m})

, parameters of modality-common clustering network

E_{c} (\cdot; θ_{c})

, parameters of modality-specific generative networks

G_{s}^{m} (\cdot; φ_{s}^{m})

, and parameters of modality-common generative network

G_{c} (\cdot; φ_{c})

while convergence criteria thr is not reached by Equation (19) do
for each

m = 1, \dots, M

do
Sample data in the

m

-th modality

X^{m}

.
Generate the

m

-th modality soft-assignment matrix

P^{m}

according to Equation (13).
Update

E_{s}^{m} (\cdot; θ_{s}^{m})

by minimizing Equation (14).
Sample modality-specific clustering prototypes from

C a t (ϖ_{s}^{m})

.
Generate the

m

-th modality reconstructed data

{\tilde{x}}_{i}^{m}

according to Equation (16).
Update

G_{s}^{m} (\cdot; φ_{s}^{m})

and

C a t (ϖ_{s}^{m})

by minimizing Equation (17).
end
Sample multimodal data

X

.
Generate modality-common soft-assignment matrix

P

.
Update modality-common clustering network

E_{c} (\cdot; θ_{c})

by minimizing Equation (15).
Sample modality-common clustering prototypes from

C a t (ϖ_{c})

.
Generate multimodal reconstructed data.
Update modality-common generative network

G_{c} (\cdot; φ_{c})

by minimizing Equation (18).
Update all network parameters by minimizing Equation (19) according to Equation (20).
end

5. Experiments

5.1. Experimental Setup

Datasets. Four real-world high-dimensional image datasets are utilized in the experiments to evaluate the performance of OTMC. Specifically, Handwritten has 2000 objects of 0–9, where the pixel features and profile correlations are used as two different modalities [31]. ORL collects 400 objects in terms of intensity features and Gabor features, which are collected under 10 different class conditions, such as lighting and facial expressions, of 40 subjects [32]. LandUse consists of 2100 satellite images distributed in 21 scene categories, where each satellite image is described by LBP and PHOG features [33]. Scene contains 4485 objects of 15 natural scene categories, in which PHOG and GIST features are used for each object [34]. The detailed information of these datasets is listed in Table 2.

Comparison methods. The comparison methods are grouped into two categories, deep fusion methods and deep clustering methods. Specifically, the deep fusion methods include DCCA [18], DCCAE [19], AE2-Nets [9], and CMIB [10]. These methods conduct k-means on the fusion features extracted by the corresponding deep architectures. The deep clustering methods include DMF [20], MVaDE [25], SiMVC [26], CoMVC [26], and SDMVC [27]. These methods are directly conducted on multimodal data to obtain the clustering results in an end-to-end manner. Moreover, FeatConcate is also used as a comparison method, which performs k-means on concatenated features.

Evaluation metrics. Three well-known and widely used evaluation metrics, i.e., accuracy (ACC) [35], normalized mutual information (NMI) [36], and adjusted rand index (ARI), are used to measure the performance of all the methods. High values of these metrics indicate ideal clustering performance.

ACC measures the ratio of correct cluster elements to all elements as follows:

ACC = \frac{\sum_{i = 0}^{N - 1} δ (l_{i}, m (p_{i}))}{N},

(21)

where

l_{i}

and

p_{i}

are the labels and the predictions, respectively;

m ()

and

δ ()

are the Kuhn–Munkres function and the Dirac delta function, respectively; and

N

is the size of the multimodal dataset.

NMI measures the dependence between the label distribution and prediction distribution as follows:

NMI = \frac{2 I (L; P)}{H (L) + H (P)},

(22)

where

L

and

P

denote the label distribution and the prediction distribution, respectively; and

I ()

and

H ()

denote the mutual information function and the entropy function, respectively.

ARI is a measure on the basis of the rand index (RI) as follows:

ARI = \frac{RI - E (RI)}{\max (RI) - E (RI)},

(23)

where RI is the rand index, and

E

denotes the expectation operator.

Implementation details. OTMC is implemented in PyTorch. In the experiments, vector representations of instances are normalized to

[0, 1]

, and then are fed into the network for training. During training, the Adam solver is used with a batch size of 100 to optimize the network, and the learning rate is set to

10^{- 3}

for all modules of the network. In addition, all the reported final results are the average results of 5 repeated experiments.

5.2. Clustering Performance Evaluation

Table 3 demonstrates the numerical clustering results in terms of ACC, NMI, and ARI. The results show that OTMC obtains the best results on four benchmark datasets. Moreover, there are also two observations about the ranking of comparison methods. First, the clustering results produced by deep fusion and deep clustering methods are better than those produced by FeatConcate, which verifies the capability of the deep methods to model correlations of heterogeneous modalities when fusing complementary knowledge. Second, the deep clustering methods outperform the deep fusion methods in clustering pattern mining, since the deep clustering methods obtain clustering-specific fusion knowledge on multimodal data, rather than the generalized fusion knowledge extracted by the deep fusion methods. In summary, OTMC produces the optimal results compared with ten baseline methods. These observations validate the effectiveness of optimal transport between data and clusters. That is, the modality-specific OT and the modality-common OT grounded in the Wasserstein distance can effectively measure divergences hidden in heterogeneous distributions of modalities. In particular, in comparison with FeatConcate, which relies on the strong-notion Euclidean distance, OTMC achieves more improvements on Handwritten than the other three datasets. The potential reason may be the differences in heterogeneous manifold overlaps, i.e., the heterogeneous manifolds in Handwritten have fewer overlaps than those in the other three datasets, which experience feature engineering to enhance the intersection.

Figure 2 shows the results of the Nemenyi tests on the four datasets. It can be seen that the average rank of OTMC is higher than that of the other methods on all four datasets, which statistically indicates the superiority of OTMC.

5.3. Further Evaluation

Ablation analysis. To investigate the effectiveness of each component of OTMC, an ablation analysis is conducted on Handwritten. As shown in Table 4, modality-specific OT produces the lowest results, which are similar in two modalities, due to the limited information of each modality. Modality-common OT produces higher results than modality-specific OT in two modalities, because it can correct the inconsistencies based on modality-common knowledge when transporting common information of modalities. OTMC produces the optimal results, since OT can fuse the private and shared information of modalities. The above observations validate the effectiveness of the collaboration between modality-specific OT and modality-common OT in the pattern mining of multimodal data.

Convergence analysis. To verify the convergence of OTMC, Figure 3 shows the losses on four datasets in each epoch of training. In the 0–10 epoch, all of the losses on the four datasets decrease rapidly. Then, in the 11–20 epoch, the speed of decrease becomes slow. And in the 21–30 epoch, the change in loss tends to zero. In general, OTMC can converge after 30-epoch training (black dotted line in Figure 3).

6. Conclusions

A multimodal clustering method based on optimal transport (OTMC) is designed in this paper to mine intrinsic patterns from the fusion information of multimodal data. It utilizes the weak-notion distance to measure the differences among heterogeneous manifolds, which overcomes the fragile assumption on strict overlaps between data manifolds in previous MC methods. In particular, multimodal clustering is defined as the transport mapping between multimodal data and clustering prototypes, which effectively fuses the complementary information in multimodal data. Then, the variational solution to OT is derived based on a generative transport plan, which utilizes fusion information to produce accurate clustering patterns. Afterwards, a multimodal OT clustering network is designed to achieve the above definition. Finally, extensive experiments illustrate the superiority of OTMC.

The possible future directions of this research are three-fold: First, in current OTMC, the transport plan is completely implemented by a deep neural network. Hence, the performance of OTMC heavily depends on the degree of fitting between the deep neural network and the real transport plan, which introduces uncertainty and may be a potential limitation. In the future, a variational solution with more compact margins will be explored to address this dependence and achieve a more precise map for pattern mining. Second, the performance of OTMC on different types of multimodal datasets, especially those with varying degrees of distribution overlap or complementarity across modalities via feature engineering, will be checked to further verify the theoretical basis of OTMC. Third, OTMC will be tried in other multimodal data learning cases such as graph data classification and time series prediction.

Author Contributions

Methodology, Y.G.; writing—original draft preparation, Z.Y. and C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Liaoning Provincial Department of Education Scientific Research Fund Project (Approval Number: LJ22241163204).

Data Availability Statement

The original data presented in the study are openly available in Handwritten at https://archive.ics.uci.edu/ml/datasets/Multiple+Features (accessed on 28 January 2025), ORL at https://cam-orl.co.uk/facedatabase.html (accessed on 28 January 2025), LandUse at http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 28 January 2025), and Scene at https://figshare.com/articles/dataset/15-Scene_Image_Dataset/7007177 (accessed on 28 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kiaei, I.; Lotfifard, S. A two-stage fault location identification method in multiarea power grids using heterogeneous types of data. IEEE Trans. Ind. Inform. 2019, 15, 4010–4020. [Google Scholar] [CrossRef]
Li, Y.; Yang, M.; Zhang, Z. A survey of multi-view representation learning. IEEE Trans. Knowl. Data Eng. 2019, 31, 1863–1883. [Google Scholar] [CrossRef]
Fu, L.; Lin, P.; Vasilakos, A.V.; Wang, S. An overview of recent multi-view clustering. Neurocomputing 2020, 402, 148–161. [Google Scholar] [CrossRef]
Gao, J.; Liu, M.; Li, P.; Laghari, A.A.; Javed, A.R.; Victor, N.; Gadekallu, T.R. Deep incomplete multi-view clustering via in-formation bottleneck for pattern mining of data in extreme-environment IoT. IEEE Internet Things J. 2023, 11, 26700–26712. [Google Scholar] [CrossRef]
Gao, J.; Liu, M.; Li, P.; Zhang, J.; Chen, Z. Deep multiview adaptive clustering with semantic invariance. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 12965–12978. [Google Scholar] [CrossRef]
Wang, H.; Yang, Y.; Liu, B. GMC: Graph-based multi-view clustering. IEEE Trans. Knowl. Data Eng. 2020, 32, 1116–1129. [Google Scholar] [CrossRef]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 139–156. [Google Scholar]
Gao, J.; Li, P.; Laghari, A.A.; Srivastava, G.; Gadekallu, T.R.; Abbas, S.; Zhang, J. Incomplete multiview clustering via semidiscrete optimal transport for multimedia data mining in IoT. ACM Trans. Multim. Comput. Commun. Appl. 2024, 20, 1–20. [Google Scholar] [CrossRef]
Zhang, C.; Liu, Y.; Fu, H. AE2-Nets: Autoencoder in autoencoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2577–2585. [Google Scholar]
Wan, Z.; Zhang, C.; Zhu, P.; Hu, Q. Multi-view information-bottleneck representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 10085–10092. [Google Scholar]
Yang, Y.; Wang, H. Multi-view clustering: A survey. Big Data Min. Anal. 2018, 1, 83–107. [Google Scholar] [CrossRef]
Chao, G.; Sun, S.; Bi, J. A survey on multi-view clustering. arXiv preprint 2017, arXiv:1712.06246. [Google Scholar]
Li, P.; Laghari, A.A.; Rashid, M.; Gao, J.; Gadekallu, T.R.; Javed, A.R.; Yin, S. A deep multimodal adversarial cycle-consistent network for smart enterprise system. IEEE Trans. Ind. Inform. 2023, 19, 693–702. [Google Scholar] [CrossRef]
Xiao, Q.; Dai, J.; Luo, J.; Fujita, H. Multi-view manifold regularized learning-based method for prioritizing candidate disease miRNAs. Knowl. Based Syst. 2019, 175, 118–129. [Google Scholar] [CrossRef]
Neema, I.; Ardecani, F.B.; Shoghli, O. Cluster-based deterioration prediction of composite pavements with incorporation of flooding. In Proceedings of the 39th International Symposium on Automation and Robotics in Construction, Bogota, Colombia, 13–15 July 2022; pp. 99–106. [Google Scholar]
Mirghaderi, H.; Hassanizadeh, B. k-most suitable locations problem: Greedy search approach. Int. J. Ind. Syst. Eng. 2022, 42, 80–95. [Google Scholar] [CrossRef]
Rahiminasab, A.; Tirandazi, P.; Ebadi, M.J.; Ahmadian, A.; Salimi, M. An energy-aware method for selecting cluster heads in wireless sensor networks. Appl. Sci. 2020, 10, 7886. [Google Scholar] [CrossRef]
Andrew, G.; Arora, R.; Bilmes, J.A.; Livescu, K. Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1247–1255. [Google Scholar]
Wang, W.; Arora, R.; Livescu, K.; Bilmes, J.A. On deep multi-view representation learning. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1083–1092. [Google Scholar]
Zhao, H.; Ding, Z.; Fu, Y. Multi-view clustering via deep matrix factorization. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 2921–2927. [Google Scholar]
Bodaghi, M.; Hosseini, M.; Gottumukkala, R. A multimodal intermediate fusion network with manifold learning for stress detection. arXiv preprint 2024, arXiv:2403.08077. [Google Scholar]
Ma, Z.; Yu, J.; Wang, L.; Chen, H.; Zhao, Y.; He, X.; Wang, Y.; Song, Y. Multi-view clustering based on view-attention driven. Int. J. Mach. Learn. Cybern. 2023, 14, 2621–2631. [Google Scholar] [CrossRef]
Liu, S.; Zhu, C.; Li, Z.; Yang, Z.; Gu, W. View-driven multi-view clustering via contrastive double-learning. Entropy 2024, 26, 470. [Google Scholar] [CrossRef]
Dornaika, F.; Hajjar, S.E.; Charafeddine, J.; Barrena, N. Unified multi-view data clustering: Simultaneous learning of consensus coefficient matrix and similarity graph. Cogn. Comput. 2025, 17, 38. [Google Scholar] [CrossRef]
Yin, M.; Huang, W.; Gao, J. Shared generative latent representation learning for multi-view clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 6688–6695. [Google Scholar]
Trosten, D.J.; Lokse, S.; Jenssen, R.; Kampffmeyer, M. Reconsidering representation alignment for multi-view clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1255–1265. [Google Scholar]
Xu, J.; Ren, Y.; Tang, H.; Yang, Z.; Pan, L.; Yang, Y.; Pu, X.; Yu, P.S.; He, L. Self-supervised discriminative feature learning for multi-view clustering. IEEE Trans. Knowl. Data Eng. 2023, 35, 7470–7482. [Google Scholar] [CrossRef]
Gao, Q.; Lian, H.; Wang, Q.; Sun, G. Cross-modal subspace clustering via deep canonical correlation analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 3938–3945. [Google Scholar]
Mao, Y.; Yan, X.; Guo, Q.; Ye, Y. Deep mutual information maximin for cross-modal clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 8893–8901. [Google Scholar]
Yang, H.; Deng, Z.; Zhang, W.; Wu, Q.; Choi, K.; Wang, S. End-to-end multiview fuzzy clustering with double representation learning and visible-hidden view cooperation. IEEE Trans. Fuzzy Syst. 2024, 32, 483–497. [Google Scholar] [CrossRef]
Multiple Features – UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/datasets/Multiple+Features (accessed on 28 January 2025).
The Database of Faces. Available online: https://cam-orl.co.uk/facedatabase.html (accessed on 28 January 2025).
UC Merced Land Use Dataset. Available online: http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 28 January 2025).
15-Scene Image Dataset. Available online: https://figshare.com/articles/dataset/15-Scene_Image_Dataset/7007177 (accessed on 28 January 2025).
Fränti, P.; Sieranoja, S. Clustering accuracy. Appl. Comput. Intell. 2024, 4, 24–44. [Google Scholar] [CrossRef]
Kvalseth, T.O. Entropy and correlation: Some comments. IEEE Trans. Syst. Man and Cybern. 1987, 17, 517–519. [Google Scholar] [CrossRef]

Figure 1. A scheme of OTMC. Multimodal clustering is defined as OTMC that contains modality-specific optimal transport (MsOT) and modality-common optimal transport (McOT). Naturally, MsOT and McOT have similar structures because both of them model OT from the data to the prototype space. MsOT adaptively learns a soft-assignment matrix

P_{s}^{m}

between the data

X^{m}

and the clustering prototypes

C_{s}^{m}

, and McOT adaptively learns a soft-assignment matrix

P_{c}

between the multimodal data

X

and the common clustering prototypes

C_{c}

. Furthermore, a consistent constraint is applied to the categorical distributions

C a t (ϖ_{s}^{m})

,

C a t (ϖ_{s}^{v})

, and

C a t (ϖ_{c})

, whose details can be found in Section 3.3 and Section 4.2. Note that the number of MsOTs is equal to that of modalities, and the two MsOTs in the figure are only shown for esthetic reasons.

Figure 1. A scheme of OTMC. Multimodal clustering is defined as OTMC that contains modality-specific optimal transport (MsOT) and modality-common optimal transport (McOT). Naturally, MsOT and McOT have similar structures because both of them model OT from the data to the prototype space. MsOT adaptively learns a soft-assignment matrix

P_{s}^{m}

between the data

X^{m}

and the clustering prototypes

C_{s}^{m}

, and McOT adaptively learns a soft-assignment matrix

P_{c}

between the multimodal data

X

and the common clustering prototypes

C_{c}

. Furthermore, a consistent constraint is applied to the categorical distributions

C a t (ϖ_{s}^{m})

,

C a t (ϖ_{s}^{v})

, and

C a t (ϖ_{c})

, whose details can be found in Section 3.3 and Section 4.2. Note that the number of MsOTs is equal to that of modalities, and the two MsOTs in the figure are only shown for esthetic reasons.

Figure 2. Results of Nemenyi test on four datasets. (a) Handwritten; (b) ORL; (c) LandUse; (d) Scene.

Figure 3. Convergence analysis on four datasets.

Table 1. Frequently used notations.

Notations	Description
$X, \tilde{X}$	Multimodal data/reconstructed data space.
$X^{m}, {\tilde{X}}^{m}$	$m$ -th modality data/reconstructed data space.
$M$	Number of modalities.
$k$	Number of clusters.
$μ$	Multimodal data measure.
$μ_{s}^{m}$	$m$ -th modality-specific data measure.
$C$	Clustering prototype set.
$c_{j}$	$j$ -th clustering prototype.
$C_{s}^{m}$	$m$ -th modality-specific clustering prototype set.
$c_{j}^{m}$	$m$ -th modality, $j$ -th clustering prototype.
$ν$	Clustering prototype measure over all modalities.
$ν_{s}^{m}$	$m$ -th modality-specific clustering prototype measure.
$T$	Clustering transport map from data to prototypes.
$T_{s}^{m}$	$m$ -th modality-specific transport map.
$T_{c}$	Common transport map over all modalities.
$W$	Multimodal data partition scheme.
$W_{j}$	Set of data split into $j$ -th cluster.
$π$	Coupled measure of measures $μ$ and $ν$ .
$G$	Generative transport plan from prototypes to data.
$G_{s}^{m}$	$m$ -th modality-specific generative transport plan.
$C a t$	Categorical prior distribution.
$C a t_{s}^{m}$	$m$ -th modality-specific categorical prior distribution.
$C a t (ϖ)$	Categorical distribution with parameter $ϖ$ .
$E_{s}^{m} (\cdot; θ_{s}^{m})$	Clustering network with parameter $θ_{s}^{m}$ .
$P^{m}$	$m$ -th modality soft-clustering-assignment matrix.
$G_{s}^{m} (\cdot; φ_{s}^{m})$	Generative network with parameter $φ_{s}^{m}$ .
$D_{w} (\cdot, \cdot)$	OT-based topological reconstruction loss.

Table 2. The information of datasets.

Dataset	Number	Modality	Class
Handwritten	2000	2 (pixel/profile correlation)	10 (handwritten numbers 0–9)
ORL	400	2 (intensity/Gabor)	10 (face image shooting conditions)
LandUse	2100	2 (LBP/PHOG)	21 (satellite image scene categories)
Scene	4485	2 (PHOG/GIST)	15 (natural scene categories)

Table 3. Average clustering results on Handwritten, ORL, LandUse, and Scene datasets. Best results are marked in bold.

Dataset	Handwritten			ORL			LandUse			Scene
Metric	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
FeatConcate	0.6104	0.6070	0.5532	0.5710	0.7528	0.4770	0.1232	0.1608	0.0365	0.3076	0.3540	0.1862
DCCA [18]	0.6626	0.6601	0.6136	0.5968	0.7784	0.5020	0.1551	0.2315	0.0443	0.3618	0.3892	0.2087
DCCAE [19]	0.6917	0.6696	0.6327	0.5940	0.7752	0.4993	0.1562	0.2441	0.0442	0.3644	0.3978	0.2147
DMF [20]	0.6962	0.7160	0.5993	0.6733	0.8164	0.5407	0.1450	0.1543	0.0360	0.3393	0.3576	0.1862
AE2-Nets [9]	0.8152	0.7139	0.6667	0.6885	0.8573	0.5637	0.2479	0.3036	0.1035	0.3610	0.4039	0.2208
MVaDE [25]	0.8875	0.8076	0.7765	0.6950	0.8356	0.5643	0.2248	0.2848	0.0936	0.3782	0.3992	0.2178
CoMVC [26]	0.8205	0.8142	0.7559	0.7063	0.8652	0.5353	0.2624	0.3083	0.1085	0.3859	0.4117	0.2231
SiMVC [26]	0.8295	0.7608	0.6985	0.6921	0.8560	0.5256	0.2448	0.2581	0.0957	0.3775	0.3935	0.2259
SDMVC [27]	0.8990	0.8214	0.8007	0.7104	0.8557	0.5942	0.2681	0.2986	0.1201	0.3857	0.4132	0.2259
CMIB [10]	0.8972	0.8178	0.7988	0.7207	0.8823	0.6004	0.2716	0.3016	0.1242	0.3954	0.4177	0.2383
OTMC	0.9215	0.8496	0.8335	0.7650	0.8837	0.6496	0.2814	0.3254	0.1312	0.4181	0.4373	0.2621

Table 4. Ablation analysis on Handwritten in terms of ACC, NMI, and ARI. Best results are marked in bold.

Methods	ACC	NMI	ARI
modality-specific OT 1	0.7215	0.7276	0.6212
modality-specific OT 2	0.7255	0.7092	0.6141
modality-common	0.9090	0.8306	0.8098
OTMC	0.9215	0.8496	0.8335

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Z.; Shi, C.; Guan, Y. An Optimal-Transport-Based Multimodal Big Data Clustering. Electronics 2025, 14, 666. https://doi.org/10.3390/electronics14040666

AMA Style

Yang Z, Shi C, Guan Y. An Optimal-Transport-Based Multimodal Big Data Clustering. Electronics. 2025; 14(4):666. https://doi.org/10.3390/electronics14040666

Chicago/Turabian Style

Yang, Zheng, Chongyang Shi, and Ying Guan. 2025. "An Optimal-Transport-Based Multimodal Big Data Clustering" Electronics 14, no. 4: 666. https://doi.org/10.3390/electronics14040666

APA Style

Yang, Z., Shi, C., & Guan, Y. (2025). An Optimal-Transport-Based Multimodal Big Data Clustering. Electronics, 14(4), 666. https://doi.org/10.3390/electronics14040666

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Optimal-Transport-Based Multimodal Big Data Clustering

Abstract

1. Introduction

2. Related Work

3. OT for Multimodal Clustering

3.1. Modeling Multimodal Clustering Based on OT

3.2. Decomposition of OTMC

3.3. The Consistent Constraint

4. The Variational Generative Solution Network Implementation

4.1. Variational Generative Solution

4.2. The Variational Generative Solution to OTMC

4.3. Multimodal OT Clustering Network

4.4. The Overall Loss

5. Experiments

5.1. Experimental Setup

5.2. Clustering Performance Evaluation

5.3. Further Evaluation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI