Unsupervised Deep Embedded Clustering for High-Dimensional Visual Features of Fashion Images

Malhi, Umar Subhan; Zhou, Junfeng; Yan, Cairong; Rasool, Abdur; Siddeeq, Shahbaz; Du, Ming

doi:10.3390/app13052828

Open AccessArticle

Unsupervised Deep Embedded Clustering for High-Dimensional Visual Features of Fashion Images

¹

School of Computer Science and Technology, Donghua University, Shanghai 200051, China

²

Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(5), 2828; https://doi.org/10.3390/app13052828

Submission received: 29 December 2022 / Revised: 18 February 2023 / Accepted: 20 February 2023 / Published: 22 February 2023

(This article belongs to the Special Issue Big Data Analytics: Correspondence Factor Analysis, Clustering and Classification Algorithms and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Fashion image clustering is the key to fashion retrieval, forecasting, and recommendation applications. Manual labeling-based clustering is both time-consuming and less accurate. Currently, popular methods for extracting features from data use deep learning techniques, such as a Convolutional Neural Network (CNN). These methods can generate high-dimensional feature vectors, which are effective for image clustering. However, high dimensions can lead to the curse of dimensionality, which makes subsequent clustering difficult. The fashion images-oriented deep clustering method (FIDC) is proposed in this paper. This method uses CNN to generate a 4096-dimensional feature vector for each fashion image through migration learning, then performs dimensionality reduction through a deep-stacked auto-encoder model, and finally performs clustering on these low-dimensional vectors. High-dimensional vectors can represent images, and dimensionality reduction avoids the curse of dimensionality during clustering tasks. A particular point in the method is the joint learning and optimization of the dimensionality reduction process and the clustering task. The optimization process is performed using two algorithms: back-propagation and stochastic gradient descent. The experimental findings show that the proposed method, called FIDC, has achieved state-of-the-art performance.

Keywords:

clustering algorithm; dimensionality reduction; fashion images; deep embedding; stacked autoencoder

1. Introduction

Fashion apparel analysts, researchers, and stakeholders are heading up to know how fashion develops and change in the real world. Modeling fashion behavior is challenging because of personal creativity, interest in fashion choices, adoption, social dependence, and moral values. Many retailers use social network data to analyze fashion trends [1,2]. To a certain extent, artificial intelligence plays a critical role in the fashion business in assessing a huge number of attributes and processing data accurately, quickly, and efficiently. AI algorithms become proficient assistants for human stylists. Due to the development of machine learning, classification and clustering algorithms are widely used to study the diversity of fashion. The most commonly used algorithms for clustering and classification tasks are K-means [3] and the Gauss mixture model [4].

From the perspective of fashion demand, these techniques can accurately classify automotive fashion products, which shows that these algorithms help solve the decision-making problem. Usually, these algorithms are performed in a supervised or semi-supervised way based on their attribute, description, and visual feature. In some cases, attribute segmentation is not practical because some product attributes may not be able to distinguish one product from another.

Although in real-world datasets such as fashion, some researchers have worked on product recommendation and retrieval based on non-visual text-based features. For non-visual attributes, we must manually describe the product characteristics based on their appearance, color, pattern, and material [5,6,7,8]. In such cases, data become noisy because of human error and can also be quite expensive [9].

We mainly utilize these attributes for product retrieval, article mapping, and product clustering. To perform clustering tasks without labeling data manually, it becomes unsupervised to find an equal representation among products based on image features [10]. The clustering algorithms often use distance-based similarity to perform segmentation.

This paper uses the Amazon dataset, which contains high-quality product images. We use a CNN to extract visual features so that each image feature has 4096 resultant vectors. In such high dimensional data space, an interesting phenomenon called the curse of dimensionality arises [11]. The ratio among the farthest and nearest points approaches 1 point, i.e., the points become equally distant. The idea of the nearest neighbor algorithm is that closer items are more relevant than farther items. Still, the distance-based similarity becomes inappropriate if all items are uniformly apart.

In Figure 1, we assume a dataset containing 20 data items. These points are spatially distributed between 0 and 2 in different dimensions. Figure 1a illustrates that all the items are placed in one dimension, indicating that almost half of the items are placed in a single cube. Figure 1b demonstrates that the items are placed in a two-dimensional data space. By adding another dimension, the item distribution is not so concentrated. There are only 6 points in the green cube space. Figure 1c shows that the items are placed in a three-dimensional data space. From this example, it can be concluded that if we continue to add more dimensions, the data points will get further apart, and the distance-based similarity will become more inefficient. In such a case, the curse of dimensionality gets strange, and clustering items in such a high-dimensional space becomes a huge mistake.

Dimension reduction is a crucial technique in machine learning and data science to combat the challenge of high-dimensional data, commonly known as the curse of dimensionality [12,13,14,15]. Most experts convert data from high to low dimensions and perform clustering assignments directly on a compressed version of the original input. Another approach is to perform dimension reduction and clustering assignments simultaneously [16,17,18], which is demonstrated better. In our previous study, we enhanced the joint learning approach, which motivated us to perform further experiments [19].

This paper introduced the fashion images deep clustering (FIDC) method, which jointly performs input dimension reduction and clustering assignments simultaneously. We inherit a deep-stacked denoising auto-encoder to achieve dimension reduction and discover optimal parameters for our real-world fashion dataset. Further, we improve visual feature representation using Kullback–Leibler divergence training loss and employ auxiliary target distribution for data distribution.

To evaluate the validity of the FIDC method, labeled data is not used to train the model. However, labels are used later on to evaluate the model’s performance. This approach allows the model to learn patterns and relationships in the data without bias toward any particular outcome.

The significant contributions in this paper are summarized as follows:

This study is the first research that uses real-world data from Amazon Fashion images for a deep clustering task without any prior supervision
In building the Amazon-Fashion dataset for unsupervised learning, we extracted a subset of clothing items and split them into distinct category labels based on tags. Further, we reproduced the experimental results on the Amazon-Fashion dataset for baseline methods.
We used an under-complete autoencoder to extract embedded features while maintaining the local structure of the data distribution. This process facilitates the simultaneous execution of dimension reduction and clustering assignments.
Our results prove that the machine learning model’s performance can be improved by concentrating more on the data rather than constantly creating new algorithms. It demonstrates the effectiveness of our method in comparison to highly effective baseline methods.

The paper is divided into several sections. Section 2 discusses previous research and baseline methods. In Section 3, the theory of the proposed method is explained in detail. The results are presented in Section 4. The performance of the proposed model (FIDC) is evaluated and discussed in Section 5 through various experiments. Finally, in Section 6, the paper is concluded, and the symbols and terms used throughout the paper are defined in Table 1.

2. Related Work

Clustering is an unsupervised learning task that can train a model without any label. Unsupervised learning is beneficial over supervised learning because it reduces the cost of labeling data. Cluster analysis aims to explore or summarize the characteristics of data. This field has seen much research in recent decades. For example, k-means has a history of 50 years. The clustering technique is used for categorical data [3]. The clustering method is employed in image processing, pattern recognition, and image segmentation [20,21]. The clustering method is widely used in biotechnology, geography, astronomy, psychology, customer relation management, marketing, time series forecasting [5,22,23], and stock management [9]. The choice of clustering methods and algorithms is dependent on both the size of individual instances and the number of instances under consideration.

While high-dimensional dataset clustering has always been a challenge for clustering methods [24], because of its high dimensionality, it becomes tedious and requires more general methods to cluster different datasets [25]. Some algorithms, such as the subspace clustering technique, ignore irrelevant features unrelated to clustering tasks in high-dimensional data spaces. [26,27]. As with the general subspace clustering methods, they are generally not applicable to combinations of arbitrary features. Projection clustering is another way to assign each point to a single cluster, but clustering may occur in different subspaces. This algorithm uses a special distance function when performing regular clustering tasks [24] In some cases, due to the overlap, it should be noted that not all algorithms can discover a distinctive cluster for each point nor the result between all clusters in sub-spaces involved in a high-dimensional space with feature vector commonalities between attributes. Correlations between features or sub-set of features lead to distinct spatial shapes of clusters. Therefore, local patterns are used to identify the differences between clustering items. Biclustering can be regarded as correlation clustering since both are strongly linked. Biclustering identifies clusters of objects based on shared attributes [28].

Due to the discovery of novel methods, interest in this deep architecture has been revived, using deep auto-encoders to convert high-dimensional data into low-dimensional representations. Recent research proved successful at learning their parameters. There is still no explicit knowledge of what constitutes useful representations to initialize deep-stacked architectures or what specific unsupervised criteria can best guide their learning. However, some algorithms perform well, such as semi-supervised embedding, auto-encoders, restricted Boltzmann machines [29], and kernel PCA [30].

There are, widely, two categories of deep clustering algorithms. (i) Learning latent space representation of data separately and performing clustering tasks directly on a compressed data representation. (ii) The second stage of work applies dimension reduction and clustering tasks simultaneously [31,32]. Existing unsupervised deep learning frameworks and methods directly benefit the first category of algorithms. For example, using deep-stacked auto-encoder to learn compressed representation and perform k-means for clustering tasks. While the second category explicitly assigns a clustering loss, faking clustering error in unsupervised deep learning and optimizing it iteratively [16,17,33,34].

In addition to fully connected approaches, other methods employ convolutional networks for learning feature representations. For instance, Deep Cluster [35] uses a recursive approach in which k-means first classify features. Later, resulting cluster assignments are used as supervision to adjust the network’s weights. Some methods use an adaptive learning technique with a neural network to learn feature representations and cluster image features [36]. However, this approach is computationally intensive. There are also semi-supervised settings [37] and data augmentation methods [38].

However, current unsupervised algorithms do not consider the relationship between representation learning and clustering and are not effective in combining these two tasks when working with high-dimensional data. The proposed algorithm in this paper uses an under-complete autoencoder to extract embedded features while preserving the local structure of the data distribution. This method is explained in the following section.

3. Methodology

The proposed method fundamentally relies on a deep-stacked denoising auto-encoder that integrates an incomplete auto-encoder to preserve the local structure. Moreover, our method jointly performs feature encoding and clustering tasks. We first present the framework of FIDC and then introduce how to reduce dimension by a deep-stacked auto-encoder. The last section presents the clustering process and its optimization algorithm.

3.1. Framework of FIDC

The FIDC method plays a significant role in the following steps: feature extraction using CNN-F architecture [39], adding noise to the original input, feature encoding, normalization, pooling, and clustering layer. Figure 2 illustrates the complete architecture of the FIDC method. We represent the visual feature extracted by CNN as the original input

x

. The encoder input is obtained by adding a fraction to the elements of

\tilde{x}

as a noise, later, we train stack layers auto-encoder to denoise corrupted versions of their original inputs

x

. Simply copying the input

x

and reconstructing input

z

is not guaranteed; we have extracted a valid latent representation. Therefore, we avoid this phenomenon and change the reconstruction criteria for more interesting objectives [40]. Accordingly, the latent representation

Y

is forcefully obtained from noised input and reconstructing the original input. During the process of feature encoding, we reduce the dimension of input so that they are ideally clustering friendly from noised input and reconstruct the original input. During the process of feature encoding, we reduce the input dimension in an ideal cluster-friendly way. Moreover, at the same time, our primary motivation is to minimize reconstruction loss, i.e., input

x

and reconstructed input

z

should be identical

x \approx z

, which makes sure that we have learned efficient latent representation

Y

, i.e., bottleneck point.

The clustering layer is put at the bottom of the latent space, so we use back-propagation and stochastic gradient descent (SGD) algorithms to optimize the network in an unsupervised way.

3.2. Dimensionality Reduction Based on Deep Stacked Auto-Encoder

An auto-encoder is a form of a neural network whose purpose is to take an input, compress it, and reconstruct the same input from the compressed version. The auto-encoder has two parts: (i) the encoder, which encodes the feature in small representation, and (ii) the decoder, which reconstructs the same input in such a way that the network learns important abstract features that are necessary. These abstracted features help improve the accuracy of classifiers and other unsupervised learning approaches.

The auto-encoder uses back-propagation to train the network, so the compressed representation of the input is as close to the original as possible, i.e.,

y^{i} \approx x^{i}

.

Our model draws inspiration from denoising the auto-encoder (DAE) to train an auto-encoder in which the mapping between the input

x

and the output

z

is nonlinear [13,40]. An excellent representation can be robustly derived from a distorted input and helps recover the relevant original input. Here, we emphasize that our objective is not to originally denoise input. Instead, denoising is proposed and investigated as a training criterion for learning to extract valuable characteristics that provide better representation. The utility of a learned representation can then be objectively evaluated by assessing the accuracy of a classifier that uses it as input.

This method involves transforming the input

x

into a corrupted form

\tilde{x}

through a stochastic process defined by

x q D (x | \tilde{x}) .

The corrupted input

\tilde{x}

is then transformed into a hidden representation

y

through the encoding function

f_{e}

, which is represented by the Equation

y = f_{e} (W \tilde{x} + b)

. This is similar to the workings of a basic autoencoder. Later, we reconstruct

z

by utilizing the following function, i.e.,

z = f_{d} (W^{'} Y + b'),

which is closer to what sparse coding has learned [41]. In Figure 3, we note that denoising auto-encoders are still tried to minimize the reconstruction loss

L_{r} (x, z)

between an original input x and its reconstruction

z

from reduced dimension

Y

. The main distinction is that

z

is the deterministic mapping function of

\tilde{x}

instead of the original input

x

. As before, the reconstruction error regarded is either the squared error loss or cross-entropy loss with an affine decoder. This way, our model is designed to enforce the training of a highly advanced mapping function that extracts meaningful features to denoise the input data rather than a simple identity mapping.

y = f_{e} (W \tilde{x} + b)

(1)

where the reduced representation code is represented by the symbol

y

, the weight of the input to the hidden layer is represented by

W

, bias as

b

, and

f_{e}

is relu activation function for encoder/decoder hidden layers [42].

The process of reconstructing the input

z

is achieved through the utilization of a mapping function

f_{d}

z = f_{d} (W^{'} Y + b')

(2)

where

f_{d}

denote decoding function and

W^{'}

as tied weights.

y^{(k + 1)} = f_{e} (W y^{k} + b^{(k + 1)})

(3)

z^{(k + 1)} = f_{d} (W^{{(L - K)}^{T}} z^{k} + b'^{(k + 1)})

(4)

where the variable

k

adjusts from 0,

y^{(0)}

is referred to original input and

y^{(L)}

represent the last encoding layer,

z^{(0)}

represents the first decoding layer that is extracted from reduced representation

Y

,

z^{(L)}

denote the last decoding layer of the decoder part, and the transform matrix is represented by

T

. Cross-entropy loss

L_{r} (x, z)

is a loss between the original input

x

and the reconstructed value

z

.

The loss function is the objective function for updating weights in the stochastic gradient descent algorithm.

L_{r} (x, z) = α (- \sum_{J \in J (\tilde{x})} [x_{j} l o g z_{j} + (1 - x_{j}) \log (1 - z_{j})]) + β \sum_{J \notin J (\tilde{x})} [x_{j} l o g z_{j} + (1 - x_{j}) \log (1 - z_{j})]

(5)

where

J \tilde{x}

is indicated, the index of corrupted data from

x

.

To learn better latent representation, we introduce FIDC architecture and auto-encoder setup in Figure 4. The autoencoder is designed to reduce the data representation and minimize the reconstruction error through programming. The deep stacked auto-encoder incorporates an objective function that measures the distance between the data and the cluster centers in the latent feature space. Optimizing the cluster centers and data representation is carried out iteratively, leading to a stable and efficient clustering performance.

We are interested in learning of low dimensions feature vectors from high dimension data space, aiming to perform enriched clustering tasks. One of the essential aspects of the clustering task is the means of measuring distance (or dissimilarity). Another important aspect is the feature space in which those measurements are performed. Following this aim, we have initialized our framework in a manner that facilitates the learning of a more compact latent space.

It is observed that in the architecture of the encoder, the number of neurons in the first hidden layer of the deep-stacked autoencoder surpasses the number of neurons in the input layer. Firstly, it is biologically founded that the human brain comprises billions of neurons arranged in a hierarchical way for stimuli to be processed. This is one of the reasons for studying deep learning. We believe that the sparsity constraint was regarded to justify stacking deep learning networks with lots of neurons per layer, similar to biological networks. Providentially, another explanation for sparsity also appears to come from findings in the human brain, suggesting that neuronal biological activities are sparse. To be more precise, neurons have low activations, and many of them do not even fire for some stimuli to be processed. In aspects of deep learning, many theoretical frameworks say that having more neurons in the hidden layer allows the model to learn better input data representations and enhances model efficiency.

Furthermore, this concept is utilized by obtaining motivation in sparse coding [41] and ICA [43]. Sparse auto-encoders are distinct from normal auto-encoders. Sparse auto-encoders pushed individual samples to be mapped by KL divergence within a limited number of dimensions.

In accordance with the above mentioned statements, we initiate our model with carefully chosen parameters for pre-training the network, which helps to extract a valuable representation of data through layer-wise training of a deep-stacked auto-encoder. The properties of this type of neural network, such as the total hidden layers, nodes per layer, and the deviation of the noise, are all influential factors [13]. The first hidden layer of the network is initialized with 8192 nodes, and its vector dimension is significantly greater than the input vector dimension to prevent under-learning and over-fitting of the model [20,44]. Although over-fitting may not matter for data compression, it becomes crucial for discovering meaningful patterns in data distribution [44].

Moreover, in the deep-stacked auto-encoder training, the Vincent model [44] is utilized to optimize the feature representation and improve the classifier’s performance. The model is initialized with an appropriate weight, and the first hidden layer is trained using noisy input. Previous models such as DEC, IDEC, and Parametric t-SNE [16,17,30] adopt the experimental setup of Salakhutdinov and Hinton (2007) [45]. Still, their network initialization with the dimensions (input−500−500−2000–d) was found to be suboptimal in our dataset. Our model’s bottleneck is determined by the compressed representation achieved by passing the high-dimensional data through the entire network. Through this process, a compressed representation of 256 nodes was found to be more efficient for improved performance [20].

3.3. Joint Clustering

However, data distribution is not well distributed in the original data space because of huge intra-variance, as shown in Figure 5. The traditional clustering algorithm performs poorly because it cannot provide the nonlinear transformation from the original data space to the latent feature space.

To resolve this problem. We utilize the deep-stacked auto-encoder, a kind of deep neural network to map original data space to embedding, which is suitable for clustering tasks. In the FIDC network, the encoder is a nonlinear mapping function, and the decoder control reconstruction error and demands high accuracy. We perform this task iteratively, which ensures that the nonlinear mapping function is effective and stable for latent space data representation. The illustration is shown in Figure 5.

The clustering layer in the proposed method transforms the original input feature x into soft labels represented by

q_{i j}

. There are k clusters, n data points, and initial cluster centers

μ_{j}

calculated using K-means. The clustering is performed on the compressed feature representation

Y

which is of much lower dimension than the original input

x

. The probability between cluster centers

μ_{j}

and embedded points

y_{j}

i.e., corresponding to

x_{i}

, is measured using Student’s t-distribution [16,17,30]. This results in a soft assignment

q_{i j}

. The clusters are refined iteratively by learning from high-confidence assignments through the use of an auxiliary target distribution, as illustrated in Figure 6.

q_{i j} = \frac{{(1 + | | y_{i} - μ_{j} | |^{2})}^{- 1}}{\sum_{j} {(1 + | | y_{i} - μ_{j} | |^{2})}^{- 1}}

(6)

Additionally,

p_{i j}

target distribution is defined as,

p_{i j} = \frac{q_{i j}^{2} / \sum_{i} q_{i j}}{\sum_{j} q_{i j}^{2} / \sum_{i} q_{i j}}

(7)

We describe our goal as a loss of KL divergence between

p_{i j}

target distribution and soft assignments

q_{i j}

. The encoder is fine-tuned by optimizing an objective, resulting in the prediction of the label for

x_{i}

is

\arg m a x_{i} q_{i j}

.

L = K L (P | Q) = \sum_{i} \sum_{j} p_{i j} l o g \frac{p_{i j}}{q_{i j}}

(8)

where

K L

denotes Kullback–Leibler divergence loss measures the non-symmetrical differences between, soft assignment

q_{i}

and target distributions

p_{i}

.

3.4. Optimization

We fine-tuned the network’s weights using back-propagation and the stochastic gradient descent (SGD) optimization method. The gradient of

L

with respect to each cluster center

μ_{j}

and embedding point

y_{i}

is computed for reference [12].

\frac{\partial L}{\partial y_{j}} = \sum_{j} {(1 + | | y_{i} - μ_{j} | |^{2})}^{- 1} \times (p_{i j} - q_{i j}) (y_{i} - μ_{j})

(9)

\frac{\partial L}{\partial u_{j}} = - \sum_{i} {(1 + | | y_{i} - μ_{j} | |^{2})}^{- 1} \times (p_{i j} - q_{i j}) (y_{i} - μ_{j})

(10)

The cluster centroids

μ_{j}

are updated utilizing a defined setting, including a learning of rate

λ = 0.001

, a clustering coefficient of

γ = 0.1

and a batch size of

m = 256

.

μ_{j} = μ_{j} - \frac{λ}{m} \sum_{i = 1}^{m} (\frac{\partial L_{c}}{\partial μ_{j}})

(11)

The weights of the input to hidden layer

W

and the decoder

W'

in the auto-encoder are refined through the following calculation.

W = W - \frac{λ}{m} \sum_{i = 1}^{m} (\frac{\partial L_{r}}{\partial W} + γ \frac{\partial L_{c}}{\partial W})

(12)

W' = W' - \frac{λ}{m} \sum_{i = 1}^{m} (\frac{\partial L_{r}}{\partial W})

(13)

where

L_{r}

represent reconstruction loss,

L_{c}

indicate clustering loss,

m

is the batch size, and

λ

is the learning rate. The optimization stops when a predefined tolerance limit is reached. The proposed algorithm FIDC is described in Algorithm 1.

Algorithm 1: FIDC Algorithm

Input: Visual features of fashion images.
Output: Clusters of images.
1: For each image, extract the CNN-F feature, i.e., x = 4096
2: Initialize cluster centroid

μ_{j}

, encoder weight

W

, decoder weight

W'

3: Set MaxIteration;
4: Learn condensed feature representation Y
5: While

(i = 0 < MaxIteration)

do
6: Calculate all latent points Y;
7: Assign labels

l

to

x_{i}

as

\arg \max_{j} p_{i j}

;
8: Update target distribution, i.e.,

p_{i j}

, which is derived from

q_{i j}

using
Equations (6) and (7);
9: Update all condensed representation points y_i
10: Save the last labels as

l_{l a s t} = l;

11: Compute new labels

l

for

x_{i}

;
12: If

(s u m (l_{l a s t} \neq l)

or

n < δ)

//

δ

is the stopping threshold
13: Stop training;
14: Update cluster centroid

μ_{j},

encoder weight

W

, and decoder weight;

W^{'}

using Equations (11)–(13), respectively
15: End while
Return: Clusters of images.

FIDC’s time complexity can be calculated as

O ({nD}^{2} + ndK)

where

D, d

is the maximum number of neurons in hidden layers and dimensions of the condensed representation. Whereas k denotes the number of clusters. Because

D \geq d \geq k

commonly holds, therefor the time complexity is

O ({nD}^{2})

.

4. Experimental Design

4.1. Dataset

To evaluate the efficiency of our methodology, we adopt a high-dimensional real-world fashion dataset. We have developed a dataset based on the Amazon Web Store [10,46]. Further, we consider a dataset with handwritten digits and the MNIST-Fashion dataset, in addition to evaluating the performance of our proposed method on different types of image data. Incorporating these different datasets can aid in demonstrating the method’s robustness and versatility and its potential superior performance on other datasets, even those outside the fashion domain.

Amazon-Fashion: Our first set of datasets consists of clothing items crawled from Amazon.com. It contains high-quality product images, which are characteristically centered on a white background. The Amazon dataset also includes descriptions, tags, and reviews of the product. First, we extract a subset of the men’s clothing dataset containing all fashion categories. Further, we split men’s clothing items into specific category labels based on tags. See Figure 7, the illustration of an experimental dataset.

During the experimental setup, we consider eight categories that have selected randomly. We used image features for model training and evaluated clustering performance based on tags in an unsupervised manner.

Fashion-MNIST: A dataset of Zalando article images comprising 70,000 examples, each of which is a 28 × 28 grayscale image classified into ten categories [47].

MNIST: A dataset consisting of 10 handwritten digits 28 × 28 pixels of grayscale handwritten digits, a total of 70,000 samples [48].

4.2. Image Feature Extraction

We utilized pre-trained networks based on the CNN-F architecture to extract visual features of fashion images [39]. CNN-F is similar to the model proposed by Krichevsky [49]. The CNN model training dataset contains over a thousand object categories from the ImageNet database. In such a way, extracted features define high-level characteristics of fashion images, e.g., shape, color, and texture. However, we found that the visual feature extracted from the above model is appropriate to show the effectiveness of the FIDC Model.

Here, Table 2 shows the detail of the architecture. We utilized features extracted by the second fully connected layer with a resultant value of 4096 [10,46]. In contrast, last fully connected layer has a k dimension, which is usually assumed for the general-purpose task. During experiments weight, the decay parameter is set to 10⁻³ though the probability of dropout term is set to 0.5. (In Table 2, pad denotes spatial padding, and st represents stride).

4.3. Model Training

During the training of the deep-stacked auto-encoder, we use the vector of 4096 between [0, 1] value as an input layer. The former dense layer’s output becomes the following layer’s input. In our architecture, we have three hidden layers. We apply stochastic gradient descent optimizer during compilation and mse as loss function. The parameter of SGD are lr = 1 and momentum = 0.9. We utilize the relu activation function for all layers and train our model with 100 epoch batch size 250. See Table 3 for architecture and parameter detail.

4.4. Evaluation Matrices

We assess the effectiveness of our clustering model by utilizing the most widely utilized evaluation metrics. During the evaluation process, we have established three performance indicators: Clustering Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI) [50]. The ACC and NMI metrics are measured on a scale of zero to one, where a higher value reflects exceptional performance and a value of zero represents the lowest possible performance. The ARI metric operates on a scale of −1 to 1, with positive values indicating exceptional performance and negative values indicating below-average performance.

A C C = \max_{m} \frac{\sum_{i = i}^{n} 1 (c_{i} = m (c_{i}^{'}))}{N}

(14)

N M I = \frac{I (c, c^{'})}{(H (c) + H (c^{'})) / 2}

(15)

A R I = \frac{\sum_{i j} (_{2}^{n_{i j}}) - [\sum_{i} (_{2}^{a_{i}}) \sum_{j} (_{2}^{b_{j}})] / ({}_{2}^{n})}{\frac{1}{2} [[\sum_{i} (_{2}^{a_{i}}) \sum_{j} (_{2}^{b_{j}})]] -[\sum_{i} (_{2}^{a_{i}}) \sum_{j} (_{2}^{b_{j}})] / ({}_{2}^{n})}

(16)

where the clustering assignment is

c_{i}^{'} and c_{i}

represent the ground-truth label, N is the number of data points, m(·) is a function that maps possible one-to-one relationships between labels and clusters, and 1(·) is an indicator function. In Equation (15), the entropy is represented by H, and the mutual information score is represented by I(·). In Equation (16),

n_{i j}, a_{i}, b_{j}

are the contingency table values [51].

5. Results and Discussions

To evaluate the performance of our fully unsupervised model in creating clusters of real-world fashion images, we compare our results with those of other approaches found in the literature. The first comparison is with k-means, which was implemented using the features of a CNN. Additionally, our approach utilized a combination of conventional, variational, and adversarial auto-encoders, along with k-means clustering. We have also compared Deep Embedding clustering (DEC) [16], its improved version (IDEC) [17], Variational Deep Embedding [34], Deep Density-based Image Clustering [18], and joint unsupervised learning for image clustering (JULE) [33]. The results are shown in Table 4.

Our model FIDC has achieved an accuracy of 84% on the Amazon-Fashion dataset, outperforming other models in all evaluation metrics. As a baseline test, the same network was applied with optimal parameters on the MNIST handwritten and Fashion-MNIST datasets, resulting in 94% accuracy and 68% accuracy, respectively. Figure 8 compares FDIC’s performance on the Amazon-Fashion dataset with other commonly used techniques, such as DEC, IDEC, DCN, VaDE, and JULE. The experimental results are also presented visually in Figure 9, where each visual space represents a single cluster.

Our model, FIDC, shows improved results compared to other methods studied. The efficiency of FIDC is commendable, and its latent space representation is relatively small compared to the input. The improved feature learning and compact latent space representation and distribution contribute to the superior performance of FIDC. The convergence of the model is rapid and consistent. After reaching 100 epochs, FIDC demonstrates state-of-the-art efficiency.

We use the confusion matrix, also known as an error matrix, to visually represent the clustering results. Figure 10, displays this matrix, where the rows represent the actual labels and the columns represent the labels assigned by clustering. As clustering is an unsupervised learning technique, it does not have pre-existing labels or target outputs for the data points. The largest numbers in the matrix are not found on the diagonal line. In this specific case, we chose to match the colors even though the clusters did not line up exactly, as we were aware of the actual situation.

The visualization of the clustering results for a subset of the Amazon dataset is depicted in Figure 11, where different colors signify different clusters. The t-SNE results indicate that the form of each cluster is largely preserved

In practice, the number of natural clusters is often unknown, so a method for determining the optimal number of clusters is necessary. The paper assumes that the number of clusters is given to make fair comparisons with previously published results by considering a predetermined number of clusters and not using supervised pre-trained models. The paper follows the training and testing methodologies outlined in prior publications to ensure consistency in the comparison. Future studies could use non-parametric Bayesian methods such as the Dirichlet process-mixing model to improve the clustering model and eliminate the necessity for a preset number of clusters.

The performance of the FIDC approach is highly dependent on the optimization step. Different optimization strategies can have varying effects on the outcomes. An optimization approach can affect convergence rate, stability, solution quality, and latent representation. Using a gradient-based optimization method, such as stochastic gradient descent (SGD), may result in faster convergence. However, it may also be more vulnerable to being stuck in local minimums. There are numerous strategies to prevent being stuck in local minimums during optimization, including correct initialization, using different optimization techniques, repeated optimization runs, early stopping, and hyperparameter adjustment. Our method utilizes early stopping to avoid local minima during optimization. It involves setting a threshold for the improvement in the objective function and stopping the optimization when the progress falls below the threshold. Early stopping can be a simple and effective way to avoid local minima.

6. Conclusions and Future Work

We have shown how to train a deep-stacked auto-encoder network that reduces data dimensions while maintaining its local structure. Our model FIDC studies on three datasets show that it is superior to other unsupervised clustering methods and models in high-dimensional visual features. In the future, unlike prior metadata-based jobs, we will use this study to identify and forecast fashion trends based on visual appearances. We also strive to explore how our model can be combined with supervised dimensionality reduction techniques to achieve better generalization performance in semi-supervised learning environments.

Author Contributions

U.S.M. and C.Y. conceived and designed the experiments; U.S.M. performed the experiments; U.S.M. and S.S. analyzed the data; U.S.M. and C.Y. wrote the paper; U.S.M. and A.R. review and editing, J.Z., M.D. and C.Y. supervised, tested, gave comments and approved this work. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by grants from the Natural Science Foundation of Shanghai (No. 20ZR1402700) and the Natural Science Foundation of China (No. 61472339, 61873337).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Amazon-Fashion dataset can be generated by the following script available at https://github.com/umarsubhanmalhi/Amazon-Fashion (accessed on 15 February 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Han, A.; Kim, J.; Ahn, J. Color Trend Analysis using Machine Learning with Fashion Collection Images. Cloth. Text. Res. J. 2022, 40, 308–324. [Google Scholar] [CrossRef]
Zhao, L.; Lee, S.H.; Li, M.; Sun, P. The Use of Social Media to Promote Sustainable Fashion and Benefit Communications: A Data-Mining Approach. Sustainability 2022, 14, 1178. [Google Scholar] [CrossRef]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Al-Halah, Z.; Stiefelhagen, R.; Grauman, K. Fashion forward: Forecasting visual style in fashion. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Liu, Z.; Luo, P.; Qiu, S.; Wang, X.; Tang, X. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Deldjoo, Y.; Nazary, F.; Ramisa, A.; Mcauley, J.; Pellegrini, G.; Bellogin, A.; Di Noia, T. A review of modern fashion recommender systems. arXiv 2022, arXiv:2202.02757. [Google Scholar]
Jain, V.; Wah, C. Computer Vision in Fashion Trend Analysis and Applications. J. Stud. Res. 2022, 11. [Google Scholar] [CrossRef]
Roh, Y.; Heo, G.; Whang, S.E. A survey on data collection for machine learning: A big data-ai integration perspective. arXiv 2018, arXiv:1811.03402. [Google Scholar] [CrossRef] [Green Version]
McAuley, J.; Targett, C.; Shi, Q.; Van Den Hengel, A. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago Chile, 9–13 August 2015. [Google Scholar]
Keogh, E.; Mueen, A. Curse of dimensionality. Encycl. Mach. Learn. Data Min. 2017, 314–315. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chowdhury, A.M.S.; Rahman, M.S.; Khanom, A.; Chowdhury, T.I.; Uddin, A. On stacked denoising autoencoder based pre-training of ANN for isolated handwritten Bengali numerals dataset recognition. arXiv 2018, arXiv:1812.05758. [Google Scholar]
Hsu, C.-C.; Lin, C.-W. Cnn-based joint clustering and representation learning with feature drift compensation for large-scale image data. IEEE Trans. Multimed. 2017, 20, 421–429. [Google Scholar] [CrossRef] [Green Version]
Liu, T.; Lu, Y.; Zhu, B.; Zhao, H. Clustering high-dimensional data via feature selection. Biometrics 2022. [Google Scholar] [CrossRef]
Xie, J.; Girshick, R.; Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved deep embedded clustering with local structure preservation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Yang, B.; Fu, X.; Sidiropoulos, N.D.; Hong, M. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Sydney, NSW, Australia, 6–11 August 2017. [Google Scholar]
Yan, C.; Malhi, U.S.; Huang, Y.; Tao, R. Unsupervised Deep Clustering for Fashion Images. In International Conference on Knowledge Management in Organizations; Springer: Cham, Switzerland, 2019. [Google Scholar]
Krizhevsky, A.; Hinton, G.E. Using very deep autoencoders for content-based image retrieval. In Proceedings of the European Symposium on Artificial Neural Networks, Bruges, Belgium, 27–29 April 2011. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Papadopoulos, S.-I.; Koutlis, C.; Papadopoulos, S.; Kompatsiaris, I. Multimodal Quasi-AutoRegression: Forecasting the visual popularity of new fashion products. Int. J. Multimed. Inf. Retr. 2022, 11, 717–729. [Google Scholar] [CrossRef]
Rasool, A.; Jiang, Q.; Qu, Q.; Ji, C. WRS: A novel word-embedding method for real-time sentiment with integrated LSTM-CNN model. In Proceedings of the 2021 IEEE International Conference on Real-Time Computing and Robotics (RCAR), Xining, China, 15–19 July 2021. [Google Scholar]
Tian, Z.; Ramakrishnan, R.; Birch, L.M. An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, QC, Canada, 4–6 June 1996. [Google Scholar]
Anand, S.K.; Kumar, S. Experimental comparisons of clustering approaches for data representation. ACM Comput. Surv. (CSUR) 2022, 55, 1–33. [Google Scholar] [CrossRef]
Zhu, W.; Lu, J.; Zhou, J. Nonlinear subspace clustering for image clustering. Pattern Recognit. Lett. 2018, 107, 131–136. [Google Scholar] [CrossRef]
Dey, S.; Das, S.; Mallipeddi, R. The Sparse MinMax k-Means Algorithm for High-Dimensional Clustering. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020. [Google Scholar]
Pontes, B.; Giráldez, R.; Aguilar-Ruiz, J.S. Biclustering on expression data: A review. J. Biomed. Inform. 2015, 57, 163–180. [Google Scholar] [CrossRef] [Green Version]
Montúfar, G. Restricted Boltzmann Machines: Introduction and Review. In Information Geometry and Its Applications IV; Springer: Tokyo, Japan, 2016. [Google Scholar]
Maaten, L.v.d.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Huang, Z.; Ren, Y.; Pu, X.; He, L. Deep Embedded Multi-View Clustering via Jointly Learning Latent Representations and Graphs. arXiv 2022, arXiv:2205.03803. [Google Scholar]
Cai, J.; Wang, S.; Xu, C.; Guo, W. Unsupervised deep clustering via contractive feature representation and focal loss. Pattern Recognit. 2022, 123, 108386. [Google Scholar] [CrossRef]
Yang, J.; Parikh, D.; Batra, D. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; Zhou, H. Variational deep embedding: An unsupervised and generative approach to clustering. arXiv 2016, arXiv:1611.05148. [Google Scholar]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Chang, J.; Wang, L.; Meng, G.; Xiang, S.; Pan, C. Deep adaptive image clustering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Ren, Y.; Hu, K.; Dai, X.; Pan, L.; Hoi, S.C.; Xu, Z. Semi-supervised deep embedded clustering. Neurocomputing 2019, 325, 121–130. [Google Scholar] [CrossRef]
Guo, X.; Zhu, E.; Liu, X.; Yin, J. Deep embedded clustering with data augmentation. In Proceedings of the Asian Conference on Machine Learning, Beijing, China, 14–16 November 2018. [Google Scholar]
Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. arXiv 2014, arXiv:1405.3531. [Google Scholar]
Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.-A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
Olshausen, B.A.; Field, D.J. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vis. Res. 1997, 37, 3311–3325. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011. [Google Scholar]
Bell, A.J.; Sejnowski, T.J. The “independent components” of natural scenes are edge filters. Vis. Res. 1997, 37, 3327–3338. [Google Scholar] [CrossRef] [Green Version]
Sarle, W.S. Stopped training and other remedies for overfitting. Comput. Sci. Stat. 1996, 352–360. [Google Scholar]
Ruslan Salakhutdinov, G.H. Learning a non-linear embedding by preserving class neighbourhood structure. In Proceedings of the International Conference on Artificial Intelligence and Statistics, San Juan, Puerto Rico, 21–24 March 2007. [Google Scholar]
He, R.; McAuley, J. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, Montreal, QC, Canada, 11–15 April 2016. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017. arXiv 2017, arXiv:1708.07747. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Cai, D.; He, X.; Han, J. Locally Consistent Concept Factorization for Document Clustering. IEEE Trans. Knowl. Data Eng. 2011, 23, 902–913. [Google Scholar] [CrossRef] [Green Version]
Santos, J.M.; Embrechts, M. On the use of the adjusted rand index as a metric for evaluating supervised classification. In International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]

Figure 1. An instance of item representation in different feature spaces. (a) Shows data points are comparatively closely packed on one axis. (b) Demonstrates data points in a two-dimensional space stretched in the direction of the longitudinal axis so that the distances between points become larger. (c) Represents data points are further stretched in a three-dimensional space.

Figure 2. The illustration of the FIDC method. It is made up of four parts: input, encoder part, clustering, and Decoder part. The original input

x

is related to a fashion image, the CNN feature. Parameter

\tilde{x}

is obtained by adding noise to the original input. According to the encoder function, a latent representation

Y

is calculated. The clustering task is executed based on

Y

.

Figure 2. The illustration of the FIDC method. It is made up of four parts: input, encoder part, clustering, and Decoder part. The original input

x

is related to a fashion image, the CNN feature. Parameter

\tilde{x}

is obtained by adding noise to the original input. According to the encoder function, a latent representation

Y

is calculated. The clustering task is executed based on

Y

.

Figure 3. The illustration of the denoising auto-encoder. Where, for example,

x

is randomly corrupted to

\tilde{x}

, then the encoder function

f_{e}

maps it to

Y

and tries to reconstruct

z

by using the decoder function

f_{d}

. Reconstruction error

L_{r} (x, z)

is measured in terms of a loss function.

Figure 3. The illustration of the denoising auto-encoder. Where, for example,

x

is randomly corrupted to

\tilde{x}

, then the encoder function

f_{e}

maps it to

Y

and tries to reconstruct

z

by using the decoder function

f_{d}

. Reconstruction error

L_{r} (x, z)

is measured in terms of a loss function.

Figure 4. Optimized architecture of deep-stacked auto-encoder. According to the input direction, the first hidden layer of the encoder is a network of 8192 neurons. Later, we reduced the dimensions to 256 neurons, i.e.,

Y

, through two other hidden layers. In the decoder section, we reconstruct

z

, ideally the same as the input

x

, using the same number of neurons and layers in the opposite directions.

Figure 4. Optimized architecture of deep-stacked auto-encoder. According to the input direction, the first hidden layer of the encoder is a network of 8192 neurons. Later, we reduced the dimensions to 256 neurons, i.e.,

Y

, through two other hidden layers. In the decoder section, we reconstruct

z

, ideally the same as the input

x

, using the same number of neurons and layers in the opposite directions.

Figure 5. An instance of the clustering layer. The left box shows the original data space where we can see the different positions of digits. Due to huge intra-variance, it seems complicated to perform a clustering task. These digits are mapped to some latent feature spaces. The oval in the upper right corner displays a linear mapping, and the one in the lower right corner shows a nonlinear mapping.

Figure 6. The deep-stacked auto-encoder-based clustering. On the left, data representation is computed using an auto-encoder. On the right, data evolution in the feature space occurs after constraints are imposed on the feature layer. The red numerals represent the former clusters, and the green numerals signify the updated clusters in the feature space.

Figure 7. Demonstration of 10 randomly selected visual spaces of men’s clothing. Each row shows one visual space.

Figure 8. Performance comparison of Amazon-Fashion clustering.

Figure 9. An illustration that summarizes the t-SNE representation of the Amazon-Fashion dataset, which reduces the data to two dimensions, has revealed four distinct visual spaces, each representing a single class.

Figure 10. The Confusion Matrix illuminates the FIDC model’s performance on the Amazon-Fashion dataset.

Figure 11. The t-SNE visualization shows the clustering of the Amazon dataset in 2 dimensions (tSNE_1 and tSNE_2) after reducing the dataset from 256 dimensions. Different colors represent different clusters.

Table 1. Notations and explanations.

Notation	Explanation
$Y$	Feature representation
$W$	Input to hidden layer weight
$b$	Bias
$f_{e}$	Activation function, Relu
$W^{'}$	Tied weights
$f_{d}$	Decoding function
$F$	The dimensionality of deep CNN feature
$x$	Original input
$\tilde{x}$	Noise input
$z$	Reconstructed input
$μ_{j}$	Cluster centroid

Table 2. Parameters of CNN-F.

Conv1	Conv2	Conv3	Conv4	Conv5	Full6	Full7	Full8
64 × 11 × 11	256 × 5 × 5	256 × 3 × 3	256 × 3 × 3	256 × 3 × 3	4096	4096	k
st. 4, pad 0 ×2 pool	st. 1, pad ×2 pool	2 st. 1, pad 1	st. 1, pad	1 st. 1, pad 1 ×2 pool	dropout	dropout	-

Table 3. Summary of deep-stacked auto-encoder training models.

Layer (Type)	Output Shape	Param
input (InputLayer)	(None, 4096)	0
encoder_0 (Dense)	(None, 8192)	33,562,624
encoder_1 (Dense)	(None, 2048)	16,779,264
encoder_2 (Dense)	(None, 512)	1,049,088
encoder_3 (Dense)	(None, 256)	131,328
decoder_3 (Dense)	(None, 512)	131,584
decoder_2 (Dense)	(None, 2048)	1,050,624
decoder_1 (Dense)	(None, 8192)	16,785,408
decoder_0 (Dense)	(None, 4096)	33,558,528
Total params: 103,048,448 Trainable params: 103,048,448 Non-trainable params: 0

Table 4. Clustering performance of various methods on open-source datasets expressed in terms of accuracy (ACC), normalized mutual information (NMI), and adjusted Rand index (ARI). All baseline method results were obtained from the corresponding code or original papers. The result is not available, indicated by the mark “-”.

Models	Amazon-Fashion			MNIST			Fashion-MNIST
Models	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
K-means	0.587	0.705	0.481	0.532	0.499	0.365	0.474	0.499	0.512
AE + k-means	0.749	0.765	0.617	0.815	0.785	0.720	0.482	0.556	0.390
VAE + k-means	0.743	0.685	0.615	0.793	0.716	-	0.485	0.513	-
AAE + k-means	0.776	0.672	0.632	0.813	0.743	-	0.512	0.534	-
DEC [16]	0.780	0.678	0.590	0.865	0.827	0.740	0.516	0.546	0.419
IDEC [17]	0.791	0.716	0.633	0.884	0.867	0.881	0.529	0.557	0.428
DCN [18]	0.723	0.699	0.614	0.831	0.825	0.756	0.531	0.571	0.386
VaDE [34]	0.801	0.745	0.662	0.945	0.876	-	0.578	0.630	-
JULE [33]	0.809	0.752	0.679	0.964	0.913	-	0.563	0.608	-
FIDC (Proposed)	0.843	0.756	0.694	0.940	0.887	0.893	0.681	0.692	0.561

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Malhi, U.S.; Zhou, J.; Yan, C.; Rasool, A.; Siddeeq, S.; Du, M. Unsupervised Deep Embedded Clustering for High-Dimensional Visual Features of Fashion Images. Appl. Sci. 2023, 13, 2828. https://doi.org/10.3390/app13052828

AMA Style

Malhi US, Zhou J, Yan C, Rasool A, Siddeeq S, Du M. Unsupervised Deep Embedded Clustering for High-Dimensional Visual Features of Fashion Images. Applied Sciences. 2023; 13(5):2828. https://doi.org/10.3390/app13052828

Chicago/Turabian Style

Malhi, Umar Subhan, Junfeng Zhou, Cairong Yan, Abdur Rasool, Shahbaz Siddeeq, and Ming Du. 2023. "Unsupervised Deep Embedded Clustering for High-Dimensional Visual Features of Fashion Images" Applied Sciences 13, no. 5: 2828. https://doi.org/10.3390/app13052828

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unsupervised Deep Embedded Clustering for High-Dimensional Visual Features of Fashion Images

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Framework of FIDC

3.2. Dimensionality Reduction Based on Deep Stacked Auto-Encoder

3.3. Joint Clustering

3.4. Optimization

4. Experimental Design

4.1. Dataset

4.2. Image Feature Extraction

4.3. Model Training

4.4. Evaluation Matrices

5. Results and Discussions

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI