SimCDL: A Simple Framework for Contrastive Dictionary Learning

Ilie-Ablachim, Denis C.; Dumitrescu, Bogdan

doi:10.3390/app142210082

Open AccessArticle

SimCDL: A Simple Framework for Contrastive Dictionary Learning

by

Denis C. Ilie-Ablachim

and

Bogdan Dumitrescu

^*

Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica Bucharest, Splaiul Independentei nr. 313, Sector 6, 060042 București, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10082; https://doi.org/10.3390/app142210082

Submission received: 26 August 2024 / Revised: 26 October 2024 / Accepted: 29 October 2024 / Published: 5 November 2024

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose a novel approach to the dictionary learning (DL) initialization problem, leveraging the SimCLR framework from deep learning in a self-supervised manner. Dictionary learning seeks to represent signals as sparse combinations of dictionary atoms, but effective initialization remains challenging. By applying contrastive learning, we encourage similar representations for augmented versions of the same sample while distinguishing between different samples. This results in a more diverse and incoherent set of atoms, which enhances the performance of DL applications in classification and anomaly detection tasks. Our experiments across several benchmark datasets demonstrate the effectiveness of our method for improving dictionary learning initialization and its subsequent impact on performance in various applications.

Keywords:

sparse learning; dictionary learning; contrastive learning

1. Introduction

Dictionary learning (DL) is a branch of signal processing and machine learning that aims to find a frame (or dictionary) for sparsely representing signals as a combination of few elements. The initial basis is unknown and is usually learned from the data. This method has been widely used in multiple fields, such as image and audio processing, inpainting, compression, feature extraction, clustering, and classification.

The DL problem can be formulated as follows:

\begin{matrix} min_{D, X} & {∥ Y - D X ∥}_{F}^{2} \\ s . t . & {∥x_{ℓ}∥}_{0} \leq s, ℓ = 1 : N \\ ∥d_{j}∥ = 1, j = 1 : n, \end{matrix}

(1)

where

Y \in R^{m \times N}

is the matrix that contains N signals of size m, stored compactly as columns,

D \in R^{m \times n}

is named dictionary and is usually overcomplete (

m < n

), and

X \in R^{n \times N}

is the coefficients’ matrix. The column vectors of matrix

D

are named atoms and define the basis vectors used for the linear combination. For each signal sample, only s atoms are used for the representation. The matrix

X

is a sparse matrix that contains the coefficients associated with the atoms used for the linear representation.

The DL problem is typically solved in two optimization stages: dictionary update and sparse coding. At the beginning of the algorithm, the dictionary

D

is randomly initialized with normalized atoms; random initialization does not favor obtaining the best results, but satisfactory results can still be obtained. In the sparse coding stage, the dictionary is considered fixed and all sparse coefficients (the elements of the matrix

X

) are computed for each sample signal. A popular method for sparse coding is Orthogonal Matching Pursuit (OMP) [1]. This greedy algorithm selects the atoms sequentially based on their correlation with the residual representation. In the next stage, the dictionary matrix,

D

, is updated, while the coefficient matrix is considered fixed. Several algorithms are available for updating the dictionary matrix [2]. The K-means Singular Value Decomposition (K-SVD) [3] and Approximate K-SVD (AK-SVD) [4] are among the most used update algorithms. The optimization continues by alternating between these two stages until a stopping criterion is met. In general, two different criteria are used: the maximum number of iterations or the representation error dropping off under a threshold value.

Dictionary learning (DL) can be used in various application problems. Considering the sparse representation capabilities and the light optimization procedure, relevant results can be obtained in practice. However, performance usually depends on the initialization of the dictionary and its properties. For example, in classification problems or discriminative learning, the incoherence of the atoms becomes relevant. Incoherence is the property of the atoms to be far apart one from another; equivalently, their scalar product is small in absolute value. Several methods include discriminative terms in the optimization procedure to meet the incoherence goal. On the other hand, a pre-trained dictionary can be advantageous when prior knowledge about the data is known. In this paper, we address the dictionary initialization problem using a previous problem validated for deep neural networks in a self-supervised manner. This initialization can be used for general problems or classification problems solved with DL.

Contrastive learning is a method that aims to learn similar and dissimilar representations from the data. This method can be used in supervised problems with known sample labels. However, the discrimination problem becomes untractable when dealing with large datasets. A straightforward method for learning instance discrimination is the use of categorical cross-entropy loss. As the dataset increases in size, performing the softmax operation to calculate class probabilities becomes excessively costly. The challenge of using categorical cross-entropy loss, such as discrimination, lies in its computational infeasibility when dealing with large datasets. Researchers have been searching for more efficient ways to approximate this loss. Recent advances are inspired by metric learning, as well as the work in [5,6].

A significant challenge within the instance discrimination framework is the lack of intraclass variability. In traditional supervised learning, there are typically hundreds or thousands of examples per class, which helps the algorithm to learn the inherent variation within each class. However, in many applications, there are only a few examples per class, which clearly hinders the learning process. This issue can be tackled through extensive data augmentation. By applying different transformations to a specific data point, we can generate slightly varied versions while maintaining its fundamental semantic meaning. This approach allows us to learn valuable representations without relying on explicit labels.

To set up the contrast between instances, several views of the inputs are produced using a process

T

and then evaluated in the representation space. For a particular input

y_{i}

, an anchor is calculated as

y^{(a)} \sim T (y_{i})

and then compared to a positive sample

y^{(p)} \sim T (y_{i})

, which is another transformation of the same input or a sample from the same class with the anchor. A negative sample,

y^{(n)} \sim T (y_{j})

, which represents a transformation of a different input, is also contrasted with the anchor. In addition, the process

T

is modified or updated so that it represents positive pairs in a compact form, while negative pairs are projected far apart.

In the general context of Self-Supervised Representation Learning (SSRL), this approach involves a pretext task generator

P

that creates pretext inputs for multiple pairs of raw input instances. These inputs have pseudo-labels that indicate whether the pairs are matching or not.

The anchor and positive and negative samples are then processed by a feature extractor

h (\cdot)

to derive their respective representations:

x^{(a)} = h (y^{(a)})

for the anchor,

x^{(p)} = h (y^{(p)})

for the positive sample, and

x^{(n)} = h (y^{(n)})

for the negative sample. After this, a similarity function denoted by

Φ

is used to evaluate the similarity between pairs of projections. The whole model is subsequently trained to minimize the distance between positive pairs and to maximize the distance between negative pairs. A simple formulation of contrastive learning loss is formulated as follows:

L_{con} = - E [log \frac{Φ (x^{(a)}, x^{(p)})}{Φ (x^{(a)}, x^{(p)}) + \sum_{j = 1}^{k} Φ (x^{(a)}, x^{(n)})}],

(2)

where k represents the number of negative samples that have been used in contrast with the anchor. The training process can also include the transformation process

T

or only the encoder

h (\cdot)

.

Multiple versions of this strategy can be employed within this framework. The methods vary based on the similarity function, the family of transformations

T

, the encoder function

h (\cdot)

, and approaches to the sampling anchor, and positive and negative examples.

There are many unsupervised applications [7] that have been developed in the spirit of Self-Supervised Representation Learning (SSRL). These methods enable the development of generalizable models that have the potential to learn and recognize a wide variety of patterns in the data.

Several strategies are available, such as Momentum Contrast (MoCo) [8], Pretext-Invariant Representation Learning (PIRL) [9], and a Simple Framework for Contrastive Learning of Visual Representations (SimCLR) [10]. MoCo is a framework that uses a queue and a moving average encoder to learn visual representations. In PIRL, invariant representations are learned by solving pretext tasks. SimCLR uses a simple contrastive loss function to learn visual representations. The simple contrastive loss is applied by learning similar representations for augmented versions of the same input image while discriminating representations of dissimilar images.

Contributions. In this paper, we adapt the SimCLR framework to the dictionary learning problem, with the purpose of obtaining more incoherent atoms that are better adapted for DL applications (classification and anomaly detection). The learned atoms can then be used to improve sparse representations, leading to smaller representation errors and better discriminative performances.

The main contribution is reconfiguring the initial SimCRL algorithm in the context of dictionary learning. This includes the substitution of the base encoder network with a dictionary learning problem. The network projection head is no longer used since the encoding and projection are performed using the OMP algorithm. The augmentation procedure was adapted for n-dimensional vectors. For this, we only used four elementary operations, which were changed in the context of dictionary learning. This self-supervised framework is capable of building more incoherent dictionaries, which facilitates a better representation error and has an impact on further supervised and semi-supervised applications.

The use of SimCLR can be beneficial for dictionary learning applications from different perspectives. In many real-world applications, large amounts of unlabeled data are used. SimCLR can learn robust feature representations from the unlabeled data, which can then be used to initialize the dictionary. This initialization can improve the performance of downstream tasks, such as classification, anomaly detection, or clustering, even when labeled data are scarce. On the other hand, the initialization of the dictionary using the SimCLR framework can boost the optimization process. The learning process can start from a more informative and structured point, potentially leading to faster convergence and more stable solutions. This not only enhances the efficiency and effectiveness of the learning process, but also improves the interpretability and stability of the resulting model.

The content of this paper is organized as follows. In Section 2, we introduce the self-supervised contrastive framework in the context of dictionary learning. The augmentation methods required for this framework are included in Section 3. In Section 4, we include several algorithms that have been used for our experiments. These algorithms [11] also aim to respect a discriminatory representation using contrastive learning. Our tests demonstrate that the use of the SimCLR framework is beneficial for dictionary learning algorithms that promote contrastive learning. We then continue with the presentations of our experiments in Section 5. We conducted tests on two mainstream tasks, namely classification (Section 5.1) and anomaly detection (Section 5.2). In the last section, we end up with some conclusions.

2. Contrastive Dictionary Learning

This section explains the Simple framework for Contrastive Dictionary Learning (SimCDL), which is used for dictionary initialization. This framework is developed in the spirit of SimCLR [10], a powerful approach to unsupervised representation learning. At the base of this framework lies the loss function, designed to learn rich and discriminative embeddings from unlabeled data. For the Contrastive Dictionary Learning framework, we consider the same loss function, but we propose a different logic for computing the encodings representation.

In the context of dictionary learning (DL), we apply stochastic data augmentation transformations to generate pairs of correlated signals from the same example, denoted

\tilde{y_{i}}

and

\tilde{y_{j}}

. These two samples are derived from an initial sample

y

. Let

T

represent the space of augmentation operations that can be applied to the samples. For two different random initializations of the augmentation operators, we have

t \sim T

and

t^{'} \sim T

. The next step is to follow an encoding process that aims to maximize the agreement between the two augmented samples in the representation space.

The SimCLR problem has been adapted without using a projection head or a base encoder. Instead, the dictionary

D

is used directly to calculate the encodings using an OMP procedure. The embeddings are represented by the column vectors of the matrix

X

.

To build a positive pair of encodings, denoted as

x_{i}

and

x_{j}

, we compute the representation coefficients of the two augmented samples,

\tilde{y_{i}}

and

\tilde{y_{j}}

. In addition, contrastive loss is calculated to measure similarities between positive pairs of encodings and discriminate them from negative pairs.

This setup leverages the principles of SimCLR, primarily contrastive learning, in a sparse coding problem. SimCLR is typically used to learn useful representations by maximizing the agreement between different augmented examples of the same data sample via a contrastive loss in the latent space. An example illustrating the SimCDL steps is shown in Figure 1.

To compute the loss function, we randomly select a mini-batch of K examples at each iteration. After that, we create pairs of augmented samples for each of the K examples, resulting in a total of

2 K

data points. By doing so, we do not need to sample negative examples explicitly.

Additionally, we consider other augmented examples as negative samples in relation to the positive pair. After this, we calculate an embedding for each augmented sample using the OMP algorithm

x_{i} = OMP ({\tilde{y}}_{i}, D, s),

(3)

where

D

is the dictionary matrix and s is the sparsity level.

OMP aims to find the best sparse representation of a signal by iteratively selecting the most relevant atoms from the dictionary

D

. Iteratively, the algorithm selects the atom most correlated with the current residual. After the selection is made, the residual is updated by projecting the signal onto the subspace spanned by the selected atoms. This process is repeated until a stopping criterion is met (e.g., a desired sparsity level or error threshold). This whole process substitutes the encoding base network that was previously used in SimCLR.

The similarity of the encodings, obtained with OMP, is computed using the cosine similarity of their normalized feature vectors,

Φ (u, v) = u^{⊤} v / ∥ u ∥ ∥ v ∥

. In contrast, the negative encoding should use other atoms in their sparse representation, leading to incoherent embeddings. The loss for a positive pair of examples is calculated using a softmax function over the similarity scores of all pairs within the mini-batch, scaled by a temperature parameter

τ

:

ℓ (i, j) = - log \frac{exp (Φ (x_{i}, x_{j}) / τ)}{\sum_{k = 1}^{2 K} [1_{[k \neq i]} exp (Φ (x_{i}, x_{k}) / τ)]} .

(4)

The global loss function is summed across all positive pairs, representing the normalized temperature-scaled cross-entropy loss (NT-XEnt). The numerator

exp (Φ (x_{i}, x_{j}) / τ)

encourages positive pairs to be closer, while the denominator

\sum_{k = 1}^{2 K} [1_{[k \neq i]} exp (Φ (x_{i}, x_{k}) / τ)]

introduces competition with all other representations in the batch, treating them as negatives. The contrastive loss can be interpreted as the maximization of the similarity between positive pairs in relation to the similarity between negative pairs. This process effectively forms a distribution over possible pairs, emphasizing the relative similarity of positive pairs over negatives. The temperature parameter

τ

controls the sharpness of the similarity scores. From a mathematical point of view, this parameter affects the relative weighting of similarities. A lower-value

τ

leads to sharper distributions, which heavily penalizes dissimilarities between positive pairs, leading to a stronger focus on very close positives. On the other hand, a higher-value

τ

treats similarities equally, which can lead to the avoidance of overemphasizing the few closest pairs. In general, the temperature term

τ

can be seen as controlling the entropy of the similarity distribution.

Using the NT-XEnt loss function, the full dictionary is updated using a Stochastic Gradient Descent (SGD) procedure, where the gradient is computed using reverse-mode automatic differentiation. The optimization result leads to more diverse quasi-orthogonal atoms that can better represent all the samples available in the training set. This problem is similar to a frame design problem, in which the atoms are designed to represent the samples better. In the context of SimCDL, we want to randomly initialize a dictionary and optimize it following the SGD procedure. Our experiments demonstrate that relevant results can be obtained for small batch sizes and several iterations, leading to better representation errors. The idea of SimCDL is summarized in Algorithm 1.

The contrastive representation learning optimizes asymptotically two main properties: the alignment of features from positive pairs and uniformity of the induced distribution of the features on the hypersphere. In a theoretical study [12], the authors demonstrated that the NT-Xent loss inherently promotes uniformity on the hypersphere. The property of alignment ensures that a pair of positive samples

(x_{i}, x_{j})

pushes them closer together in the latent space. This can be seen as maximizing the similarity

E [Φ (x_{i}, x_{j})] \to \max

, encouraging the alignment of different augmentations from the same sample. The property of uniformity pushes negative pairs apart. In this way, the representations are spread over the entire space,

E_{k \neq i} [Φ (x_{i}, x_{k})] \to \min

, with the aim of achieving a uniform distribution over the unit hypersphere. These properties share similarities with the concept of the incoherence of atoms in dictionary learning. The underlying goal of the problem is the same, but applied in a different context. In dictionary learning, coherence refers to the similarity between different atoms. This suggests that the use of contrastive representation learning should lead to more incoherent dictionaries. The underlying mechanisms of building incoherent atoms is similar to the strategy of contrastive representation learning. Moreover, contrastive learning is related to maximizing the mutual information (MI) [13] between different augmentations of the same sample:

MI (\tilde{x_{i 1}}; \tilde{x_{i 2}}) \geq E [log p (x_{i} | x_{j}) - log p (x_{i})],

(5)

where

\tilde{x_{1}}

and

\tilde{x_{2}}

are different augmentations of

x_{i}

. The maximization of mutual information for the data samples is related to the problem of reducing mutual coherence in the problem of dictionary learning. Since we want to enhance the representation capabilities (mutual information), more diverse atoms are needed, leading to a reduction in mutual coherence.

Algorithm 1 SimCDL: main learning algorithm

Require: batch size K, constant $τ$ , augmentation function T, dictionary $D$ , sparsity constraint s
for sampled minibatch ${y_{k}}_{k = 1}^{K}$ do
for all $k \in {1, \dots, K}$ do
draw two augmentation functions $t \sim T$ , $t^{'} \sim T$
# the first augmentation
${\tilde{y}}_{2 k - 1} = t (y_{k})$
$x_{2 k - 1} = OMP ({\tilde{y}}_{2 k - 1}, D, s)$
# the second augmentation
${\tilde{y}}_{2 k} = t^{'} (y_{k})$
$x_{2 k} = OMP ({\tilde{y}}_{2 k}, D, s)$
end for
for all $i \in {1, \dots, 2 K}$ and $j \in {1, \dots, 2 K}$ do
$Φ (i, j) = x_{i}^{⊤} x_{j} / (∥ x_{i} ∥ ∥ x_{j} ∥)$
end for
compute gradient of $L = \frac{1}{2 K} \sum_{k = 1}^{K} [ℓ (2 k - 1, 2 k) + ℓ (2 k, 2 k - 1)]$ , where ℓ is defined in (4)
update dictionary $D$ using gradient descent
end for
return learned dictionary $D$

The use of SimCDL can be beneficial for the initialization of dictionaries with incoherent atoms or even incoherent sub-dictionaries. In classification problems, with

Y = [Y_{1}, \dots, Y_{c}, \dots, Y_{C}]

representing a set of feature vectors, we want to learn local dictionaries,

D_{c}

, for each class. In general, the initialization problem is not addressed; simple methods like random matrices or a random selection of signals are used for initialization; we will tackle the problem using SimCDL. Considering that a class dictionary,

D_{c} \in R^{m \times N_{c}}

, should achieve good representations for its class, we will further adapt the SimCRL framework for the initialization of dictionaries in classification problems. Since we need C dictionaries, we will further optimize a wide dictionary,

D = [D_{1}, \dots, D_{c}, \dots, D_{C}] \in R^{m \times (C \cdot N)}

. During optimization, the sparsity constraint s is set to N, which is the number of atoms per class. Since we need N atoms, we want to overspecialize enough atoms for each class.

After training the general dictionary, the problem of atom distribution must be solved; each atom of

D

must be assigned to a class c. For this task, we propose two different approaches. The first one is greedy: we loop over each class and search for the atoms that are most used in the sparse representation of the samples of the current class. Since class dictionaries do not have common atoms, once an atom is assigned to a class, it becomes unavailable for other classes. The second approach is based on the linear sum assignment problem [14]. This algorithm solves an optimization problem where the goal is to assign a set of workers to a set of tasks in a one-to-one manner while minimizing the total cost of the assignment. This scenario is similar to the problem of assigning the atom class with respect to their use. During our experiments, we tested both methods and decided to use the second approach.

3. Data Augmentation for Dictionary Learning

This section introduces the main data augmentation transformations that can be applied on a signal sample

y \in R^{m}

. This strategy is widely used in supervised and unsupervised learning.

We follow the example provided in [10], where the authors used augmentation operators such as cropping, resizing, flipping, color distortion (drop and jitter), image rotation (90°, 180°, 270°), cutout, Gaussian noise, Gaussian blur, and Sobel filtering. Since we do not focus on images in DL, we further use only four augmentation operators that have been adapted to signal samples.

3.1. Rotation

The first augmentation operator that we use is rotation. Since we are working with n-dimensional vectors, we need a method to perform rotations in any dimension. We achieve this by applying Rodrigues’ rotation formula [15]. This formula is used to rotate a vector in a multidimensional space in the hyperplane spanned by two orthogonal unit vectors, named

u

and

v

. Notice that for spaces with dimensions larger than three, we do not have a unique axis that can be used for rotations. Different initializations are used for the unit vectors during our experiments each time a rotation is generated. The resulting rotation matrix

R

is defined as

R = I_{m} + sin (α) \cdot (v u^{T} - u v^{T}) + (cos (α) - 1) \cdot (u u^{T} + v v^{T}),

(6)

where

I_{m}

is the identity matrix and

α

is the angle under which the rotation is made.

In a n-dimensional space, a rotation using Rodrigues’ formula is confined to the subspace (hyperplane) defined by the two orthogonal vectors

u

and

v

. These vectors define a plane within the n-dimensional space. The rotation matrix

R

only affects the components of the vector

y

that lie in the plane spanned by

u

and

v

. Any component of

y

orthogonal to this plane remains unchanged. For a given n-dimensional sample

y

, we generate an augmented version by applying a simple rotation, defined as follows:

y \leftarrow R y,

(7)

where

R

is the rotation matrix (6).

3.2. Cutout

In general, the cutout transformation selectively zeros out some input data elements. In our case, the cutout operator refers to a reduced DL problem in which we isolate some features. More precisely, considering

m_{c}

as the number of features that will be further used, with

m_{c} < m

, the DL becomes

y_{I_{c}} = D_{I_{c}} x = \sum_{j = 1}^{n} x_{j} D_{I_{c}, j} = \sum_{j \in S} x_{j} D_{I_{c}, j},

(8)

where the subscript

I_{c}

denotes the restriction of the vectors or matrices to the rows with indices in

I_{c}

. Notice that when the cutout operator is used, only

m_{c}

features of each atom are updated in the update stage of DL, namely those belonging to the set

I_{c}

. Moreover, the atom normalization that follows the update is made with respect to all features. This strategy demonstrates that the cutout operator can obtain more robust dictionaries. In this way, the dictionary focuses more on the underlying structure of the data representation. In addition, the cutout operator makes the dictionary more generalizable, reducing overfitting risks. Figure 2 presents a general illustration of the cutout operator.

3.3. Gaussian Noise

Another augmentation operator is the Gaussian noise, which involves creating a noise vector

ϵ

, where each element is drawn from a Gaussian distribution

N (0, σ^{2})

. The transformation is simple and is performed as follows:

\tilde{y} = y + ϵ,

(9)

where

\tilde{y}

is the new sample vector, and

ϵ

is the noise vector centered in 0 with standard deviation

σ

. This type of augmentation is trivial for developing robust atoms.

3.4. Gaussian Blur

The Gaussian blur operator is obtained by convolving the sample vector

y

with a one-dimensional Gaussian kernel. The kernel

k

is defined as

k_{i} = \frac{1}{\sqrt{2 π σ^{2}}} exp (- \frac{i^{2}}{2 σ^{2}})

(10)

where i ranges over the kernel size, centered at

μ = 0

, the middle of the kernel; and

σ

is the standard deviation of the blur. The resulting vector is

\tilde{y_{i}} = {(k * y)}_{i} = \sum_{j} k_{j} \cdot y_{i - j},

(11)

where ’∗’ is the convolution operator.

4. Algorithms

In this section, we present three algorithms [11] developed for Incoherent Dictionary Learning (IDL) in the context of classification tasks. These algorithms include the standard Incoherent Dictionary Learning (IDL), Incoherent Dictionary Learning via Distance Barrier (IDB), and Incoherent Dictionary Learning via Triplet Distance Barrier (ITDB). Each algorithm is designed to promote incoherence between class dictionaries, thereby enhancing their discriminative power.

4.1. Incoherent Dictionary Learning (IDL)

The Incoherent Dictionary Learning (IDL) problem aims to learn class-specific dictionaries

D_{i}

from a set of feature vectors

Y = [Y_{1}, \dots, Y_{C}]

, where

Y_{c} \in R^{m \times N_{c}}

represents the samples from class c and C is the total number of classes. Each dictionary

D_{c}

is optimized to accurately represent its corresponding class while being poorly aligned with the representations of other classes. The optimization problem is formulated as follows:

\begin{matrix} min_{D_{i}, X_{i}} & \sum_{i = 1}^{C} {∥Y_{i} - D_{i} X_{i}∥}_{F}^{2} + γ \sum_{i = 1}^{C} \sum_{\begin{matrix} j \neq i \end{matrix}} {∥D_{i}^{⊤} D_{j}∥}_{F}^{2} \\ s . t . & {∥x_{i, ℓ}∥}_{0} \leq s, ℓ = 1 : N \\ ∥D_{i, j}∥ = 1, j = 1 : n \end{matrix}

(12)

where

x_{i, ℓ}

is the row vector on which the sparsity constraint is imposed,

d_{i, j}

denotes the jth column of matrix

D_{i}

and

γ > 0

is a trade-off factor that balances reconstruction accuracy with inter-class incoherence. After learning the dictionaries, a test signal

y \in R^{m}

is classified by finding the dictionary with the smallest reconstruction error:

c = \underset{i = 1 : C}{argmin} ∥ y - D_{i} x_{i} ∥, subject to {∥x_{i}∥}_{0} \leq s,

(13)

where s is the sparsity constraint.

This formulation, as introduced in [16], projects the dictionaries into quasi-orthogonal spaces, thereby enhancing class discrimination while preserving their representational capabilities.

The problem in (12) can be solved using an approach similar to the AK-SVD method. During the sparse representation phase, the Orthogonal Matching Pursuit (OMP) is employed as the incoherence terms are independent of the coefficients

X

. In the dictionary update phase, each atom in

D_{i}

is sequentially adjusted to improve the representation of its class while maintaining near orthogonality to atoms of other classes. The focus is on atoms within a certain proximity, ensuring relevant incoherence for classification.

4.2. Incoherent Dictionary Learning via Distance Barrier (IDB)

Incoherent Dictionary Learning via Distance Barrier (IDB) introduces an Incoherent Distance Barrier (IDB) function to discriminate between class dictionaries. This formulation, inspired by work on incoherent frames [17], aims to promote incoherence between classes for classification tasks. The IDB optimization problem is expressed as

\begin{matrix} min_{D_{i}, X_{i}} & \sum_{i = 1}^{C} {∥Y_{i} - D_{i} X_{i}∥}_{F}^{2} + γ \sum_{i = 1}^{C} \sum_{\begin{matrix} j \neq i \end{matrix}} f (D_{i}, D_{j}) \\ s . t . & {∥x_{i, ℓ}∥}_{0} \leq s, ℓ = 1 : N \\ ∥D_{i, j}∥ = 1, j = 1 : n \end{matrix}

(14)

where

f (\cdot)

is a function that enforces incoherence.

The first version of IDB utilizes a distance barrier function:

\begin{matrix} b (D_{i}, D_{j}) = \sum_{k} \sum_{l} [max (0, M - ∥ d_{k}^{(i)} - d_{l}^{(j)} ∥^{2}) + max (0, M - ∥ d_{k}^{(i)} + d_{l}^{(j)} ∥^{2})], \end{matrix}

(15)

where M is a soft margin that indicates the minimum allowable distance between atoms from different classes. This function considers scenarios where atoms are either aligned or opposed. The soft margin M is derived from the mutual coherence constraint

μ

between atoms:

| d_{i}^{T} d_{j} | \leq μ, \forall i \neq j,

(16)

which is equivalent to

\{\begin{matrix} ∥ d_{i} - d_{j} ∥^{2} \geq M \\ ∥ d_{i} + d_{j} ∥^{2} \geq M \end{matrix}, \forall i \neq j,

(17)

where

M = 2 (1 - μ)

.

To further enhance incoherence, the function

f (\cdot)

in (14) includes terms like

f_{j} (d_{j}) = {∥ W_{j} {\bar{D}}_{j}^{T} d_{j} ∥}^{2} + λ b (d_{j}),

(18)

where

{\bar{D}}_{j}

concatenates all dictionaries except the one corresponding to class j, and

W_{j}

is a diagonal weighting matrix defined as

w_{i j}^{2} = max (| d_{i}^{T} d_{j} | / μ, 1) .

(19)

The first term discourages coherence across all atoms, while the barrier function

b (\cdot)

prioritizes updates for nearby atoms. The optimization is carried out using block coordinate descent, updating one atom at a time with a gradient descent approach:

d_{j} \leftarrow d_{j} - γ_{k} g_{j} (d_{j}),

(20)

where the gradient

g_{j} (d_{j})

accounts for both representation error and inter-class incoherence. This equation represents an update step in the optimization procedure, which adjusts

d_{j}

to minimize the objective function; the left side is the new value of the atom, and the right side uses the old value.

4.3. Incoherent Dictionary Learning via Triplet Distance Barrier (ITDB)

Incoherent Dictionary Learning via Triplet Distance Barrier (ITDB) is a variant of the IDB approach, inspired by the triplet loss function [18]. The optimization problem is reformulated as follows:

\begin{matrix} min_{D_{i}, X_{i}} & \sum_{i = 1}^{C} {∥Y_{i} - D_{i} X_{i}∥}_{F}^{2} + γ \sum_{i = 1}^{C} \sum_{\begin{matrix} j \neq i \end{matrix}} \tilde{b} (D_{i}, D_{j}) \\ s . t . & {∥x_{i, ℓ}∥}_{0} \leq s, ℓ = 1 : N \\ ∥D_{i, j}∥ = 1, j = 1 : n \end{matrix}

(21)

where the triplet-based distance barrier function

\tilde{b} (D_{i}, D_{j})

is defined as

\begin{matrix} \sum_{\begin{matrix} (d^{(a)}, d^{(p)}, d^{(n)}) \end{matrix}} [max (0, M + ∥ d^{(a)} - d^{(p)} ∥^{2} - ∥ d^{(a)} - d^{(n)} ∥^{2}) \\ + max (0, M + ∥ d^{(a)} - d^{(p)} ∥^{2} - ∥ d^{(a)} + d^{(n)} ∥^{2})] . \end{matrix}

(22)

In this approach,

d^{(a)}

,

d^{(p)}

, and

d^{(n)}

form triplets of atoms. In this approach,

d^{(a)}

,

d^{(p)}

, and

d^{(n)}

represent the anchor, positive, and negative elements in a triplet of atoms, respectively;

d^{(a)}

(anchor) is the reference point;

d^{(p)}

(positive pair) is similar to the anchor, and the goal is to minimize the distance between them, represented by

∥ d^{(a)} - d^{(p)} ∥^{2}

;

d^{(n)}

(negative pair) is dissimilar to the anchor, and the objective is to maximize the distance between the anchor and negative pair, represented by

∥ d^{(a)} - d^{(n)} ∥^{2}

. The max ensures that these distances are adjusted based on a margin M, promoting better separation between the positive and negative pairs. The method ensures that atoms from the same class are closer to each other than atoms from different classes.

The triplet-based approach (ITDB) offers a more refined control over the atom distribution but incurs higher computational costs due to the management of triplets instead of pairs. To alleviate this, we propose optimizing only a subset of possible triplets, randomly selected, to reduce computational load while maintaining effective representation.

In summary, the IDB and ITDB methods extend the standard IDL approach by introducing advanced barriers that ensure greater separation between class dictionaries, thereby enhancing their effectiveness in classification tasks.

5. Experiments

We present here the results of our numerical experiments on two problems: classification and anomaly detection.

5.1. Classification

This section contains experiments carried out with the SimCDL framework. For our tests, we reproduce the results of the algorithms presented in the previous section, named Incoherent Dictionary Learning (IDL), Incoherent Dictionary Learning via Distance Barrier (IDB), and Incoherent Dictionary Learning via Triplet Distance Barrier (ITDB). We used several popular datasets that are widely used in computer vision and machine learning. Some of them were explicitly adapted for DL problems. The procedure for obtaining these datasets is available in [11]. These datasets were extensively used for research in face recognition, scene classification, object recognition, and human action recognition. In the following, we provide some general details for each dataset.

The YaleB Face Dataset [19] is a well-known collection of face images for studying problems in the field of computer vision, such as face recognition and facial expression categorization. It consists of 2414 grayscale images of 38 different persons. This dataset is particularly challenging due to the variations in lighting and facial expressions.

The AR Face Dataset [20] comprises over 4000 color images corresponding to 126 people’s faces (70 men and 56 women). The images feature different facial expressions, illumination conditions, and occlusions. The pictures were taken under varying conditions, making the dataset suitable for evaluating robustness in facial recognition tasks.

The CMU Pose, Illumination, and Expression (PIE) dataset [21] contains 41,368 images of 68 people, each captured under 13 different poses, 43 different illumination conditions, with 4 different expressions. The CMU PIE dataset is extensively used to develop and test algorithms for face recognition across pose and illumination variations and is one of the largest databases of its kind.

The Scene 15 dataset [21] is a collection of scene images categorized into 15 categories, including office, kitchen, street, and living room, totaling over 4000 images. This dataset has been used to evaluate algorithms for scene recognition. It is notable for its scale, resolution, and subject matter variety, presenting a realistic benchmark for scene classification tasks.

Caltech101 [22] consists of a database of 9144 images, subdivided into 101 object categories plus a background clutter category. With an average of 40 to 800 images per category, the dataset includes various objects taken from different angles and backgrounds. It has been a standard for assessing the performance of algorithms in object recognition and category classification.

The UCF50 dataset [23] is an action recognition dataset consisting of 50 action categories, each from videos in the wild. It contains realistic action videos collected from YouTube, totaling over 6680 clips. The UCF50 dataset is designed to evaluate human action recognition algorithms in varied scenarios, capturing the complexities of appearance, motion, interaction, and context.

Table 1 presents a general description of these datasets.

In our experimental setup, we utilized the augmentation techniques described in Section 3: cutout, rotation, Gaussian noise, and Gaussian blur. A systematic grid search was performed for various parameters: rotation angles were set at

α \in {0, 1, 2, 3, 4}

; cutout percentages were

{0 %, 20 %, 40 %, 60 %, 80 %}

; standard deviations for noise were

{0, 0.2, 0.4}

and for blur were

{0, 0.25, 0.5}

. The batch sizes tested included

{16, 32, 64, 128}

, and the dictionary was updated across

{5, 10, 20}

iterations. Optimal parameters identified for most datasets were a rotation angle of 2°, a noise standard deviation of

0.2

, a blur standard deviation of

0.5

, and a cutout percentage of

20 %

. The temperature parameter,

τ

, was set to

0.5

. All subsequent experiments were conducted with batch sizes of 16 pairs of samples, resulting in 32 augmented samples in total. The dictionary optimization was completed over 10 iterations, yielding satisfactory results. The results provided were obtained by computing the mean results over 10 independent rounds with different initialization seeds. However, when comparing a standard DL with one that benefits from initialization with SimCDL, we ensured that both methods had the same starting random dictionary. All experiments were performed in Python 3.10 and Matlab 2023a.

Our programs are openly available in the SimCDL repository at https://github.com/denisilie94/simcdl (accessed on 1 September 2024).

5.1.1. Dictionary Learning Initialization

For the general DL problem, we conducted experiments on a large amount of samples. To enhance the understanding of the outcomes, we made two experiments using the AK-SVD algorithm with two different initializations. The first is a random initialization,

D_{0}

, while the second starts from

D_{0}

and is trained with the SimCDL framework provided in Algorithm 1. We denote with

D_{t}

the dictionary obtained by SimCDL.

We further include the experimental details for the YaleB dataset. For this experiment, we only used the training dataset, which consisted of 1216 samples and 32 for each of the 38 classes. The dictionary

D_{0}

was initialized with a total of 1520 atoms, 40 atoms for each class. Using this dictionary, we further obtained the dictionary

D_{t}

by updating it for 10 iterations of SGD following the SimCDL framework. We applied AK-SVD starting with the two initializations,

D_{0}

and

D_{t}

. For this stage, we set the sparsity constraint to 40, which is equal to the number of atoms from each class. This constraint will prove useful in future experiments.

To balance the experimental conditions, we allowed additional AK-SVD iterations when initialized with

D_{0}

to compensate for the time needed for the computation of

D_{t}

. This adjustment was essential to ensure that the comparison between the dictionaries is fair. In general, the computation of the dictionary

D_{t}

is equivalent to several iterations of AK-SVD. This depends on the experimental configuration. More precisely, the batch size, the employed transformations, and the number of optimization iterations give the execution time for the SimCDL initialization. The compared results were obtained with a different number of iterations but the same execution time. For the YaleB dataset, we used 10 additional iterations to compensate for the SimCDL initialization. Even if the AK-SVD algorithm, when initialized with

D_{0}

, benefited from more iterations, the representation error still tended to have poorer results compared to the AK-SVD algorithm initialized with SimCDL.

A similar experiment was carried out for the UCF50 dataset. We provide the results obtained on these two datasets in Figure 3 and Figure 4.

We can observe that the error representation is better when the starting point is a dictionary trained with SimCDL.

To better understand the nature of the SimCDL initialization, we plotted the mutual coherence of the two dictionaries,

D_{0}

and

D_{t}

. The result is shown in Figure 5. We observed that even though SimCDL follows a Self-Supervised Representation Learning strategy, the algorithm is capable of obtaining more incoherent atoms.

This shows that the augmentation transformation tends to build some incoherence barriers around the samples. This situation propagates further into the process of dictionary optimization and leads to more diverse and incoherent atoms.

During our investigation, we also wanted to understand how the augmentation transformation affected the samples. Since the datasets consist of large vectors, a dimensional reduction task was employed. For this, we used the t-distributed Stochastic Neighbor Embedding (t-SNE) [24] algorithm to reduce the data dimensions to two. We first reduced the original data points to 2D (Figure 6), and then reduced the size of the same samples, but only after the augmentation task was applied (Figure 7).

We noticed that the augmentation process tended to produce a more compact form of the classes compared to the original dataset. It is also possible that t-SNE was helped by the augmentation process to better group the samples from the same class; in this case, it is likely that the classification algorithms also directly benefited from augmentation. This is true especially in the case where there were only a few samples per class.

5.1.2. Dictionary Learning Initialization for Classification Problems

We further reproduced several classification experiments from [11], including methods that promote the incoherence of atoms. We reproduced the results according to the instructions available in the paper. What we changed was the dictionary initialization procedure. Just like in the previous experiments with the YaleB and UCF50 datasets, we trained a general dictionary using the SimCDL framework and distributed atoms based on the linear sum assignment algorithm. The results are given in Table 2 and demonstrate that, in general, this type of initialization can slightly improve accuracy performance. We appended ‘*’ to the name of a method when the initialization was computed with SimCDL.

For these experiments, it is relevant that the batch size has to be larger than the number of samples per class. Since we want to obtain incoherent atoms for each class dictionary, we need to have augmented samples from negative classes. For several scenarios, the SimCDL initialization does not improve the classification accuracy. This is due to the fact that the SimCDL optimization direction is not aligned with the discriminative nature of the dictionary learning algorithm. This demonstrates that a different discriminative term might be more beneficial in the spirit of the SimCDL framework. For now, regarding the classification problem, we consider the results to be satisfactory enough, but they also give a start to a new strategy that must be researched.

5.1.3. Ablation Analysis

To analyze the impact of individual components within the augmentation operators, we conducted an ablation study by systematically removing each augmentation operator. The performance was measured from two points of view: we measured the representation error for each dictionary, using the SimCDL dictionary initialization, and measured the classification accuracy. In Table 3, we register the error obtained at the last iteration, while in Table 4, we register the accuracy obtained for the classification problem. We denote the columns as follows: Origin stands for the original dictionary

D_{0}

; Rotate*, Cutout*, Gaussian noise*, and Gaussian blur* suggest that this component was eliminated from the SimCDL framework, while a single * indicates that no elimination was made and all the augmentation operators are used. For the representation problem, it is clear that the use of the cutout operator reduces the performance of the dictionary. Since we are cutting information from data, the dictionary is not capable of capturing the right direction of representation. We observe that every time the SimCDL framework is used (with at least three augmentation operators) the representation error is improved. On the other hand, the cutout operator is beneficial for the classification problem. The classification accuracy varies for each dataset, demonstrating that the augmentation operators have different behaviors. Except for the AR dataset, we obtain better accuracy results whenever augmentation operators are used. So, although in many cases the use of three out of the four augmentation operators is the best option, the benefits of augmentation are proven.

5.2. Anomaly Detection

Dictionary learning can also be applied to anomaly detection tasks. A straightforward approach for outlier detection is to calculate the representation error as follows:

E = Y - D X

(23)

Anomalous signals are those that usually produce poor representations. The anomaly score of signal i is simply the norm

∥ e_{i} ∥

of the i-th column of

E

. A higher error norm indicates a greater likelihood that the signal is an outlier. The core assumption is that similar signals are better captured by the dictionary created by solving the DL problem.

In this section, we present some experiments regarding anomaly detection applications. We conducted investigations in the spirit of the Python Outlier Detection (PyOD) toolbox [25]. More precisely, we used the Anomaly Detection Benchmark (ADBench) [26] to run experiments for anomaly detection tasks using dictionary learning. ADBench is currently the most comprehensive anomaly detection benchmark; it contains 30 anomaly detection algorithms and 57 datasets from different domains, including healthcare (e.g., disease diagnosis), audio and language processing (e.g., speech recognition), image processing (e.g., object identification), finance (e.g., financial fraud detection), etc. In addition to the original datasets, this benchmark provides support for generating four types of anomalies: local, global, dependency, and cluster. As stressed in [26], no method obtains good results for all types. During our experiments, we concluded that dictionary learning is a good competitor regarding the dependency anomalies.

Out of the 57 available datasets, we selected 30 for testing, ensuring a diverse range of data types with varying dimensions and anomaly ratios. To align with the sparse representation framework, all chosen datasets have signals with more than three features. For all our tests, we used undersampled dictionaries that were half the size of the data. This reduction ensures that, during the learning process, the dictionary atoms capture the overall structure and key features of the representation. Additionally, the sparsity constraint s was set to 2. This configuration facilitates a compact representation that effectively disregards abnormal or outlier samples. All the dictionaries were trained for 10 iterations.

In Table 5, we summarize the ROC AUC (Receiver Operating Characteristic Area Under Curve) results obtained on the 30 testing datasets. With AK-SVD [4], we denote the results obtained using a random initialization. On the other hand, AK-SVD* denotes the same method, but with the benefits of a SimCDL initialization. The initialization was obtained using a batch size of 64, with 40 dictionary optimization steps and a sparsity constraint of 2. The problem of anomaly detection was solved in an unsupervised manner. The datasets were split into training test sets with a 70-30 ratio. During the training phase, both normal and abnormal samples were used to update the dictionary. However, due to the relatively small number of anomalies, the dictionary is expected to predominantly capture the representation of the normal samples. The ROC AUC values reported in Table 5 are obtained by averaging 5 independent runs and are computed for the test sets.

The results indicate that for 21 out of 30 datasets, the AK-SVD method performs better when initialized with SimCDL, while for one dataset, we obtain the same result. For the 20 improved datasets, the average ROC AUC gain is 0.0058, whereas, for the remaining datasets, the average performance decrease is 0.0010. The best improvement, nearly 0.0317, is observed for the Vertebral dataset, while the wine dataset shows the largest decline, with a drop of 0.0041. Generally, the datasets with negative improvements exhibit similar results for the two initializations. Overall, the AK-SVD method achieves a mean ROC AUC of 0.9151 across all datasets, while AK-SVD* yields a higher average of 0.9189, demonstrating an improvement. When comparing the DL method to the 14 unsupervised algorithms from the ADBench benchmark, we achieve the third-best result. Only two methods outperform ours: COF with 0.9274 and LOF with 0.9256. The next closest method in ADBench is KNN, which achieves 0.9154, placing it between the performance of AK-SVD and AK-SVD*. Additionally, in Table 5, we have highlighted the results where a DL method achieved the top 3 results; a superscript of the ROC AUC values indicates the ranking of the respective methods in the context of the 14 (unsupervised) methods from ADBench.

6. Conclusions

This paper presents a novel framework for initializing a dictionary using the SimCLR technique. The framework is suitable for general dictionary learning problems, classification problems that involve multiple dictionaries, and anomaly detection tasks. This approach makes the atoms in the dictionary more incoherent, resulting in a better representation of the input data and improved representation errors. The results obtained from several datasets demonstrate the efficacy of this framework for applications such as classification and outlier detection.

Future work will concentrate on enhancing the SimCDL framework, with a particular focus on the development of incoherent atoms within the DL problems in a supervised manner.

Author Contributions

Conceptualization, D.C.I.-A. and B.D.; methodology, D.C.I.-A. and B.D.; software, D.C.I.-A.; validation, D.C.I.-A. and B.D.; formal analysis, D.C.I.-A. and B.D.; investigation, D.C.I.-A.; data curation, D.C.I.-A.; writing—original draft preparation, D.C.I.-A.; writing—review and editing, B.D. and D.C.I.-A.; visualization, D.C.I.-A.; supervision, B.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant of the Ministry of Research, Innovation and Digitization, CNCS—UEFISCDI, project number PN-III-P4-PCE-2021-0154, within PNCDI III.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We used the original datasets YaleB, AR, Scene 15, and Caltech101, available at http://www.zhuolin.umiacs.io/projectlcksvd.html (accessed on 1 September 2024) and first presented in [27]. For the UCF50 dataset, we downloaded the videos from https://cse.buffalo.edu/~jcorso/r/actionbank/ (accessed on 1 September 2024) and followed the procedure described in [23] to extract the action bank features. The initial features obtained were reduced using PCA to a size dimension of 5000. The CMU dataset and the source code for the classification problem are publicly available at https://github.com/denisilie94/class-idb (accessed on 1 September 2024). In this repository, we also provide the extractor and PCA dimensional reduction program. For the anomaly detection task, we utilized datasets available in the ADBench repository, which can be accessed at https://github.com/Minqi824/ADBench (accessed on 1 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DL	Dictionary Learning
OMP	Orthogonal Matching Pursuit
K-SVD	K-means Singular Value Decomposition (K-SVD)
AK-SVD	Approximate K-means Singular Value Decomposition (K-SVD)
SSRL	Self-Supervised Representation Learning
MoCo	Momentum Contrast
PIRL	Pretext-Invariant Representation Learning
SimCLR	Simple Framework for Contrastive Learning of Visual Representations
SimCDL	Simple Framework for Contrastive Dictionary Learning
NT-Xent	Normalized Temperature-scaled Cross Entropy Loss
SGD	Stochastic Gradient Descent
MI	Mutual Information
IDL	Incoherent Dictionary Learning
IDB	Incoherent Dictionary Learning via Distance Barrier
ITDB	Incoherent Dictionary Learning via Triplet Distance Barrier
t-SNE	t-Distributed Stochastic Neighbor Embedding
PyOD	Python Outlier Detection
ADBench	Anomaly Detection Benchmark
ROC AUC	Receiver Operating Characteristic Area Under Curve

References

Pati, Y.; Rezaiifar, R.; Krishnaprasad, P. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 1–3 November 1993; pp. 40–44. [Google Scholar]
Dumitrescu, B.; Irofti, P. Dictionary Learning Algorithms and Applications; Springer: Cham, Switzerland, 2018. [Google Scholar]
Aharon, M.; Elad, M.; Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 2006, 54, 4311–4322. [Google Scholar] [CrossRef]
Rubinstein, R.; Zibulevsky, M.; Elad, M. Efficient Implementation of the K-SVD Algorithm Using Batch Orthogonal Matching Pursuit; Technical report; Computer Science Department—Technion: Haifa, Israel, 2008. [Google Scholar]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar]
Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; JMLR Workshop and Conference Proceedings. pp. 297–304. [Google Scholar]
Ericsson, L.; Gouk, H.; Loy, C.C.; Hospedales, T.M. Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Process. Mag. 2022, 39, 42–62. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9729–9738. [Google Scholar]
Misra, I.; Maaten, L. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6707–6717. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Ilie-Ablachim, D.; Dumitrescu, B. Classification With Dictionary Learning and A Distance Barrier Promoting Incoherence. In Proceedings of the 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP), Rome, Italy, 17–20 September 2023; pp. 1–6. [Google Scholar]
Wang, T.; Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 9929–9939. [Google Scholar]
Poole, B.; Ozair, S.; Van Den Oord, A.; Alemi, A.; Tucker, G. On variational bounds of mutual information. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5171–5180. [Google Scholar]
Crouse, D. On implementing 2D rectangular assignment algorithms. IEEE Trans. Aerosp. Electron. Syst. 2016, 52, 1679–1696. [Google Scholar] [CrossRef]
Jan. Rotation Matrix. MATLAB Central File Exchange. Retrieved 15 October 2024. 2024. Available online: https://www.mathworks.com/matlabcentral/fileexchange/66446-rotation-matrix (accessed on 1 September 2024).
Ramirez, I.; Sprechmann, P.; Sapiro, G. Classification and clustering via dictionary learning with structured incoherence and shared features. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3501–3508. [Google Scholar]
Ilie-Ablachim, D.; Dumitrescu, B. Incoherent frames design and dictionary learning using a distance barrier. Signal Process. 2023, 209, 109019. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Georghiades, A.; Belhumeur, P.; Kriegman, D. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 643–660. [Google Scholar] [CrossRef]
Martinez, A. The AR Face Database: CVC Technical Report 24. 1998. Available online: https://portalrecerca.uab.cat/en/publications/the-ar-face-database-cvc-technical-report-24 (accessed on 1 September 2024).
Sim, T.; Baker, S.; Bsat, M. The CMU pose, illumination, and expression (PIE) database. In Proceedings of the Fifth IEEE International Conference on Automatic Face Gesture Recognition, Washinton, DC, USA, 20–21 May 2002; pp. 53–58. [Google Scholar]
Fei-Fei, L.; Fergus, R.; Perona, P. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 27 June–2 July 2004; p. 178. [Google Scholar]
Sadanand, S.; Corso, J. Action bank: A high-level representation of activity in video. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1234–1241. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Zhao, Y.; Nasrullah, Z.; Li, Z. Pyod: A python toolbox for scalable outlier detection. J. Mach. Learn. Res. 2019, 20, 1–7. [Google Scholar]
Han, S.; Hu, X.; Huang, H.; Jiang, M.; Zhao, Y. Adbench: Anomaly detection benchmark. Adv. Neural Inf. Process. Syst. 2022, 35, 32142–32159. [Google Scholar] [CrossRef]
Jiang, Z.; Lin, Z.; Davis, L.S. Label consistent K-SVD: Learning a discriminative dictionary for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2651–2664. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Contrastive dictionary learning framework.

Figure 2. DL with cutout; the grayed-out cells indicate features that are ignored during the atoms’ update, while the atoms’ normalization is made using all the features.

Figure 3. YaleB—representation error evolution.

Figure 4. UCF50—representation error evolution.

Figure 5. YaleB—mutual coherence.

Figure 6. UCF50—t-SNE on original dataset.

Figure 7. UCF50—t-SNE on augmented dataset.

Table 1. Dataset summary.

Dataset Name	# Samples	# Dim	# Classes
YaleB	2414	504	38
AR	2600	540	100
CMU PIE	11,554	256	68
15 Scene	4485	3000	15
Caltech101	9144	3000	102
UCF50	6680	5000	50

Table 2. Accuracy results (percentages). The table highlights the higher performance values in bold for each pair of methods (random initialization and SimCDL initialization) to indicate which method performs better.

	IDL	IDB	ITDB	IDL*	IDB*	ITDB*
YaleB	94.28	94.35	94.47	94.48	94.50	94.54
AR	93.13	93.07	93.92	93.02	92.57	94.10
CMU PIE	90.56	90.60	91.98	90.69	90.55	91.93
15 Scene	95.62	94.17	92.39	94.72	94.04	95.66
Caltech101	69.16	72.18	70.42	70.26	72.30	70.07
UCF50	59.94	60.14	60.63	60.26	60.08	60.47

Table 3. Ablation study: representation error. The best result for each dataset is in bold.

Dataset	Origin	Rotate*	Cutout*	Gaussian Noise*	Gaussian Blur*	*
YaleB	$1.99 \times 10^{- 3}$	$1.56 \times 10^{- 3}$	$1.47 \times 10^{- 3}$	$1.65 \times 10^{- 3}$	$1.66 \times 10^{- 3}$	$1.62 \times 10^{- 3}$
AR	$2.66 \times 10^{- 3}$	$1.94 \times 10^{- 3}$	$1.81 \times 10^{- 3}$	$1.97 \times 10^{- 3}$	$1.98 \times 10^{- 3}$	$1.96 \times 10^{- 3}$
CMU PIE	$2.29 \times 10^{- 3}$	$1.87 \times 10^{- 3}$	$1.75 \times 10^{- 3}$	$1.89 \times 10^{- 3}$	$1.79 \times 10^{- 3}$	$1.89 \times 10^{- 3}$
15 Scene	$7.24 \times 10^{- 9}$	$7.67 \times 10^{- 9}$	$7.73 \times 10^{- 9}$	$7.25 \times 10^{- 9}$	$7.03 \times 10^{- 9}$	$7.64 \times 10^{- 9}$
Caltech101	$3.50 \times 10^{- 8}$	$3.50 \times 10^{- 8}$	$3.56 \times 10^{- 8}$	$3.42 \times 10^{- 8}$	$3.39 \times 10^{- 8}$	$3.33 \times 10^{- 8}$
UCF50	$9.21 \times 10^{- 4}$	$7.86 \times 10^{- 4}$	$6.89 \times 10^{- 4}$	$7.37 \times 10^{- 4}$	$7.26 \times 10^{- 4}$	$7.30 \times 10^{- 4}$

Table 4. Ablation study: classification accuracy. The best result for each dataset is in bold.

Dataset	Origin	Rotate*	Cutout*	Gaussian Noise*	Gaussian Blur*	*
YaleB	94.35	94.39	94.39	94.36	94.37	94.50
AR	93.07	93.00	92.57	92.85	92.85	92.57
CMU PIE	90.60	90.71	90.94	90.74	90.80	90.55
15 Scene	94.17	94.22	94.17	93.83	93.99	94.04
Caltech101	72.18	72.22	72.27	72.31	72.22	72.30
UCF50	60.14	60.13	60.07	60.16	60.12	60.08

Table 5. ADBench results. The best result for each dataset is in bold.

Dataset	AK-SVD	AK-SVD*	# Samples	# Features	# Anomaly	Category
annthyroid	0.782634	0.789434	7200	6	534	Healthcare
breastw	0.813422²	0.810371	683	9	239	Healthcare
cardio	0.976233²	0.975875³	1831	21	176	Healthcare
cover	0.942390	0.945926	286,048	10	2747	Botany
fault	0.994302²	0.993869³	1941	27	673	Physical
glass	0.854623²	0.873868¹	214	7	9	Forensic
Hepatitis	0.894448³	0.901473²	80	19	13	Healthcare
Ionosphere	0.990133	0.990810	351	32	126	Oryctognosy
landsat	0.999017¹	0.998700³	6435	36	1333	Astronautics
letter	0.998711¹	0.998607²	1600	32	100	Image
Lymphography	0.954972²	0.955347¹	148	18	6	Healthcare
magic.gamma	0.899432	0.901307	19,020	10	6688	Physical
mammography	0.817071	0.831252	11,183	6	260	Healthcare
musk	0.998768³	0.999148²	3062	166	97	Chemistry
pendigits	0.980712	0.983217	6870	16	156	Image
Pima	0.809388	0.814498³	768	8	268	Healthcare
satellite	0.999136³	0.999298²	6435	36	2036	Astronautics
satimage	0.999900²	0.999934¹	5803	36	71	Astronautics
SpamBase	0.908187	0.907505	4207	57	1679	Document
Stamps	0.856465	0.859910	340	9	31	Document
thyroid	0.820620	0.830234	3772	6	93	Healthcare
vertebral	0.858781	0.890556¹	240	6	30	Biology
vowels	0.932859	0.935261	1456	12	50	Linguistics
Waveform	0.992237³	0.992642²	3443	21	100	Physics
WBC	0.823793	0.825724³	223	9	10	Healthcare
WDBC	0.998545²	0.998288	367	30	10	Healthcare
Wilt	0.726567³	0.726060	4819	5	257	Botany
wine	0.956704³	0.952583	129	13	10	Chemistry
WPBC	0.993470	0.994286³	198	33	47	Healthcare
yeast	0.880533³	0.891971²	1484	8	507	Biology

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ilie-Ablachim, D.C.; Dumitrescu, B. SimCDL: A Simple Framework for Contrastive Dictionary Learning. Appl. Sci. 2024, 14, 10082. https://doi.org/10.3390/app142210082

AMA Style

Ilie-Ablachim DC, Dumitrescu B. SimCDL: A Simple Framework for Contrastive Dictionary Learning. Applied Sciences. 2024; 14(22):10082. https://doi.org/10.3390/app142210082

Chicago/Turabian Style

Ilie-Ablachim, Denis C., and Bogdan Dumitrescu. 2024. "SimCDL: A Simple Framework for Contrastive Dictionary Learning" Applied Sciences 14, no. 22: 10082. https://doi.org/10.3390/app142210082

APA Style

Ilie-Ablachim, D. C., & Dumitrescu, B. (2024). SimCDL: A Simple Framework for Contrastive Dictionary Learning. Applied Sciences, 14(22), 10082. https://doi.org/10.3390/app142210082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SimCDL: A Simple Framework for Contrastive Dictionary Learning

Abstract

1. Introduction

2. Contrastive Dictionary Learning

3. Data Augmentation for Dictionary Learning

3.1. Rotation

3.2. Cutout

3.3. Gaussian Noise

3.4. Gaussian Blur

4. Algorithms

4.1. Incoherent Dictionary Learning (IDL)

4.2. Incoherent Dictionary Learning via Distance Barrier (IDB)

4.3. Incoherent Dictionary Learning via Triplet Distance Barrier (ITDB)

5. Experiments

5.1. Classification

5.1.1. Dictionary Learning Initialization

5.1.2. Dictionary Learning Initialization for Classification Problems

5.1.3. Ablation Analysis

5.2. Anomaly Detection

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI