This study introduces an effective yet simple framework for image retrieval. The feature descriptor is extracted from a backbone network and goes further through a projection module to become embedded vectors for retrieval. We study the behavior of the output representation space when training with a contrastive loss, in particular, how augmenting impacts the space properties and the performance of image retrieval on low-resolution inputs. Our framework is illustrated in
Figure 2. This section demonstrates our framework with three modules. A feature extractor module extracts a visual representation of a given image, while a projection head trained in a contrastive approach helps to map visual representation to an embedding space so that the similarity of samples can be calculated. Finally, we introduce an auxiliary module with classification loss and triplet loss, which significantly enhances the category retrieval’s performance.
2.1. Feature Extractor
We manipulate the Visual Transformer [
11] model to extract a visual representation of given images. The main component of the ViT model is the self-attention encoder module, which implements Transformer architecture [
10] in the most standard way. According to the original version of ViT, our feature extractor involves three main steps.
Patch embedding: We split an image into a sequence of patches and map each patch to a dimensions embedding space. Precisely, we put an image through 2d convolutions with the kernel size of and stride of , resulting in a feature map with the size , then flattening the feature map into a sequence of latent vectors with a constant size of . In the above configuration, is the number of embedded patches, where represents the resolution of the original image, and is the resolution of an image patch. Apart from the embedded patches, the ViT model adds an extra learnable class embedding for classification tasks, and the results obtained from using this class token are referred to as “ViT-class” in our study. To maintain the position of the patches after flattening, we follow the standard way by adding a learnable 1D positional embedding into each patch, and this positional embedding does not share weights across patches.
Encoder: Encoder is a computational block consisting of a multi-head attention module [
10] and an MLP with two consecutive linear layers. Input and output of encoder module are both embedded vectors of batches. LayerNorm is applied before feeding embedded vectors into the attention module and MLP. The multi-head attention module expands the model’s ability to jointly focus on different positions, thus providing different representation subspaces of pair
from different attention heads:
where each head is a context vector from scale dot-product attention.
Q,K, and
V represent a query, key, and value, respectively, calculated inside the transformer architecture that encodes information from the image’s patches with the self-attention mechanism to mutually attend to each other.
is the dimensions of patch embeddings; Equation (2) employs
to scale the attention scores. The encoder module is illustrated in
Figure 3.
Visual descriptor: We develop a novel visual descriptor for the corresponding image based on the attention mechanism. Our technique naturally arises from attention maps of the ViT model. We generate visual descriptors by combining the final embeddings of informative regions selected by ranking attention weights.
Given an image
, let
be the list of embedded patches resulting from the
encoder. Obviously, the first list of embedded patches
is spawned from the summation of linear projections and positional embedding as described above,
. The
embedding layer is the output from the
ith encoder module, while input is the (
i − 1)th embedding layer
. Let
be a joint self-attention weight of a local patch
,
, then our visual descriptor for an image
is the ranking weighted sum of its patch embeddings,
where
is the desired rank,
) is a permutation of the list of joint attention weights, and
is the corresponding list of embedded patches’ output from the final encoder and sorted with the same order. When
is the arrangement from greatest to least, our visual descriptor is the weighted sum of the most
attentive regions.
According to ViT architecture, each encoder block has its own self-attention maps from the multi-head attention. The self-attention mechanism allows ViT to interpret information across the entire image, even in the early layers. In the early layers, some heads consistently focus on small areas, while others attend to most parts of the image, indicating that the ability to unite information globally is already in use inside the early layers. Meanwhile, the attention regions from all heads tend to be wider when going through higher layers, showing that the model aims to capture global information at higher layers. This ability is analogous to the receptive field concept in CNNs, which is the strength of convolutional layers capable of integrating both local and global information.
Instead of considering the correlation among patches of an image, we investigate parts of the image that should be attended to and extract their corresponding embedded patches. As highlighted in the equation, the embedded patches are output from the final encoder block, but the attention weights are jointly measured across multi-head attention modules. Let
be the attention map calculated inside the multi-head attention module of the
ith encoder, where
is the number of attention heads, and
is the number of patches as earlier mentioned. Then,
is a collection of attention maps, which are symmetric matrices of size
whose coefficients estimate the degree of attention between two patches. To derive the joint attention map, we first average all attention maps for different heads to obtain an attention map responsible for a layer. To account for residual connections, we add an identity matrix to the attention map and re-normalize the weights,
where
is the normalized average attention map from the
ith layer,
is an identity matrix of size
, and
. Finally, the joint attention map is obtained by multiplying the attention maps across all layers.
where
and
is the number of encoder blocks. To generate the visual descriptors, we require only the attention weights assigned to image patches; these attention weights attend to themselves and can be attained by extracting diagonal of the joint attention matrix
. The attention weights derived here are exactly the weights involved in Equation (3). The aforementioned method is referred to as attention rollout and was introduced in [
11].
Figure 4 illustrates our approach to extract visual representations from the ViT model.
2.2. Contrastive Learning Framework
We train projection heads on top of the ViT model in a contrastive way to maximize the agreement of positive pairs in an embedding space. The contrastive learning framework can be decomposed into four modules: data augmentation, backbone network, projection head, and contrastive objective function.
Data augmentation: This study concentrates on low-resolution image retrieval, and thus the augmentation method we applied here only varies the input’s resolution. A stochastic data augmentation module transforms the input data randomly, resulting in two correlated views of the same sample but with different resolutions. In this work, we apply random cropping followed by a low-pass filter. The low-pass filter used here is the Gaussian blur with a fixed size of the kernel and random kernel standard deviation.
Base encoder: The input to the base encoder network is augmented data, and output is a representation vector. We mainly use ViT as our base encoder model; however, we also experiment with other encoder networks, such as BiT [
20] and EfficientNet [
21]. In the case of BiT and EfficientNet encoders, we extract the final convolutional layer before fully connected layers and then apply adaptive average pooling to obtain a 1-D representation vector. This study uses
as a notation for the representation vector. As a result, the dimension of the representation vector varies based on encoder architectures. The base encoder is visualized as two “Encoder” blocks as in
Figure 2.
Projection head: The projection head consists of a non-linear mapping that maps representation vectors into an embedded space where the similarity between samples can be measured. As suggested in [
16], we use an MLP with one hidden layer to formulate the projection head. This study uses
as a notation for the embedded vector. The projection head is depicted in
Figure 2 as “Projector” blocks with a hidden layer. The size of the hidden layer is the same as the number of dimensions of embedded vectors.
Contrastive objective function: A contrastive function is defined so that minimizing it results in maximizing the agreement between positive pairs; in other words, it is meant to pull similar samples closer to each other and push dissimilar samples in the opposite direction. We measure the similarity within a minibatch of
random samples. Following the setting of the data augmentation module, two different resolution versions of the original input are generated inside a minibatch, resulting in
data points. Within a multiview minibatch, let
be the index of an arbitrary augmented sample, and let
be the index of the other augmented sample obtained from the same source sample. In self-supervised contrastive learning [
16], the loss function for a positive pair of examples
is defined as
where
is embedding obtained from the projection head, and
is the temperature scaling factor. This loss is infoNCE loss that maximizes a lower bound on mutual information of two observations.
Regarding the presence of labels, supervised contrastive losses [
37] can be used and also be generalized to an arbitrary number of positive pairs. The loss function takes the following form:
where
is the set of indices for all positives in the multiview batch. In addition to the augmented version of the anchor, supervised contrastive loss considers the same label samples within the minibatch as a positive pair.
2.3. Auxiliary Module: Classification Loss and Triplet Loss
Contrastive loss maximizes agreement between augmented versions derived from the same source without implicitly sampling negative pairs [
16]. To achieve a better performance, many negative pairs are sampled to ensure the convergence of the contrastive objective function. For example, [
16] used a batch size of 8196. In that case, 16,382 negative samples per positive pair were given from both augmentation views, and the same conditions were applied in [
37] with a batch size of 6144. Using a large batch size is a computational burden and hard to train with regular optimizations [
38,
39]. In this study, instead of using a large batch size, we leverage the samples’ label to implicitly generate pairs of negative and positive samples. We also found that training with implicit labels or supervised training is standard practice to learn embedded vectors for category image retrieval.
As proposed in previous literature on image retrieval, softmax cross-entropy loss and ranking loss, such as triplet loss, are used for end-to-end training of a CNN backbone [
6] or to fine-tune model triplet loss based on a classifier trained with cross-entropy loss [
5,
8]. In this study, we train the model with auxiliary classification loss to maximize inter-class distance and utilize triplet loss to rank the embeddings of inter-class pairs over intra-class pairs. We add label smoothing [
40] and temperature scaling [
41] in the auxiliary cross-entropy loss function to prevent overconfidence and to learn better embedding.
where
is the batch size, including augmented samples, M is the number of classes, and
is the temperature scaling factor.
is logits of sample
ith obtained by adding a trainable liner layer over the embedded vector
.
Additionally, we add triplet loss to the objective function. Minimizing the triplet loss in the embedding space results in instances with the same label, and its augmentations should be closer together to form well-separated clusters. The version of triplet loss used in this study is online triplet mining using the hard-batch strategy [
42]. For each sample
in the batch, we can select the hardest positive and the hardest negative samples within the batch when forming the triplets to be used for computing the loss.
where
is margin and
are the number of classes and the number of samples in these classes, calculated within a minibatch. The final loss function for end-to-end training of our framework is the weighted summation of a contrastive loss and the two auxiliary losses.
where
are hyperparameters to control the presence of each loss.