Auto-Probabilistic Mining Method for Siamese Neural Network Training

Mokin, Arseniy; Sheshkus, Alexander; Arlazarov, Vladimir L.

doi:10.3390/math13081270

Open AccessArticle

Auto-Probabilistic Mining Method for Siamese Neural Network Training

by

Arseniy Mokin

^1,2,*

,

Alexander Sheshkus

^1,3

and

Vladimir L. Arlazarov

^1,3

¹

Smart Engines Service LLC, 117312 Moscow, Russia

²

The Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, 119991 Moscow, Russia

³

Federal Research Center “Computer Science and Control” of Russian Academy of Sciences, 119333 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(8), 1270; https://doi.org/10.3390/math13081270

Submission received: 30 October 2024 / Revised: 7 April 2025 / Accepted: 9 April 2025 / Published: 12 April 2025

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Training deep learning models for classification with limited data and computational resources remains a challenge when the number of classes is large. Metric learning offers an effective solution to this problem. However, it has its own shortcomings due to the known imperfections of widely used loss functions such as contrastive loss and triplet loss, as well as sample mining methods. This paper address these issues by proposing a novel mining method and metric loss function. Firstly, this paper presents an auto-probabilistic mining method designed to automatically select the most informative training samples for Siamese neural networks. Combined with a previously proposed auto-clustering technique, the method improves model training by optimizing the utilization of available data and reducing computational overhead. Secondly, this paper proposes the novel cluster-aware triplet-based metric loss function that addresses the limitations of contrastive and triplet loss, enhancing the overall training process. To evaluate the proposed methods, experiments were conducted with the optical character recognition task using the PHD08 and Omniglot datasets. The proposed loss function with the random-mining method achieved

82.6 %

classification accuracy on the PHD08 dataset with full training on the Korean alphabet, surpassing the known baseline. The same experiment, using a reduced training alphabet, set a new baseline of

88.6 %

on the PHD08 dataset. The application of the novel mining method further enhanced the accuracy to

90.6 %

(+2.0%) and, combined with auto-clustering, achieved

92.3 %

(+3.7%) compared with the new baseline. On the Omniglot dataset, the proposed mining method reached

92.32 %

, rising to

93.17 %

with auto-clustering. These findings highlight the potential effectiveness of the developed loss function and mining method in addressing a wide range of pattern recognition challenges.

Keywords:

deep metric learning; optical character recognition; Siamese neural networks; pattern recognition

MSC:

68T07

1. Introduction

Deep learning methods have been widely used in recent years, transforming industries such as natural language processing and computer vision. Convolutional neural networks (CNNs), capable of efficiently building hierarchical representations from raw inputs, have become a cornerstone of many applications. Optical character recognition (OCR) is one such application where CNNs have shown outstanding performance. OCR systems are essential for digitizing and extracting text from documents, images and other scanned materials, simplifying tasks like text-based analysis and automated data entry [1,2].

The ability of CNNs to automatically learn discriminative features straight from raw pixel data accounts for their effectiveness in OCR tasks [3]. CNNs are capable of capturing the details necessary for precise character identification by using convolutional layers to extract hierarchical structures and local patterns. For example, in OCR tasks, the number of neurons in the final layer of a neural network typically matches the number of distinct characters or symbols being recognized. But handling a huge alphabet (Figure 1) in OCR is a problem.

Under such circumstances, the number of neurons in the last layer increases dramatically, increasing the volume of data and requiring a large amount of processing power and training data. As a result, training these networks becomes ineffective.

Metric learning aids in avoiding this issue [4]. Metric learning operates by matching a point in the metric space to the input vector, ensuring that the point is positioned as close as possible to points of the same class and as far as possible from points of other classes [5].

There is a metric approach named Siamese neural networks (SNN). SNNs, initially introduced by Bromley et al. [6], have attracted considerable attention for their capability to learn robust representations in scenarios with limited labeled data. Unlike traditional CNN architectures that aim to directly predict class labels, SNNs acquire a similarity metric between inputs, allowing them to distinguish subtle differences and similarities even without sufficient labeled data.

An SNN consists of a few CNN branches with shared weights. The Euclidean norm is used to calculate the distances between the outputs of the branches, and the resulting vectors are then sent to the loss function.

The primary contributions of this paper are as follows:

The development of a mining method for training SNNs that enhances data representation and reduces computational costs.
The proposal of a novel metric loss function that combines the advantages of contrastive loss and triplet loss while introducing a new mechanism to ensure cluster compactness.
An experimental evaluation of the proposed methods’ effectiveness in OCR tasks using the PHD08 and Omniglot datasets, demonstrating significant improvements in classification accuracy and computational efficiency.

The remainder of this paper is organized as follows: Section 2 provides an overview of existing mining methods and loss functions in metric learning tasks. Section 3 is dedicated to the considered datasets. Section 4 describes the proposed methods, including the mining approach and the novel metric loss function. Section 5 focuses on the metric network architecture, training and experimental evaluation. The results tables are presented in Section 6. Finally, Section 7 presents the conclusions and directions for future research.

2. Related Work

OCR involves classifying characters and associating them with their respective classes. While some languages, like English, have achieved high recognition rates [7], the task becomes increasingly challenging as the size of the alphabet exceeds 10,000, requiring novel training and recognition methodologies.

2.1. Contrastive Loss

The contrastive loss function [8,9] (Figure 2) that was initially proposed by Chopra et al. for face verification tasks has gained popularity in metric learning experiments, as shown by its use in studies like [10,11,12].

In summary, contrastive loss (Equation (1)) uses pairwise distance and pulls positive pairs as close to each other as possible and pushes negative pairs to a distance from each other of no less than the selected margin (

α

).

L^{α} (x^{i}, x^{j}; f) = y_{i j} d_{i j}^{2} + (1 - y_{i j}) \cdot m a x {(0, α - d_{i j})}^{2},

(1)

where

d_{i j} = {∥ f (x_{i}) - f (x_{j}) ∥}_{2}

is the Euclidean distance between the pair, and

y_{i j}

returns 1 if

y_{i} = y_{j}

and 0 otherwise.

x^{i}, x^{j}

respresents the inputs of the SNN f.

The absence of a natural stopping mechanism is one of the notable aspects of contrastive loss. This loss function encourages the minimization of distance between a positive pair to zero. In practice, the neural network may use computational resources to refine pairs that are already close to convergence, preventing it from exploring and optimizing other examples.

The recent work [13] mentions a different interpretation of a contrastive loss function that has similar meaning but is formulated differently. In our work, we refer to the classic version of contrastive loss, proposed in [8,9].

2.2. Triplet Loss

The triplet loss function (Equation (2)), as proposed and described in [14], provides an alternative approach to learning embeddings in metric learning tasks. Unlike contrastive loss, which deals with pairs of samples, triplet loss operates on triplets comprising an anchor, a positive sample from the same class as the anchor and a negative sample from a different class.

L^{α} (x_{a}, x_{p}, x_{n}, f) = m a x (0, {∥f (x_{a}) - f (x_{p})∥}_{2} - {∥f (x_{a}) - f (x_{n})∥}_{2} + α)

(2)

where

x_{a}, x_{p}, x_{n}

represents the anchor and the positive and negative inputs of the SNN f.

The objective (Figure 3) of triplet loss is to train the network to reduce the distance between the anchor and positive samples while simultaneously increasing the distance between the anchor and negative samples by at least a specified margin (

α

).

However, it is important to note that triplet loss introduces cubic complexity [15], meaning that there are significantly more possible triplets than pairs. Not all triplets are equally informative for training, and including all possible triplets can lead to slower convergence and computational inefficiency. Therefore, selecting “hard” triplets that are particularly challenging [16] and informative is crucial for effectively training the model and achieving faster convergence and improved performance.

The proposed mining method (Section 4) in this work introduces a strategy to solve the “hard” triplets problem.

Another challenge associated with triplet loss is related to the absence of a direct requirement for closeness between the anchor and positive instances. Rather, the emphasis lies on ensuring that the distance between the anchor and positive samples is smaller than the distance between the anchor and negative samples by a predefined margin.

The proposed metric loss (Section 4.4) offers an effective solution to this challenge by addressing key limitations of classic loss functions, such as contrastive and triplet loss, improving the model’s ability to learn discriminative features and achieve better performance in metric learning tasks.

2.3. Quadruplet Loss

There exists a quadruplet loss function [17], which extends traditional triplet loss by introducing an additional negative sample. This approach enforces a stricter constraint, ensuring that the anchor-positive pair is not only closer than the first negative but also closer than the second negative, thereby improving feature discrimination. While this method enhances generalization in metric learning, it also increases computational complexity and requires more sophisticated mining strategies for effective sample selection. Despite these challenges, quadruplet loss has demonstrated improved performance in representation learning tasks.

2.4. Character Decomposition

Moreover, certain languages permit character decomposition, enabling segmentation into individual components for recognition and subsequent composition of the final result [10,18]. This approach obviates the need for neural networks with tens and even hundreds of thousands of outputs. However, it suffers from drawbacks such as vulnerability to image distortions and heavy reliance on segmentation quality. Some researchers [19] choose deep neural networks with numerous trainable parameters, which offer exceptional quality but demand significant computational resources and are unsuitable for mobile applications and so on.

2.5. Mining Methods

Furthermore, training metric networks presents challenges in managing the sequence of pair/triplet mining. While random data selection is common [20], some works prioritize addressing this issue by focusing on similarity. For instance, in [21], an aggressive hard-mining strategy is employed, mining pairs with the highest error for backpropagation to train the network on challenging examples exclusively. The method’s ability to tolerate noise in data can make it difficult to reach the local minimum.

Another interesting point is the mathematical justification [22] for the effectiveness of hard-mining in metric learning, using the Isometric Approximation Theorem. The authors show that hard-mining is equivalent to minimizing the Hausdorff distance between the neural network and its ideal function, which explains the empirical success of this approach. In [23], the authors propose Easy Positive Triplet Mining (EPTM) to address the instability of traditional hard-mining methods. Hard Positive Mining often selects outliers or mislabeled samples, leading to noisy optimization and poor generalization. EPTM mitigates this issue by selecting easier yet informative positives, ensuring better clustering while maintaining stable training. Combined with Semi-Hard Negative Mining, which avoids extreme negatives, this approach leads to more robust embeddings and improved convergence.

Additionally, distance-based pairs mining methods, such as the approach proposed by [11], demonstrate effective results by generating samples based on creating a vector of distances for all possible impostor pairs. While promising in terms of training quality, this type of method is significantly time-consuming.

The auto-clustering [24] method partially solves the mining problem in SNNs. This method is based on creating clusters, i.e., groups consisting of classes similar from a network point of view. The usage of clusters allows the network to pay more attention to classes that are difficult to differentiate. According to the auto-clustering method the negative sample is selected from the same cluster as the positive sample. Let us consider how to effectively choose a positive sample.

In the process of training the neural network, we can figure out which classes are far from their cluster in the metric space and try to manually increase the probability of their choice when generating a positive class. However, automatic methods are recommended as the most impactful and convenient solutions for such a task.

3. Datasets

3.1. Korean Hangul Recognition

3.1.1. PHD08 Dataset

The Korean alphabet (Hangul) is known for its large set of characters and the high visual similarity between many of them, making it particularly challenging for OCR tasks. These characteristics align well with the goals of our research, as they test the robustness and discriminative power of the proposed methods. Thus, to evaluate the suggested methods, the printed Hangul character dataset called PHD08 [25] was chosen. Moreover the same dataset was used in the previous works [10,11,24].

The dataset contains 2350 classes, and each class has 2187 images of Korean characters. There are a total of 5,139,450 binary images that have different sizes, rotations and distortions. Examples of the dataset images are shown in Figure 4.

3.1.2. Synthetic Korean Alphabet Training Data

To assess of objectivity of the proposed approach, synthetic data were generated using the same approach as in [24] for training the network in the same way as in the works [10,11,24] using an average of eight fonts per each class. The total number of classes was 11,172 characters of the Korean language. Moreover, in our experiments, we also decided not to generate all possible classes and make the number of classes equal to the evaluation dataset size. This decision was made to more accurately evaluate the proposed auto-probabilistic mining method (Section 4.2), because the network could train on some classes, that do not exist in the evaluation dataset. And this fact, in combination with the considered approach, could significantly increase such character probabilities.

3.2. Omniglot Dataset

The Omniglot dataset was collected by Brenden Lake and his collaborators at MIT via Amazon’s Mechanical Turk to produce a standard benchmark for learning from few examples in the handwritten character recognition domain [26].

Omniglot contains 1623 characters from 50 different alphabets (Figure 5). It consists of not only international languages like Korean and Latin, but also lesser known local dialects and fictional character sets such as Futurama and Klingon. Each of these was hand drawn by 20 different people. The number of letters in each alphabet varies considerably from about 15 to upwards of 40 characters. All characters across these alphabets were produced a single time by each of 20 individuals. The dataset is composed of two subsets: images_background, containing 19,289 images, and images_evaluation, containing 13,180 images.

3.3. Augmentation

Data augmentation [24] was applied to the images during the training process, with a probability of

0.7

for each sample and the following distortions:

Projective transformation—Each point of the image is transformed, and the values of the shift on axes x and y are chosen randomly in the range $[0.0, 1.0]$ , where the minimum and maximum shares of offset are presented on the x-axis and y-axis from the width of the image.
Rotation—Rotating an image by an angle in the range $[- 5, 5]$ in degrees.
Scale—Scaling the image to a given size and then scaling the result to the original size with $min = 0.7$ and $max = 0.9$ scaling coefficients of the original image in width and height to make the image pixelated.

3.4. Experimental Setup

For all experiments in this paper, the images were resized to

37 \times 37

pixels and converted to grayscale to align with the preprocessing standards of previous studies for objective comparative analysis.

During each epoch, 50 iterations were performed with 10,240 elements generated per iteration. For contrastive loss, pairs were dynamically sampled (3072 genuine pairs and 7168 impostor pairs), while CATML utilized triplets. A total of 10,240 elements were generated for both the training and validation sets during every epoch. These settings were applied uniformly to all datasets used in the experiments, including PHD08 and Omniglot, to keep the training conditions consistent and allow for an unbiased evaluation of the proposed methods.

4. Suggested Method

4.1. Auto-Probabilistic Mining Method

Assigning specific probabilities to classes during neural network training can, in fact, improve the model’s capacity to identify certain classes. Prioritizing some classes with greater probability encourages the network to produce them more frequently, which enhances the network’s ability to recognize these symbols.

Manually assigning probability raises practical issues in a variety of situations. Firstly, the cognitive strain of maintaining comprehensive knowledge of several varied classes is beyond human capacity when their number is in the thousands. Furthermore, while automatic probability assignment is possible for some organized data, such as languages with symbol decompositions based on keys, many real-world settings lack such inherent structure, making hand assignment problematic. An automated technique can be applied to a wider range of items, regardless of their type or complexity.

As a result, the auto-probabilistic mining (APM) method provides a more flexible approach. The network gains the capacity to dynamically adapt and prioritize classes according to their significance for learning by automatically computing the probabilities for each class during training. This makes it possible for the network to determine which classes need more attention and to produce them more frequently, which improves recognition performance overall and optimizes the training process. Therefore, the APM method offers a promising way to enhance neural network training efficiency.

Using computed probabilities, the algorithm in this method determines whether each pair belongs to a positive or negative class. This approach differs from uniform distribution techniques and it looks for classes with the greatest average distances between each example of class and corresponding cluster.

4.2. Automatic Calculation of Class Appearance Probabilities

To enhance the training process using the APM method, we calculate a class probability vector

P^{e}

at each epoch e. The method leverages the average distances between samples and their corresponding class centers to define class probabilities.

Let

f (x_{i, j})

be the feature representation of the input image

x_{i, j}

(i-th class, j-th sample) obtained from the model f. Define d as the distance (L2 norm of difference) between

f (x_{i, j})

and the center

c_{i}

of the i-th class, where

c_{i}

represents the mean feature vector of samples from class i. The class probability vector

P^{e}

is calculated as

P_{i}^{e} = \frac{{(\frac{1}{M_{i}} \sum_{j = 1}^{M_{i}} d (f (x_{i, j}), c_{i}))}^{γ}}{\sum_{k = 1}^{N} {(\frac{1}{M_{k}} \sum_{j = 1}^{M_{k}} d (f (x_{k, j}), c_{k}))}^{γ}}, i = 1, \dots, N

where

$M_{i}$ is the number of samples of i-th class;
$x_{i, j}$ is j-th sample of the i class;
N is the total number of classes;
e is the number of epochs;
$γ$ is the amplification factor controlling the impact of distances to emphasize dominant classes.

To adapt probabilities across epochs, the class probability vector

P^{e}

on epoch e is updated recursively as

P^{e} = (1 - w) \cdot P^{e} + w \cdot P^{e - 1},

(3)

where

$P^{e - 1}$ is the class probability vector from the previous epoch;
w ( $0 \leq w \leq 1$ ) is the weight factor controlling the contribution of the last epoch probabilities;
$γ$ amplifies the updated probabilities to emphasize dominant classes.

Finally,

P_{e}

is normalized to form a discrete probability distribution, which is utilized to guide the mining of positive pairs during training. Specifically, this distribution ensures that positive pairs are sampled more effectively by prioritizing classes with larger distances from the current samples.

4.3. Auto-Clustering Method Improvements

The base auto-clustering method contains the following steps: calculate all norms between ideal vectors of every class, sort them and, since there can be many of them, select some of them for analysis.

In the previous version of the auto-clustering method, there were lots of confusing hyperparameters required for network training. And such an approach is not convenient for experiments. In this stage, it was decided to simplify but retain the essence and leave just two parameters:

The probability of selecting a class from a cluster for a negative pair/triplet ( $θ$ );
The number of the lowest norms considered for cluster generation ( $η$ ).

This method is used for the mining of negative samples in pairs/triplets during training.

4.4. Cluster-Aware Triplet-Based Metric Loss

We propose the novel cluster-aware triplet-based metric loss (CATML) function (Equation (4)) combining the contrastive and triplet loss functions’ advantages.

C A T M L = ρ * g_{1} + τ * g_{2} + ξ * g_{3}

(4)

where

$ρ * g_{1}$ is the contrastive loss contribution, where $ρ$ is responsible for reducing the distance between the anchor and positive;
$τ * g_{2}$ is the triplet loss contribution, where $τ$ regulates the principle that the distance between the anchor and positive should be less than the distance between the anchor and negative by a selected value equal to or greater than $α$ ;
$ξ * g_{3}$ is the cluster contribution, where $ξ$ is responsible for stability so that the clusters created during training do not fall apart and receive a gradient to stay in the same place.

In this work, we let

ρ = 0.1

,

τ = 1.0

,

ξ = 1.0

and

α = 10

.

The components

g_{1}

,

g_{2}

and

g_{3}

are defined as follows:

$g_{1} = d_{A P} = {∥s (f (x_{A n c})) - s (f (x_{P o s}))∥}_{2}$ ;
$g_{2} = s (d_{A P} - d_{A N} + α) = s ({∥s (f (x_{A n c})) - s (f (x_{P o s}))∥}_{2} - {∥s (f (x_{A n c})) - s (f (x_{N e g}))∥}_{2})$ , where
$α$ is the margin value;
$g_{3} = \frac{1}{M_{i}} \sum_{j = 1}^{M_{i}} d (f (x_{i, j}), c_{i}), i = 1, \dots, N$ , where $x_{i, j} \in (x_{A n c}, x_{P o s}, x_{N e g})$ , $c_{i}$ is an ideal vector corresponding to a certain class (the average vector of the samples of the given class).

Here, f denotes the deep learning model used for feature extraction, and

s (x) = s o f t r e l u (x)

is an activation function.

5. Experiments

5.1. Model

We chose the architecture (Table 1) from the works [10,11,24] for consistency in our comparison. A more illustrative scheme is represented in Figure 6.

The activation function softsign (Equation (5)) is described in detail in the works [28,29] and is recommended as the most suitable for this kind of problems. Softsign is a bounded activation function that promotes stable convergence and eliminates any numerical inconsistencies. This function was also used in the works [10,11,24].

softsign (x) = \frac{x}{1 + | x |}

(5)

5.2. Training

5.2.1. Synthetic Korean Characters

We performed experiments where the network was trained on synthetic images of Korean characters with the new CATML function. We considered two sizes of training alphabet: full (11,172 classes) and reduced (2350 classes). In the evaluated PHD08 dataset, there are only 2350 classes, so the reduced approach enables the APM method to have a more objective impact and avoids focusing on unnecessary classes.

Firstly, we trained CATML on the full alphabet with the random-mining method. Secondly, it was trained on the reduced alphabet with random-mining, auto-clustering and auto-probabilistic mining. Moreover, the specificity of two last methods allows us to use them both, where the APM method is used for positive sample mining and auto-clustering is responsible for negative samples.

In our experiments, the APM method only showed the best accuracy with

γ = 1

, but for the APM + auto-clustering method,

γ = 2

showed the best result. The auto-clustering method demonstrated high accuracy, with

θ = 0.5

and

η = 1000

, and the APM method achieved the best result, with w equal to 0, which means that there was no dependency on the previous probability distribution, but this weight factor can also be considered for future experiments.

5.2.2. Omniglot Dataset

For the Omniglot dataset, we trained our network and set aside

60 %

of the total data for training. The test and validation parts were set the same as in [3], both equal to

20 %

. We fixed a uniform number of training examples per alphabet so that each alphabet receives equal representation during optimization, although this is not guaranteed for the individual character classes within each alphabet.

All parameters used for the suggested methods were the same as for the experiments on the Korean alphabet.

5.3. Evaluation

The accuracy (

A c c

) for classification was defined as

A c c = \frac{N_{c o r r e c t}}{N_{t o t a l}} \cdot 100 %,

(6)

where

N_{t o t a l}

is the size of the dataset and

N_{c o r r e c t}

is the number of images correctly classified by the network.

6. Results

All experiments followed a common evaluation protocol with data partitioned into three distinct sets: training, validation and test sets. The validation set was exclusively used for hyperparameter tuning and model selection through epoch-wise performance monitoring, while the test set remained strictly isolated for final evaluation. The final metrics were computed through a single evaluation on the test set using the optimal model checkpoint, selected based on the validation performance. To ensure reproducibility and prevent data leakage, no parameter updates or architecture decisions were influenced by test set observations during training.

For each experimental configuration, we provide the following:

Epoch-wise validation curves demonstrating convergence patterns and training dynamics.
A results table reporting both the training and test accuracies for the selected model. The training set metrics are included specifically to verify model behavior; the close alignment between training and test performance provides empirical evidence against overfitting.

6.1. PHD08

The results of the experiments with a large number of classes are presented in Table 2. The results of random-mining, hard-mining and distance-based mining were taken from the work [11]. The CATML function showed notable accuracy (Figure 7) compared to all previous results.

We can also see in Table 3 that the proposed APM method improves classification accuracy. Moreover, this method is supposed to be an efficient and convenient solution regardless of the type of object being recognized. Additionally, it can be successfully applied both with and without the previously suggested auto-clustering method (Figure 8).

6.2. Omniglot

As we can see in the Table 4, using the proposed methods improves the classification accuracy. Also, it is worth noting that the auto-clustering method does not improve accuracy significantly (Figure 9). This could be attributed to the dataset’s diversity, as it contains multiple alphabets, including Korean.

7. Conclusions

We have proposed the auto-probabilistic mining method, which enhances the accuracy of Siamese neural networks, making them more effective for classification tasks like OCR. This method can be successfully used both in combination with and without the auto-clustering method. It enhances network training, achieving high accuracy, as demonstrated on the PHD08 dataset, surpassing the previous baseline, and on the Omniglot verification task. The developed method does not require any knowledge of the object’s nature and can be used to train networks for any type of object, such as characters, feature point descriptors, faces, and so on. In addition to this, the proposed cluster-aware triplet-based metric loss function combines the benefits of contrastive and triplet loss functions and improves the accuracy of training.

In future research, the proposed methods, including the novel loss function and the auto-probabilistic mining strategy, could be explored in the context of one-shot learning and re-identification tasks. These problems present unique challenges, such as limited data and the necessity for robust discrimination between highly similar instances, which align well with the strengths of metric learning. By applying the proposed techniques to these tasks, we aim to investigate their potential to enhance generalization, reduce dependency on large-scale datasets and achieve competitive performance in scenarios requiring fine-grained feature representations and adaptability.

Author Contributions

Conceptualization, A.M.; Methodology, A.M.; Software, A.M.; Validation, A.M.; Writing—original draft, A.M.; Writing—review & editing, A.S.; Supervision, A.S. and V.L.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

All authors are employed by Smart Engines LLC. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors received no specific funding for this work.

References

Khaustov, P.A. Algorithms for handwritten character recognition based on constructing structural models. Comput. Opt. 2017, 41, 67–78. [Google Scholar] [CrossRef]
Nikolaev, D.P.; Polevoy, D.V.; Tarasova, N.A. Training data synthesis in text recognition problem solved in three-dimensional space. Informatsionnye Tekhnologii I Vychslitel’nye Sist. 2014, 3, 82–88. [Google Scholar]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Volume 2. [Google Scholar]
Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Proceedings of the International Workshop on Similarity-Based Pattern Recognition, Copenhagen, Denmark, 12–14 October 2015; Springer: Cham, Switzerland, 2015; pp. 84–92. [Google Scholar]
Oh Song, H.; Xiang, Y.; Jegelka, S.; Savarese, S. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4004–4012. [Google Scholar]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Adv. Neural Inf. Process. Syst. 1993, 6, 737–744. [Google Scholar] [CrossRef]
Tafti, A.P.; Baghaie, A.; Assefi, M.; Arabnia, H.R.; Yu, Z.; Peissig, P. OCR as a service: An experimental evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym. In Proceedings of the 12th International Symposium on Visual Computing, Las Vegas, NV, USA, 12–14 December 2016; Springer: Cham, Switzerland, 2016; pp. 735–746. [Google Scholar]
Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 539–546. [Google Scholar]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Washington, DC, USA, 17–22 June 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 2, pp. 1735–1742. [Google Scholar]
Ilyuhin, S.A.; Sheshkus, A.V.; Arlazarov, V.L. Recognition of images of Korean characters using embedded networks. In Proceedings of the Twelfth International Conference on Machine Vision (ICMV 2019), Amsterdam, The Netherlands, 16–18 November 2019; International Society for Optics and Photonics: Bellingham, WA, USA, 2020; Volume 11433, p. 1143311. [Google Scholar]
Kondrashev, I.V.; Sheshkus, A.V.; Arlazarov, V.V. Distance-based online pairs generation method for metric networks training. In Proceedings of the Thirteenth International Conference on Machine Vision, Singapore, 20–22 February 2021; International Society for Optics and Photonics: Bellingham, WA, USA, 2021; Volume 11605, p. 1160508. [Google Scholar]
Wang, X.; Hua, Y.; Kodirov, E.; Hu, G.; Robertson, N.M. Deep metric learning by online soft mining and class-aware attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019; Volume 33, pp. 5361–5368. [Google Scholar]
Wang, F.; Liu, H. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2495–2504. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Yuan, Y.; Chen, W.; Yang, Y.; Wang, Z. In defense of the triplet loss again: Learning robust person re-identification with fast approximated triplet loss and label distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 354–355. [Google Scholar]
Suloev, K.; Sheshkus, A.; Arlazarov, V. Spherical constraints in the triplet loss function. In Proceedings of the Institute for Systems Analysis Russian Academy of Sciences, Moscow, Russia, 24–27 October 2023; Volume 73, pp. 50–58. [Google Scholar] [CrossRef]
Chen, W.; Chen, X.; Zhang, J.; Huang, K. Beyond triplet loss: A deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 403–412. [Google Scholar]
Franken, M.; van Gemert, J.C. Automatic Egyptian hieroglyph recognition by retrieving images as texts. In Proceedings of the 21st ACM international conference on Multimedia, Barcelona, Spain, 21–25 October 2013; ACM: New York, NY, USA, 2013; pp. 765–768. [Google Scholar]
Kim, Y.g.; Cha, E.y. Learning of Large-Scale Korean Character Data through the Convolutional Neural Network. In Proceedings of the Korean Institute of Information and Commucation Sciences Conference, Jeongseon, Republic of Korea, 27–29 January 2016; The Korea Institute of Information and Commucation Engineering: Busan, Republic of Korea, 2016; pp. 97–100. [Google Scholar]
Bell, S.; Bala, K. Learning visual similarity for product design with convolutional neural networks. ACM Trans. Graph. (TOG) 2015, 34, 1–10. [Google Scholar] [CrossRef]
Simo-Serra, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 118–126. [Google Scholar]
Xu, A.; Hsieh, J.Y.; Vundurthy, B.; Cohen, E.; Choset, H.; Li, L. Mathematical Justification of Hard Negative Mining via Isometric Approximation Theorem. arXiv 2022, arXiv:2210.11173. [Google Scholar]
Xuan, H.; Stylianou, A.; Pless, R. Improved embeddings with easy positive triplet mining. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2474–2482. [Google Scholar]
Mokin, A.K.; Gayer, A.V.; Sheshkus, A.V.; Arlazarov, V.L. Auto-clustering pairs generation method for Siamese neural networks training. In Proceedings of the Fourteenth International Conference on Machine Vision (ICMV 2021), Virtual Conference, 8–12 November 2021; SPIE: Bellingham, WA, USA, 2022; Volume 12084, pp. 369–376. [Google Scholar]
Ham, D.S.; Lee, D.R.; Jung, I.S.; Oh, I.S. Construction of printed Hangul character database PHD08. J. Korea Contents Assoc. 2008, 8, 33–40. [Google Scholar] [CrossRef]
Lake, B.; Salakhutdinov, R.; Gross, J.; Tenenbaum, J. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Boston, MA, USA, 20–23 July 2011; Volume 33. [Google Scholar]
Lake, B.M.; Salakhutdinov, R.; Tenenbaum, J.B. Human-level concept learning through probabilistic program induction. Science 2015, 350, 1332–1338. [Google Scholar] [CrossRef] [PubMed]
Bergstra, J.; Desjardins, G.; Lamblin, P.; Bengio, Y. Quadratic polynomials learn better image features. Tech. Rep. 2009, 1337. [Google Scholar]
Lin, G.; Shen, W. Research on convolutional neural network based on improved Relu piecewise activation function. Procedia Comput. Sci. 2018, 131, 977–984. [Google Scholar] [CrossRef]

Figure 1. Similar characters of the Korean alphabet (Hangul).

Figure 2. Contrastive loss principle illustrated with Korean language samples.

Figure 3. Triplet loss principle illustrated with Korean language samples.

Figure 4. PHD08 dataset examples.

Figure 5. The Omniglot dataset examples from the original paper [27].

Figure 6. Visualization of the metric network architecture.

Figure 7. Epoch-wise validation accuracy (

A c c

) for the full alphabet with 11,172 classes.

Figure 7. Epoch-wise validation accuracy (

A c c

) for the full alphabet with 11,172 classes.

Figure 8. Epoch-wise validation accuracy (

A c c

) for the reduced alphabet with 2350 classes.

Figure 8. Epoch-wise validation accuracy (

A c c

) for the reduced alphabet with 2350 classes.

Figure 9. Epoch-wise validation accuracy (

A c c

) for the Omniglot dataset.

Figure 9. Epoch-wise validation accuracy (

A c c

) for the Omniglot dataset.

Table 1. A list of the layers for the metric network.

Layers
N	Layer Type	Parameters	Output Size	Activation Function
1	conv	16 filters $3 \times 3$ , stride $1 \times 1$ , no padding	$35 \times 35 \times 16$	softsign
2	conv	16 filters $5 \times 5$ , stride $2 \times 2$ , padding $2 \times 2$	$18 \times 18 \times 16$	softsign
3	conv	16 filters $3 \times 3$ , stride $1 \times 1$ , padding $1 \times 1$	$18 \times 18 \times 16$	softsign
4	conv	24 filters $5 \times 5$ , stride $2 \times 2$ , padding $2 \times 2$	$9 \times 9 \times 24$	softsign
5	conv	24 filters $3 \times 3$ , stride $1 \times 1$ , padding $1 \times 1$	$9 \times 9 \times 24$	softsign
6	conv	24 filters $3 \times 3$ , stride $1 \times 1$ , padding $1 \times 1$	$9 \times 9 \times 24$	softsign
7	fully connected	25 outputs	$1 \times 1 \times 25$	-

Table 2. Classification accuracy for the full alphabet with 11,172 classes.

Loss	Mining Method	Train $Acc$	Test $Acc$
Contrastive loss	Random-mining [20]		$63.7 %$
	Hard-mining [21]		$64.5 %$
	Distance-based mining [11]		$69.7 %$
	Auto-clustering [24]	$79.3 %$	$76.1 %$
CATML	Random-mining	$86.2 %$	82.6%

Table 3. Classification accuracy for the reduced alphabet with 2350 classes.

Loss	Mining Method	Train $Acc$	Test $Acc$
CATML	Random-mining	$89.1 %$	$88.6 %$
	Auto-probabilistic	$94.8 %$	$90.6 %$
	Auto-clustering	$93.7 %$	$91.1 %$
	Auto-probabilistic + Auto-clustering	$95.1 %$	92.3%

Table 4. Classification accuracy for the Omniglot dataset.

Loss	Mining Method	Train $Acc$	Test $Acc$
CATML	Random-mining	$93.33$ %	$91.25$ %
	Auto-clustering	$95.21$ %	$91.60$ %
	Auto-probabilistic	$94.78$ %	$92.32$ %
	Auto-probabilistic + Auto-clustering	$95.89 %$	93.17%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mokin, A.; Sheshkus, A.; Arlazarov, V.L. Auto-Probabilistic Mining Method for Siamese Neural Network Training. Mathematics 2025, 13, 1270. https://doi.org/10.3390/math13081270

AMA Style

Mokin A, Sheshkus A, Arlazarov VL. Auto-Probabilistic Mining Method for Siamese Neural Network Training. Mathematics. 2025; 13(8):1270. https://doi.org/10.3390/math13081270

Chicago/Turabian Style

Mokin, Arseniy, Alexander Sheshkus, and Vladimir L. Arlazarov. 2025. "Auto-Probabilistic Mining Method for Siamese Neural Network Training" Mathematics 13, no. 8: 1270. https://doi.org/10.3390/math13081270

APA Style

Mokin, A., Sheshkus, A., & Arlazarov, V. L. (2025). Auto-Probabilistic Mining Method for Siamese Neural Network Training. Mathematics, 13(8), 1270. https://doi.org/10.3390/math13081270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Auto-Probabilistic Mining Method for Siamese Neural Network Training

Abstract

1. Introduction

2. Related Work

2.1. Contrastive Loss

2.2. Triplet Loss

2.3. Quadruplet Loss

2.4. Character Decomposition

2.5. Mining Methods

3. Datasets

3.1. Korean Hangul Recognition

3.1.1. PHD08 Dataset

3.1.2. Synthetic Korean Alphabet Training Data

3.2. Omniglot Dataset

3.3. Augmentation

3.4. Experimental Setup

4. Suggested Method

4.1. Auto-Probabilistic Mining Method

4.2. Automatic Calculation of Class Appearance Probabilities

4.3. Auto-Clustering Method Improvements

4.4. Cluster-Aware Triplet-Based Metric Loss

5. Experiments

5.1. Model

5.2. Training

5.2.1. Synthetic Korean Characters

5.2.2. Omniglot Dataset

5.3. Evaluation

6. Results

6.1. PHD08

6.2. Omniglot

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI