Two-Stage Feature Generator for Handwritten Digit Classification

Vakifbank, 06200 Ankara, Turkey
Department of Avionics, Atilim University, 06830 Ankara, Turkey
Department of Computer Engineering, Konya Food and Agriculture University, 42080 Konya, Turkey
Department of Computer Engineering, KTH Royal Institute of Technology, SE-114 28 Stockholm, Sweden
Department of Computer Engineering, OSTIM Technical University, 06370 Ankara, Turkey
Author to whom correspondence should be addressed.
Sensors 2023, 23(20), 8477;
Submission received: 8 June 2023 / Revised: 12 September 2023 / Accepted: 9 October 2023 / Published: 15 October 2023
(This article belongs to the Section Sensing and Imaging)


In this paper, a novel feature generator framework is proposed for handwritten digit classification. The proposed framework includes a two-stage cascaded feature generator. The first stage is based on principal component analysis (PCA), which generates projected data on principal components as features. The second one is constructed by a partially trained neural network (PTNN), which uses projected data as inputs and generates hidden layer outputs as features. The features obtained from the PCA and PTNN-based feature generator are tested on the MNIST and USPS datasets designed for handwritten digit sets. Minimum distance classifier (MDC) and support vector machine (SVM) methods are exploited as classifiers for the obtained features in association with this framework. The performance evaluation results show that the proposed framework outperforms the state-of-the-art techniques and achieves accuracies of 99.9815% and 99.9863% on the MNIST and USPS datasets, respectively. The results also show that the proposed framework achieves almost perfect accuracies, even with significantly small training data sizes.

1. Introduction

Pattern recognition typically involves both feature generation and classification. In pattern recognition approaches, such as face recognition and digit recognition, a feature extractor aims to find the characteristics of patterns that can discriminate and separate classes. However, a variability of features can lead to difficulties in such approaches. For example, even if it is desirable to have small within-class variability in a face recognition approach, varying lighting conditions can lead to differences in features. Similarly, digits written by different people in digit recognition systems can cause variability in features [1,2,3,4,5,6,7]. Hence, determining and using the most efficient framework for feature generation and classification is crucial in the pattern recognition approaches.
There are studies conducted on feature generation and classification in the literature. For instance, linear transformation techniques, such as principal component analysis (PCA), singular value decomposition (SVD), independent component analysis, discrete Fourier transform, Hadamard and Haar transforms, and discrete time wavelet transform (DTWT), are used for feature generation in the literature [7]. Moreover, neural networks (NNs) are used for classification in numerous studies such as in [2,3,4,5,6,7]. It should be stressed that typical recognition architectures use a single feature extractor followed by a supervised classifier. However, as stated in [8,9], two successive stages of feature generation yield higher accuracies than a one-stage extractor. There are also studies in the literature in which two or more feature extractors are cascaded, and the resulting features are used to train a supervised classifier. Even though such studies are exploited for the feature generation and classification in the literature, they cannot deal with within-class variability with a high performance which may cause difficulties for them to distinguish features into different classes.
In this paper, a new feature extraction method is presented. The method uses two consecutive attribute extractors. The first one generates the projected patterns on principal components or eigenvectors obtained from the covariance matrix of the data via PCA. The second provides the hidden layer outputs of a partially trained neural network (PTNN), where training of the neural network is stopped after a few epochs, i.e., the training is not fully completed. These two generators are cascaded, that is, the outputs of the PCA stage become the inputs of PTNN. We show that the proposed feature generator reduces within-cluster variability. This makes it much easier to distinguish data from different classes. The original input data are first transformed into a new space referred to as the PCA feature space. Then, the feature space is transformed into another space through a PTNN with one hidden layer with various hidden units. We show that a two-stage feature generator is advantageous in terms of the distribution of clusters in the feature space.
It should be stressed here that the framework proposed in this study reduces within-cluster variability as compared to state-of-the-art studies. In this way, it becomes much easier to differentiate data from different classes using the framework. Moreover, the proposed framework can enable achieving low intra-class and high inter-class variations. In addition to all these advantages, more importantly, the proposed framework achieves the best performance on the MNIST and USPS handwritten digit datasets compared to all the studies in the literature. To assess the clusterability of the features generated using the proposed method, minimum distance classifier (MDC) and support vector machine (SVM) are used as classifiers.
This paper is organized as follows. Section 2 presents state-of-the-art studies in the literature. The proposed framework is discussed in Section 3. Section 4 discusses the verification of intra-class and inter-class feature distributions. The experimental results are described in Section 5. Finally, Section 6 concludes the study.

2. State of the Art

There are studies based on handwritten character recognition in the literature. For instance, Mellouli et al. [1] proposed a new convolutional neural network (CNN) architecture using morphological filters for digit recognition. The morphological configuration was called Morph-CNN, which achieved a test accuracy of 99.66% on the MNIST dataset. Patel et al. proposed a multi-resolution technique using a discrete wavelet transform (DWT)-based approach for handwritten character recognition [10]. The authors used the DWT to extract features and they also used the MDC to recognize the system output. Their technique achieved an overall success rate of 90.00%. Ayyaz et al. [11] proposed a hybrid feature extraction system based on the SVM. Their system was tested on both handwritten digits and uppercase alphabets, which achieved higher efficiency compared to other methods. Shubhangi et al. [12] proposed a structural micro-feature system based on the SVM to recognize handwritten English characters and digits with a high recognition rate.
Liu et al. [13] proposed an NN-based system, which achieved improved accuracy by discriminative training and achieved a 98.45% recognition rate on the CENPARMI dataset. Suen et al. [14] developed a system to sort and identify cheques and financial documents on the CENPARMI dataset, which achieved a success rate of 98.85%. Lee et al. [15] proposed an offline handwritten digit recognition system for the CEDAR dataset, which achieved a recognition rate of 99.09%. Filatov et al. [16] designed a system based on an address script to identify handwritten postal addresses for US mail on the CEDAR dataset, which achieved a success rate of 99.54%.
In [17], a discriminative cascaded CNN model was used, which achieved an error rate of 0.18% on the MNIST dataset. Ganapathy et al. [18] studied a multiscale NN recognition system. In [19], a single-layer NN achieved a 98.39% accuracy on the proposed MNIST dataset. In [20], four different techniques, i.e., the PCA, CNN, SVM, and multi-classifier systems, were used to develop a powerful system for handwritten character recognition, which achieved a success rate of 98.50% on the MNIST dataset. In [21], a cascaded PCA, binary hashing, and block-wise histograms were used with a very simple deep learning network for image classification, which achieved a 99.67% recognition rate. In [22], a system based on a multicolumn deep neural network (MCDNN) was developed using 35 pre-trained CNNs, which achieved an error rate of 0.23% on the MNIST dataset. Bruna et al. [9] used an invariant scattering convolution network, which achieved an error rate of 0.43% on the MNIST dataset. Goodfellow et al. [23] used a convolution max-out system to regularize dropout, which achieved an error rate of 0.45% on the MNIST dataset. Zeiler et al. [24] proposed stochastic pooling on deep CNN, which achieved an error rate of 0.47% on the MNIST dataset. In [25], a context-dependent deep NN/hidden Markov model was used for large-vocabulary speech recognition. This system was tested on both the MNIST and TIMIT datasets, which achieved an error rate of 0.83% on the MNIST dataset. Jarrett et al. [8] used large CNNs and achieved an error rate of 0.53% on the MNIST dataset without distortions. Yu et al. [26] used a hierarchical two-layer sparse coding network on pixels, which achieved an error rate of 0.77% on the MNIST dataset. Keysers et al. [27] proposed an image distortion model based on local optimization, which achieved a low error rate of 0.54% on the MNIST dataset. In [28], a scalable generative model based on a convolutional deep belief network was used for unlabeled data from the MNIST dataset, which achieved an error rate of 0.82%.
In [29], pattern recognition using average patterns of categorical k-nearest neighbors was proposed, which achieved error rates of 1.27% and 3.44% on the MNIST and USPS datasets, respectively, using kernel classification on categorical average patterns. In [30], a discriminative-based supervised dictionary learning was developed, which achieved test error rates of 0.60% and 2.40% for the MNIST and USPS datasets, respectively. Error rates of 1.66% and 2.59% were achieved using SVM and KNN, respectively, on the USPS test set in [31]. In [32], perceptron learning of a modified quadratic discriminant function (MQDF) was used to achieve error rates of 1.49% and 2.19% on the MNIST and USPS datasets, respectively, which indicates that discriminative learning of MQDF can further improve MQDF’s performance. Xu et al. [33] presented a nonnegative representation-based classifier for pattern classification, which achieved accuracies of 99% and 95.1% on the MNIST and USPS datasets, respectively. Prasad et al. [34] presented novel features and cascaded classifiers KNN and SVM, resulting in an accuracy of 99.26 on the MNIST dataset.

3. Proposed Framework

Employing appropriate features to classify data can directly influence desired learning results. Therefore, selecting and generating features that are easily separable is vital for accurate classification [1,2,3]. Considering this motivation, a two-stage cascaded feature generator framework is proposed in this study.
In this sub-section, first, a one-stage feature generator, which provides the basis for the proposed framework, is discussed, and then the proposed two-stage feature generator framework of this study is introduced.

3.1. Soft Sensor Implementation for the Feature Generation

The proposed method has been implemented by using both hardware sensors (cameras, scanners, etc.) and soft sensors. The former captures the digits. The latter provides features that no hardware sensor is able to measure. In this study, a soft sensor model was developed to generate features for handwritten digit classification. The soft sensor has been realized by two cascaded modules, namely PCA and PTNN. The following section presents the details of each module.
Handwritten digit classification (HDC) has found such applications as postal automation, bank check automation, and human–computer interactions in practice. Many studies have been conducted for the classification of digits, as mentioned in Section 2. The first and most vital step in the recognition cycle is the collection of handwritten digits from people. There exist various ways to acquire the digits depending on the way the digits are generated. Therefore, different sensors are utilized for capturing the digits. While the digits written on paper can be recorded by handheld scanners or cameras, the digits created in the air can be captured by Kinect cameras, wearable inertial measurement unit (IMU) sensors, and wearable smart gloves and armbands. They rely on capturing hand and finger movements. In addition, a smart pen that exploits the inertial force sensors can record the digits [35,36,37,38].

3.2. One-Stage Feature Generator

Figure 1 depicts a one-stage feature generator framework that employs the PCA for the feature extraction. As can also be seen from the figure, the implementation of the one-stage classifier is based on either the MDC, which is a simple algorithm, or the SVM, which is a sophisticated algorithm, for classification [11,12]. MDC can be defined as calculating the distance between the unknown data and each class center and assigning the data to the nearest class center with the shortest Euclidean distance.
Algorithm 1, which is presented below, describes the steps to generate the features based on the PCA within this framework.
Algorithm 1: Obtaining principal component (PC)-based features
The   data   matrix   D = ( d 1   d 2 d N )   of   size   M x N   where   d i   represents   the   i t h   sample   from   the   data   matrix ,   where   i   =   1 ,   ,   N   and   N is the number of examples in data matrix.
  • S 1 :   Scale   the   values   of   the   data   matrix   in   [ 0 ,   1 ] .   Resulting   matrix   is   called   D s .
  • S 2 :   Calculate   the   principal   components   ( PCs )   of   the   D s .
  • S 3 :   Select   the   PCs   corresponding   to   K highest eigenvalues.
  • S 4 :   Construct   the   matrix   whose   columns   are   formed   by   the   principal   component   vectors   ( eigenvectors   of   D s )   C   =   ( c 1   c 2   c K )   with   the   size   of   M x K .
  • S5: Calculate the feature matrix by
           F   =   D s T   C   =   ( d 1 c 1   d 1 c 2   d 1 c K )   with   the   size   of   N x K .
    Although the algorithm ends in this step, the following step demonstrates the effectiveness of the generated features.
  • S 6 :   Train   a   classifier   ( such   as   SVM   or   MDC )   by   the   rows   of   the   matrix   F .
It is easy to show that the elements of matrix F are the projection of each data sample on principal component vectors. In F, the product d i c j denotes the inner product of the two vectors. Hence, we can express this product as:
d i , c j = d i c j cos θ
where θ is the angle between d i and c j , i = 1 ,   ,   N and j = 1 ,   ,   K .
Since c j = 1 , the inner product can be written as:
d i , c j = d i cos θ
Equation (2) represents the projection of d i and c j , that is:
p r o j c j d i = d i cos θ
Consequently, the projected data are employed as features to train the selected classifier, which is based on the MDC or SVM.
It is a fact that the variance in clusters obtained from using the PCA is very large; hence, the MDC or SVM classifiers cannot successfully separate one cluster from another. This is due to features sparsely scattered around the center of the cluster (i.e., distances between samples within the same cluster are high). This reduces the classification performance which yields low success rates.

3.3. Two-Stage Feature Generator

To enhance the performance of the one-stage generator, we propose inserting another transformation operator between the PCA and MDC/SVM modules to form a two-stage feature generator framework. Figure 2 depicts the proposed framework. The framework for a two-stage feature generator is explained in more detail in Algorithm 2 step by step.
The PTNN module in the framework is simply a multilayer perceptron (MLP) [2] with one hidden layer with various neurons. It is structured for the purpose of classification. Thus, the outputs of the network correspond to the clusters to be identified, i.e., the number of digits in our test cases. The network is fed by the projected data features. However, the network is not fully trained, but partially trained. Therefore, training is halted after a few epochs. The epoch errors are high, which indicates that training is far from complete. When training is halted, the network cannot correctly identify the clusters. However, we keep on training the network to observe the behavior of the PTNN at the early stage of the training. In summary, PTNN is simply an MLP without full training, or the training period is stopped after a predefined number of epochs.
Figure 3a,b illustrate mean squared error (MSE) results obtained from the fully trained NN and PTNN training, respectively. MSE is computed as the mean of the squared differences between the actual output and the estimated output. Figure 3b represents the performance of the neural network at the early stages of training. The results show that the MSE decreases rapidly at the beginning of the training phase and changes slowly until 2000 epochs are reached. Then, it remains almost constant, implying that the NN is fully trained. In total, 60,000 and 10,000 samples are employed during the training and testing phases, respectively, and a test accuracy of 98.58% is achieved [39]. Additionally, when an MLP NN is trained for classifying the digits in the MNIST dataset with zero feature extraction, the number of the epochs required varies from 40 to 50 to achieve test accuracies between 87% and 98% using 60,000 samples for the training set and 10,000 samples for the test set [40,41].
Despite stopping the training at a significantly early stage, if the outputs of the hidden unit of the partially trained network are used as features, we find that intra-cluster distances are reduced as compared to those in the PCA feature space. On the other hand, the size of the feature vectors in the two-stage feature generator is higher than those in the one-stage feature generator (i.e., larger than K). That is, the feature space composed of the two-stage feature generator includes more features than that of the one-stage feature generator. Hence, the proposed approach does not reduce the number of features. However, it improves the accuracy of the classifier.
Algorithm 2 describes the transformation to generate features based on the PCA plus PTNN.
Algorithm 2: Obtaining neural network-based features from projected data on the PCs
The   feature   matrix   F   =   ( f 1   f 2     f K )
  • S 1 :   Build   an   MLP   network   with   one   hidden   layer   and   P hidden nodes (neuron).
  • S 2 :   Start   training   the   network   for   classifying   the   examples   represented   by   the   rows   of   F .
  • S3: Halt training in early iterations.
  • S4: Calculate the outputs of the hidden layer
    h i   =   s i g m o i d ( F W + b )  
    where ,   W   is   the   weight   matrix   between   input   and   hidden   layers   and   i   =   1 ,   ,   P
  • S 5 :   Construct   the   hidden   layer   output   matrix   H   =   ( h 1   h 2     h P )   whose   size   is   N x P . Although the algorithm ends in this step, the following step demonstrates the effectiveness of the generated features.
  • S 6 :   Train   a   classifier   ( such   as   SVM   or   MDC )   by   the   rows   of   the   matrix   H .
The algorithms discussed above are tested on the MNIST and USPS digit datasets to analyze the distance distribution of each digit class in this study. For this purpose, the distances within the class and between classes are calculated, where the Euclidean distance is used as the distance metric. Let d i and d j be row vectors in R N . Then, the Euclidean distance between these two vectors is defined as
  L = | | d j d i | | = s q r t [   ( d j 1 d i 1 ) 2 + + ( d j N d i N ) 2   ]
The within-cluster distances are calculated by Algorithm 3.
Algorithm 3: Calculating the distances among the feature vectors within a digit class
Assume   that   F m   =   ( f 1   f 2   f S )   is   a   feature   matrix   for   m t h   class   and   m   =   1 , 2 ,   , O   and   S is the number of the examples in a given class.
  • S1: Calculate the centroid of the class m as:
    F c m = 1 S s = 1 S f s
  • S2: Calculate the Euclidean distance between each example and centroid vectors as:
    L m c = | | f m F c m | |
It is envisaged that the proposed framework should yield minimized intra-cluster distances or maximized inter-class distances. This envisagement is proved in the following section by considering both the one-stage and two-stage feature generators and the algorithms considered.

4. Verification of Inter and Intra Class Distributions

In this section, the intra-class and inter-class distance distributions are verified using the distance metric presented in Equation (5). To form the metric, first of all, the standard deviation, which indicates how sparsely or densely distributed the distances are within a class, is determined for each class. Then, to quantify the distance between the classes, the separation metric (SM) is formed:
S M = d i j ( σ i + σ j ) / 2
where d i j is the distance between the centers of classes i and j, while σ i and σ j are the standard deviations for classes i and j, respectively. This metric represents the degree of separability. The inter-class distances are calculated by Algorithm 4.
Algorithm 4: Calculating the distances between the two-digit classes
Assume   that   F m   =   ( f 1   f 2   f S )   is   a   feature   matrix   for   m t h class   and   m   =   1 , 2 ,   ,   O   and   S is the number of the examples in a given class.
  • S1: Calculate the centroids for each class in a given dataset.
  • S2: Calculate the distance between the centroids of two classes.
    L m ( m 1 ) = | | F c m F c m 1 | |
In step 2 of Algorithm 4, m ( m 1 ) represents the distance of the center of each class from the centers of all other classes. Suppose the following:
  • Case 1: if the distance remains constant and the standard deviations in Equation (5) have small values, then SM becomes higher. Note that a higher SM indicates better separation.
  • Case 2: if the standard deviations in Equation (5) are constant and the distance has high values, then SM becomes higher.
These two cases are illustrated in Figure 4. Table 1 and Table 2 show the standard deviations of distances between the centers of classes using the one-stage and two-stage feature generators, respectively, in the USPS dataset. The results show that the standard deviations (i.e., σ ’s) using the two-stage feature generator are smaller than those using the one-stage feature generator. This is associated with the fact that samples in the given class are distributed close to the center of the class. A consequence of this is that the data are more separable in the feature space formed by the PCA plus PTNN. In other words, the boundary or volume of each cluster shrinks inward. On the other hand, with the one-stage case, the samples in each class are scattered away from the center of the class so that small values of standard deviations are obtained. Consequently, the variation within a cluster without the PTNN is higher than that with the PTNN.
Table 3 and Table 4 show the separability values calculated by Equation (5) for the one-stage and two-stage feature generators, respectively. It can be seen that classes scattered in the feature space are more separable in the two-stage case (i.e., the separability increases). In pattern recognition, this is one of the desired requirements for a classifier to classify data accurately. Furthermore, we can obtain the SM ratio of the value of a selected class from Table 4 to the value of that class in Table 3. Once these ratios are calculated, it can be seen from Table 5 that they are mostly greater than 1. Thus, the classes in the feature space built from the PCA plus NN are more separable compared to those in the PCA space.
The same cluster behavior is also observed for the MNIST digit dataset. Table 6 and Table 7 show the standard deviations for the clusters formed with 5000 and 10,000 samples, respectively. It can be seen from the tables that the variation within a cluster with the two-stage extractor is lower than that with the one-stage generator. The separability values and SM ratios for the MNIST dataset are shown in Table 8, Table 9 and Table 10, respectively.

5. Results and Discussion

The performance of the proposed feature generator is tested on the MNIST and USPS digit datasets. The USPS handwritten digit dataset is derived from a project on recognizing handwritten digits on envelopes [42]. The digits have sizes of 16 × 16 pixels. It contains 7291 samples for the training set and 2007 samples for the test set. The standard MNIST dataset is derived from the NIST dataset and was created by LeCun et al. [43]. The digits have sizes of 28 × 28 pixels. It has 60,000 samples for the training set and 10,000 samples for the test set. Figure 5 and Figure 6 show some examples of the digits from the MNIST and USPS datasets, respectively. The MDC and SVM are utilized to identify digits in these datasets. The MDC is a simple classifier. In the training phase, training vectors are separated by each class. Then, the mean values of each class are computed. In the test phase, the closest mean to the test vector is calculated via the Euclidean distance. Then, the corresponding class is predicted. The SVM is much more complex than the MDC. It is capable of extracting not only linear but also curved decision boundaries. Thus, more accurate classification can be achieved by setting a maximum margin separator among the sample points, where the margin is defined as the distance of the decision boundary to the closest sample.
In Table 11, the results of the MDC for the USPS digit classes are shown for both the one-stage and two-stage feature generators. We then determine the accuracies for different eigenvalues and different training sizes. The results show that the best recognition rate is achieved using 4000 samples for the training set and 5298 samples for the test set with K = 10. Note that the NN is partly trained for various epochs, i.e., the training is halted in the early stage of iterations. As an example, Table 12 presents the accuracies for K = 10 at different epochs and different training sizes. The table shows that the performance of PCA plus PTNN (two-stage generator) is higher than that of the one-stage extractor. Moreover, as an example, the performance of the two-stage generator framework with a training size of 500 samples is improved by 2.386 points with reference to the one-stage extractor at an epoch of 15 for the USPS dataset. During the training for each scenario, the learning rate and the number of hidden nodes are set to 0.5 and 50, respectively. Then, hidden layer outputs are extracted from the NN. The mean values of these outputs are calculated for each digit class. For the unseen test data, the hidden layer output is calculated. Then, Euclidean distances of the test data to the mean values of digit classes are computed. The test data are classified according to the digit class with minimum distance. For all the scenarios, two-stage features lead to higher performance than one-stage features. The average test recognition rates for 10 classes are 91.60% and 90.13% at the training size of 4000 for the two-stage and one-stage cases, respectively.
Table 13 presents the performance rates for the MNIST digit classes. As seen, the performance of a two-stage extractor is lower than that of the one-stage extractor for small training sizes. However, an improvement in the performance appears for the full training size of 60,000.
Table 14 and Table 15 show the test success rates of the SVM classifier for the USPS and MNIST datasets, respectively. The experiments on SVM are held with the RBF kernel function. Although the best performance is obtained using 60,000 samples for the training set and 10,000 samples for the test set, it is clear that small training sizes also result in very high accuracies. PTNN is trained with a learning rate of 0.50 and 50 hidden nodes.
Table 16 lists the accuracies with K = 8 at different epochs and training sizes. The improvements in the performance of the two-stage extractor are clear; for instance, accuracy is increased by 1.5869 points with respect to the one-stage extractor at an epoch of 10 for a training size of 5000.
Although the PTNN is trained for 5% to 30% of the MNIST and USPS datasets, the proposed method achieves almost perfect performance with the SVM. Furthermore, the performance is acceptable even for a simple MDC. The results show that the proposed approach provides more relevant features for the data. Hence, the classifier achieves much better performance scores, i.e., 99.9863% and 99.9815%, for the USPS and MNIST datasets, respectively. To the best of our knowledge, these are the best performances in the current literature.
Table 17 shows the effectiveness of the two-stage feature extractor. Improvements in the accuracies with respect to the one-stage extractor are clear for each classifier. This shows that the proposed features give quite better abilities of generalization to the classifiers.
Table 18 and Table 19 show comparisons of the performances of our framework and some state-of-the-art methods on the MNIST and USPS datasets, respectively. The results show that the proposed method outperforms well-known techniques in the literature. Note that the SVM using two-stage features achieves error rates of 0.0185% and 0.0137% for the MNIST and USPS datasets, respectively, which are currently the best performances in the literature.

6. Conclusions and Future Work

In this paper, we proposed a novel framework based on a two-stage feature generator for handwritten digit classification. The first stage of this framework relies on the PCA, which generates the projected data from the eigenvectors corresponding to the highest K eigenvectors. The second stage has been constructed by a PTNN whose training has been halted at early epochs, i.e., it was not fully trained to recognize the input classes. This PTNN has been fed by the projected data on principal components and then its hidden layer outputs have been selected as new features, which have been used to train the MDC and SVM classifiers.
We evaluated the performance of the proposed method on the MNIST and USPS datasets. In both datasets, the best results are performed by using an SVM classifier. We found out that the two-stage feature extractor has led to noticeable improvements in terms of accuracy. Moreover, compared to current state-of-the-art methods, the proposed framework has resulted in almost perfect performances even with small training sizes. In addition, our experiments have shown that the proposed method can achieve error rates of 0.0185% and 0.0137% for the MNIST and USPS datasets, respectively, which can currently be considered the best performances in the literature.
In future work, as an easier but meaningful expansion, sign recognition will be added to the study. As a more complex study, we will use face and texture datasets to further evaluate the usefulness of our proposed framework.

Author Contributions

Conceptualization, H.T., M.A.G.P. and K.O.; methodology, H.T., M.A.G.P., İ.B. and K.O.; software, H.T. and M.A.G.P.; validation, H.T., M.A.G.P. and K.O.; formal analysis, H.T. and K.O.; investigation, H.T., M.A.G.P., K.O. and İ.B.; resources, H.T. and İ.B.; data curation, H.T., M.A.G.P. and K.O.; writing—original draft preparation, M.A.G.P.; writing—review and editing, H.T., K.O. and İ.B.; visualization, H.T. and K.O.; supervision, H.T. and K.O.; project administration, H.T. and K.O.; funding acquisition, İ.B. All authors have read and agreed to the published version of the manuscript.


This research was funded by KTH Royal Institute of Technology (via KTH Library’s Open Access Policy).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Correction Statement

Due to an error in article production, an incorrect author was listed as a corresponding author in the original publication. This manuscript has been updated and this change does not affect the scientific content of the article.


