Next Article in Journal
Methods of Fuzzy Multi-Criteria Decision Making for Controlling the Operating Modes of the Stabilization Column of the Primary Oil-Refining Unit
Next Article in Special Issue
Robust Low-Rank Graph Multi-View Clustering via Cauchy Norm Minimization
Previous Article in Journal
Numerical Analysis of Entropy Generation in a Double Stage Triangular Solar Still Using CNT-Nanofluid under Double-Diffusive Natural Convection
Previous Article in Special Issue
Swarm Robots Search for Multiple Targets Based on Historical Optimal Weighting Grey Wolf Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling

1
Department of Computer Science, University of Toronto, Toronto, ON M5S 0A5, Canada
2
Department of Mechanical & Industrial Engineering, Northeastern University, Boston, MA 02115, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2023, 11(13), 2819; https://doi.org/10.3390/math11132819
Submission received: 25 May 2023 / Revised: 19 June 2023 / Accepted: 19 June 2023 / Published: 23 June 2023

Abstract

:
This study introduces a training pipeline comprising two components: the Encoder-Decoder-Outlayer framework and the Vector Space Diversification Sampling method. This framework efficiently separates the pre-training and fine-tuning stages, while the sampling method employs pivot nodes to divide the subvector space and selectively choose unlabeled data, thereby reducing the reliance on human labeling. The pipeline offers numerous advantages, including rapid training, parallelization, buffer capability, flexibility, low GPU memory usage, and a sample method with nearly linear time complexity. Experimental results demonstrate that models trained with the proposed sampling algorithm generally outperform those trained with random sampling on small datasets. These characteristics make it a highly efficient and effective training approach for machine learning models. Further details can be found in the project repository on GitHub.

1. Introduction

The outcome of modeling is significantly influenced by labeled datasets, which are usually costly in terms of human effort. For many years, researchers relied on intuition or randomly sampled data points and split train–test data randomly. This paper was inspired by an active learning methodology that leverages neural networks to guide humans in preparing and labeling data, minimizing human effort and improving the overall performance of models.
Researchers in the field of natural language processing (NLP) commonly employ zero-shot, one-shot, and few-shot methods [1] to address issues of limited labeled data. However, these methods have limitations, such as the absence of modified parameters and limited customization, which make it difficult to achieve industry-level high scores for most large language models (LLMs). An alternative technique is the pre-training and fine-tuning method [2], which also requires substantial labeled data and is costly to train. Therefore, this paper proposes an Encoder-Decoder-Outlayer framework that addresses the aforementioned shortcomings and provides additional benefits.
To address the challenge of adapting pre-trained language models to specific downstream tasks without requiring extensive fine-tuning or re-training of the entire model, there are some approaches similar to the adapter method [3]. However, these approaches, like the adapter, are nested in the large language model (LLM), necessitating the entire LLM to be accommodated in GPU memory for training and prediction. In this context, the utilization of sampling methods will be examined to categorize unlabeled datasets, thereby choosing data that can enhance modeling accuracy. During training and prediction, only the necessary parts need to be stored in GPU memory, and the method supports a large Encoder vector buffer during training.
Active learning [4] can be an effective approach to improve the performance of neural networks and Encoder-Decoder models. To this end, a novel framework has been proposed, which involves using active learning alongside the Vector Space Diversification (VSD) sampling technique. The first step in this approach involves training an Encoder-Decoder model and then applying the VSD sampling method to it. This sampling method utilizes a tree-level sample to efficiently explore the diversity of the Encoder vector data points. Compared to traditional clustering techniques, like DBSCAN, and dimension reduction methods, like t-SNE and PCA, the tree-level sample approach is more efficient and has nearly linear time complexity in practice. PCA is an algorithm that finds the principal components of a dataset, which can help to reduce the number of dimensions while preserving the most important information in the dataset. T-SNE is a technique that maps high-dimensional data points onto a low-dimensional space, resulting in a map that preserves the similarities between points in the high-dimensional space. The differences between these two methods are that PCA is a linear technique, while t-SNE is non-linear, and t-SNE is better at preserving local structure and can be used to create more visually appealing maps. By incorporating active learning into the VSD sampling technique, the proposed framework enables the model to iteratively select and label the most informative data points from the remaining unlabelled data, thereby improving the overall performance of the model. This approach is particularly useful for large datasets where manually labeling all the data points is not feasible or practical. The authors of [5] propose a sample selection strategy for active learning to enhance its quality prediction performance with limited labeled data. It uses a minimax game and a latent-enhanced variational autoencoder to deceive an adversarial network and Gaussian process regression to incrementally select informative unlabeled samples. The authors of [6] developed an active learning method to explore information from multiphase flow process data, facilitating smart process modeling and prediction. An index is proposed to describe the process dynamics and nonlinearity, and a criterion to judge the learning termination is designed. The authors of [7] propose energy-based active domain adaptation (EADA), which queries groups of target data that incorporate both domain characteristic and instance uncertainty. Experiments show that EADA surpasses state-of-the-art methods on challenging benchmarks with substantial improvements. The authors of [8] developed a multi-purpose haze removal framework for nighttime hazy images. It uses a nonlinear model based on Retinex theory and a variational Retinex model to estimate a smoothed illumination component and predict the noise map. Experiments show that the proposed framework performs better than famous nighttime image dehazing methods. It can also be applied to other types of degraded images.
The Encoder used in this study is based on Sentence-BERT, which combines Transformer [9] and ResNet architectures. Transformer has become a popular neural network architecture for natural language processing (NLP) tasks, as well as image recognition and other applications. They are particularly useful in classifying unlabeled datasets due to their ability to learn from vast amounts of data and identify patterns and relationships in complex datasets. In contrast, ResNets are deep convolutional neural networks that have achieved state-of-the-art performance in image classification tasks. They are designed to address the issue of vanishing gradients that can occur in deep neural networks and have been successfully used in various applications, such as object detection and image segmentation.
This study aims to explore the use of sampling methods for classifying unlabeled datasets and selecting data that can improve modeling accuracy. The initial section will present the description of the data. Subsequently, various sampling methodologies will be analyzed, highlighting their benefits and drawbacks, and discussing the requisite methodologies and tools for their implementation. The experiment’s specifics and outcomes will be presented, followed by real-world illustrations demonstrating the applicability of these techniques in resolving intricate classification challenges. The following are the contributions of this work:
-
Proposal for an Encoder-Decoder-Outlayer (EDO) active learning method for text classification;
-
Exploration of the applicability of EDO, demonstrating its effectiveness in addressing issues of limited labeled data;
-
Exploration of the utilization of different models and techniques, such as BERTbase, S-BERT, Universal Sentence Encoder, Word2Vec, and Document2Vec, to optimize datasets for deep learning;
-
Proposal for the use of T-SNE for dimension reduction and comparison of sentence vectors.

2. Literature Review

The optimization of datasets is a crucial part of deep learning, and it has been a critical research field for many researchers. This section reviews and compares some related studies on classification (clustering) datasets and how the data are selected. There has been a limited amount of research conducted on active learning (AL) in the context of text classification, especially with regard to the latest, cutting-edge natural language processing (NLP) models. The work in [10] involved an empirical analysis that evaluated various uncertainty-based algorithms utilizing BERTbase as the classifier. To compare different types of strategies for getting sentence vectors, ref. [11] set three aiming functions for training and optimizing different tasks: Classification Objective Function, Regression Objective Function, and Triplet Objective Function. All-mpnet-base-v is based on S-BERT. This framework is utilized to generate sentence or text embeddings, which can be used to compare for finding sentences with similar meanings. In addition to S-BERT, the utilization of Universal Sentence Encoder [12], Word2Vec [13], and Document2Vec [14] are all viable options.
For dimension reduction, ref. [15] presents T-SNE, a technique to visualize high-dimensional data by giving each data point a location on a two or three-dimensional map. The information contained in high-dimensional vectors is preserved after it is transformed into low-dimension vectors. Its basic idea is that two vectors reduced to a low dimension are supposed to be close if they are similar in high dimension.
The authors of [1] propose that fine-tuning pre-trained models on small datasets with adapters that store in-domain knowledge and that are pre-trained in a task-specific way on a large corpus of unannotated customer reviews, using held-out reviews as pseudo summaries, improves the summary quality over standard fine-tuning and allows for summary personalization through aspect keyword queries. The authors of [2] examined the brittleness of fine-tuning pre-trained contextual word embedding models for natural language processing tasks by experimenting with four datasets from the GLUE benchmark and varying random seeds, finding substantial performance increases and quantifying how the performance of the best-found model varies with the number of fine-tuning trials, while also exploring factors influenced by the choice of random seed such as weight initialization and training data order.
The Encoder-Decoder architecture is commonly used in NLP tasks, such as machine translation and text summarization. The Encoder takes an input sequence, such as a sentence in one language, and transforms it into a fixed-dimensional vector representation. The Decoder then takes this representation as input and generates an output sequence, such as a translated sentence in another language. The work in [9] brings up a Transformer based on this Encoder-Decoder architecture.
For deeper neural network training, ref. [16] presents Resnet to ease the training of networks that are substantially deeper than those used previously. The representations also have excellent generalization performance on other recognition tasks. However, overfitting may cause a worse result. Combining with stronger regularization may improve results. The authors of [9] propose that the Transformer, a new network architecture based solely on attention mechanisms, outperforms complex recurrent or convolutional neural networks with Encoder-Decoder attention mechanisms in machine translation tasks, achieving state-of-the-art BLEU scores with significantly less training time and cost, and shows good generalization to other tasks. The residual learning framework, which was raised by [16], won first place in the ILSVRC 2015 classification task and improved the performance on the COCO object detection dataset. It can ease the training of substantially deeper neural networks and achieve higher accuracy.
The authors of [17] developed a procedure for Int8 matrix multiplication in Transformer that reduces the memory needed for inference by half while retaining full precision performance by using vector-wise quantization and a mixed-precision decomposition scheme to cope with highly systematic emergent features in language models, enabling up to 175B-parameter LLMs to be used without any performance degradation.
For image classification, the authors of [16] present the use of parametric rectified linear units (PReLU) and a robust initialization method in training extremely deep rectified neural networks for image classification, achieving a 4.94% top-5 test error on the ImageNet 2012 classification dataset, surpassing human-level performance for the first time. In 2018, the work in [18] involved comparing the performance of seven commonly used stochastic-gradient-based optimization techniques in a convolutional neural network (ConvNet), and Nadam achieved the best performance.

3. Data Description

The IMDB dataset contains highly polar movie reviews. The Amazon_polarity dataset contains product reviews from Amazon. Each sample from these two datasets was annotated by labels: 0 (negative) or 1 (positive). The Ag_news dataset is a collection of news articles gathered by ComeToMyHead. Each sample was labeled according to its category: World (0), Sports (1), Business (2), and Sci/Tech (3). The Emotion dataset contains Twitter messages classified by emotions, including sadness (0), joy (1), love (2), anger (3), fear (4), and surprise (5). The DBpedia dataset is constructed from 14 different classes in DBpedia. Each sample is annotated by its class. The YelpReviewFull dataset contains reviews from Yelp, each annotated by labels from 0 to 5 corresponding to the score associated with the review.

4. Methodology

The approach described draws inspiration from the concept of diversified investments in the realm of financial investment. In traditional financial investment strategies, the fundamental idea is to construct a portfolio by combining a set of unrelated assets. By diversifying the portfolio, investors aim to reduce the overall risk while potentially increasing the return on investment [19,20]. Similarly, the approach being discussed adopts a similar principle of diversification to tackle a different kind of risk in the context of machine learning models.
To reduce variance and enhance the performance of the model, the approach utilizes a collection of smaller models instead of relying on a single large model. This ensemble of models is designed to work in tandem within an Encoder-Decoder framework. The Encoder part of the framework maps real data to encoded vectors, while the Decoder part allows for sampling on these encoded vectors. By performing sampling, the approach introduces an ordered structure to the dataset, where the first few data points provide the greatest amount of diversity.
The encoded vector spaces generated by the model maintain a unique property: similar original data points will have a smaller distance between each other in the vector space. This property facilitates the effective organization and representation of the data, enabling the model to capture important patterns and relationships more efficiently.
In the process of training the model, a subset of the dataset is selected for manual labeling, as opposed to labeling the entire dataset. This strategic approach minimizes the resources required for manual labeling while still obtaining valuable labeled data. The labeled subset is then used to train a simple Outlayer model. This model takes the encoded vectors as input and produces human labels as output. By training on this subset, the model can learn to generalize and predict labels for the remaining unlabeled data.
For building the training model, the approach adopts the Nadam optimizer. Nadam, a combination of Nesterov accelerated gradient descent [21] and Adam optimization algorithm [22], offers distinct advantages. It provides greater control over the learning rate and directly influences the gradient update, resulting in improved convergence and potentially faster training times.
By incorporating these strategies and techniques inspired by diversified financial investments, the approach aims to mitigate risk and enhance performance in the realm of machine learning. The use of smaller models, the organization of data through encoded vectors, and the selective labeling process all contribute to a more robust and efficient learning framework. Additionally, the choice of the Nadam optimizer further optimizes the training process, ultimately leading to better outcomes in terms of accuracy and generalization.
One advantage of this approach is the separation of feature extraction and output, which helps reduce the GPU RAM usage. During the pre-training stage, only raw data are required, and the Encoder-Decoder model is stored in the GPU RAM. During the encoded vector buffering step, only the Encoder is kept in the GPU RAM. Similarly, during the sampling stage, only the sample algorithm is running. When training the Outlayer, only the simple Outlayer and batch data are stored in the RAM.
By breaking down the large prediction model into smaller parts and executing the process step-by-step, the Outlayer model can accommodate more encoded vectors and process a larger batch of items within a fixed GPU RAM capacity. This partitioning of tasks and resource allocation allows for more efficient memory management during the different stages of the approach. It ensures that only the necessary components are stored in the GPU RAM at any given time, freeing up space for other operations. The advantage of this approach becomes particularly evident when dealing with large datasets or when working with limited GPU resources. By carefully managing the GPU RAM usage, the approach enables the model to handle a greater number of encoded vectors and process larger batches of items, without exceeding the memory constraints. This scalability and flexibility contribute to the overall effectiveness and practicality of the approach, making it suitable for a wide range of applications.

Basic Framework

The model utilized in this study consists of three primary components: Encoder, Decoder, and Outlayer. The Encoder component is responsible for transforming the data into feature vectors, which are then subjected to Vector Space Diversification sampling. This sampling process reorganizes the dataset, and the first N samples are selected for training the model. Figure 1 depicts the Encoder-Decoder-Outlayer framework, which comprises an Encoder, a Decoder, and an Outlayer.
During training, the chosen loss function is cross-entropy loss with weight. This loss function helps measure the discrepancy between the predicted outputs and the actual labels, taking into account the importance assigned to each class. Additionally, F1-score guidance is employed as a trigger mechanism. If the F1-score decreases below the previous score, specific actions are initiated to address and rectify the issue. Overall, the model’s architecture and training process aim to effectively encode the data, generate diverse samples through vector space diversification, and train the model using the selected samples. The use of cross-entropy loss with weight assists in optimizing the model’s performance, while the F1-score guidance helps monitor and manage the training progress, ensuring that the model maintains or improves its performance throughout the training process.

5. Experimental Results

5.1. Settings

The Encoder used in this study was the “all-mpnet-base-v2” Sentence-BERT model with 768 features. It was employed to transform both the train and test datasets into vectors. Figure 2 illustrates the architecture of the Outlayer, which consists of a three-layer-ResNet framework with Prelu activation, batch normalization, linear layer, and hidden layer size set to twice the cluster number.
The study employed the Nadam optimizer with an initial learning rate of 0.1 to train a three-layer-ResNet framework for data analysis. The activation function was Prelu, and a batch normalization [23] layer was applied before the linear layer. The hidden layer size was set to twice the cluster number, and cross entropy loss with weight was used. The weight was determined using a specific formula.
W c = 1 + i N y i = c   1 c K 1 + i N y i = c 1
F1-score-guidance was used, which triggered when the F1-score decreased below the previous score. When this occurred, the learning rate was reduced by half and the forgiveness count was decreased by one. The forgiveness count was initialized at 12, and when it reached zero, the training stopped. The F1-score threshold for early stop was set at 0.995.
The use of F1-score-guidance and early stop eliminated the need for a validation set, as no other models were compared. During training, the model was only saved if the loss was not NaN and if the F1score had improved. The study was conducted on a NVIDIA GeForce RTX 3080 Ti GPU.
GPU memory optimization and data buffering:
(1)
Train an auto encoder-decoder, (omitted, borrow from Sentence-BERT)
a.
Only Encoder, Decoder and train data in GPU memory
(2)
Encoder vector data buffering (encode all data items into vectors)
a.
Parallelizable
b.
Only Encoder and predict data in GPU memory
(3)
Train Outlayer
a.
Only Outlayer, batch train vectors and label in GPU memory
This optimized GPU memory and data buffering pipeline allows for efficient training of an auto Encoder-Decoder with separate training steps for each part of the network. The encoded vectors are smaller in size than the raw data, allowing for larger batches in Outlayer training. This approach can reduce the overall training cost of the neural network.

5.2. Vector Space Diversification Sampling

The basic idea is to find the ‘center’ point among the training set encoding vectors for each dataset. Then, we select the center point as the root and perform a binary split in each feature dimension. We record the comparison status as 0 or 1, which allows us to obtain a binary representation of an integer. We can then utilize these integers as keys to represent the vector subspace and create branches based on these keys. This process is repeated recursively for each branch.
In this study, a method is proposed for sampling points in a vector space to explore the variety of the feature space. A center pivots picking algorithm is used to select the representative point of the space and divide the space into smaller subspaces. Distance methods, such as Euclidean distance and cosine similarity, are used to measure the distance between points. To introduce randomness, the rank is merged with each algorithm, and the indices are resorted. The output of different methods is blended to create a series of sample methods. The behavior of the algorithms is visualized using 2D points sampled in a circle. Results show that exploring the first few indices after reranking provides the greatest diversity of the feature space. Although cosine similarity may be a reasonable choice as a distance measure, as Sentence-BERT is designed to work with unit vectors and perform cosine similarity on text pairs, our experiments did not find it to make a significant difference [24]. Nonetheless, our approach still provides a useful method for sampling points in a vector space to explore its variety. Figure 3 shows the sampling executed on a 2D unit circle employing the Gaussian distribution of theta values. It consists of subfigures, which illustrate the effect of different sampling methods, including pure random, and picking pivots by using mean and median on the 2D unit circle.

5.3. Data and Vector Space: Understanding and Unfamiliarity

Understanding refers to measuring the extent to which a given set of vectors comprehends the properties of the data points or subspaces contained within it. It is measured by assessing how well the system identifies and represents all possible points in the vector space.
Unfamiliarity, on the other hand, refers to evaluating the degree to which a given data point is unfamiliar to a specific subspace within the vector space. This measure can be used to inform AI systems of the level of confusion or disinterest they should feel towards certain data points, based on their level of familiarity with the subspace to which they belong. These new metrics can be described mathematically using the following formulas:
Let V be a vector space, and B be a set of real or virtual points within that vector space. Let x be a vector that belongs to V, and D be the distance function (e.g., Euclidean distance, arccos of cosine similarity, etc.). Finally, let g be an adjustment function that is positive and monotonically increasing, and have a derivative that is monotonically decreasing.
U n f a m i l i a r i t y   x ,   B = m i n b B   D x ,   B  
U n d e r s t a n d i n g   B = x B g ( U n f a m i l i a r i t y   x ,   B x
U n f a m i l i a r i t y R a t e   x ,   B = U n f a m i l i a r i t y x ,   B m a x S V   U n f a m i l i a r i t y x ,   S
U n d e r s t a n d i n g R a t e   B = U n d e r s t a n d i n g B m a x S V   U n d e r s t a n d i n g S
To ensure stable results and secure float representation, we can use the inverse of percentiles to obtain the rate, although this may introduce additional complexity.

5.4. Results

This study investigated the effectiveness of different data sampling methods on the performance of trained models. The VSD sampling algorithm, which selects items that maximize understanding during the sampling process, was compared to a random sampling method. The experimental results indicate that the model trained with VSD sampling algorithms typically outperforms the random sampling method on small datasets. However, for a large portion of the dataset, there is not much difference since the sample will eventually be the same as the whole training data as the size increases to the maximum size.
The improvement in model performance is more significant in metrics, such as recall, F1, and accuracy, but not in precision score. The experiment demonstrated that VSD sampling leads to a substantial improvement in F1-score for several datasets, including amazonpolar, dbpedia, agnews, and emotion, on small datasets. A 50-item trained model was evaluated using F1-score on a test set. The results of the study are presented in Table 1. F1 (trivial) represents the F1-score of randomly selecting items from each class. F1 (rand) represents the F1-score of the random sampling method. F1 (VSD min), F1 (VSD ave), and F1 (VSD max) are the F1-scores of the VSD sampling algorithm when selecting items with the minimum, average, and maximum understanding, respectively. F1 (VSD min-rand), F1 (VSD ave-rand), and F1 (VSD max-rand) represent the differences between the F1-scores of the VSD sampling algorithm and the random sampling method when selecting items with the minimum, average, and maximum understanding, respectively. The dataset’s F1-score, accuracy, precision, and recall for each sampling approach are illustrated in Figure 4, Figure 5 and Figure 6.
However, the nature of the Sentence-BERT encoding used in the experiment may have limited the performance in some datasets. The black line in the figure represents the trivial F1-score baseline achieved by randomly selecting items from each class. As the size of the dataset increases, the difference between the VSD sampling and random sampling methods becomes less significant. Table 1 shows the enhancements observed across various dataset sample sizes.
Overall, these findings suggest that VSD sampling can improve model performance on small datasets, but its effectiveness may vary depending on the nature of the data and the encoding method used. Therefore, researchers should consider using VSD sampling in conjunction with appropriate encoding techniques to improve model performance.

6. Conclusions

Data diversity is crucial for enhancing the performance of neural network models, and simply increasing the amount of data without considering diversity can be misleading. Traditionally, training datasets have contained redundant data, and researchers have resorted to brute force or AI-generated data to enhance diversity, which can be resource-intensive.
To address this issue, we propose an Encoder-Decoder-Outlayer (EDO) pipeline and a VSD sampling algorithm that leverages a pre-trained Encoder-Decoder framework for feature extraction. Our approach involves using a compact output layer and efficiently exploring the diversity of the encoded feature or hidden layer vector space to prevent overfitting and improve performance, even with limited data.
Experimental results demonstrate that our approach can yield satisfactory results in tasks that previously demanded substantial amounts of data. By employing a pretrained Encoder model for feature extraction and incorporating a small output layer, we can conserve computational resources and reduce human labor. Furthermore, storing the encoding process in a buffer allows for data to be encoded only once, further diminishing computational costs. Future work may involve extending the application of the EDO pipeline and VSD sampling to other tasks and developing a more generalized Encoder-Decoder approach.

Author Contributions

Abstract, F.K. and H.Z.; Introduction, F.K. and H.Z; Literature Review, F.K. and H.Z.; Data Description, F.K. and H.Z.; Methodology, H.Z. and F.K.; Experiment & Result H.Z. and F.K.; Conclusions H.Z. and F.K; Writing—original draft, F.K. and H.Z.; Writing—review & editing, H.Z. and F.K.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Acknowledgments

Hongyi Zeng and Fanyi Kong contributed equally to this work and should be considered as co-first authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bražinskas, A.; Nallapati, R.; Bansal, M.; Dreyer, M. Efficient few-shot fine-tuning for opinion summarization. arXiv 2022, arXiv:2205.02170. [Google Scholar]
  2. Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv 2020, arXiv:2002.06305. [Google Scholar]
  3. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
  4. Settles, B. Active Learning Literature Survey: Semantic Scholar, Active Learning Literature Survey|Semantic Scholar. 1970. Available online: https://www.semanticscholar.org/paper/Active-Learning-Literature-Survey-Settles/818826f356444f3daa3447755bf63f171f39ec47 (accessed on 1 April 2023).
  5. Dai, Y.; Yang, C.; Liu, Y.; Yao, Y. Latent-Enhanced Variational Adversarial Active Learning Assisted Soft Sensor. IEEE Sens. J. 2023. [Google Scholar] [CrossRef]
  6. Deng, H.; Yang, K.; Liu, Y.; Zhang, S.; Yao, Y. Actively exploring informative data for smart modeling of industrial multiphase flow processes. IEEE Trans. Ind. Inform. 2020, 17, 8357–8366. [Google Scholar] [CrossRef]
  7. Xie, B.; Yuan, L.; Li, S.; Liu, C.H.; Cheng, X.; Wang, G. Active learning for domain adaptation: An energy-based approach. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 8708–8716. [Google Scholar]
  8. Liu, Y.; Yan, Z.; Tan, J.; Li, Y. Multi-purpose Oriented Single Nighttime Image Haze Removal Based on Unified Variational Retinex Model. In IEEE Transactions on Circuits and Systems for Video Technology; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
  9. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. Available online: https://arxiv.org/pdf/1706.03762.pdf (accessed on 2 April 2023).
  10. Floris Jacobs, P.; Maillette de Buy Wenniger, G.; Wiering, M.; Schomaker, L. Active Learning for Reducing Labeling Effort in Text Classification Tasks. arXiv 2021, arXiv:2109.04847. [Google Scholar]
  11. Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
  12. Cer, D.; Yang, Y.; Kong, S.Y.; Hua, N.; Limtiaco, N.; John, R.S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal sentence encoder. arXiv 2018, arXiv:1803.11175. [Google Scholar]
  13. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
  14. Le, Q.; Mikolov, T. Distributed Representations of Sentences and Documents. arXiv 2014, arXiv:1405.4053. [Google Scholar]
  15. Van der Maaten, L.; Hinton, G. Visualizing Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  16. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv 2015, arXiv:1502.01852. [Google Scholar]
  17. Dettmers, T.; Lewis, M.; Belkada, Y.; Zettlemoyer, L. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv 2022, arXiv:2208.07339. [Google Scholar]
  18. Dogo, E.M.; Afolabi, O.J.; Nwulu, N.I.; Twala, B.; Aigbavboa, C.O. A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks. In Proceedings of the 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS), Belgaum, India, NJ, USA, 21–22 December 2018; Niranjan, S.K., Kavitha, C., Kavitha, K.S., Sathish Kumar, T., Eds.; IEEE Bangalore Section. IEEE: Piscataway, NJ, USA, 2018; pp. 92–99. [Google Scholar]
  19. Markowitz, H. Portfolio Selection. J. Financ. 1952, 7, 77–91. [Google Scholar] [CrossRef]
  20. Leung, M.F.; Wang, J. Cardinality-constrained portfolio selection based on collaborative neurodynamic optimization. Neural Netw. 2022, 145, 68–79. [Google Scholar] [CrossRef] [PubMed]
  21. Gu, P.; Tian, S.; Chen, Y. Iterative Learning Control Based on Nesterov Accelerated Gradient Method. IEEE Access 2019, 7, 115836–115842. [Google Scholar] [CrossRef]
  22. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  23. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
  24. Bello, A.; Ng, S.-C.; Leung, M.-F. A BERT Framework to Sentiment Analysis of Tweets. Sensors 2023, 23, 506. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Pipeline illustrating the Encoder-Decoder-Outlayer framework.
Figure 1. Pipeline illustrating the Encoder-Decoder-Outlayer framework.
Mathematics 11 02819 g001
Figure 2. The architecture of the Outlayer.
Figure 2. The architecture of the Outlayer.
Mathematics 11 02819 g002
Figure 3. Sampling executed on a 2D unit circle employing Gaussian distribution of theta values is demonstrated.
Figure 3. Sampling executed on a 2D unit circle employing Gaussian distribution of theta values is demonstrated.
Mathematics 11 02819 g003
Figure 4. F1-score, accuracy, precision, and recall of the yelp and imdb datasets.
Figure 4. F1-score, accuracy, precision, and recall of the yelp and imdb datasets.
Mathematics 11 02819 g004
Figure 5. F1-score, accuracy, precision, and recall of the emotion and dbpedia datasets.
Figure 5. F1-score, accuracy, precision, and recall of the emotion and dbpedia datasets.
Mathematics 11 02819 g005
Figure 6. F1-score, accuracy, precision, and recall of the amazonpolar and agnews datasets.
Figure 6. F1-score, accuracy, precision, and recall of the amazonpolar and agnews datasets.
Mathematics 11 02819 g006
Table 1. The enhancements observed across various dataset sample sizes.
Table 1. The enhancements observed across various dataset sample sizes.
Dataset Name#ItemsPercentF1 (Trivial)F1 (Rand)F1 (VSD Min)F1 (VSD Ave)F1 (VSD Max)F1 (VSD Min-Rand)F1 (VSD Ave-Rand)F1 (VSD Max-Rand)
agnews150.00197370.250.20020.32280.52190.63580.12260.32170.4356
amazonpolar153.75 × 10−50.50.37020.34250.52590.6999−0.02770.15560.3296
dbpedia150.00021430.07140.17660.20060.30520.38990.0240.12860.2133
emotion150.00750.16670.11150.20690.22930.26130.09540.11780.1498
imdb150.00060.50.550.4840.66140.7568−0.06590.11140.2069
yelp150.00030.20.11220.22690.25290.30480.11480.14080.1926
agnews250.00328950.250.35640.64110.70710.73560.28470.35080.3793
amazonpolar256.25 × 10−50.50.39750.61520.68310.74430.21780.28560.3468
dbpedia250.00035710.07140.20020.3640.46110.55140.16380.26090.3512
emotion250.01250.16670.16750.23430.25130.28050.06680.08380.113
imdb250.0010.50.75550.70610.74390.7749−0.0494−0.01160.0194
yelp250.00050.20.15010.22290.29010.32960.07280.140.1795
agnews500.00657890.250.37470.68870.7640.79330.3140.38930.4186
amazonpolar500.0001250.50.76150.69860.73930.7825−0.0629−0.02220.021
dbpedia500.00071430.07140.37670.53420.62050.68470.15740.24380.308
emotion500.0250.16670.2340.25810.28540.31710.02410.05140.0831
imdb500.0020.50.55180.7250.74910.79810.17320.19730.2463
yelp500.0010.20.19070.31060.33610.35890.11990.14540.1682
agnews700.00921050.250.6860.73840.77220.80270.05230.08610.1167
amazonpolar700.0001750.50.62150.76150.77750.79260.140.1560.1711
dbpedia700.0010.07140.63040.71540.75320.77970.08510.12280.1494
emotion700.0350.16670.22840.29120.31440.35320.06290.0860.1248
imdb700.00280.50.69590.73560.77090.79240.03960.07490.0965
yelp700.00140.20.33970.35120.37060.40040.01150.03090.0607
agnews1000.01315790.250.65150.79090.80050.81940.13940.1490.1679
amazonpolar1000.000250.50.76120.76970.78520.80490.00850.0240.0437
dbpedia1000.00142860.07140.57670.80670.82540.85380.230.24870.2771
emotion1000.050.16670.31330.29810.32230.3429−0.01520.0090.0296
imdb1000.0040.50.73120.74040.76380.77830.00920.03260.0471
yelp1000.0020.20.36160.34990.37780.3936−0.01170.01620.032
agnews2000.02631580.250.81660.78870.80470.8314−0.0279−0.01180.0148
amazonpolar2000.00050.50.72590.80260.81620.83380.07670.09040.108
dbpedia2000.00285710.07140.770.88740.89680.91340.11740.12680.1434
emotion2000.10.16670.36190.31580.34510.3659−0.0461−0.01680.004
imdb2000.0080.50.81390.77950.78640.7938−0.0344−0.0275−0.0201
yelp2000.0040.20.37880.38040.39560.40870.00160.01680.0299
agnews3000.03947370.250.80770.80770.82570.8468−0.00010.0180.039
amazonpolar3000.000750.50.83440.82160.83910.8519−0.01280.00470.0174
dbpedia3000.00428570.07140.89260.90480.91820.92690.01220.02560.0343
emotion3000.150.16670.38070.35370.36750.3876−0.027−0.01320.007
imdb3000.0120.50.77670.77140.7890.8036−0.00530.01230.0269
yelp3000.0060.20.39960.4080.41860.42820.00840.0190.0286
agnews5000.06578950.250.83270.81670.8320.8401−0.0159−0.00060.0075
amazonpolar5000.001250.50.83550.82820.83840.8452−0.00740.00280.0096
dbpedia5000.00714290.07140.9020.92660.93890.94640.02460.03690.0444
emotion5000.250.16670.40490.38110.39870.4088−0.0238−0.00620.0038
imdb5000.020.50.81260.80920.8110.8173−0.0034−0.00160.0047
yelp5000.010.20.43170.42990.43810.4432−0.00170.00640.0115
agnews7000.09210530.250.84770.82750.84270.8534−0.0202−0.0050.0058
amazonpolar7000.001750.50.84210.83730.84960.8638−0.00480.00760.0217
dbpedia7000.010.07140.92390.94250.94840.95240.01860.02450.0285
emotion7000.350.16670.4640.40840.42220.4388−0.0556−0.0417−0.0251
imdb7000.0280.50.83340.81530.82120.8261−0.0182−0.0123−0.0073
yelp7000.0140.20.45010.43560.44460.4565−0.0145−0.00550.0064
agnews10000.13157890.250.86260.84070.85070.8592−0.0219−0.012−0.0034
amazonpolar10000.00250.50.850.85190.85880.86570.00190.00880.0157
dbpedia10000.01428570.07140.94820.94770.95450.962−0.00050.00630.0138
emotion10000.50.16670.45190.43040.44690.4565−0.0215−0.0050.0046
imdb10000.040.50.8430.81970.82740.8405−0.0233−0.0157−0.0025
yelp10000.020.20.46440.44470.45530.4697−0.0197−0.00910.0053
agnews20000.26315790.250.86880.86360.86680.8721−0.0053−0.0020.0033
amazonpolar20000.0050.50.87370.87220.87740.8839−0.00150.00370.0102
dbpedia20000.02857140.07140.96110.96230.9650.96650.00120.00390.0054
emotion200010.16670.50490.4780.48480.4945−0.0269−0.0201−0.0104
imdb20000.080.50.85070.83860.84720.8527−0.0121−0.00350.002
yelp20000.040.20.48570.46830.47760.4839−0.0174−0.0082−0.0018
agnews30000.39473680.250.87110.86890.87040.8721−0.0022−0.00070.0011
amazonpolar30000.00750.50.88220.87480.88220.8869−0.007400.0048
dbpedia30000.04285710.07140.96720.9670.96990.9716−0.00010.00270.0045
emotion30001.50.16670.5120.50630.51290.5188−0.00570.0010.0068
imdb30000.120.50.8390.85220.85410.85780.01320.01510.0187
yelp30000.060.20.48750.48530.48850.4931−0.00220.0010.0056
agnews50000.65789470.250.87940.87360.87850.8865−0.0058−0.00090.0071
amazonpolar50000.01250.50.8920.88130.8890.8942−0.0107−0.0030.0022
dbpedia50000.07142860.07140.97060.97130.9720.9730.00070.00140.0024
emotion50002.50.16670.53720.52410.5340.543−0.0131−0.00310.0059
imdb50000.20.50.87030.85650.86420.8727−0.0138−0.00610.0024
yelp50000.10.20.5040.49860.5050.5137−0.00550.0010.0097
agnews10,0001.31578950.250.88610.88650.88840.89090.00030.00230.0048
amazonpolar10,0000.0250.50.90120.8980.89990.9029−0.0032−0.00130.0017
dbpedia10,0000.14285710.07140.97580.97430.97520.9761−0.0014−0.00060.0003
emotion10,00050.16670.56210.55080.55670.5634−0.0112−0.00530.0013
imdb10,0000.40.50.87870.86940.87440.8766−0.0093−0.0043−0.0021
yelp10,0000.20.20.5280.52320.52840.5332−0.00470.00040.0052
agnews16,0002.10526320.250.89770.89270.8950.8983−0.005−0.00270.0006
amazonpolar16,0000.040.50.90420.90290.90570.9074−0.00120.00160.0032
dbpedia16,0000.22857140.07140.97780.97790.97810.97820.00010.00030.0004
emotion16,00080.16670.58410.57430.58440.5958−0.00990.00030.0116
imdb16,0000.640.50.88020.88080.88440.88690.00060.00410.0066
yelp16,0000.320.20.54120.53870.54260.5463−0.00250.00140.0051
agnews20,0002.63157890.250.89440.89490.89640.89860.00050.0020.0041
amazonpolar20,0000.050.50.90980.90550.90770.9107−0.0043−0.00210.0009
dbpedia20,0000.28571430.07140.97910.97850.9790.9794−0.0005−0.00010.0004
imdb20,0000.80.50.88440.87970.88310.8846−0.0047−0.00130.0002
yelp20,0000.40.20.55110.5480.55020.5556−0.0031−0.00090.0044
agnews25,0003.28947370.250.90040.89520.89820.9005−0.0052−0.00210.0002
amazonpolar25,0000.06250.50.91060.90920.91030.912−0.0014−0.00030.0015
dbpedia25,0000.35714290.07140.97980.97930.97980.9804−0.000500.0006
imdb25,00010.50.88630.88420.88660.8879−0.00210.00030.0016
yelp25,0000.50.20.55370.54760.55390.5568−0.00610.00020.003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, H.; Kong, F. Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling. Mathematics 2023, 11, 2819. https://doi.org/10.3390/math11132819

AMA Style

Zeng H, Kong F. Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling. Mathematics. 2023; 11(13):2819. https://doi.org/10.3390/math11132819

Chicago/Turabian Style

Zeng, Hongyi, and Fanyi Kong. 2023. "Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling" Mathematics 11, no. 13: 2819. https://doi.org/10.3390/math11132819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop