Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling

Zeng, Hongyi; Kong, Fanyi

doi:10.3390/math11132819

Open AccessArticle

Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling

by

Hongyi Zeng

^1,*,† and

Fanyi Kong

^2,†

¹

Department of Computer Science, University of Toronto, Toronto, ON M5S 0A5, Canada

²

Department of Mechanical & Industrial Engineering, Northeastern University, Boston, MA 02115, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2023, 11(13), 2819; https://doi.org/10.3390/math11132819

Submission received: 25 May 2023 / Revised: 19 June 2023 / Accepted: 19 June 2023 / Published: 23 June 2023

(This article belongs to the Special Issue Advances in Analysis and Application of Mathematical Optimization Algorithms)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This study introduces a training pipeline comprising two components: the Encoder-Decoder-Outlayer framework and the Vector Space Diversification Sampling method. This framework efficiently separates the pre-training and fine-tuning stages, while the sampling method employs pivot nodes to divide the subvector space and selectively choose unlabeled data, thereby reducing the reliance on human labeling. The pipeline offers numerous advantages, including rapid training, parallelization, buffer capability, flexibility, low GPU memory usage, and a sample method with nearly linear time complexity. Experimental results demonstrate that models trained with the proposed sampling algorithm generally outperform those trained with random sampling on small datasets. These characteristics make it a highly efficient and effective training approach for machine learning models. Further details can be found in the project repository on GitHub.

Keywords:

neural network; training pipeline; Encoder-Decoder-Outlayer framework; Vector Space Diversification Sampling method; human labeling; GPU memory

MSC:

68T05

1. Introduction

The outcome of modeling is significantly influenced by labeled datasets, which are usually costly in terms of human effort. For many years, researchers relied on intuition or randomly sampled data points and split train–test data randomly. This paper was inspired by an active learning methodology that leverages neural networks to guide humans in preparing and labeling data, minimizing human effort and improving the overall performance of models.

Researchers in the field of natural language processing (NLP) commonly employ zero-shot, one-shot, and few-shot methods [1] to address issues of limited labeled data. However, these methods have limitations, such as the absence of modified parameters and limited customization, which make it difficult to achieve industry-level high scores for most large language models (LLMs). An alternative technique is the pre-training and fine-tuning method [2], which also requires substantial labeled data and is costly to train. Therefore, this paper proposes an Encoder-Decoder-Outlayer framework that addresses the aforementioned shortcomings and provides additional benefits.

To address the challenge of adapting pre-trained language models to specific downstream tasks without requiring extensive fine-tuning or re-training of the entire model, there are some approaches similar to the adapter method [3]. However, these approaches, like the adapter, are nested in the large language model (LLM), necessitating the entire LLM to be accommodated in GPU memory for training and prediction. In this context, the utilization of sampling methods will be examined to categorize unlabeled datasets, thereby choosing data that can enhance modeling accuracy. During training and prediction, only the necessary parts need to be stored in GPU memory, and the method supports a large Encoder vector buffer during training.

Active learning [4] can be an effective approach to improve the performance of neural networks and Encoder-Decoder models. To this end, a novel framework has been proposed, which involves using active learning alongside the Vector Space Diversification (VSD) sampling technique. The first step in this approach involves training an Encoder-Decoder model and then applying the VSD sampling method to it. This sampling method utilizes a tree-level sample to efficiently explore the diversity of the Encoder vector data points. Compared to traditional clustering techniques, like DBSCAN, and dimension reduction methods, like t-SNE and PCA, the tree-level sample approach is more efficient and has nearly linear time complexity in practice. PCA is an algorithm that finds the principal components of a dataset, which can help to reduce the number of dimensions while preserving the most important information in the dataset. T-SNE is a technique that maps high-dimensional data points onto a low-dimensional space, resulting in a map that preserves the similarities between points in the high-dimensional space. The differences between these two methods are that PCA is a linear technique, while t-SNE is non-linear, and t-SNE is better at preserving local structure and can be used to create more visually appealing maps. By incorporating active learning into the VSD sampling technique, the proposed framework enables the model to iteratively select and label the most informative data points from the remaining unlabelled data, thereby improving the overall performance of the model. This approach is particularly useful for large datasets where manually labeling all the data points is not feasible or practical. The authors of [5] propose a sample selection strategy for active learning to enhance its quality prediction performance with limited labeled data. It uses a minimax game and a latent-enhanced variational autoencoder to deceive an adversarial network and Gaussian process regression to incrementally select informative unlabeled samples. The authors of [6] developed an active learning method to explore information from multiphase flow process data, facilitating smart process modeling and prediction. An index is proposed to describe the process dynamics and nonlinearity, and a criterion to judge the learning termination is designed. The authors of [7] propose energy-based active domain adaptation (EADA), which queries groups of target data that incorporate both domain characteristic and instance uncertainty. Experiments show that EADA surpasses state-of-the-art methods on challenging benchmarks with substantial improvements. The authors of [8] developed a multi-purpose haze removal framework for nighttime hazy images. It uses a nonlinear model based on Retinex theory and a variational Retinex model to estimate a smoothed illumination component and predict the noise map. Experiments show that the proposed framework performs better than famous nighttime image dehazing methods. It can also be applied to other types of degraded images.

The Encoder used in this study is based on Sentence-BERT, which combines Transformer [9] and ResNet architectures. Transformer has become a popular neural network architecture for natural language processing (NLP) tasks, as well as image recognition and other applications. They are particularly useful in classifying unlabeled datasets due to their ability to learn from vast amounts of data and identify patterns and relationships in complex datasets. In contrast, ResNets are deep convolutional neural networks that have achieved state-of-the-art performance in image classification tasks. They are designed to address the issue of vanishing gradients that can occur in deep neural networks and have been successfully used in various applications, such as object detection and image segmentation.

This study aims to explore the use of sampling methods for classifying unlabeled datasets and selecting data that can improve modeling accuracy. The initial section will present the description of the data. Subsequently, various sampling methodologies will be analyzed, highlighting their benefits and drawbacks, and discussing the requisite methodologies and tools for their implementation. The experiment’s specifics and outcomes will be presented, followed by real-world illustrations demonstrating the applicability of these techniques in resolving intricate classification challenges. The following are the contributions of this work:

-: Proposal for an Encoder-Decoder-Outlayer (EDO) active learning method for text classification;
-: Exploration of the applicability of EDO, demonstrating its effectiveness in addressing issues of limited labeled data;
-: Exploration of the utilization of different models and techniques, such as BERTbase, S-BERT, Universal Sentence Encoder, Word2Vec, and Document2Vec, to optimize datasets for deep learning;
-: Proposal for the use of T-SNE for dimension reduction and comparison of sentence vectors.

https://github.com/MarcoMozilla/cis_research_project

2. Literature Review

The optimization of datasets is a crucial part of deep learning, and it has been a critical research field for many researchers. This section reviews and compares some related studies on classification (clustering) datasets and how the data are selected. There has been a limited amount of research conducted on active learning (AL) in the context of text classification, especially with regard to the latest, cutting-edge natural language processing (NLP) models. The work in [10] involved an empirical analysis that evaluated various uncertainty-based algorithms utilizing BERTbase as the classifier. To compare different types of strategies for getting sentence vectors, ref. [11] set three aiming functions for training and optimizing different tasks: Classification Objective Function, Regression Objective Function, and Triplet Objective Function. All-mpnet-base-v is based on S-BERT. This framework is utilized to generate sentence or text embeddings, which can be used to compare for finding sentences with similar meanings. In addition to S-BERT, the utilization of Universal Sentence Encoder [12], Word2Vec [13], and Document2Vec [14] are all viable options.

For dimension reduction, ref. [15] presents T-SNE, a technique to visualize high-dimensional data by giving each data point a location on a two or three-dimensional map. The information contained in high-dimensional vectors is preserved after it is transformed into low-dimension vectors. Its basic idea is that two vectors reduced to a low dimension are supposed to be close if they are similar in high dimension.

The authors of [1] propose that fine-tuning pre-trained models on small datasets with adapters that store in-domain knowledge and that are pre-trained in a task-specific way on a large corpus of unannotated customer reviews, using held-out reviews as pseudo summaries, improves the summary quality over standard fine-tuning and allows for summary personalization through aspect keyword queries. The authors of [2] examined the brittleness of fine-tuning pre-trained contextual word embedding models for natural language processing tasks by experimenting with four datasets from the GLUE benchmark and varying random seeds, finding substantial performance increases and quantifying how the performance of the best-found model varies with the number of fine-tuning trials, while also exploring factors influenced by the choice of random seed such as weight initialization and training data order.

The Encoder-Decoder architecture is commonly used in NLP tasks, such as machine translation and text summarization. The Encoder takes an input sequence, such as a sentence in one language, and transforms it into a fixed-dimensional vector representation. The Decoder then takes this representation as input and generates an output sequence, such as a translated sentence in another language. The work in [9] brings up a Transformer based on this Encoder-Decoder architecture.

For deeper neural network training, ref. [16] presents Resnet to ease the training of networks that are substantially deeper than those used previously. The representations also have excellent generalization performance on other recognition tasks. However, overfitting may cause a worse result. Combining with stronger regularization may improve results. The authors of [9] propose that the Transformer, a new network architecture based solely on attention mechanisms, outperforms complex recurrent or convolutional neural networks with Encoder-Decoder attention mechanisms in machine translation tasks, achieving state-of-the-art BLEU scores with significantly less training time and cost, and shows good generalization to other tasks. The residual learning framework, which was raised by [16], won first place in the ILSVRC 2015 classification task and improved the performance on the COCO object detection dataset. It can ease the training of substantially deeper neural networks and achieve higher accuracy.

The authors of [17] developed a procedure for Int8 matrix multiplication in Transformer that reduces the memory needed for inference by half while retaining full precision performance by using vector-wise quantization and a mixed-precision decomposition scheme to cope with highly systematic emergent features in language models, enabling up to 175B-parameter LLMs to be used without any performance degradation.

For image classification, the authors of [16] present the use of parametric rectified linear units (PReLU) and a robust initialization method in training extremely deep rectified neural networks for image classification, achieving a 4.94% top-5 test error on the ImageNet 2012 classification dataset, surpassing human-level performance for the first time. In 2018, the work in [18] involved comparing the performance of seven commonly used stochastic-gradient-based optimization techniques in a convolutional neural network (ConvNet), and Nadam achieved the best performance.

3. Data Description

The IMDB dataset contains highly polar movie reviews. The Amazon_polarity dataset contains product reviews from Amazon. Each sample from these two datasets was annotated by labels: 0 (negative) or 1 (positive). The Ag_news dataset is a collection of news articles gathered by ComeToMyHead. Each sample was labeled according to its category: World (0), Sports (1), Business (2), and Sci/Tech (3). The Emotion dataset contains Twitter messages classified by emotions, including sadness (0), joy (1), love (2), anger (3), fear (4), and surprise (5). The DBpedia dataset is constructed from 14 different classes in DBpedia. Each sample is annotated by its class. The YelpReviewFull dataset contains reviews from Yelp, each annotated by labels from 0 to 5 corresponding to the score associated with the review.

4. Methodology

The approach described draws inspiration from the concept of diversified investments in the realm of financial investment. In traditional financial investment strategies, the fundamental idea is to construct a portfolio by combining a set of unrelated assets. By diversifying the portfolio, investors aim to reduce the overall risk while potentially increasing the return on investment [19,20]. Similarly, the approach being discussed adopts a similar principle of diversification to tackle a different kind of risk in the context of machine learning models.

To reduce variance and enhance the performance of the model, the approach utilizes a collection of smaller models instead of relying on a single large model. This ensemble of models is designed to work in tandem within an Encoder-Decoder framework. The Encoder part of the framework maps real data to encoded vectors, while the Decoder part allows for sampling on these encoded vectors. By performing sampling, the approach introduces an ordered structure to the dataset, where the first few data points provide the greatest amount of diversity.

The encoded vector spaces generated by the model maintain a unique property: similar original data points will have a smaller distance between each other in the vector space. This property facilitates the effective organization and representation of the data, enabling the model to capture important patterns and relationships more efficiently.

In the process of training the model, a subset of the dataset is selected for manual labeling, as opposed to labeling the entire dataset. This strategic approach minimizes the resources required for manual labeling while still obtaining valuable labeled data. The labeled subset is then used to train a simple Outlayer model. This model takes the encoded vectors as input and produces human labels as output. By training on this subset, the model can learn to generalize and predict labels for the remaining unlabeled data.

For building the training model, the approach adopts the Nadam optimizer. Nadam, a combination of Nesterov accelerated gradient descent [21] and Adam optimization algorithm [22], offers distinct advantages. It provides greater control over the learning rate and directly influences the gradient update, resulting in improved convergence and potentially faster training times.

By incorporating these strategies and techniques inspired by diversified financial investments, the approach aims to mitigate risk and enhance performance in the realm of machine learning. The use of smaller models, the organization of data through encoded vectors, and the selective labeling process all contribute to a more robust and efficient learning framework. Additionally, the choice of the Nadam optimizer further optimizes the training process, ultimately leading to better outcomes in terms of accuracy and generalization.

One advantage of this approach is the separation of feature extraction and output, which helps reduce the GPU RAM usage. During the pre-training stage, only raw data are required, and the Encoder-Decoder model is stored in the GPU RAM. During the encoded vector buffering step, only the Encoder is kept in the GPU RAM. Similarly, during the sampling stage, only the sample algorithm is running. When training the Outlayer, only the simple Outlayer and batch data are stored in the RAM.

By breaking down the large prediction model into smaller parts and executing the process step-by-step, the Outlayer model can accommodate more encoded vectors and process a larger batch of items within a fixed GPU RAM capacity. This partitioning of tasks and resource allocation allows for more efficient memory management during the different stages of the approach. It ensures that only the necessary components are stored in the GPU RAM at any given time, freeing up space for other operations. The advantage of this approach becomes particularly evident when dealing with large datasets or when working with limited GPU resources. By carefully managing the GPU RAM usage, the approach enables the model to handle a greater number of encoded vectors and process larger batches of items, without exceeding the memory constraints. This scalability and flexibility contribute to the overall effectiveness and practicality of the approach, making it suitable for a wide range of applications.

Basic Framework

The model utilized in this study consists of three primary components: Encoder, Decoder, and Outlayer. The Encoder component is responsible for transforming the data into feature vectors, which are then subjected to Vector Space Diversification sampling. This sampling process reorganizes the dataset, and the first N samples are selected for training the model. Figure 1 depicts the Encoder-Decoder-Outlayer framework, which comprises an Encoder, a Decoder, and an Outlayer.

During training, the chosen loss function is cross-entropy loss with weight. This loss function helps measure the discrepancy between the predicted outputs and the actual labels, taking into account the importance assigned to each class. Additionally, F1-score guidance is employed as a trigger mechanism. If the F1-score decreases below the previous score, specific actions are initiated to address and rectify the issue. Overall, the model’s architecture and training process aim to effectively encode the data, generate diverse samples through vector space diversification, and train the model using the selected samples. The use of cross-entropy loss with weight assists in optimizing the model’s performance, while the F1-score guidance helps monitor and manage the training progress, ensuring that the model maintains or improves its performance throughout the training process.

5. Experimental Results

5.1. Settings

The Encoder used in this study was the “all-mpnet-base-v2” Sentence-BERT model with 768 features. It was employed to transform both the train and test datasets into vectors. Figure 2 illustrates the architecture of the Outlayer, which consists of a three-layer-ResNet framework with Prelu activation, batch normalization, linear layer, and hidden layer size set to twice the cluster number.

The study employed the Nadam optimizer with an initial learning rate of 0.1 to train a three-layer-ResNet framework for data analysis. The activation function was Prelu, and a batch normalization [23] layer was applied before the linear layer. The hidden layer size was set to twice the cluster number, and cross entropy loss with weight was used. The weight was determined using a specific formula.

W_{c} = \frac{{(1 + \sum_{i \in N} y_{i} = c)}^{- 1}}{\sum_{c \in K} {(1 + \sum_{i \in N} y_{i} = c)}^{- 1}}

(1)

F1-score-guidance was used, which triggered when the F1-score decreased below the previous score. When this occurred, the learning rate was reduced by half and the forgiveness count was decreased by one. The forgiveness count was initialized at 12, and when it reached zero, the training stopped. The F1-score threshold for early stop was set at 0.995.

The use of F1-score-guidance and early stop eliminated the need for a validation set, as no other models were compared. During training, the model was only saved if the loss was not NaN and if the F1score had improved. The study was conducted on a NVIDIA GeForce RTX 3080 Ti GPU.

GPU memory optimization and data buffering:

(1)

Train an auto encoder-decoder, (omitted, borrow from Sentence-BERT)

a.: Only Encoder, Decoder and train data in GPU memory

(2)

Encoder vector data buffering (encode all data items into vectors)

a.: Parallelizable
b.: Only Encoder and predict data in GPU memory

(3)

Train Outlayer

a.: Only Outlayer, batch train vectors and label in GPU memory

This optimized GPU memory and data buffering pipeline allows for efficient training of an auto Encoder-Decoder with separate training steps for each part of the network. The encoded vectors are smaller in size than the raw data, allowing for larger batches in Outlayer training. This approach can reduce the overall training cost of the neural network.

5.2. Vector Space Diversification Sampling

The basic idea is to find the ‘center’ point among the training set encoding vectors for each dataset. Then, we select the center point as the root and perform a binary split in each feature dimension. We record the comparison status as 0 or 1, which allows us to obtain a binary representation of an integer. We can then utilize these integers as keys to represent the vector subspace and create branches based on these keys. This process is repeated recursively for each branch.

In this study, a method is proposed for sampling points in a vector space to explore the variety of the feature space. A center pivots picking algorithm is used to select the representative point of the space and divide the space into smaller subspaces. Distance methods, such as Euclidean distance and cosine similarity, are used to measure the distance between points. To introduce randomness, the rank is merged with each algorithm, and the indices are resorted. The output of different methods is blended to create a series of sample methods. The behavior of the algorithms is visualized using 2D points sampled in a circle. Results show that exploring the first few indices after reranking provides the greatest diversity of the feature space. Although cosine similarity may be a reasonable choice as a distance measure, as Sentence-BERT is designed to work with unit vectors and perform cosine similarity on text pairs, our experiments did not find it to make a significant difference [24]. Nonetheless, our approach still provides a useful method for sampling points in a vector space to explore its variety. Figure 3 shows the sampling executed on a 2D unit circle employing the Gaussian distribution of theta values. It consists of subfigures, which illustrate the effect of different sampling methods, including pure random, and picking pivots by using mean and median on the 2D unit circle.

5.3. Data and Vector Space: Understanding and Unfamiliarity

Understanding refers to measuring the extent to which a given set of vectors comprehends the properties of the data points or subspaces contained within it. It is measured by assessing how well the system identifies and represents all possible points in the vector space.

Unfamiliarity, on the other hand, refers to evaluating the degree to which a given data point is unfamiliar to a specific subspace within the vector space. This measure can be used to inform AI systems of the level of confusion or disinterest they should feel towards certain data points, based on their level of familiarity with the subspace to which they belong. These new metrics can be described mathematically using the following formulas:

Let V be a vector space, and B be a set of real or virtual points within that vector space. Let x be a vector that belongs to V, and D be the distance function (e.g., Euclidean distance, arccos of cosine similarity, etc.). Finally, let g be an adjustment function that is positive and monotonically increasing, and have a derivative that is monotonically decreasing.

U n f a m i l i a r i t y (x, B) = m i n_{b \in B} D (x, B)

(2)

U n d e r s t a n d i n g (B) = \sum_{x \in B} g (U n f a m i l i a r i t y (x, B - \{x\})

(3)

U n f a m i l i a r i t y R a t e (x, B) = \frac{U n f a m i l i a r i t y (x, B)}{m a x_{S \in V} U n f a m i l i a r i t y (x, S)}

(4)

U n d e r s t a n d i n g R a t e (B) = \frac{U n d e r s t a n d i n g (B)}{m a x_{S \in V} U n d e r s t a n d i n g (S)}

(5)

To ensure stable results and secure float representation, we can use the inverse of percentiles to obtain the rate, although this may introduce additional complexity.

5.4. Results

This study investigated the effectiveness of different data sampling methods on the performance of trained models. The VSD sampling algorithm, which selects items that maximize understanding during the sampling process, was compared to a random sampling method. The experimental results indicate that the model trained with VSD sampling algorithms typically outperforms the random sampling method on small datasets. However, for a large portion of the dataset, there is not much difference since the sample will eventually be the same as the whole training data as the size increases to the maximum size.

The improvement in model performance is more significant in metrics, such as recall, F1, and accuracy, but not in precision score. The experiment demonstrated that VSD sampling leads to a substantial improvement in F1-score for several datasets, including amazonpolar, dbpedia, agnews, and emotion, on small datasets. A 50-item trained model was evaluated using F1-score on a test set. The results of the study are presented in Table 1. F1 (trivial) represents the F1-score of randomly selecting items from each class. F1 (rand) represents the F1-score of the random sampling method. F1 (VSD min), F1 (VSD ave), and F1 (VSD max) are the F1-scores of the VSD sampling algorithm when selecting items with the minimum, average, and maximum understanding, respectively. F1 (VSD min-rand), F1 (VSD ave-rand), and F1 (VSD max-rand) represent the differences between the F1-scores of the VSD sampling algorithm and the random sampling method when selecting items with the minimum, average, and maximum understanding, respectively. The dataset’s F1-score, accuracy, precision, and recall for each sampling approach are illustrated in Figure 4, Figure 5 and Figure 6.

However, the nature of the Sentence-BERT encoding used in the experiment may have limited the performance in some datasets. The black line in the figure represents the trivial F1-score baseline achieved by randomly selecting items from each class. As the size of the dataset increases, the difference between the VSD sampling and random sampling methods becomes less significant. Table 1 shows the enhancements observed across various dataset sample sizes.

Overall, these findings suggest that VSD sampling can improve model performance on small datasets, but its effectiveness may vary depending on the nature of the data and the encoding method used. Therefore, researchers should consider using VSD sampling in conjunction with appropriate encoding techniques to improve model performance.

6. Conclusions

Data diversity is crucial for enhancing the performance of neural network models, and simply increasing the amount of data without considering diversity can be misleading. Traditionally, training datasets have contained redundant data, and researchers have resorted to brute force or AI-generated data to enhance diversity, which can be resource-intensive.

To address this issue, we propose an Encoder-Decoder-Outlayer (EDO) pipeline and a VSD sampling algorithm that leverages a pre-trained Encoder-Decoder framework for feature extraction. Our approach involves using a compact output layer and efficiently exploring the diversity of the encoded feature or hidden layer vector space to prevent overfitting and improve performance, even with limited data.

Experimental results demonstrate that our approach can yield satisfactory results in tasks that previously demanded substantial amounts of data. By employing a pretrained Encoder model for feature extraction and incorporating a small output layer, we can conserve computational resources and reduce human labor. Furthermore, storing the encoding process in a buffer allows for data to be encoded only once, further diminishing computational costs. Future work may involve extending the application of the EDO pipeline and VSD sampling to other tasks and developing a more generalized Encoder-Decoder approach.

Author Contributions

Abstract, F.K. and H.Z.; Introduction, F.K. and H.Z; Literature Review, F.K. and H.Z.; Data Description, F.K. and H.Z.; Methodology, H.Z. and F.K.; Experiment & Result H.Z. and F.K.; Conclusions H.Z. and F.K; Writing—original draft, F.K. and H.Z.; Writing—review & editing, H.Z. and F.K.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

https://pypi.org/project/datasets/.

Acknowledgments

Hongyi Zeng and Fanyi Kong contributed equally to this work and should be considered as co-first authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bražinskas, A.; Nallapati, R.; Bansal, M.; Dreyer, M. Efficient few-shot fine-tuning for opinion summarization. arXiv 2022, arXiv:2205.02170. [Google Scholar]
Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv 2020, arXiv:2002.06305. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Settles, B. Active Learning Literature Survey: Semantic Scholar, Active Learning Literature Survey|Semantic Scholar. 1970. Available online: https://www.semanticscholar.org/paper/Active-Learning-Literature-Survey-Settles/818826f356444f3daa3447755bf63f171f39ec47 (accessed on 1 April 2023).
Dai, Y.; Yang, C.; Liu, Y.; Yao, Y. Latent-Enhanced Variational Adversarial Active Learning Assisted Soft Sensor. IEEE Sens. J. 2023. [Google Scholar] [CrossRef]
Deng, H.; Yang, K.; Liu, Y.; Zhang, S.; Yao, Y. Actively exploring informative data for smart modeling of industrial multiphase flow processes. IEEE Trans. Ind. Inform. 2020, 17, 8357–8366. [Google Scholar] [CrossRef]
Xie, B.; Yuan, L.; Li, S.; Liu, C.H.; Cheng, X.; Wang, G. Active learning for domain adaptation: An energy-based approach. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 8708–8716. [Google Scholar]
Liu, Y.; Yan, Z.; Tan, J.; Li, Y. Multi-purpose Oriented Single Nighttime Image Haze Removal Based on Unified Variational Retinex Model. In IEEE Transactions on Circuits and Systems for Video Technology; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. Available online: https://arxiv.org/pdf/1706.03762.pdf (accessed on 2 April 2023).
Floris Jacobs, P.; Maillette de Buy Wenniger, G.; Wiering, M.; Schomaker, L. Active Learning for Reducing Labeling Effort in Text Classification Tasks. arXiv 2021, arXiv:2109.04847. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
Cer, D.; Yang, Y.; Kong, S.Y.; Hua, N.; Limtiaco, N.; John, R.S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal sentence encoder. arXiv 2018, arXiv:1803.11175. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Le, Q.; Mikolov, T. Distributed Representations of Sentences and Documents. arXiv 2014, arXiv:1405.4053. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv 2015, arXiv:1502.01852. [Google Scholar]
Dettmers, T.; Lewis, M.; Belkada, Y.; Zettlemoyer, L. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv 2022, arXiv:2208.07339. [Google Scholar]
Dogo, E.M.; Afolabi, O.J.; Nwulu, N.I.; Twala, B.; Aigbavboa, C.O. A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks. In Proceedings of the 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS), Belgaum, India, NJ, USA, 21–22 December 2018; Niranjan, S.K., Kavitha, C., Kavitha, K.S., Sathish Kumar, T., Eds.; IEEE Bangalore Section. IEEE: Piscataway, NJ, USA, 2018; pp. 92–99. [Google Scholar]
Markowitz, H. Portfolio Selection. J. Financ. 1952, 7, 77–91. [Google Scholar] [CrossRef]
Leung, M.F.; Wang, J. Cardinality-constrained portfolio selection based on collaborative neurodynamic optimization. Neural Netw. 2022, 145, 68–79. [Google Scholar] [CrossRef] [PubMed]
Gu, P.; Tian, S.; Chen, Y. Iterative Learning Control Based on Nesterov Accelerated Gradient Method. IEEE Access 2019, 7, 115836–115842. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Bello, A.; Ng, S.-C.; Leung, M.-F. A BERT Framework to Sentiment Analysis of Tweets. Sensors 2023, 23, 506. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Pipeline illustrating the Encoder-Decoder-Outlayer framework.

Figure 2. The architecture of the Outlayer.

Figure 3. Sampling executed on a 2D unit circle employing Gaussian distribution of theta values is demonstrated.

Figure 4. F1-score, accuracy, precision, and recall of the yelp and imdb datasets.

Figure 5. F1-score, accuracy, precision, and recall of the emotion and dbpedia datasets.

Figure 6. F1-score, accuracy, precision, and recall of the amazonpolar and agnews datasets.

Table 1. The enhancements observed across various dataset sample sizes.

Dataset Name	#Items	Percent	F1 (Trivial)	F1 (Rand)	F1 (VSD Min)	F1 (VSD Ave)	F1 (VSD Max)	F1 (VSD Min-Rand)	F1 (VSD Ave-Rand)	F1 (VSD Max-Rand)
agnews	15	0.0019737	0.25	0.2002	0.3228	0.5219	0.6358	0.1226	0.3217	0.4356
amazonpolar	15	3.75 × 10⁻⁵	0.5	0.3702	0.3425	0.5259	0.6999	−0.0277	0.1556	0.3296
dbpedia	15	0.0002143	0.0714	0.1766	0.2006	0.3052	0.3899	0.024	0.1286	0.2133
emotion	15	0.0075	0.1667	0.1115	0.2069	0.2293	0.2613	0.0954	0.1178	0.1498
imdb	15	0.0006	0.5	0.55	0.484	0.6614	0.7568	−0.0659	0.1114	0.2069
yelp	15	0.0003	0.2	0.1122	0.2269	0.2529	0.3048	0.1148	0.1408	0.1926
agnews	25	0.0032895	0.25	0.3564	0.6411	0.7071	0.7356	0.2847	0.3508	0.3793
amazonpolar	25	6.25 × 10⁻⁵	0.5	0.3975	0.6152	0.6831	0.7443	0.2178	0.2856	0.3468
dbpedia	25	0.0003571	0.0714	0.2002	0.364	0.4611	0.5514	0.1638	0.2609	0.3512
emotion	25	0.0125	0.1667	0.1675	0.2343	0.2513	0.2805	0.0668	0.0838	0.113
imdb	25	0.001	0.5	0.7555	0.7061	0.7439	0.7749	−0.0494	−0.0116	0.0194
yelp	25	0.0005	0.2	0.1501	0.2229	0.2901	0.3296	0.0728	0.14	0.1795
agnews	50	0.0065789	0.25	0.3747	0.6887	0.764	0.7933	0.314	0.3893	0.4186
amazonpolar	50	0.000125	0.5	0.7615	0.6986	0.7393	0.7825	−0.0629	−0.0222	0.021
dbpedia	50	0.0007143	0.0714	0.3767	0.5342	0.6205	0.6847	0.1574	0.2438	0.308
emotion	50	0.025	0.1667	0.234	0.2581	0.2854	0.3171	0.0241	0.0514	0.0831
imdb	50	0.002	0.5	0.5518	0.725	0.7491	0.7981	0.1732	0.1973	0.2463
yelp	50	0.001	0.2	0.1907	0.3106	0.3361	0.3589	0.1199	0.1454	0.1682
agnews	70	0.0092105	0.25	0.686	0.7384	0.7722	0.8027	0.0523	0.0861	0.1167
amazonpolar	70	0.000175	0.5	0.6215	0.7615	0.7775	0.7926	0.14	0.156	0.1711
dbpedia	70	0.001	0.0714	0.6304	0.7154	0.7532	0.7797	0.0851	0.1228	0.1494
emotion	70	0.035	0.1667	0.2284	0.2912	0.3144	0.3532	0.0629	0.086	0.1248
imdb	70	0.0028	0.5	0.6959	0.7356	0.7709	0.7924	0.0396	0.0749	0.0965
yelp	70	0.0014	0.2	0.3397	0.3512	0.3706	0.4004	0.0115	0.0309	0.0607
agnews	100	0.0131579	0.25	0.6515	0.7909	0.8005	0.8194	0.1394	0.149	0.1679
amazonpolar	100	0.00025	0.5	0.7612	0.7697	0.7852	0.8049	0.0085	0.024	0.0437
dbpedia	100	0.0014286	0.0714	0.5767	0.8067	0.8254	0.8538	0.23	0.2487	0.2771
emotion	100	0.05	0.1667	0.3133	0.2981	0.3223	0.3429	−0.0152	0.009	0.0296
imdb	100	0.004	0.5	0.7312	0.7404	0.7638	0.7783	0.0092	0.0326	0.0471
yelp	100	0.002	0.2	0.3616	0.3499	0.3778	0.3936	−0.0117	0.0162	0.032
agnews	200	0.0263158	0.25	0.8166	0.7887	0.8047	0.8314	−0.0279	−0.0118	0.0148
amazonpolar	200	0.0005	0.5	0.7259	0.8026	0.8162	0.8338	0.0767	0.0904	0.108
dbpedia	200	0.0028571	0.0714	0.77	0.8874	0.8968	0.9134	0.1174	0.1268	0.1434
emotion	200	0.1	0.1667	0.3619	0.3158	0.3451	0.3659	−0.0461	−0.0168	0.004
imdb	200	0.008	0.5	0.8139	0.7795	0.7864	0.7938	−0.0344	−0.0275	−0.0201
yelp	200	0.004	0.2	0.3788	0.3804	0.3956	0.4087	0.0016	0.0168	0.0299
agnews	300	0.0394737	0.25	0.8077	0.8077	0.8257	0.8468	−0.0001	0.018	0.039
amazonpolar	300	0.00075	0.5	0.8344	0.8216	0.8391	0.8519	−0.0128	0.0047	0.0174
dbpedia	300	0.0042857	0.0714	0.8926	0.9048	0.9182	0.9269	0.0122	0.0256	0.0343
emotion	300	0.15	0.1667	0.3807	0.3537	0.3675	0.3876	−0.027	−0.0132	0.007
imdb	300	0.012	0.5	0.7767	0.7714	0.789	0.8036	−0.0053	0.0123	0.0269
yelp	300	0.006	0.2	0.3996	0.408	0.4186	0.4282	0.0084	0.019	0.0286
agnews	500	0.0657895	0.25	0.8327	0.8167	0.832	0.8401	−0.0159	−0.0006	0.0075
amazonpolar	500	0.00125	0.5	0.8355	0.8282	0.8384	0.8452	−0.0074	0.0028	0.0096
dbpedia	500	0.0071429	0.0714	0.902	0.9266	0.9389	0.9464	0.0246	0.0369	0.0444
emotion	500	0.25	0.1667	0.4049	0.3811	0.3987	0.4088	−0.0238	−0.0062	0.0038
imdb	500	0.02	0.5	0.8126	0.8092	0.811	0.8173	−0.0034	−0.0016	0.0047
yelp	500	0.01	0.2	0.4317	0.4299	0.4381	0.4432	−0.0017	0.0064	0.0115
agnews	700	0.0921053	0.25	0.8477	0.8275	0.8427	0.8534	−0.0202	−0.005	0.0058
amazonpolar	700	0.00175	0.5	0.8421	0.8373	0.8496	0.8638	−0.0048	0.0076	0.0217
dbpedia	700	0.01	0.0714	0.9239	0.9425	0.9484	0.9524	0.0186	0.0245	0.0285
emotion	700	0.35	0.1667	0.464	0.4084	0.4222	0.4388	−0.0556	−0.0417	−0.0251
imdb	700	0.028	0.5	0.8334	0.8153	0.8212	0.8261	−0.0182	−0.0123	−0.0073
yelp	700	0.014	0.2	0.4501	0.4356	0.4446	0.4565	−0.0145	−0.0055	0.0064
agnews	1000	0.1315789	0.25	0.8626	0.8407	0.8507	0.8592	−0.0219	−0.012	−0.0034
amazonpolar	1000	0.0025	0.5	0.85	0.8519	0.8588	0.8657	0.0019	0.0088	0.0157
dbpedia	1000	0.0142857	0.0714	0.9482	0.9477	0.9545	0.962	−0.0005	0.0063	0.0138
emotion	1000	0.5	0.1667	0.4519	0.4304	0.4469	0.4565	−0.0215	−0.005	0.0046
imdb	1000	0.04	0.5	0.843	0.8197	0.8274	0.8405	−0.0233	−0.0157	−0.0025
yelp	1000	0.02	0.2	0.4644	0.4447	0.4553	0.4697	−0.0197	−0.0091	0.0053
agnews	2000	0.2631579	0.25	0.8688	0.8636	0.8668	0.8721	−0.0053	−0.002	0.0033
amazonpolar	2000	0.005	0.5	0.8737	0.8722	0.8774	0.8839	−0.0015	0.0037	0.0102
dbpedia	2000	0.0285714	0.0714	0.9611	0.9623	0.965	0.9665	0.0012	0.0039	0.0054
emotion	2000	1	0.1667	0.5049	0.478	0.4848	0.4945	−0.0269	−0.0201	−0.0104
imdb	2000	0.08	0.5	0.8507	0.8386	0.8472	0.8527	−0.0121	−0.0035	0.002
yelp	2000	0.04	0.2	0.4857	0.4683	0.4776	0.4839	−0.0174	−0.0082	−0.0018
agnews	3000	0.3947368	0.25	0.8711	0.8689	0.8704	0.8721	−0.0022	−0.0007	0.0011
amazonpolar	3000	0.0075	0.5	0.8822	0.8748	0.8822	0.8869	−0.0074	0	0.0048
dbpedia	3000	0.0428571	0.0714	0.9672	0.967	0.9699	0.9716	−0.0001	0.0027	0.0045
emotion	3000	1.5	0.1667	0.512	0.5063	0.5129	0.5188	−0.0057	0.001	0.0068
imdb	3000	0.12	0.5	0.839	0.8522	0.8541	0.8578	0.0132	0.0151	0.0187
yelp	3000	0.06	0.2	0.4875	0.4853	0.4885	0.4931	−0.0022	0.001	0.0056
agnews	5000	0.6578947	0.25	0.8794	0.8736	0.8785	0.8865	−0.0058	−0.0009	0.0071
amazonpolar	5000	0.0125	0.5	0.892	0.8813	0.889	0.8942	−0.0107	−0.003	0.0022
dbpedia	5000	0.0714286	0.0714	0.9706	0.9713	0.972	0.973	0.0007	0.0014	0.0024
emotion	5000	2.5	0.1667	0.5372	0.5241	0.534	0.543	−0.0131	−0.0031	0.0059
imdb	5000	0.2	0.5	0.8703	0.8565	0.8642	0.8727	−0.0138	−0.0061	0.0024
yelp	5000	0.1	0.2	0.504	0.4986	0.505	0.5137	−0.0055	0.001	0.0097
agnews	10,000	1.3157895	0.25	0.8861	0.8865	0.8884	0.8909	0.0003	0.0023	0.0048
amazonpolar	10,000	0.025	0.5	0.9012	0.898	0.8999	0.9029	−0.0032	−0.0013	0.0017
dbpedia	10,000	0.1428571	0.0714	0.9758	0.9743	0.9752	0.9761	−0.0014	−0.0006	0.0003
emotion	10,000	5	0.1667	0.5621	0.5508	0.5567	0.5634	−0.0112	−0.0053	0.0013
imdb	10,000	0.4	0.5	0.8787	0.8694	0.8744	0.8766	−0.0093	−0.0043	−0.0021
yelp	10,000	0.2	0.2	0.528	0.5232	0.5284	0.5332	−0.0047	0.0004	0.0052
agnews	16,000	2.1052632	0.25	0.8977	0.8927	0.895	0.8983	−0.005	−0.0027	0.0006
amazonpolar	16,000	0.04	0.5	0.9042	0.9029	0.9057	0.9074	−0.0012	0.0016	0.0032
dbpedia	16,000	0.2285714	0.0714	0.9778	0.9779	0.9781	0.9782	0.0001	0.0003	0.0004
emotion	16,000	8	0.1667	0.5841	0.5743	0.5844	0.5958	−0.0099	0.0003	0.0116
imdb	16,000	0.64	0.5	0.8802	0.8808	0.8844	0.8869	0.0006	0.0041	0.0066
yelp	16,000	0.32	0.2	0.5412	0.5387	0.5426	0.5463	−0.0025	0.0014	0.0051
agnews	20,000	2.6315789	0.25	0.8944	0.8949	0.8964	0.8986	0.0005	0.002	0.0041
amazonpolar	20,000	0.05	0.5	0.9098	0.9055	0.9077	0.9107	−0.0043	−0.0021	0.0009
dbpedia	20,000	0.2857143	0.0714	0.9791	0.9785	0.979	0.9794	−0.0005	−0.0001	0.0004
imdb	20,000	0.8	0.5	0.8844	0.8797	0.8831	0.8846	−0.0047	−0.0013	0.0002
yelp	20,000	0.4	0.2	0.5511	0.548	0.5502	0.5556	−0.0031	−0.0009	0.0044
agnews	25,000	3.2894737	0.25	0.9004	0.8952	0.8982	0.9005	−0.0052	−0.0021	0.0002
amazonpolar	25,000	0.0625	0.5	0.9106	0.9092	0.9103	0.912	−0.0014	−0.0003	0.0015
dbpedia	25,000	0.3571429	0.0714	0.9798	0.9793	0.9798	0.9804	−0.0005	0	0.0006
imdb	25,000	1	0.5	0.8863	0.8842	0.8866	0.8879	−0.0021	0.0003	0.0016
yelp	25,000	0.5	0.2	0.5537	0.5476	0.5539	0.5568	−0.0061	0.0002	0.003

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, H.; Kong, F. Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling. Mathematics 2023, 11, 2819. https://doi.org/10.3390/math11132819

AMA Style

Zeng H, Kong F. Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling. Mathematics. 2023; 11(13):2819. https://doi.org/10.3390/math11132819

Chicago/Turabian Style

Zeng, Hongyi, and Fanyi Kong. 2023. "Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling" Mathematics 11, no. 13: 2819. https://doi.org/10.3390/math11132819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Active Learning: Encoder-Decoder-Outlayer and Vector Space Diversification Sampling

Abstract

1. Introduction

2. Literature Review

3. Data Description

4. Methodology

Basic Framework

5. Experimental Results

5.1. Settings

5.2. Vector Space Diversification Sampling

5.3. Data and Vector Space: Understanding and Unfamiliarity

5.4. Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI