1. Introduction
Gems are high-value and diverse minerals or rocks [
1]. Effectively identifying gem species can prevent confusion between counterfeits, synthetic gems, and natural gems. Traditional gem identification methods mainly rely on manual observation, testing, and experience-based judgment, which are subjective, unstable, inaccurate, and inefficient. To overcome these issues, spectroscopic analysis technology, as a non-destructive, rapid, accurate, and objective identification method, has gradually gained widespread attention and application [
2]. Spectral analysis is a method that uses the spectrum of a substance to identify its composition and relative chemical content. Ultraviolet–visible (UV–VIS) absorption spectroscopy determines the composition, structure, and properties of jade by analyzing the selective absorption characteristics of valence electrons within the material under electromagnetic radiation [
3]. The position of absorption peaks and the intensity of absorption in the spectrum are used for qualitative and quantitative analysis of the components. Due to the different crystal structures and chemical compositions of gemstones, they exhibit distinct ultraviolet-visible absorption spectral features. UV–VIS absorption spectroscopy is also included as a common method for gem identification in the national standard GB/T 42645-2023 [
3] “Gem Identification—Ultraviolet–Visible Absorption Spectroscopy”. UV–VIS are highly efficient and enable rapid analysis, allowing for direct and non-destructive testing of samples [
4]. However, due to the variety of gemstone types, ultraviolet spectra also show complex and diverse forms. It requires a comprehensive consideration of the wavelength position, number, shape, and relative intensity of absorption peaks, making it difficult to quickly and accurately describe and classify using subjective judgment. Gemstone identification based on ultraviolet spectroscopy mainly relies on chemometric models that characterize the relationship between spectral data and gemstone categories [
5]. Traditional spectral analysis methods include Principal Component Analysis (PCA) [
6,
7], Partial Least Squares Discriminant Analysis (PLS_DA), Support Vector Machines (SVM) [
8], among others. Although the resolution of spectral measurement equipment is continuously improving, providing more effective information and reducing more noise, the high dimensionality and susceptibility to the interference of spectra, as well as unavoidable noise, pose challenges. This makes it difficult for traditional chemometrics to directly extract effective features from spectra. Therefore, how to use advanced computer technology to improve the accuracy and efficiency of jade ultraviolet spectral analysis, and quickly achieve jade identification, is a pressing problem that needs to be addressed.
In recent years, deep-learning-based spectral analysis techniques have rapidly developed [
9] and have been widely applied in fields such as agricultural products [
10], pharmaceuticals [
11], minerals, and medicine [
12]. In 2017, Acquarelli et al. were the first to apply convolutional neural network (CNN)-based methods to classify ten different types of spectra [
13]. Compared to traditional chemometric methods, CNNs achieved higher accuracy in quantitative analysis and were less dependent on spectral data processing. In 2019, another study [
14] introduced the Inception structure [
15] based on CNN, using combinations of convolution kernels of different sizes to extract multi-scale local features of the spectra, thus improving the adaptability of the network. CNN-based classification models [
16,
17,
18] can effectively capture local features of spectra through convolutional layers, but the fixed size of the convolution kernels limits CNNs in learning long-range features. Spectral data, characterized by wavelength or frequency amplitudes, have properties similar to time–frequency sequences [
19], and CNNs struggle to capture long-distance feature dependencies. One study [
20] combined visible-near-infrared spectra with LSTM and CNN to propose a prediction model for orange quality, which generally outperformed the CNN-based DeepSpectra and CNN-AT models in predicting five quality aspects of oranges, except for vitamin C content prediction, where it was slightly inferior to DeepSpectra.
The Transformer model, emerging and widely applied in various natural language processing tasks, has made significant progress and success in deep learning. Compared with CNNs, Transformers perform better in tasks requiring long-distance dependencies [
21]. In recent years, Transformer-based methods have also been proposed for one-dimensional signal classification [
22,
23,
24]. Compared with CNNs, Transformers more easily leverage the self-attention mechanism, compensating for CNNs’ inherent limitations in long-range dependencies. One study proposed the SpectraTr [
11] model based on Transformers for spectral classification, but Transformers have significantly more parameters than CNNs, slower inference speeds, and require more training samples. The combination of CNNs and Transformers has shown significant advantages in recent research. CNNs excel at extracting local features, while Transformers can capture long-range dependencies and global information. This combination has achieved remarkable results and widespread application in image classification and recognition, natural language processing, medical image analysis, and multimodal fusion. One study [
25] proposed a lightweight CNN–Transformer model to solve the traveling salesman problem (TSP), combining CNN embedding layers and partial self-attention mechanisms to better learn spatial features from input data and reduce redundancy in fully connected attention models, showing clear performance and accuracy advantages over other deep learning models. Another study [
26] similarly combined the strengths of CNNs and Transformers, modeling the time–frequency features of EEG signals through alternating model structures, capturing both local features and long-range dependencies, achieving an ROC curve area (AUC) of 93.5% on the CHB-MIT database. Although spectral data have properties similar to time–frequency sequences, there are currently no models combining CNNs and Transformers applied to spectral analysis tasks.
Spectral data characterized by wavelength or frequency amplitudes have properties similar to time–frequency sequences. A hybrid CNN–Transformer spectral classification model, SpectraViT, has been proposed. On a jade ultraviolet spectral dataset, this model outperformed traditional SVM and PLS_DA methods as well as other deep learning classification methods, proving the proposed model to be an effective solution. The model not only has a higher performance but also consumes less computational resources and memory compared to the Transformer model, balancing performance and accuracy.
The rest of this paper is organized as follows: the second part introduces the jade ultraviolet spectral dataset and the architecture of the hybrid model, including the inverted residual structure and Transformer. The third part presents the application experiments of the model and discusses the experimental results in detail. The final section provides the conclusion.
3. Results
3.1. Experimental Environment
The experiment was conducted on a desktop equipped with a GTX 1050Ti graphics processing unit (GPU). The software environment used Windows 10 operating system, CUDA 12.1, and cuDNN 8.9.7. The learning framework employed Pytorch 2.0 as the backend and Python 3.10 as the programming language. Adam was used as the optimizer to optimize the network model, with an initial learning rate of 1 × 10−4 , a batch size of 10, and 500 epochs of training. The learning rate was dynamically adjusted using the cosine annealing algorithm, eventually reducing it to 1 × 10−6. The cosine function was used to first slow down the learning rate, then accelerate the decrease, and finally slow it down again, in order to avoid the phenomenon of gradient descent being too rapid during training, bringing the loss function as close to the optimal solution as possible. This effectively avoided local optimal results, accelerated model convergence, and improved the model’s generalization ability, demonstrating good results in this experiment.
3.2. Model Performance on the Jade Spectral Dataset
The model with the highest accuracy in the validation set was tested for performance on the test set. As shown in
Table 2, the model achieved an accuracy of 99.24% in the jade classification task, demonstrating excellent classification capability. The model’s recall rate is slightly higher than its precision, indicating a tendency to capture positive samples, although this may lead to some false positives. The F1 score is 99.23%, reflecting a good balance between precision and recall. As shown in
Figure 6, the loss curve (a) and accuracy curve (b) indicate that our model converged after 190 training iterations and achieved a high accuracy.The confusion matrix in
Table 3 shows that the model performs very well on the D, CVD, and MS categories, but there are some misclassifications in the HPHT category, mainly misclassifying HPHT samples as CVD. Overall, the model exhibits an outstanding performance.
3.3. Comparison of the Model with Other Models
To validate the superiority of the SpectraViT model used in this study, comparisons were first made with traditional classification algorithms, including SVM and PLS_DA. Then, comparisons were conducted with CNN-based models such as AlexNet and DeepSpectra, as well as with the Transformer-based SpectraTr model for time-series sequences.
3.3.1. Comparison with SVM and PLS
SVM and PLS-DA are two commonly used traditional machine learning methods widely applied in spectral analysis. SVM is a popular supervised learning model used for classification and regression analysis, with the core idea of finding a hyperplane that maximizes the margin between classes. PLS_DA is a classification method that combines partial least squares regression with linear discriminant analysis, particularly suitable for high-dimensional data.
In this section, we compared the deep learning model SpectraViT with SVM and PLS_DA to verify the superiority of our model over traditional methods. SVM and PLS_DA are commonly used traditional machine learning methods in spectral analysis. To implement these two classification algorithms, we first standardized the data. For the SVM model, we used a linear kernel function with parameters C = 1 and = auto, applying the One-vs-One strategy for classifying the four types of UV spectra. For the PLS-DA model, we selected 100 principal components for feature extraction. All of the models were implemented using Python and its Scikit-learn library.
As shown in
Table 4, the SpectraViT model demonstrated outstanding performance in accuracy, recall, precision, and F1 score, achieving 99.24%, 99.25%, 99.06%, and 99.23%, respectively. In comparison, the performance of SVM and PLS_DA is lower. These results indicate that SpectraViT offers performance advantages in spectral analysis tasks over traditional machine learning methods, particularly in providing a higher classification accuracy and better model generalization when handling complex datasets.
3.3.2. Comparison with Other Deep Learning Models
In this study, we compared SpectraViT with AlexNet, DeepSpectra, and SpectraTr deep learning models. The comparison was based on the number of parameters, FLOPs (floating-point operations per second), and performance metrics on the test set, including accuracy, recall, precision, and F1 score.
Table 5 details the performance of each model:
In terms of the number of parameters, SpectraViT has the fewest, with only 0.852 M, significantly lower than the other models. AlexNet has 8.291 M parameters, DeepSpectra has as many as 207.398 M, and SpectraTr has 16.030 M. Fewer parameters indicate that our model is more efficient in terms of storage and computation, making it suitable for deployment in resource-constrained environments. SpectraViT also has the lowest FLOPs, at just 0.009 G, indicating the lowest computational complexity. In contrast, AlexNet’s FLOPs are 0.022 G, DeepSpectra’s are 0.213 G, and SpectraTr’s are as high as 2.101 G. Lower FLOPs give SpectraViT a significant advantage in inference speed and energy consumption.
Regarding performance on the validation set, SpectraViT achieved an accuracy of 99.31%, outperforming all other models. AlexNet and DeepSpectra achieved validation accuracies of 98.77% and 98.43%, respectively, while SpectraTr achieved 97.82%. On the test set, SpectraViT also outperformed the other models, with an accuracy of 99.24%, recall of 99.25%, precision of 99.06%, and F1 score of 99.23%. AlexNet’s accuracy was 98.02%, recall was 98.02%, precision was 98.09%, and F1 score was 98.05%. DeepSpectra’s metrics were slightly lower than AlexNet’s but close, around 98.36%. SpectraTr performed relatively worse.
Overall, SpectraViT excels in all aspects, particularly in terms of parameter count and computational complexity, while also achieving better accuracy, recall, precision, and F1 score on both the validation and test sets compared to other models. This indicates that SpectraViT is not only more efficient in terms of resource consumption but also superior in performance, with better generalization capability. Thus, SpectraViT demonstrates better performance than other deep convolutional neural network models and pure Transformer networks for jade UV spectrum recognition.
3.4. Impact of Preprocessing Algorithms on Model Performance
The experimental results shown in
Table 6 indicate that preliminary preprocessing of spectral data through interpolation and resampling can significantly enhance the accuracy of classification algorithms. This finding confirms the effectiveness of interpolation and resampling techniques in adjusting data sampling rates and distributions, especially when handling unevenly sampled signals or converting signals between different sampling rates.
In the deep learning process, models using resampling techniques performed the best. With the original data, the SpectraViT model achieved an accuracy and recall of 98.29%, precision of 98.30%, and an F1 score of 98.30%. This demonstrates that even without any preprocessing, the model performs well, indicating a strong robustness to the raw data.
When the raw spectral data were resampled, model performance improved significantly, with accuracy reaching 99.24%, recall of 99.25%, precision of 99.06%, and an F1 score of 99.23%. This suggests that resampling preserves and expresses the most useful initial features of the data.
Further experiments explored the effects of different preprocessing algorithms, including normalization (NMS), standard score transformation (SS), Savitzky–Golay smoothing (SG), and multiplicative scatter correction (MSC), applied after interpolation and resampling. The results showed a slight decrease in model performance with these preprocessing methods. This indicates that these preprocessing methods may lead to some loss of feature information, reducing the model’s classification performance. However, our model still demonstrated strong nonlinear fitting and adaptive learning capabilities, effectively extracting key features from the data. Additionally, our model is capable of achieving good spectral recognition without extensive preprocessing.
3.5. Impact of Loss Functions on Model Performance
In the spectral classification experiments, we compared the performance of four different loss functions—Cross Entropy Loss, Weighted Cross Entropy Loss, Focal Loss, and Dice Loss—in terms of model accuracy. The results showed that Cross Entropy Loss achieved the highest accuracy rate of 99.24%, followed by Weighted Cross Entropy Loss with an accuracy of 98.54%. Focal Loss and Dice Loss had accuracies of 98.32% and 0.98.65%, respectively, as shown in
Table 7. These results indicate that, for spectral data classification tasks, the standard Cross Entropy Loss function is the optimal choice among the four loss functions.
3.6. Ablation Experiments
To investigate the impact of each module on the model, ablation experiments were conducted. Specifically, key modules were removed from the complete model to compare the changes in model performance. The complete model includes four modules: conv1, mv2, transformer, and conv2. Here, conv1 and conv2 are convolution modules with 1 × 1 kernels, used for linear combinations of features across different channels, increasing cross-channel information interaction, and performing dimensionality reduction or expansion to adjust the number of output channels.
In this experiment, comparisons were made by removing either the Mv2 module or the Transformer module from the complete model.
Table 8 shows the results of the ablation experiments, where Accuracy refers to the accuracy on the test set. As seen in
Table 8, removing each module leads to a decrease in the model’s accuracy. Notably, the removal of the Transformer module causes a significant drop in accuracy, indicating the crucial role of the global attention mechanism in spectral classification. The Transformer module, combined with the convolution modules, can more effectively learn spectral features and improve accuracy.
3.7. Performance of SpectraViT on Public Datasets
To further validate the recognition and generalization performance of the SpectraViT model, we conducted a comparative analysis using the publicly available fruit puree dataset. The primary connection between this dataset and the jade dataset lies in their use for qualitative analysis, allowing us to apply similar analysis methods to process and compare the spectral features of different substances. Specifically, this dataset includes 351 strawberry samples, 159 raspberry samples, and 665 “non-fruit” samples (including various other fruits and “contaminated” strawberry and raspberry purees, where the weight percentage of other fruits is greater than 10%). The dataset was sourced from reference [
31]. The results are shown in
Table 9. According to the performance comparison on the fruit puree spectral dataset, our SpectraViT model outperforms all other models across all evaluation metrics (accuracy, recall, precision, and F1 score), achieving 98.47%, 98.49%, 98.47%, and 98.48%, respectively. AlexNet and SpectraTr also performed well, with AlexNet achieving accuracy, recall, precision, and F1 scores of 97.96%, 97.95%, 97.99%, and 97.96%, respectively, and SpectraTr achieving 97.45%, 97.45%, 97.43%, and 97.44%, respectively. In contrast, DeepSpectra had noticeably lower metrics, with an accuracy of about 95.41%.
Traditional machine learning methods did not perform as well on the fruit puree spectral dataset compared with the jade UV spectral dataset, with SVM and PLS-DA achieving accuracies of only 92.93% and 93.88%, respectively. Traditional machine learning methods struggle to achieve a strong performance across multiple datasets, whereas deep learning models can effectively classify multiple spectral datasets. This indicates that deep learning models have a significant advantage in handling spectral data. Furthermore, among deep learning models, our SpectraViT model demonstrates superior accuracy, recall, precision, and F1 score on both the jade UV spectral dataset and the fruit puree infrared spectral dataset compared with other deep learning models. This suggests that SpectraViT is a better-suited deep learning model for spectral data classification.