Next Article in Journal
Study on the Migration and Release of Sulfur during the Oxidizing Roasting of High-Sulfur Iron Ore
Next Article in Special Issue
Intelligent Classification and Segmentation of Sandstone Thin Section Image Using a Semi-Supervised Framework and GL-SLIC
Previous Article in Journal
Geochemistry of Geothermal Fluids in the Three Rivers Lateral Collision Zone in Northwest Yunnan, China: Relevance for Tectonic Structure and Seismic Activity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Learning for Refined Lithology Identification of Sandstone Microscopic Images

1
Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
2
University of Chinese Academy of Sciences, Beijing 100045, China
*
Author to whom correspondence should be addressed.
Minerals 2024, 14(3), 275; https://doi.org/10.3390/min14030275
Submission received: 9 January 2024 / Revised: 26 February 2024 / Accepted: 1 March 2024 / Published: 5 March 2024

Abstract

:
Refined lithology identification is an essential task, often constrained by the subjectivity and low efficiency of classical methods. Computer-aided automatic identification, while useful, has seldom been specifically geared toward refined lithology identification. In this study, we introduce Rock-ViT, an innovative machine learning approach. Its architecture, enhanced with supervised contrastive loss and rooted in visual Transformer principles, markedly improves accuracy in identifying complex lithological patterns. To this end, we have collected public datasets and implemented data augmentation, aiming to validate our method using sandstone as a focal point. The results demonstrate that Rock-ViT achieves superior accuracy and effectiveness in the refined lithology identification of sandstone. Rock-ViT presents a new perspective and a feasible approach for detailed lithological analysis, offering fresh insights and innovative solutions in geological analysis.

1. Introduction

In recent years, image recognition technology has made remarkable strides, achieving significant advancements in general applications such as entity identification, fine-grained classification, and scene understanding [1,2]. Despite these advancements, the application of this technology in petrology, particularly in the refined lithology identification through the analysis of microscopic images of rock mineral thin sections, presents unique challenges. These challenges stem from the complex structures, varied features, and inherent ambiguities and uncertainties of rock thin section microscopic images [3,4,5]. Further complexities arise from difficulties in data collection and the specialized nature of domain-specific knowledge, leading to a slower pace of technological integration in this field. Accurate lithological identification of rock thin sections is essential in mineral resource exploration and geological surveys [6,7], particularly for sedimentary rocks such as sandstones. The fine-grained lithological classification of sandstones not only reveals their genesis and geological environment but also plays a critical role in understanding tectonic evolution, sedimentary system development, and hydrocarbon reservoir formation mechanisms. Traditional lithological identification, primarily conducted manually by geologists, is not only time-consuming but also faces challenges in quantitative analysis of large volumes of thin sections [8,9].
Recent advancements in computer-assisted methods, particularly in neural network-based lithological identification, have largely focused on coarse-grained hand specimen recognition or specific mineral identification under microscopes [10,11,12]. However, there remains a significant gap in the detailed classification and recognition of lithologies at a finer scale. In 2010, Baykan and Yılmaz developed a method for mineral identification using color spaces and artificial neural networks. This method utilized RGB and HSV color parameters from thin rock sections, analyzed through a three-layer feed-forward neural network trained on manually classified minerals. Focusing on five minerals, including quartz and biotite, the network demonstrated a high success rate of 81%–98% in identifying minerals, highlighting its potential in geological studies [13]. In their 2017 research, Li and colleagues developed “Festra”, a transfer learning method for automatically classifying sandstone microscopic images. This method effectively addresses inter-regional variations in sandstone samples by combining feature selection and enhanced TrAdaBoost for instance transfer. Tested on sandstone images from four Tibetan regions, Festra demonstrated high accuracy and efficiency, making it a practical solution for field-based geological classification [14].
In this study, we introduce Rock-ViT, a novel automated machine learning approach for refined lithology identification. This approach leverages a Vision Transformer (ViT)-based model adept at discerning subtle variations among different rock lithologies. Enhanced by supervised contrastive loss, the model’s architecture, derived from visual Transformer principles, shows improved discrimination abilities in identifying challenging lithological samples. To evaluate the effectiveness of our proposed machine-learning method in lithological classification, we initially constructed a comprehensive dataset. This dataset was compiled from five micrograph rock-slice datasets sourced from ScienceDB (https://www.scidb.cn/en, accessed on 29 February 2024), encompassing over 2000 images. To enhance the dataset and further bolster network robustness, we incorporated a self-labeling image augmentation mechanism [15], expanding the dataset to over 7000 images. Extensive experiments conducted on both the original and augmented datasets demonstrate that our algorithm outperforms existing machine learning-based approaches and traditional tools in lithological classification.
The rest of this paper is organized as follows: Section 2 elaborates the techniques; Section 3 describes the experimental scheme; Section 4 analyzes and discusses the experimental results; and finally, Section 5 briefly draws the conclusions and provides an overview of potential future work.

2. Literature Review

Lithology identification is fundamental for stratigraphic analysis, resource estimation, and geological genesis analysis [16]. It provides essential information for the prevention and treatment of geological disasters and geological engineering. While field-based visual judgment of hand specimens is common, it often lacks precision and objectivity. Therefore, more refined lithology identification in laboratories is essential.

2.1. Traditional Methods

Traditional lithology identification predominantly relies on physical testing methods. Techniques like Scanning Electron Microscopy (SEM), X-ray Diffraction (XRD), and Electron Probe Microanalysis (EPMA) are employed to analyze rock properties such as density, magnetism, conductivity, and elemental content. SEM, in particular, is widely used and can provide various types of information like secondary electrons, back-scattered electrons, characteristic X-rays [17], and light (cathodoluminescence). Grain counting can be done using an electron microprobe, but the throughput is limited for large grain counts, as each analysis can take several seconds. Laboratory-based lithology identification necessitates high-precision equipment and a specific work environment. Different equipment facilities may generate various types of data [18]. However, due to the high cost and time-consuming nature of these methods, thin section identification remains the primary method for lithology identification in current engineering practices.
Thin section identification is a traditional method using image recognition to identify mineral lithology. Rock samples are sliced into thin sections, and geologists observe the crystallization characteristics of minerals under a polarizing microscope. The mineral composition of rocks is determined by measuring their optical properties. Compared to experimental analysis, this method is time and cost-effective. However, due to the strong subjectivity of the results and the high requirements for researchers, lithology identification is often challenging [19,20]. If intelligent lithology identification can be achieved, it will not only reduce the workload of researchers but also enable more practitioners to achieve efficient and objective identification results.

2.2. Computer Vision-Based Computational Methods

With the rapid development of computer vision and machine learning technologies, these techniques have been widely applied in general fields such as autonomous driving, image recognition, automatic medical diagnosis, and security surveillance [21]. This advancement has further propelled the application of machine learning-based methods in the field of Earth sciences, particularly in the intelligent analysis and application of petrology [22]. It provides new perspectives and tools for Earth sciences. Machine learning-based lithology identification methods have demonstrated tremendous potential in enhancing identification speed and accuracy, especially when dealing with complex lithology identification. As algorithms and computational abilities continue to improve, it is expected that these technologies will play an increasingly vital role in future geological research and practice.
Currently, the rapid advancements in deep learning technology have positioned it as a key research tool in the field of rock identification. With the application of these advanced methods, the automation and intelligent recognition of lithology have become a reality. The introduction of deep learning significantly reduces the workload of researchers in complex data processing and manual feature extraction, enhancing the efficiency and accuracy of rock identification [23]. In 2019, Hao et al. [24] applied these principles to classify heavy minerals from river samples, using a dataset of 3067 grains across 22 classes. By employing 26 decision attributes and a Random Forest algorithm, they achieved an impressive 98.8% accuracy. This highlighted the potential of machine learning in accurately classifying complex geological samples with high precision.
In 2020, Baraboshkin and team [25] developed a time-efficient rock description method using convolutional neural networks (CNNs), including AlexNet, VGG, GoogLeNet, and ResNet. Their approach, particularly with GoogLeNet, achieved up to 95% precision, significantly accelerating the rock typing process in geological studies.
In 2021 [26], a pioneering study leveraged deep learning for semantic segmentation of sandstone thin sections, enhancing automated mineralogical analysis in sedimentary petrology. Employing CNN-based methods, the research focused on pixel-scale mineral identification from 2D RGB images obtained via transmission light microscopy. The study successfully trained models for binary pore–mineral segmentation and a detailed 10-class segmentation, showing that models like DeepLabv3+ and ResNet-18 outperform VGG networks in accuracy. This advancement underscores the potential of CNNs in providing detailed petrological insights from thin section images, offering valuable contributions to geological studies and petrophysical evaluations.
After 2022, building upon these foundations, researchers have further advanced the field by utilizing deep neural networks for automated mineral segmentation and recognition. Latif et al. [27] proposed a superpixel segmentation method followed by the use of the ResNet architecture for mineral recognition, achieving 49.23% accuracy in a five-class classification, a notable achievement given the complexity of the task. Liu et al. and Cui et al. [28,29] focused on direct recognition of rock thin slices, with Liu et al. incorporating the Mask Region-based Convolutional Neural Network (Mask R-CNN) and Cui et al. proposing a Vision Transformer-based deep model. These methods offer promising avenues for the identification and analysis of rock thin slices, a critical aspect of geological research. Rock thin slice recognition can be seen as a recognition task, in which the segmentation step can be simplified and omitted.
In 2023, Tang et al. [30] introduced a modified DeepLabv3+ model for mineral segmentation, combined with a Fully Connected Conditional Random Field (FC-CRF) for recognition, achieving up to 99% accuracy for certain minerals. This model, which employs a semantic segmentation algorithm with a spatial pyramid module and an encoder–decoder architecture, represents a significant step forward in precise mineral recognition. Shi et al. [12] introduced a classification model for fine-grained lithology identification, which was tested on 160 laboratory lithologies and 13 on-site lithologies. Achieving an F1 score of 0.9764 in lab and 0.6143 in field conditions, the study underscores the model’s varying performance across different environments. It delves into the challenges of identifying fine-grained lithologies with macroscopic images, offering insights for future research on reservoir-related issues.
In 2024 [31], a study used convolutional neural networks (CNNs) to identify key microstructures affecting the elastic wave velocity and resistivity of Berea sandstone. By applying machine learning to visualize characteristic microstructures, the research found that larger grains influence P-wave and S-wave velocities, while pore spaces affect the VP/VS ratio. Electrical properties were linked to grain edges and pore tortuosity. This approach, with a relative error of 2%–7% for elastic properties, highlights CNNs’ potential in geophysical property prediction and microstructure characterization.
Liu et al. [32] utilized the Segmenting Anything Model (SAM) for quantifying mineral grain morphology in Yichun rare-metal granite, applying fractal methods to analyze quartz, lepidolite, and albite grains. Their study revealed scaling invariance in mineral grains and precise fractal dimensions, highlighting distinct distribution patterns among different minerals. This approach efficiently characterizes complex mineral patterns, aiding in understanding mineralization processes.
The integration of Transformer architectures in geological studies signifies a trend towards more sophisticated analyses in lithology identification and mineral analysis. These advanced models enhance precision and depth, enabling nuanced interpretations of complex geological data. Transformers’ impact is profound, facilitating groundbreaking approaches in segmenting, recognizing, and analyzing geological features, thus marking a transformative shift in Earth sciences research.

3. Methods

The workflow and key techniques of the Rock-ViT method are elaborated in this section. The workflow of Rock-ViT mainly includes three stages: establishment of micrograph rock-slice datasets, augmentation of micrograph rock-slice datasets, and training the model on the dataset for lithology identification.

3.1. Dataset Acquisition and Augmentation

We collected thin section images of sandstone from five different regions from the publicly available ScienceDB dataset. The geographical locations of these data primarily cover the northeastern part of the Ordos Basin (https://cstr.cn/31253.11.sciencedb.j00001.00087, accessed on 29 February 2024), the eastern edge of the Ordos Basin (https://cstr.cn/31253.11.sciencedb.j00001.00048, accessed on 29 February 2024), the central and northern part of the Lhasa terrane in Tibet (https://cstr.cn/31253.11.sciencedb.j00001.00013, accessed on 29 February 2024), southwestern China (https://cstr.cn/31253.11.sciencedb.j00001.00024, accessed on 29 February 2024), and the middle Yangtze River (https://cstr.cn/31253.11.sciencedb.j00001.00046, accessed on 29 February 2024). The dataset spans from the Permian to the Cretaceous periods. After merging the dataset and conducting quality screening, the number of thin section images exceeded 2000. The main subdivided lithologies include Quartz Lithic Sandstone, Feldspathic Quartz Sandstone, Feldspathic Litharenite, Lithic Sandstone, and Quartz Sandstone, along with a small number of sandstones with complex compositions; see Figure 1.
To elucidate the petrological framework of our study, we present a brief overview of the petrological characteristics of the sandstones from each region, as follows:
  • Northeastern Ordos Basin: Characterized by Middle Jurassic clastic rocks, predominantly lithic sandstone, interspersed with fine-grained sandstone and gravel. The quartz is mainly single-crystal, complemented by microcline and plagioclase feldspar. The lithic fragments consist mostly of siliceous and clayey materials, with some metamorphic debris, reflecting the sedimentary dynamics and mineralogical compositions pertinent to the basin’s Middle Jurassic period.
  • Eastern Ordos Basin: Features Upper Paleozoic tight sandstone reservoirs with low porosity and permeability, showcasing a significant degree of heterogeneity. The predominant rock types include feldspathic quartz sandstone, lithic quartz sandstone, and quartz sandstone, indicative of the complex depositional and diagenetic history of the region.
  • Central and Northern Lhasa Terrane, Tibet: Encompasses Cretaceous clastic rocks, recording significant geological events such as the closure of the Bangong-Nujiang Ocean and the early uplift of the Tibetan Plateau. These formations provide insights into the region’s sedimentary and tectonic evolution during the Cretaceous.
  • Southwestern China: Comprises Permian volcanolithic sandstones, marked by a high content of volcanic rock fragments. This highlights the region’s volcanic activity during the Permian and its impact on sedimentary processes.
  • Middle Yangtze River: The dataset includes Triassic to Jurassic sandstones, with a notable presence of metamorphic grains. These characteristics are essential for understanding the sedimentary provenance, tectonic settings, and depositional history of the region during the Mesozoic era.
These detailed petrological characteristics significantly enhance our understanding of sedimentary geology across a range of geological epochs. The breadth of the dataset, covering extensive geological periods and a diverse array of lithologies, provides an unparalleled resource for in-depth exploration of sedimentary processes, diagenetic histories, and mineralogical compositions. This foundation not only deepens our comprehension of the complex nature of sedimentary geology but also sets the stage for advanced analytical endeavors, offering unique insights into the sedimentary record with unprecedented detail.
This type of sandstone data is quite representative and generic; see Figure 2. However, in current traditional deep learning methods, a large number of training samples are highly correlated with model quality [33,34,35]. Yet, such datasets still rely on extensive manual annotation and verification, making them unsuitable for the augmentation of small-scale data. Therefore, we introduced image enhancement techniques [36], including flipping, color transformation, cropping, rotation, translation, and noise injection, to simulate some common scenarios that occur during the observation of rock thin sections under a real microscope.
  • Rotation: 1000 randomly selected images are rotated by 30°, 60°, 90°, and 120° clockwise.
  • Flipping: 1000 randomly selected images are changed through horizontal, vertical, and diagonal mirror flipping.
  • Gaussian Blur: 3 × 3 , 5 × 5 , and 7 × 7 Gaussian kernels are applied to blur 1000 randomly selected images globally.
  • Noise Injection: 5%, 7.5%, 10%, and 12.5% global Gaussian noise are injected into 1000 randomly selected images.
  • Contrast Enhancement: the visual disparities between dark and light regions in 1000 randomly selected images are amplified to improve the perceptual differentiation between various intensity levels.
  • Adding Sunlight: 500 randomly selected images are added with additional simulated incandescent lighting. The objective is to manipulate the spectral composition and intensity of the illumination within the image.
The number of valid samples after augmentation reached 7731, which were divided into a training set and test set at the ratio of 7:3. Based on image processing, the self-labeling mechanism was introduced to complete the augmentation of labeled image samples and corresponding label files, as shown in Figure 3.

3.2. Model: Region Vision Transformer

With the increasing complexity of tasks in computer vision and artificial intelligence, deep neural network models are becoming progressively intricate. Deep learning methods have seen successful applications in various fields recently, as referenced in sources [37,38]. Interest in convolutional neural networks (CNNs) surged in 2012 with the introduction of AlexNet. Subsequent developments included GoogLeNet and ResNet [39,40]. More recently, the Vision Transformer (ViT) and its variants [38] have demonstrated their robust abilities, achieving results comparable to those of CNNs in image classification.
To handle 2D images, the image X R h × w × c is reshaped into a group of non-overlapping flattened 2D patches, X p R n × ( p 2 × c ) , where c is the number of channels, ( h , w ) is the resolution of the original image, and ( p , p ) is the resolution of the image patch. RegionViT takes the large non-overlapping patches (e.g., 28 × 28) as regional tokens and takes the smaller patches (e.g., 4 × 4) as local tokens for each region [38]. Similar to BERT’s [ c l a s s ] token in natural language processing, a linear projection is applied to the patches to get the learnable embeddings, as shown in Figure 4 and Figure 5.
Multi-Head Attention (MHA) [29] is the key component in Transformer architectures [41]. It gives the Transformer a robust structure, with multiple independent heads focusing on different information (global information and local information) to extract more comprehensive and rich feature information. Let Q denote the concatenation of { Q i } i = 1 h (and similarly for K and V ); the computation process of MHA is defined in Equation (1):
Attention ( Q i , K i , V i ) = softmax Q i · K i T d k · V i , head i = Attention ( Q i , K i , V i ) , MHA ( Q , K , V ) = Concat ( head 1 , , head h ) W ,
where Q, K, and V denote the query, key, and value matrices in the attention module.
To enable communication between the two types of tokens, the model first performs self-attention on regional tokens and then jointly attends to the local tokens of each region, including their associated regional tokens. By doing so, regional tokens pass global contextual information to local tokens efficiently while being able to effectively learn from local tokens themselves. For clarity, we represent this two-stage attention mechanism as Regional-to-Local Attention, i.e., R2L attention (see Figure 4 for an illustration). Let x r ( d 1 ) and x l ( d 1 ) denote the regional and local tokens in the d-th layer. Then, a standard transformer is conducted to process regional and local tokens separately. The R2L transformer encoder is expressed in Equation (2):
y r d = x r ( d 1 ) + RSA ( LN ( x r ( d 1 ) ) ) , y i , j ( d ) = [ y i , j ( d ) | | { x l i , j , m , n ( d 1 ) } m , n M ] ,
where i and j are the spatial index for regional tokens, and m and n are the index of local tokens in the window size M 2 . For the input of LSA, y i , j d includes one regional token and corresponding local tokens, and thus, the information between local and regional tokens is exchanged. The RSA exchanges information among all tokens, which covers the context of the whole image. Then, a feed-forward network (FFN) is applied after the self-attention layers in each encoder and decoder. The FFN feeds the extracted features F into the Transformer encoder. It comprises two linear transformation layers with a nonlinear activation function between them and is represented by the function in Equation (3):
z i , j ( d ) = y i , j ( d ) + LSA ( LN ( y i , j ( d ) ) ) , x i , j ( d ) = z i , j ( d ) + FFN ( LN ( z i , j ( d ) ) ) , FFN ( X ) = W 2 σ ( W 2 · X ) ,
where W 1 and W 2 are the two parameter matrices of the two linear transformation layers, and σ represents the nonlinear activation function, such as GELU. The LSA combines the features among the tokens belonging to the spatial region, including both regional and local tokens. Note that the weights are shared between RSA and LSA, except for the layer normalization; therefore, the number of parameters will not increase significantly when compared to the standard transformer encoder. With these two attentions, the R2L can effectively and efficiently exchange information among all regional and local tokens. In particular, the self-attention on regional tokens aims to extract high-level information and act as a bridge to pass information on local tokens from one region to other regions. On the other hand, the R2L attention is primarily concerned with local contextual information, which includes one regional token, as outlined in Equation (4):
L ce = 1 N i = 1 N y i j log ( y i ^ ) ,
where N is the number of training samples; y i is an indicator function that equals 1 when sample i belongs to its category, and 0 otherwise; and y i ^ denotes the predicted probability that sample i belongs to category j. As our model is the ViT-based architecture, we use the ViT pre-trained on large datasets and then fine-tuned for downstream tasks with smaller data.
Contrastive learning [42,43] focuses on learning representations by contrasting positive pairs, i.e., similar instances, against negative pairs, i.e., dissimilar instances. Many works demonstrate that contrastive learning can alleviate the problems of imbalanced class distributions and various categories [42,44]. To alleviate the problems of imbalanced class distributions, Rock-ViT incorporates supervised multi-class contrastive loss as an auxiliary loss. For a set of N randomly sampled sample or label pairs, { x k , y k } k = 1 , , N , the corresponding batch used for training consists of 2 N pairs, { x l ^ , y l ^ } l = 1 , , 2 N , where x 2 k ^ and x 2 k 1 ^ are two random augmentations (also known as “views”) of x k , k = 1 , , N . Additionally, y 2 k 1 ^ = y 2 k ^ = y k ^ . Let i I { 1 , , 2 N } be the index of an arbitrary augmented sample, and let A ( i ) I { i } . The index i is called the anchor, index j ( i ) is called the positive, and the other 2 ( N 1 ) indices, i.e., { k A ( i ) { j ( i ) } } , are called negatives. For each anchor i, there is one positive pair and 2 N 2 negative pairs. Then, the contrastive loss is defined as presented in Equation (5):
L contrastive = i I log 1 | P ( i ) | p P ( i ) exp ( z i ^ · z p ^ / τ ) a A ( i ) exp ( z i ^ · z a ^ / τ ) ,
where P ( i ) { p A ( i ) : y p ^ = y i ^ } is the set of indices of all positives in the multi-viewed batch distinct from i, and τ R + is a scalar temperature parameter. The objective of this auxiliary objective function is to increase the predicted scores for the correct class in the model while keeping scores for other classes relatively low. It is done to enhance the differentiation between different classes.
Finally, the overall objective training function is defined in Equation (6):
L = λ L ce + ( 1 λ ) L contrastive ,
where λ is a hyperparameter that controls the weight of cross-entropy loss and contrastive-learning loss.

4. Experiment and Discussion

In comparison with conventional domains such as human recognition or vehicle identification, lithology identification from rock microscopic images embodies inherent complexity and dynamic characteristics unique to geosciences. The distinctive lithological characteristics result from the influences of the minerals composing them, their modes of formation, and the geological processes from their formation periods, leading to diverse mineral crystallization and optical information in rock thin sections.
Addressing these complexities, our Rock-ViT model employs the advanced Region Vision Transformer (RegionViT) architecture, offering a detailed and comprehensive analysis of rock thin section images under a microscope. This model particularly excels in distinguishing rock categories with highly similar appearances by integrating both local detail features and global comprehensive information from rock thin section images. Additionally, the incorporation of the supervised contrastive loss technique significantly enhances the model’s capability to differentiate and recognize various categories within complex datasets of rock microscopic thin section images.
Despite the significant contributions of traditional models such as Festra, Mask R-CNN, and FC-CRF to geological analysis, these models exhibit certain limitations when applied to the nuanced field of lithological analysis. Festra and other transfer learning methods are limited by the similarity between source and target domains, while Mask R-CNN’s performance heavily depends on extensive annotated datasets, which are labor-intensive to compile. Furthermore, FC-CRF, although effective in refining segmentation accuracy, demands considerable computational resources, especially for high-resolution images.
In contrast, the Rock-ViT model, leveraging the Vision Transformer (ViT) architecture, overcomes these challenges by treating images as sequences of patches. This approach enables a nuanced understanding of complex geological features, capturing long-range dependencies and subtle variations within the rock images. The model’s architecture is specifically tailored to the unique demands of lithological classification, setting a new benchmark in accuracy and efficiency for lithological classification tasks. Unlike traditional models, the Rock-ViT model reduces dependency on extensive annotated datasets and optimizes computational efficiency, making it more practical for high-resolution lithological analyses.

4.1. Experimental Analysis

To ensure the validity and reliability of our comparative study, we employed a uniform experimental setup for evaluating the Rock-ViT model against established deep learning architectures. The experimental environment was standardized using a high-performance computing system endowed with 256 GB of memory and an NVIDIA Tesla-V100 GPU, which boasts 5120 CUDA cores. Programming tasks, encompassing both the classification and preprocessing phases for the original image datasets, were executed in Python 3.8. Our dataset augmentation strategy involved allocating 80% of the data for training purposes and the remaining 20% for testing. Furthermore, the training dataset was partitioned into a secondary split of 80% for actual training and 20% for validation. We meticulously tested various epoch sizes to identify the optimal number to prevent overfitting and underfitting, ensuring the model’s generalization capabilities. Rock-ViT’s parameter settings were fine-tuned during these experiments to ascertain the most effective configuration for refined lithology identification tasks.
In this study, the deliberate use of both unenhanced and enhanced datasets aims to provide a comprehensive evaluation of model performance in lithological identification. The unenhanced data establish a baseline, revealing each model’s raw image processing capabilities without additional manipulations, crucial for understanding their core effectiveness. Meanwhile, the enhanced datasets mirror the variability encountered in real-world geological analysis, testing the models in practical, variable conditions. This dual approach not only demonstrates the impact of data enhancement on accuracy and model robustness but also verifies the models’ generalizability and their preparedness for diverse, real-world applications. Furthermore, the enhanced data serve to reduce overfitting risks, ensuring sustained model performance on novel data. Therefore, our methodology facilitates a thorough assessment, underscoring the models’ operational reliability for authentic geological tasks.
Initially, our experiments were conducted on both unenhanced and enhanced datasets, assessing the performance trajectory of the foundational ViT model against our Rock-ViT model in terms of accuracy and loss at various stages of training. As depicted in Figure 6 and Figure 7, Rock-ViT demonstrates a consistent improvement in accuracy over time, despite initially lagging slightly behind ViT. This trend suggests that Rock-ViT, starting with marginally lower accuracy, incrementally masters the distinctive features inherent to sandstone microscopic image samples. Figure 8 shows that the training accuracy of Rock-ViT, particularly within the enhanced dataset, overtakes that of ViT after a certain number of training steps. This enhancement becomes more pronounced with further training, indicating the model’s adaptability and learning efficacy when dealing with augmented data possessing complex features. Figure 9 illustrates that Rock-ViT exhibits a steeper decline in loss, notably in the enhanced dataset, achieving a lower loss more rapidly than ViT. This denotes Rock-ViT’s superior optimization and error minimization capabilities, which are essential for intricate tasks such as lithology identification.
To summarize, the Rock-ViT model has demonstrated high accuracy and precision in lithological identification tasks. Its effectiveness in processing enhanced datasets confirms its suitability for geological applications.

4.2. Experimental Result

We evaluate the overall performance of Rock-ViT among various models using metrics such as Accuracy, Precision, Recall, and F1 score. Accuracy refers to the proportion of true positive and true negative predictions to the total number of observations, as shown in the following Equation (7):
Accuracy = T P + T N T P + T N + F P + F N .
Precision refers to the ratio of correctly identified positive observations to the total predicted positives, illustrated in Equation (8), where TP represents true positives and FP represents false positives.
Precision = T P T P + F P
Recall indicates the ratio of correctly identified positive observations within the actual positive class, as depicted in Equation (9), with FN denoting false negatives.
Recall = T P T P + F N
The F1 score is defined based on Precision and Recall, as demonstrated in the following Equation (10):
F 1 Score = 2 · Precision · Recall Precision + Recall .
The performance of various deep learning models, including our proposed Rock-ViT, is quantitatively assessed in Table 1. This table presents a comparative analysis of model accuracy across unenhanced and enhanced datasets using well-established metrics.
In the baseline experiments, we use the Adam optimizer with a learning rate of 0.01, weight decay of 1 × 10 5 , and exponential learning rate decay strategy with gamma of 0.9. In Rock-ViT, we use an SGD optimizer with a learning rate of 0.01, weight decay of 1 × 10 5 , momentum of 0.95, and exponential learning rate decay strategy with gamma of 0.99. Both are trained until convergence.
In the unenhanced dataset, Rock-ViT achieved an accuracy of 67.18%, closely approximating ViT’s 65.21%. Notably, it surpassed ViT with a 2.12% higher precision and a 1.31% higher recall, achieving an F1 score of 66.44%. This indicates Rock-ViT’s ability to parallel foundational models’ performance without data enhancement.
The effectiveness of Rock-ViT is more pronounced in the enhanced dataset, where it achieved an accuracy of 91.75%, surpassing ViT by 1.84%. The model also led in precision by 0.97% and in recall by 1.89%, with an F1 score of 91.69%, 1.72% higher than ViT. These results highlight Rock-ViT’s ability to handle complex, nuanced sandstone microscopic image data, demonstrating its suitability for advanced lithological analysis.

4.3. Discussion

Expanding on the quantitative model performance metrics previously discussed, we now direct our examination toward the confusion matrices depicted in Figure 10 and Figure 11. These matrices provide a nuanced analysis of the Rock-ViT model’s capacity for fine-grained classification of sandstone lithologies. They detail the model’s classification accuracy across a spectrum of sandstone types, from Feldspathic Quartz Sandstone to Quartz Sandstone, as well as other variants, elucidating the model’s efficacy and its discernment challenges among closely similar lithological classes.
In Figure 10, we observe that the Rock-ViT model, when processing unenhanced data, is prone to misclassifying Feldspathic Quartz Sandstone (A) as Quartz Lithic Sandstone (B) when differentiating from Quartz Sandstone (E). This pattern suggests that the model struggles to capture the subtle distinctions between these lithologies without the aid of data enhancement. Illustrated in the provided Figure 12, both Feldspathic Quartz Sandstone (A) and Quartz Sandstone (E) are primarily quartz. However, the additional mineral contents and textures, which may not always be distinct in thin section images, are different. The model’s difficulty in distinguishing Feldspathic Quartz Sandstone from Quartz Sandstone could stem from the degrees of feldspar content, which may lack significant contrast in the image data, particularly when the images are unenhanced or the feldspar content visually resembles quartz. Moreover, the presence of lithic fragments and the overall granularity and sorting within the rock can further complicate the classification.
Quartz Lithic Sandstone (B) and Quartz Sandstone (E), while both quartz-dominant, differ in that Quartz Lithic Sandstone contains lithic fragments, rendering them visually similar in thin section images. If these lithic fragments are not clearly visible or are indistinguishable from quartz grains due to image quality, the model’s feature extraction capabilities become essential. If the feature extraction layers of the model are not fine-tuned to emphasize the textural and compositional differences presented by these fragments, the model may fail to fully capture the defining features of Quartz Lithic Sandstone (the lithic fragments). Data imbalance may also contribute to this effect. Additionally, inaccuracies in white balance or uneven lighting conditions in the original data can lead to such feature confusion.
Furthermore, our analysis of Lithic Sandstone (C) reveals its propensity for misclassification as Quartz Lithic Sandstone (B), Feldspathic Litharenite (D), and Quartz Sandstone (E). Each of these lithologies has unique features but also shares overlapping characteristics, complicating their classification. Lithic Sandstone is defined by a substantial content of rock fragments or lithics, while Quartz Lithic Sandstone also contains lithics, but with a higher proportion of quartz. Feldspathic Litharenite encompasses feldspar, lithics, as well as quartz. Considering these similarities, the model’s ability to discriminate among them depends on subtle differences in texture and compositional attributes, which may not be evident in thin section images. High-quality, high-resolution images are imperative for capturing the fine details necessary to differentiate these lithologies. If the resolution is insufficient to display the textures and compositions of grains and lithics clearly, or if the images lack contrast between these components, the model may face difficulties in accurately classifying lithologies.
Further complicating the classification of Feldspathic Litharenite (D) relative to Quartz Lithic Sandstone (B) and Lithic Sandstone (C) is the fact that Feldspathic Litharenite is fundamentally a sandstone with an abundance of feldspar (more so than in Quartz Lithic Sandstone) and lithic fragments. Lithic Sandstone (C) is distinguished by a higher proportion of lithic fragments compared to the feldspar content seen in Feldspathic Litharenite, yet the presence of feldspar in both could induce confusion. Quartz Lithic Sandstone (B), although predominantly quartz, also contains lithic fragments, albeit in a lesser proportion than Lithic Sandstone, and has lower feldspar content than Feldspathic Litharenite.
These classification challenges are amplified by texture similarities across the lithologies in question. The size, shape, and arrangement of grains within the rock matrix can exhibit a high degree of resemblance, particularly when viewed through thin section images. The Rock-ViT model heavily relies on these textural cues to differentiate between lithological classes, which highlights a critical dependency on subtle, often granular, image details that may not be sufficiently captured without high-resolution imaging. This reliance on texture is a double-edged sword; while it can enhance the model’s ability to classify based on fine details, it also means that without clear, high-contrast images, the model’s performance can be significantly hindered. The observed confusion is not solely a consequence of the model’s limitations but also reflects the long-tail distribution of data, with image quality and resolution playing a significant role in the accurate identification of lithologies. By addressing these factors, we can further refine the model’s performance, ensuring that it more reliably discerns between lithologies with closely overlapping textural characteristics.
Upon the application of data enhancement techniques, the Rock-ViT model exhibits a substantial improvement in classification accuracy, particularly noticeable in the differentiation of Feldspathic Litharenite and Quartz Sandstone. This enhancement is captured in Figure 11, reflecting the model’s enhanced ability to discern intricate lithological features with greater precision.
The success of data augmentation in boosting Rock-ViT’s performance can be attributed to several factors. The image enhancement techniques, including flipping, rotation, scaling, and the injection of noise, have likely contributed to the model’s improved robustness. These techniques introduce a level of variability and complexity that mimics the conditions encountered when observing rock thin sections under a microscope, thus better preparing the model to handle real-world data.
Additionally, the implementation of contrast adjustment and the addition of simulated lighting conditions are particularly notable. By amplifying the visual disparities between dark and light regions within the images, these enhancements aid in accentuating textural and compositional contrasts that are crucial for the model’s feature extraction processes. The introduction of varied lighting conditions, meanwhile, challenges the model to maintain classification accuracy despite changes in illumination that affect the perception of mineral components.
The combination of these enhancement strategies results in a dataset that closely represents the variability and challenges inherent in geological analysis, thereby enabling the Rock-ViT to deliver a refined performance. With the augmented dataset, the model not only demonstrates increased classification accuracy but also showcases its potential in handling complex and nuanced sandstone microscopic image data, confirming its suitability for advanced lithological analysis. It is important to acknowledge, however, that while these data augmentation techniques significantly enrich the dataset’s diversity, they may not fully replicate the range of conditions that could be achieved through direct manipulation of thin sections, including adjustments in polarizing conditions or angles. Nonetheless, this limitation is not insurmountable for the model. By incorporating additional training data in future iterations, we can further mitigate these issues. This approach represents an ongoing effort, with future work aimed at expanding the training dataset to more directly address these challenges.
In our cross-comparison of different models, we compiled data on the efficiency and performance of each model. This included an analysis of the number of parameters for each model against both unenhanced and enhanced datasets, leading to some insightful observations. These insights are crucial for understanding the computational efficiency and resource requirements of models for fine lithological analysis, which directly impact their applicability in tasks of fine lithological identification.
Table 2 clearly shows that models like AlexNet and GoogLeNet have relatively larger numbers of parameters, which consequently result in longer training and inference times. However, models with significantly fewer parameters, such as ResNet, demonstrate more efficient training and inference times. In this comparison, Rock-ViT stands out for its optimization of the balance between the number of parameters and computational efficiency, leading to reduced training and inference times. This indicates that Rock-ViT requires less computational power and time for both training and inferring, a significant advantage in practical applications.
Our research harnesses these sophisticated models, with a particular focus on the Rock-ViT model, which is grounded in the innovative Region Vision Transformer (RegionViT) architecture. This model stands out for its proficiency in analyzing complex structures and subtle patterns in rock thin section images, significantly enhancing the precision in discriminating between similar lithologies.
The development of models like Rock-ViT indicates a significant progression from conventional AI techniques to cutting-edge models based on Transformer architecture. These improvements are laying the foundation for their application in intelligent reservoir analysis, core analysis, and other geoscientific scenarios, where fine lithology identification is vital.
Specifically, in the context of oil and gas reservoirs, refined lithology identification through these models aids in intelligent analysis, allowing for a better understanding of reservoir characteristics and the optimization of extraction strategies. In core analysis, these models facilitate a detailed examination of core samples, providing insights into the mineralogical composition and textural features of the subsurface materials.
By moving away from labor-intensive, manual identification processes to automated, intelligent systems, we are not only enhancing the efficiency of geological analyses but also improving the accuracy and reliability of the interpretations and decisions based on these analyses.
As we continue to develop and refine these models, we expect their applicability and significance to grow, further enriching the field of geological sciences with precise, data-driven analysis and interpretations.

5. Conclusions

To investigate the refined lithology identification of sandstone microscopic images, this study proposes the Rock-ViT model based on the Region-ViT model. The results show that Rock-ViT demonstrates superior performance in identifying lithologies in sandstone microscopic images compared to traditional methods. Its high Accuracy, Precision, Recall, and F1 Score, especially in complex image scenarios, highlight its potential in advanced geological tasks. The conclusions of this study are as follows:
(1)
The Rock-ViT model, founded on the Region-ViT architecture, exhibits remarkable capabilities in processing the complex structures and intricate features of rock thin section images, particularly excelling in differentiating rock categories with closely similar appearances. It integrates local detailed features and global information from sandstone microscopic images, enhancing Precision and reinforcing the recognition of subtle differences between various lithologies;
(2)
Rock-ViT demonstrates superior adaptability and learning efficiency when handling enhanced datasets with complex features, surpassing the ViT series models and traditional approaches in optimization and error minimization. This capability is crucial for the intricate task of lithology identification;
(3)
In both unenhanced and enhanced datasets, Rock-ViT shows exceptional performance. In unenhanced datasets, it exceeds the foundational ViT model in terms of Precision and Recall. In enhanced datasets, Rock-ViT’s Accuracy, Precision, and Recall are significantly improved, demonstrating its ability to process complex sandstone microscopic images and its suitability for more complicated lithological analysis scenarios.

Author Contributions

C.W.: planning, methodology, analysis, data collection, initial draft writing and revision. P.L.: methodology and experiments. Q.L.: methodology, draft writing, and experiments. H.C.: data collection, data argumentation, and experiments. P.W.: planning, supervision, review, and editing. Z.M.: funding procurement. X.W.: funding procurement, supervision, review, and editing. Y.Z.: funding procurement and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Key Research Program of Frontier Sciences, CAS (grant number: ZDBS-LY-DQC016).

Data Availability Statement

The datasets used in this study were sourced from several collections available on ScienceDB. These datasets have been specifically processed and adapted to meet the objectives of our research. Researchers interested in obtaining the processed datasets for academic or research purposes can contact us for more information and to request data access. Inquiries regarding data sharing should be directed to Chengrui Wang at [email protected].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Alzubaidi, F.; Mostaghimi, P.; Swietojanski, P.; Clark, S.R.; Armstrong, R.T. Automated lithology classification from drill core images using convolutional neural networks. J. Pet. Sci. Eng. 2021, 197, 107933. [Google Scholar] [CrossRef]
  2. Golsanami, N.; Jayasuriya, M.N.; Yan, W.; Fernando, S.G.; Liu, X.; Cui, L.; Zhang, X.; Yasin, Q.; Dong, H.; Dong, X. Characterizing clay textures and their impact on the reservoir using deep learning and Lattice-Boltzmann simulation applied to SEM images. Energy 2022, 240, 122599. [Google Scholar] [CrossRef]
  3. Li, N.; Hao, H.; Gu, Q.; Wang, D.; Hu, X. A transfer learning method for automatic identification of sandstone microscopic images. Comput. Geosci. 2017, 103, 111–121. [Google Scholar] [CrossRef]
  4. Xu, T.; Chang, J.; Feng, D.; Lv, W.; Kang, Y.; Liu, H.; Li, J.; Li, Z. Evaluation of active learning algorithms for formation lithology identification. J. Pet. Sci. Eng. 2021, 206, 108999. [Google Scholar] [CrossRef]
  5. Xu, Z.; Ma, W.; Lin, P.; Hua, Y. Deep learning of rock microscopic images for intelligent lithology identification: Neural network comparison and selection. J. Rock Mech. Geotech. Eng. 2022, 14, 1140–1152. [Google Scholar] [CrossRef]
  6. Xie, Y.; Zhu, C.; Zhou, W.; Li, Z.; Liu, X.; Tu, M. Evaluation of machine learning methods for formation lithology identification: A comparison of tuning processes and model performances. J. Pet. Sci. Eng. 2018, 160, 182–193. [Google Scholar] [CrossRef]
  7. Ren, X.; Hou, J.; Song, S.; Liu, Y.; Chen, D.; Wang, X.; Dou, L. Lithology identification using well logs: A method by integrating artificial neural networks and sedimentary patterns. J. Pet. Sci. Eng. 2019, 182, 106336. [Google Scholar] [CrossRef]
  8. Młynarczuk, M.; Górszczyk, A.; Ślipek, B. The application of pattern recognition in the automatic classification of microscopic rock images. Comput. Geosci. 2013, 60, 126–133. [Google Scholar] [CrossRef]
  9. Fan, G.; Chen, F.; Chen, D.; Dong, Y. Recognizing multiple types of rocks quickly and accurately based on lightweight CNNs model. IEEE Access 2020, 8, 55269–55278. [Google Scholar] [CrossRef]
  10. Liu, X.; Wang, H.; Jing, H.; Shao, A.; Wang, L. Research on intelligent identification of rock types based on faster R-CNN method. IEEE Access 2020, 8, 21804–21812. [Google Scholar] [CrossRef]
  11. Lou, W.; Zhang, D.; Bayless, R.C. Review of mineral recognition and its future. Appl. Geochem. 2020, 122, 104727. [Google Scholar] [CrossRef]
  12. Shi, H.; Xu, Z.; Lin, P.; Ma, W. Refined lithology identification: Methodology, challenges and prospects. Geoenergy Sci. Eng. 2023, 231, 212382. [Google Scholar] [CrossRef]
  13. Marmo, R.; Amodio, S.; Tagliaferri, R.; Ferreri, V.; Longo, G. Textural identification of carbonate rocks by image processing and neural network: Methodology proposal and examples. Comput. Geosci. 2005, 31, 649–659. [Google Scholar] [CrossRef]
  14. Baykan, N.A.; Yılmaz, N. Mineral identification using color spaces and artificial neural networks. Comput. Geosci. 2010, 36, 91–97. [Google Scholar] [CrossRef]
  15. Iglesias, J.C.A.; Gomes, O.d.F.M.; Paciornik, S. Automatic recognition of hematite grains under polarized reflected light microscopy through image analysis. Miner. Eng. 2011, 24, 1264–1270. [Google Scholar] [CrossRef]
  16. Dong, S.; Wang, Z.; Zeng, L. Lithology identification using kernel Fisher discriminant analysis with well logs. J. Pet. Sci. Eng. 2016, 143, 95–102. [Google Scholar] [CrossRef]
  17. Tsuji, T.; Yamaguchi, H.; Ishii, T.; Matsuoka, T. Mineral classification from quantitative X-ray maps using neural network: Application to volcanic rocks. Island Arc 2010, 19, 105–119. [Google Scholar] [CrossRef]
  18. Izadi, H.; Sadri, J.; Bayati, M. An intelligent system for mineral identification in thin sections based on a cascade approach. Comput. Geosci. 2017, 99, 37–49. [Google Scholar] [CrossRef]
  19. Koeshidayatullah, A.; Morsilli, M.; Lehrmann, D.J.; Al-Ramadan, K.; Payne, J.L. Fully automated carbonate petrography using deep convolutional neural networks. Mar. Pet. Geol. 2020, 122, 104687. [Google Scholar] [CrossRef]
  20. Cheng, G.; Guo, W. Rock images classification by using deep convolution neural network. J. Phys. Conf. Ser. 2017, 887, 012089. [Google Scholar] [CrossRef]
  21. Wang, P.; Fu, Y.; Liu, G.; Hu, W.; Aggarwal, C. Human mobility synchronization and trip purpose detection with mixture of hawkes processes. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, USA, 13–17 August 2017; pp. 495–503. [Google Scholar]
  22. Xu, C.; Misra, S.; Srinivasan, P.; Ma, S. When petrophysics meets big data: What can machine do? In SPE Middle East Oil and Gas Show and Conference; OnePetro: Richardson, TX, USA, 2019. [Google Scholar]
  23. Zhang, Y.; Li, M.; Han, S.; Ren, Q.; Shi, J. Intelligent identification for rock-mineral microscopic images using ensemble machine learning algorithms. Sensors 2019, 19, 3914. [Google Scholar] [CrossRef]
  24. Hao, H.; Guo, R.; Gu, Q.; Hu, X. Machine learning application to automatically classify heavy minerals in river sand by using SEM/EDS data. Miner. Eng. 2019, 143, 105899. [Google Scholar] [CrossRef]
  25. Tran, T.T.; Payenberg, T.H.; Jian, F.X.; Cole, S.; Barranco, I. Deep convolutional neural networks for generating grain-size logs from core photographs. AAPG Bull. 2022, 106, 2259–2274. [Google Scholar] [CrossRef]
  26. Saxena, N.; Day-Stirrat, R.J.; Hows, A.; Hofmann, R. Application of deep learning for semantic segmentation of sandstone thin sections. Comput. Geosci. 2021, 152, 104778. [Google Scholar] [CrossRef]
  27. Latif, G.; Bouchard, K.; Maitre, J.; Back, A.; Bédard, L.P. Deep-learning-based automatic mineral grain segmentation and recognition. Minerals 2022, 12, 455. [Google Scholar] [CrossRef]
  28. Liu, T.; Li, C.; Liu, Z.; Zhang, K.; Liu, F.; Li, D.; Zhang, Y.; Liu, Z.; Liu, L.; Huang, J. Research on image identification method of rock thin slices in tight oil reservoirs based on Mask R-CNN. Energies 2022, 15, 5818. [Google Scholar] [CrossRef]
  29. Cui, X.; Peng, C.; Yang, H. Intelligent Mineral Identification and Classification based on Vision Transformer. In Proceedings of the 2022 9th International Conference on Dependable Systems and Their Applications (DSA), Wulumuqi, China, 4–5 August 2022; pp. 670–676. [Google Scholar]
  30. Tang, H.; Wang, H.; Wang, L.; Cao, C.; Nie, Y.; Liu, S. An Improved Mineral Image Recognition Method Based on Deep Learning. JOM 2023, 75, 2590–2602. [Google Scholar] [CrossRef]
  31. Sawayama, K.; Tsuji, T.; Shige, K. Extracting crucial microstructures to characterize the elastic wave velocity and resistivity of Berea sandstone using convolutional neural networks. Geophysics 2024, 89, WA117–WA126. [Google Scholar] [CrossRef]
  32. Liu, Y.; Sun, T.; Wu, K.; Zhang, H.; Zhang, J.; Jiang, X.; Lin, Q.; Feng, M. Fractal-Based Pattern Quantification of Mineral Grains: A Case Study of Yichun Rare-Metal Granite. Fractal Fract. 2024, 8, 49. [Google Scholar] [CrossRef]
  33. Gomes, O.d.F.M.; Iglesias, J.C.A.; Paciornik, S.; Vieira, M.B. Classification of hematite types in iron ores through circularly polarized light microscopy and image analysis. Miner. Eng. 2013, 52, 191–197. [Google Scholar] [CrossRef]
  34. Figueroa, G.; Moeller, K.; Buhot, M.; Gloy, G.; Haberla, D. Advanced discrimination of hematite and magnetite by automated mineralogy. In Proceedings of the 10th International Congress for Applied Mineralogy (ICAM), Trondheim, Norway, 1–5 August 2011; Springer: Berlin/Heidelberg, Germany, 2012; pp. 197–204. [Google Scholar]
  35. Wang, P.; Wu, D.; Chen, C.; Liu, K.; Fu, Y.; Huang, J.; Zhou, Y.; Zhan, J.; Hua, X. Deep Adaptive Graph Clustering via von Mises-Fisher Distributions. ACM Trans. Web 2023, 18, 22. [Google Scholar] [CrossRef]
  36. Bow, S.T. Pattern Recognition and Image Preprocessing; CRC Press: Boca Raton, FL, USA, 2002. [Google Scholar]
  37. Hechler, E.; Oberhofer, M.; Schaeck, T. Deploying AI in the Enterprise. In IT Approaches for Design, DevOps, Governance, Change Management, Blockchain, and Quantum Computing; Apress: Berkeley, CA, USA, 2020. [Google Scholar]
  38. Chen, C.F.; Panda, R.; Fan, Q. Regionvit: Regional-to-local attention for vision transformers. arXiv 2021, arXiv:2106.02689. [Google Scholar]
  39. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. Available online: https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html (accessed on 29 February 2024). [CrossRef]
  40. Wu, Z.; Shen, C.; Van Den Hengel, A. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognit. 2019, 90, 119–133. [Google Scholar] [CrossRef]
  41. Mao, X.; Qi, G.; Chen, Y.; Li, X.; Duan, R.; Ye, S.; He, Y.; Xue, H. Towards robust vision transformer. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12042–12051. [Google Scholar]
  42. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
  43. Ning, Z.; Wang, P.; Wang, P.; Qiao, Z.; Fan, W.; Zhang, D.; Du, Y.; Zhou, Y. Graph soft-contrastive learning via neighborhood ranking. arXiv 2022, arXiv:2209.13964. [Google Scholar]
  44. Albelwi, S. Survey on self-supervised learning: Auxiliary pretext tasks and contrastive learning methods in imaging. Entropy 2022, 24, 551. [Google Scholar] [CrossRef]
Figure 1. Dataset class distribution. Others: Composite Siltstone; Calcareous Siltstone; Hydrothermal Metamorphosed Sandstone; Metamorphosed Siltstone; Clayey Sandstone; Hydrothermal Metamorphosed Quartz Sandstone.
Figure 1. Dataset class distribution. Others: Composite Siltstone; Calcareous Siltstone; Hydrothermal Metamorphosed Sandstone; Metamorphosed Siltstone; Clayey Sandstone; Hydrothermal Metamorphosed Quartz Sandstone.
Minerals 14 00275 g001
Figure 2. Examples of typical sandstone micrographs in the dataset: (a) Quartz Lithic Sandstone; (b) Feldspathic Quartz Sandstone; (c) Feldspathic Litharenite; (d) Lithic Sandstone; (e) Quartz Sandstone; (f) Others (Composite Siltstone, Hydrothermal Metamorphosed Quartz Sandstone, etc.)
Figure 2. Examples of typical sandstone micrographs in the dataset: (a) Quartz Lithic Sandstone; (b) Feldspathic Quartz Sandstone; (c) Feldspathic Litharenite; (d) Lithic Sandstone; (e) Quartz Sandstone; (f) Others (Composite Siltstone, Hydrothermal Metamorphosed Quartz Sandstone, etc.)
Minerals 14 00275 g002
Figure 3. Sample images of Quartz Lithic Sandstone. (a) Rotation; (b) flipping; (c) blurring; (d) contrast enhancement; (e) irregular noise; (f) microscope dark-field noise.
Figure 3. Sample images of Quartz Lithic Sandstone. (a) Rotation; (b) flipping; (c) blurring; (d) contrast enhancement; (e) irregular noise; (f) microscope dark-field noise.
Minerals 14 00275 g003
Figure 4. Regional-to-local attention for Vision Transformers. RegionViT combines a pyramid structure with an efficient regional-to-local (R2L) attention mechanism to reduce computation and memory usage. Our approach divides the input image into two groups of tokens: regional tokens of large patch size (red) and local ones of small patch size (black). The two types of tokens communicate efficiently through R2L attention, which jointly attends to the local tokens in the same region and the associated regional token. In the end, we average all regional tokens and use them for the classification.
Figure 4. Regional-to-local attention for Vision Transformers. RegionViT combines a pyramid structure with an efficient regional-to-local (R2L) attention mechanism to reduce computation and memory usage. Our approach divides the input image into two groups of tokens: regional tokens of large patch size (red) and local ones of small patch size (black). The two types of tokens communicate efficiently through R2L attention, which jointly attends to the local tokens in the same region and the associated regional token. In the end, we average all regional tokens and use them for the classification.
Minerals 14 00275 g004
Figure 5. The framework of our Rock-ViT. All regional tokens are first passed through Regional Self-Attention (RSA) to exchange the information among regions, and then Local Self-Attention (LSA) performs parallel self-attention, wherein each takes one regional token and corresponding local tokens. Contrastive learning is added to alleviate the problems of numerous categories and imbalanced class distributions.
Figure 5. The framework of our Rock-ViT. All regional tokens are first passed through Regional Self-Attention (RSA) to exchange the information among regions, and then Local Self-Attention (LSA) performs parallel self-attention, wherein each takes one regional token and corresponding local tokens. Contrastive learning is added to alleviate the problems of numerous categories and imbalanced class distributions.
Minerals 14 00275 g005
Figure 6. Training loss visualization during the training process.
Figure 6. Training loss visualization during the training process.
Minerals 14 00275 g006
Figure 7. Validation loss visualization during the training process.
Figure 7. Validation loss visualization during the training process.
Minerals 14 00275 g007
Figure 8. Training accuracy visualization during the training process.
Figure 8. Training accuracy visualization during the training process.
Minerals 14 00275 g008
Figure 9. Evaluation accuracy visualization during the training process.
Figure 9. Evaluation accuracy visualization during the training process.
Minerals 14 00275 g009
Figure 10. Confusion matrix for unenhanced data classification with Rock-ViT. (A) Feldspathic Quartz Sandstone; (B) Quartz Lithic Sandstone; (C) Lithic Sandstone; (D) Feldspathic Litharenite; (E) Quartz Sandstone; (F) Others.
Figure 10. Confusion matrix for unenhanced data classification with Rock-ViT. (A) Feldspathic Quartz Sandstone; (B) Quartz Lithic Sandstone; (C) Lithic Sandstone; (D) Feldspathic Litharenite; (E) Quartz Sandstone; (F) Others.
Minerals 14 00275 g010
Figure 11. Confusion Matrix for Enhanced Data Classification with Rock-ViT. (A) Feldspathic Quartz Sandstone; (B) Quartz Lithic Sandstone; (C) Lithic Sandstone; (D) Feldspathic Litharenite; (E) Quartz Sandstone; (F) Others.
Figure 11. Confusion Matrix for Enhanced Data Classification with Rock-ViT. (A) Feldspathic Quartz Sandstone; (B) Quartz Lithic Sandstone; (C) Lithic Sandstone; (D) Feldspathic Litharenite; (E) Quartz Sandstone; (F) Others.
Minerals 14 00275 g011
Figure 12. Microscopic images of the original sandstone (af). (a) Feldspathic Quartz Sandstone; (b) Quartz Lithic Sandstone; (c) Lithic Sandstone;(d) Feldspathic Litharenite; (e) Quartz Sandstone; (f) Others.
Figure 12. Microscopic images of the original sandstone (af). (a) Feldspathic Quartz Sandstone; (b) Quartz Lithic Sandstone; (c) Lithic Sandstone;(d) Feldspathic Litharenite; (e) Quartz Sandstone; (f) Others.
Minerals 14 00275 g012
Table 1. Overall performance comparison of refined lithology identification methods. The best results are highlighted in bold.
Table 1. Overall performance comparison of refined lithology identification methods. The best results are highlighted in bold.
Data TypeModelAccuracy (%)Precision (%)Recall (%)F1 Score (%)
UnenhancedAlexNet26.855.3823.198.73
GoogLeNet52.0855.6454.7055.17
ResNet65.4263.7263.6863.70
ViT65.2170.0768.4969.27
Rock-ViT67.1868.1967.1866.44
EnhancedAlexNet24.675.9024.289.49
GoogLeNet84.2484.6184.2884.42
ResNet87.7087.8687.7087.78
ViT89.9190.8189.8689.97
Rock-ViT91.7591.7891.7591.69
Table 2. Comparative analysis of model parameters and epoch runtime efficiency (unit: seconds).
Table 2. Comparative analysis of model parameters and epoch runtime efficiency (unit: seconds).
ModelParametersUnenhanced DatasetEnhanced Dataset
Train Infer Train Infer
AlexNet58,461,61226.830.5962.690.38
GoogLeNet5,644,68426.690.6062.730.34
ResNet4,421,82027.270.9264.410.96
ViT85,885,86820.660.5247.680.42
Rock-ViT71,837,09220.830.5450.970.43
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, C.; Li, P.; Long, Q.; Chen, H.; Wang, P.; Meng, Z.; Wang, X.; Zhou, Y. Deep Learning for Refined Lithology Identification of Sandstone Microscopic Images. Minerals 2024, 14, 275. https://doi.org/10.3390/min14030275

AMA Style

Wang C, Li P, Long Q, Chen H, Wang P, Meng Z, Wang X, Zhou Y. Deep Learning for Refined Lithology Identification of Sandstone Microscopic Images. Minerals. 2024; 14(3):275. https://doi.org/10.3390/min14030275

Chicago/Turabian Style

Wang, Chengrui, Pengjiang Li, Qingqing Long, Haotian Chen, Pengfei Wang, Zhen Meng, Xuezhi Wang, and Yuanchun Zhou. 2024. "Deep Learning for Refined Lithology Identification of Sandstone Microscopic Images" Minerals 14, no. 3: 275. https://doi.org/10.3390/min14030275

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop