5.1. Model Training
The Minima-YOLO architecture was trained on the lithium mineral dataset alongside YOLOv8n and YOLOv8s as baseline comparators under standardized hyperparameters to ensure experimental validity. The training configuration included 200 epochs (batch size = 8), SGD optimization (momentum = 0.937, initial learning rate = 0.01, and weight decay = 5 ×
), and early stopping (patience = 50 epochs). Mosaic augmentation was deactivated during the final 10 epochs to stabilize convergence.
Figure 14 details the training dynamics, demonstrating progressive loss reduction and metric stabilization across all models.
As shown in the curves, YOLOv8n and YOLOv8s exhibit significant accuracy fluctuations in the early training stages, failing to improve steadily. This suggests that their structures struggle to quickly adapt to the dataset, leading to instability during training. Additionally, both models show accuracy drops in the mid-training stage, indicating poor adaptability to the dataset and slower convergence, requiring more epochs to achieve better results. In contrast, Minima-YOLO performs exceptionally well in the early training stages, with steadily increasing accuracy and minimal fluctuations. This indicates its ability to quickly extract useful features and adapt to the training data. In the mid-training stage, Minima-YOLO demonstrates a more stable and sustained performance improvements compared to the large fluctuations of the baseline models. The same trends are also observed in the classification loss curve. Notably, Minima-YOLO converges at around 150 epochs, ending training early and being the first among the three models to meet the early stopping condition. This demonstrates that the improved model structure is more efficient, achieving better performance in less time and significantly improving training efficiency. Overall, Minima-YOLO shows significantly better performance during training compared to YOLOv8n and YOLOv8s.
5.3. Ablation Experiments
To systematically validate the efficacy of individual architectural modifications, YOLOv8-tiny was employed as the baseline model, with proposed enhancements incrementally integrated through a phased implementation protocol. Adhering to the principle of controlled variables, identical hyperparameter settings were applied, and multiple ablation experiments were designed based on mathematical combinations, including the following improvements:
A: Replacing the C2f module in the backbone with the Faster-EMA module;
B: Replacing the downsampling operation in the backbone with GhostConv;
C: Improving the Neck component with the Slim-Neck structure.
The experimental results are summarized in
Table 4. The observed high accuracy metrics are primarily attributable to the standardized microscopic imaging protocol and rigorous preprocessing methodologies (segmentation and cropping) that effectively eliminated background interference and minimized False Negatives. This consistency, however, does not preclude comparative model evaluations or architectural optimizations predicated on these performance benchmarks.
Specifically, compared to YOLOv8n, YOLOv8-tiny reduced parameters, model size, and FLOPs by 34%, 33%, and 31%, respectively, while improving FPS by 14%. This demonstrates that limiting and reducing backbone channels can effectively simplify the model structure. Furthermore, applying improvements A, B, and C individually to YOLOv8-tiny showed varying degrees of lightweight optimization. Improvement A, utilizing the Faster-EMA module with PConv and EMA, further lightened the model while slightly improving P, R, and mAP50. However, the deeper network layers introduced by the Faster-EMA Bottleneck led to a decrease in FPS. Combining Faster-EMA with GhostConv (A + B) improved downsampling efficiency through cheap operations, mitigating the FPS drop and further reducing model size and complexity. On the other hand, combining with Slim-Neck (A + C and B + C), produced significant lightweight effects, particularly in reducing parameters and model size.
After incorporating all of the enhancements, the Minima-YOLO model achieved exceptional lightweight performance while maintaining high accuracy. Compared to YOLOv8n, parameters and model size were reduced by 76% and 73%, FLOPs decreased by 72%, and FPS slightly increased by 5%. The ablation experiments validated the effectiveness of these improvements in model lightweighting.
5.4. Comparison Experiments
To rigorously validate Minima-YOLO’s performance superiority, comparative analyses were conducted against other YOLO lightweight models and subsequent iterations (v9 [
38], v10 [
39], v11 [
40]). To ensure fairness, all models were trained without pre-trained weights and with their respective default hyperparameter settings, without any optimization or adjustments. The detailed comparison results are shown in the
Table 5 and
Figure 15.
The comparison results can be grouped into three categories. First, compared to earlier models, Minima-YOLO demonstrated superior performance in both accuracy and complexity, particularly excelling in parameters, model size, and FLOPs. Compared to YOLOv3-tiny, YOLOv5n, YOLOv5s, YOLOv6 [
41], and YOLOv7-tiny [
42], Minima-YOLO reduced the parameters by 94%, 59%, 90%, 83%, and 88%, respectively, and the model size was reduced by 93%, 56%, 88%, 80%, and 86%, respectively. Additionally, its FPS was only 17% lower than the fastest YOLOv5n. Second, compared to contemporary lightweight models like YOLOv8s and YOLOv8-FasterNet [
43], Minima-YOLO maintained its lead in parameters and model size while also improving speed. It reduced FLOPs by 92% and 57%, respectively, and the FPS was 21% higher than YOLOv8s. Although the YOLOv10 and YOLOv11 series offered more efficient structures and slightly outperform Minima-YOLO in FPS, Minima-YOLO still excelled in other lightweight metrics. Its parameters are about one-fourth and its model size one-third of YOLOv10n and YOLOv11n. Notably, compared to YOLOv11s, Minima-YOLO sacrificed only 0.1% mAP50 while reducing the parameters, model size, and FLOPs by 92%, 91%, and 89%, respectively, highlighting its advantages in lightweight design.
Furthermore, comparative analyses were conducted between Minima-YOLO and established object detection architectures, including Faster R-CNN [
44], RetinaNet [
45], SSD [
46], and EfficientDet [
47]. The dataset annotations were standardized to comply with each architecture’s specifications, employing default input resolutions during training. The results are presented in
Table 6.
Moreover, RetinaNet’s FPS was only 31 frames, indicating poor real-time performance, making it unsuitable for scenarios with high real-time identification requirements. A similar trend was observed in Faster R-CNN and SSD, which not only lagged behind Minima-YOLO in accuracy but also faced challenges in deployment due to their large model sizes. EfficientDet achieved a balance between accuracy and lightweight design but still underperformed Minima-YOLO across all metrics.
Figure 16 visually illustrates the differences in FLOPs, parameters, and model sizes across these networks.
5.5. Visualization Experiments
Following comprehensive ablation studies and comparative analyses, performance improvements were quantitatively validated. Subsequent evaluations focused on practical applicability through visual comparative assessments. Six representative mineral samples (#1–6) were selected from the validation set, encompassing diverse morphological characteristics across categories (shape, scale, and edge definition).
Benchmark models (YOLOv8n, YOLOv8s, and YOLOv11s) and optimized variants (YOLOv8-tiny and Minima-YOLO) were applied for detection analysis. The comparative visualization results (
Figure 17) demonstrated detection consistency across models, with mineral classifications color-coded as follows: red (feldspar), orange-red (quartz), and orange (lepidolite). The confidence scores (0–1 range) quantify detection certainty, where elevated values indicate higher probabilistic assurance of target mineral presence within localized regions.
The results show that all models correctly identified the specific categories of mineral blocks but exhibited differences in confidence scores. Specifically, Block #1 is a feldspar captured under strong lighting. Its reflective surface may resemble quartz, resulting in lower confidence scores for YOLOv8n and YOLOv8-tiny, at 0.59 and 0.65, respectively. In contrast, YOLOv8s, YOLOv11s, and Minima-YOLO achieved confidence scores above 0.9, demonstrating superior distinguishing ability. Block #2 is a feldspar captured under low lighting, with its opaque features well learned by the models. All models showed consistent confidence scores around 0.8. Block #3 is a stacked quartz with thick edges and a glass-like luster. All models correctly identified it with high confidence, with YOLOv11s scoring the highest at 0.97, followed by YOLOv8s and Minima-YOLO at 0.96. Block #4 is a quartz located at the edge of the image. Despite being cropped at the top, its distinct features allowed all models to achieve high confidence scores, with YOLOv11s scoring the highest and YOLOv8n the lowest. Block #5 is a single lepidolite block, appearing transparent against a black background. Minima-YOLO achieved confidence scores comparable to YOLOv8s, both higher than YOLOv8n. Block #6 is a stacked lepidolite with cropped edges. Minima-YOLO still demonstrated a high confidence level comparable to YOLOv8s, and similarly higher than YOLOv8n. Minima-YOLO demonstrated superior performance in microscopic mineral identification, maintaining competitive accuracy with YOLOv8n/YOLOv8s. The optimizations achieved an effective balance between computational efficiency and detection fidelity, establishing a robust solution for resource-constrained mineralogical analysis.
Block #1 (displaying pronounced confidence variance) was selected from the visualization results to investigate hierarchical feature abstraction efficacy. Grad-CAM visualization [
48] was employed, extracting fourth-layer backbone feature activations from both architectures. Comparative visualization revealed differential attention allocation patterns between YOLOv8n and Minima-YOLO, with the latter demonstrating an enhanced focus on texturally discriminative mineralogical signatures. The results are shown in
Figure 18.
It can be observed that the extracted features vary significantly across different feature maps, with some focusing on edges and others emphasizing the center. From the overall results, Mnima-YOLO more clearly highlighted the actual texture and edge details of Block #1, which is also reflected in its higher confidence scores.
This difference arises from the distinct ways the two models process input images. YOLOv8n uses traditional convolution for downsampling and the C2f module for feature extraction, whereas Minima-YOLO employs two rounds of GhostConv operations and Faster-EMA modules. More specifically, compared to the CBS module’s traditional convolution that generates redundant feature maps, GhostConv achieves downsampling more efficiently and with lower cost through cheap operations, ensuring a lightweight design. Additionally, the Faster-EMA Bottleneck structure within the Faster-EMA module enables Minima-YOLO to use PConv for reduced computational cost while introducing the EMA attention mechanism. This redesigns the multi-branch parallel structure to effectively capture cross-channel and spatial interactions, allowing the module to extract both local details and global features, achieving a comprehensive understanding of the input data. These synergistic architectural optimizations collectively enable Minima-YOLO to achieve feature discriminability comparable to YOLOv8n while attaining enhanced granularity in textural feature capture for specific mineral particulates, thereby elucidating the observed confidence score differentials in localized detection tasks.