Performance Evaluation of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 for Stamp Detection in Scanned Documents

Bento, João; Paixão, Thuanne; Alvarez, Ana Beatriz

doi:10.3390/app15063154

Open AccessArticle

Performance Evaluation of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 for Stamp Detection in Scanned Documents

by

João Bento

^*

,

Thuanne Paixão

and

Ana Beatriz Alvarez

PAVIC Laboratory, University of Acre (UFAC), Rio Branco 69915-900, Brazil

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 3154; https://doi.org/10.3390/app15063154

Submission received: 7 February 2025 / Revised: 2 March 2025 / Accepted: 7 March 2025 / Published: 14 March 2025

(This article belongs to the Special Issue Deep Learning for Object Detection)

Download

Browse Figures

Versions Notes

Abstract

:

Stamps are an essential mechanism for authenticating documents in various sectors and institutions. Given the high volume of documents and the increase in forgery, it is necessary to adopt automated methods to identify stamps on documents. In this context, techniques based on deep learning stand out as an efficient solution for automating this process. To this end, this article presents a performance evaluation of YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s in detecting stamps on scanned documents. To train, validate, and test the models, an adapted dataset with 732 images from the combination of the StaVer and DDI-100 datasets is used. The performance of the models is evaluated by means of quantitative and qualitative analyses and by analyzing the computational cost. The results show that, in terms of performance, the YOLOv9s model obtained the best result, with a mAP (Mean Average Precision) of 98.7% for a precision and recall of 97.6%. In terms of computational cost and shorter inference time, the YOLOv11s model stands out. This comparative approach is a contribution to the state of the art for implementation in automatic stamp authentication devices.

Keywords:

stamp identifications; object detection; YOLO; deep learning

1. Introduction

Stamps are important mechanisms for securing documents in various sectors [1]. These tools are essential for verifying the authenticity of documents in banks, companies, financial institutions, hospitals, and other organizations. Additionally, stamps are essential for book registration in libraries, serving as institutional identifiers that enable item tracking in cases of theft or loss [2]. Given the delays in bureaucratic processes caused by the large volume of processed documents and the increasing occurrence of document forgery [3], it is essential to develop automated methods for identifying stamps and verifying their authenticity. The efficient detection of these seals in documents and books enhances institutional security, preserves documentary heritage, combats fraud, and optimizes cataloging and information retrieval.

In this context, machine learning-based techniques emerge as a robust solution by leveraging object detection models to automate stamp identification in documents [4]. Object detection models identify and categorize the objects present in any image and label them with bounding boxes to indicate their presence and confidence levels [5]. These models learn from a dataset composed of images and the corresponding annotations, where the annotations define bounding boxes that specify the object’s location and assigned class. These annotations, along with the target detections of the models, are also referred to as ground truths.

Object detection models are classified into Two-Stage and One-Stage Detectors [6]. The first type, for example, Faster R-CNN [7], generates region proposals before performing detection and classification, while the second type, such as YOLO [8] (You Only Look Once), performs these processes in a single step, providing greater speed. Thus, for applications requiring real-time detection, One-Stage Detectors become the more efficient option. YOLO models outperform not only Two-Stage Detectors like Faster R-CNN, but also other One-Stage models such as SSD [9] (Single Shot MultiBox Detector) and EfficientDet [10] in various applications [11,12,13,14,15]. In comparison, YOLO is significantly faster, enabling real-time applications with minimal loss of accuracy. Therefore, YOLO models stand out as a powerful option for object detection.

The YOLO architecture was first introduced by Joseph Redmon in 2016, and it was designed for real-time object detection in images and videos. Since then, several versions of YOLO have been developed. YOLOv8 [16], released in January 2023 by Ultralytics founder Glenn Jocher, supports multiple computer vision tasks, including object detection, segmentation, pose estimation, tracking, and classification. In February 2024, YOLOv9 [17] was introduced, incorporating the GELAN (Generalized Efficient Layer Aggregation Network) architecture and the PGI (Programmable Gradient Information) concept. Just three months later, YOLOv10 [18] was launched, with its primary innovation being the elimination of dependence on Non-Maximum Suppression (NMS), a technique heavily relied upon by its predecessors. Building on the advancements of previous YOLO versions, YOLOv11 was announced in October 2024, introducing significant architectural improvements that enhanced speed and efficiency without compromising accuracy [19].

In this context, this paper presents a performance evaluation of four state-of-the-art YOLO models—YOLOv8, YOLOv9, YOLOv10, and YOLOv11—in their small versions for stamp detection in scanned documents. The models are trained using the StaVer (Stamp Verification) [20] and DDI-100 (Distorted Document Images) [21] datasets. To evaluate the performance of deep learning-based models, a quantitative analysis is conducted using precision, recall, mAP, and computational cost metrics, such as model complexity and speed. Additionally, a qualitative evaluation is performed based on the bounding boxes predicted by the models and their confidence levels. Furthermore, the ability of the best-performing model to detect stamps in the presence of image distortions is analyzed.

In summary, the main contributions of this research are as follows:

Based on the combination and annotation of the StaVer and DDI-100 datasets, an adapted dataset was created.
Comparative evaluation of the performance of YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s models for stamp detection in scanned documents. Considering that the YOLOv9 to YOLOv11 models are used for the first time in this task.
Specification of the deep learning-based YOLO model with the best performance and/or lowest computational cost and robustness for stamp detection, enabling implementation in document authentication tools.

This paper is organized as follows, Section 2 includes a literature review; Section 3 presents the materials and methods, which refer to the dataset, the architectures of the object detection models, and the metrics used. The experiments and results are described and discussed in Section 4. The conclusions and directions for future research are presented in Section 5.

2. Literature Review

Currently, the open datasets described with stamps are quite limited [22], which makes it difficult to develop and train detection models. However, the literature presents some approaches based on machine learning and deep learning, which can be used as a starting point for future research.

In the study by [23], the SFDL (spectral filtering-based deep learning algorithm) was proposed to detect logos and seals in scanned document images. The SFDL was used to separate the graphical regions of the image in the StaVer dataset. Candidate regions were classified using a deep convolutional neural network, achieving a precision of 94.7% and a recall of 85.8% for 300 dpi images. In [24], the authors employed a fully convolutional neural network to segment images into two classes, stamp and background. The model architecture is based on the pre-trained VGG-16, evaluated on the StaVer dataset, where it achieved a precision of 87% and a recall of 84%. The authors report that this method has presented difficulties in segmenting black stamps.

In the approach proposed by [25], the authors present a compact neural network model called YOLO-Stamp, based on the YOLO network architecture for detection, with 128,000 parameters. To perform stamp detection, the proposed model was tested with the scanned pseudo-official data-set (SPODS) and achieved a precision of 90.6% and a recall of 90.7%. On the other hand, the research by [26] conceptualizes a stamp extraction network framework, based on a GAN (generative adversarial network), to extract texture features from images containing stamps. Accordingly, an optimized stamp text recognition method, PP-OCR, was proposed for text identification in stamps of various formats. Similarly, in the approach by [27], a verification detector was used, employing a CNN (Convolutional Neural Network) for the automatic segmentation of document elements. The results were compared with a Haar+Adaboost detector, highlighting the performance of the proposed models.

In [4], a three-step approach for stamp verification was proposed using the StaVer dataset. The first step involved segmenting object areas, the second step used an SVM to classify the areas as stamp or non-stamp, achieving an accuracy of 90%. Finally, it verified whether the stamp was genuine or not. In the same direction, in the approach by [22], different versions of the YOLOv8 model were considered, including YOLOv8n, YOLOv8s, and YOLOv8m, with the goal of implementing an efficient and accurate stamp detector. To evaluate the implemented models, precision, recall, and mAP50 metrics were used, highlighting that the best result was achieved by YOLOv8n with a mAP of 98.63%.

Thus, considering the rapid evolution of object detection methodologies, the most recent versions of the YOLO family (YOLOv9, YOLOv10, and YOLOv11), released in 2024, present significant advancements in accuracy, computational efficiency, and generalization. However, their application in stamp detection in documents has not yet been widely explored, representing a gap in the literature. Given their potential, these architectures offer relevant contributions compared to previous approaches and are being applied in this research.

3. Materials and Methods

3.1. Dataset

The study presents an adapted dataset, created by combining and annotating the StaVer [20] and DDI-100 [21] datasets, as shown in Figure 1. The processes applied to the datasets are described in detail in Section 4.2. StaVer contains 400 images of scanned documents with the respective ground truths for the stamp locations. The documents refer to automatically generated invoice files that were printed, stamped, and scanned, containing stamps, texts, logos, and tables, as exemplified in Figure 1a. The ground truths are binarized images with masks representing the location of the stamps in the document. DDI-100 is a synthetic dataset based on 7000 pages of unique real documents and consists of 100,000 images augmented by the application of distortions. The documents are reports and book pages containing textual elements, figures, and graphs, and they include 99 different types of stamps, as exemplified in Figure 1b. The dataset contains text annotations that contain the locations of the stamps in pickle format, in addition to the mask files.

The two integrated datasets include different scenarios and a wide variety of stamps in black, blue, green, red, and yellow colors, covering various shapes and styles, such as circular, triangular, rectangular, textual, and stamp designs, as illustrated in Figure 2. Additionally, the stamps appear at different angles and may be partially or fully overlapped with text or figures, as well as faded or blurred.

3.2. YOLOv8

The YOLOv8 [28] model consists of the backbone, neck, and head components, as shown in Figure 3, and it is an anchor-free model, meaning it does not contain predefined bounding boxes. The process starts with image preprocessing at the input layer, followed by feature extraction through the adapted CSPDarknet53 backbone layer, which contains sequences of convolutional layers that extract relevant features from the input image at various resolution levels. Additionally, the C2f module (cross-stage partial bottleneck with two convolutions) combines contextual information and high-level features to improve detection.

Next, the extracted features are passed to the neck layer, where they are concatenated directly without enforcing exact feature dimensions, combining resources from varied scales. Finally, the prediction result is obtained through the head layer, which is decoupled and independently processes object detection, classification, and regression tasks.

3.3. YOLOv9

The YOLOv9 [17] model is composed of the combination of the GELAN architecture and the proposed concept called PGI (Programmable Gradient Information). The GELAN model is developed by combining the CSPNet [29] (Cross-Stage Partial Network) and ELAN (Efficient Layer Aggregation Network) [30], both designed with gradient path planning principles. It integrates the partial cross-stage connections of CSPNet and the efficient layer aggregation of ELAN for effective gradient propagation and feature aggregation. Figure 4 shows the GELAN architecture, where, after the input image is passed through the backbone layer and features are extracted at different levels of abstraction, its feature map is split into two parts. One part undergoes stacking of computational blocks to obtain high-level semantic information, while the other part passes directly through the entire stage and is then integrated with the part that passes through the computational block.

PGI was proposed for addressing information loss during spatial transformations and feature extraction, a phenomenon known as the information bottleneck. A key advantage of PGI is its applicability to lightweight models, such as those developed with GELAN, effectively solving the deep supervision constraint that was previously limited to complex models.

3.4. YOLOv10

The YOLOv10 [18] model introduces new approaches to object detection, addressing post-processing and architectural limitations present in previous versions. Its core concept is Consistent Dual Assignment during training, which integrates one-to-many assignment and one-to-one matching, as illustrated in Figure 5. Traditionally, earlier YOLO versions relied on one-to-many assignment, generating abundant supervision signals but requiring NMS (Non-Maximum Suppression) post-processing to remove redundant predictions, leading to inefficiencies and increased inference latency. YOLOv10 incorporates one-to-one matching, where each ground truth instance is assigned a single prediction, eliminating the need for NMS. Additionally, the CIB (Compact Inverted Block) structure is introduced to address redundancies observed in previous YOLO architectures, where the same basic building block was used across all stages.

Its architecture consists of the following three main components: a backbone for feature extraction, a neck that aggregates features at different scales and forwards them to the head, which is the third component. In the neck, the PAN (Path Aggregation Network) [31] layers are used. The head implements Dual Label Assignments, which includes a one-to-many assignment head and a one-to-one matching head, both aligned using the proposed consistent matching metric.

3.5. YOLOv11

The YOLOv11 [32] architecture enhances the foundation established primarily by YOLOv8, introducing architectural innovations and parameter optimizations. Its architecture is illustrated in Figure 6.

In the backbone and neck, the C2f block is replaced by the C3k2 (Cross-Stage Partial with kernel size 2) block, which provides a more efficient implementation of CSPnet [29]. The C3k2 block consists of two smaller convolutions instead of a single large-scale convolution. Additionally, the C2PSA (Cross-Stage Partial with Spatial Attention) block is introduced after the SPPF (Spatial Pyramid Pooling Fast) block, enhancing spatial attention in feature maps and helping the model focus on the most relevant regions of the image for detection.

3.6. Metrics and Validation

The metrics described in this section are based on [33,34].

3.6.1. Intersection over Union (IoU)

This is a metric that measures the overlap between two bounding boxes, the predicted one and the ground truth according to Equation (1).

J (B_{p}, B_{g t}) = I o U = \frac{a r e a (B_{p} \cap B_{g t})}{a r e a (B_{p} \cup B_{g t})}

(1)

For detection, the calculated IoU value is compared with a predefined threshold. IoU values above the threshold are considered as correct prediction and values below as incorrect prediction.

3.6.2. Precision and Recall

Precision measures the model’s ability to identify only relevant objects, and it is the ratio of correct predictions over all detections, as described by Equation (2). Recall measures the ability to find all objects to be detected (ground truth), which is given mathematically by the ratio between correct predictions and all ground truths according to Equation (3). Where, TP (True Positives) are the predictions classified as correct, FP (False Positives) are the predictions classified as incorrect, and FN (False Negatives) are the non-detections of a ground truth.

P = \frac{T P}{T P + F P} = \frac{T P}{a l l d e t e c t i o n s}

(2)

R = \frac{T P}{T P + F N} = \frac{T P}{a l l g r o u n d t r u t h s}

(3)

3.6.3. Average Precision and Mean Average Precision

From the precision x recall curve, the most common metric for evaluating detection models is extracted, called Average Precision (AP), which refers to the area below the curve of the maximum precision points interpolated at all recall levels, defined mathematically in Equations (4) and (5).

A P = \sum_{n} (R_{n + 1} - R_{n}) P_{interp} (R_{n + 1}),

(4)

where

P_{interp} (R_{n + 1}) = max_{\tilde{R} : \tilde{R} > R_{n + 1}} P (\tilde{R}) .

(5)

Mean Average Precision (mAP) is a metric used to measure the average precision of object detectors across all classes in a given database. The mAP is the average AP across all classes.

3.6.4. Confidence Score

The confidence score means the accuracy of the box probability predicted by the detection models around the ground truth, with values from 0 to 1.

4. Experimental Results

4.1. Hardware and Software

For the development of the experiments, the PyTorch 2.3.0 deep learning framework and YOLO framework were used, with hardware containing NVIDIAGeForce RTX3060 4 GB (Nvidia Corporation, Santa Clara, CA, USA), Intel CORE i7-12700H 2.30 GHz (Intel Corporation, Santa Clara, CA, USA), with 64 GB of RAM.

4.2. Preprocessing

To extract the annotations from all images in the StaVer (400) dataset, the bounding box contours were located using color from the mask file. Using the image size data (width and height) and the coordinates of the located contours (

x_{min}

,

y_{min}

,

x_{max}

,

y_{max}

), the image annotations were automatically generated in text format for each bounding box. A blur filter was applied to eliminate the presence of multiple bounding boxes in the same stamp, as illustrated in Figure 7.

Within the StaVer dataset, 61 images without stamps and 27 additional images of scanned documents without any corresponding ground truth were identified. Using the Roboflow tool (https://roboflow.com), accessed on 9 September 2024, the 27 images were annotated and labeled, as illustrated in Figure 8. Thus, after removing the images without stamps and incorporating the newly annotated images, the StaVer dataset totaled 366 images.

The DDI-100 dataset contains a large amount of data with a low variety of stamps. Therefore, 366 images were selected, ensuring the inclusion of all different types of stamps present in the set and maintaining the same quantity as the StaVer dataset. In this way, the images and their corresponding annotations were used in the Roboflow tool, integrating the adapted dataset (https://universe.roboflow.com/marcos-7aslt/stampdet/dataset/10), accessed on 9 September 2024. In Table 1, the number of annotations in the adapted dataset can be seen, corresponding to the number of stamps and their various types.

The images were divided into training, validation, and test sets following a 7:1.5:1.5 ratio, as described in Table 2.

4.3. Comparison of YOLO Models

For the analysis of the stamp detection process, the models YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s were used. These architectures have different numbers of parameters, layers, sizes, and FLOPS, as shown in Table 3. A gradual reduction in these characteristics is observed with the progression of versions, except for the number of layers in YOLOv9s. In this case, the significant increase in layers, compared to the other versions, directly impacts the computational cost and model complexity.

The hyperparameters used for training the detection models include an input image size of 640 × 640, an initial learning rate of 0.01, a weight decay of 0.0005, a momentum parameter of 0.937, and a batch size set to 16. Patience was set to 50 for early stopping in case there was no improvement in metrics during the last 50 epochs of training, with the number of epochs limited to 500. The best weights were saved for evaluation and inference on the test set.

The training performance is shown in Table 4. It can be seen that the YOLOv9s required six times more training time than YOLOv10s and twelve times more than YOLOv8s and YOLOv11s.

4.3.1. Quantitative Comparisons

Table 5 presents the quantitative results of the detection performance of the models, measured by the precision, recall, and mAP metrics. For comparison purposes, it also displays the inference time of the trained models, reflecting the average time per image in the test set.

Among the detection results, the YOLOv9s model demonstrated excellent performance, with a precision and recall of 97.6% and a mAP of 98.7%. This indicates that, due to its more robust architecture and greater number of layers, the model achieves a remarkable balance between precision and recall, effectively detecting objects and minimizing false detections. Conversely, the YOLOv8s, YOLOv10s, and YOLOv11s models showed positive results, with YOLOv11s achieving the lowest performance in comparison to the others. With the lowest average inference time, the YOLOv11s model stands out. Despite its superior performance, the YOLOv9s model exhibits an average prediction time approximately twice as long as the other models, likely due to its more robust architecture and higher FLOPS count. Meanwhile, the YOLOv8s model strikes an effective balance between high performance (97.3%) and low inference time (13.9 ms).

To analyze the detection performance and error patterns, confusion matrices were generated for each model, as shown in Figure 9. These matrices display the correct detection rates (values along the main diagonal) and the types of errors (values off the diagonal). According to the confusion matrix, the YOLOv9s model demonstrated robust performance, achieving a high true positive rate for stamp detection (98%), surpassing the other models. The YOLOv8s model achieved a true positive rate of 95%, followed by YOLOv10s (93%) and YOLOv11s (92%). These values are inversely proportional to the number of false negatives, with the models showing the least occurrence of this error listed in order. Regarding false positives, the YOLOv9s model exhibited the lowest occurrence (2%), while the YOLOv8s and YOLOv11s models showed the highest (4%).

4.3.2. Qualitative Comparisons

For the qualitative analysis, three samples presenting complex detections from the test set of the adapted dataset were considered, as illustrated in Figure 10. The first sample, presented in the first row, contains partially occluded stamps that overlap with other components of the document. The second sample, shown in the second row, includes the presence of black stamps that may be confused with textual elements. In the third sample, the stamp is partially occluded and mixed with other document components, as evidenced in the last row of the figure. To visualize the models’ detections, a threshold of 0.4 was used. The number above each bounding box represents the detection confidence level of the stamp in the document, with values ranging from 0 to 1. Each model determines the exact location of the stamp, and the highest confidence level of detection represents the best model for detecting that type of stamp. Additionally, each column illustrates the detection results of the following analyzed models: YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s, respectively.

In the first sample, the YOLOv8s, YOLOv9s, and YOLOv10s models were able to detect all the stamps present in the document. However, the YOLOv11s model failed to detect the overlapped stamp in the image. Considering detection reliability, YOLOv9s demonstrates better generalization capability when analyzing the detection of all three stamps. In the second sample, it is observed that the detection performed by the YOLOv8s model resulted in a false positive, incorrectly detecting a stamp. In contrast, the YOLOv9s model reliably detected the stamp present in the document. However, the YOLOv10s and YOLOv11s models failed to detect the stamp. Finally, when analyzing the results for the third sample, the only model that successfully detected the stamp was YOLOv9s.

4.4. Robustness Analysis

To simulate possible distortions introduced during the image scanning process, disturbance scenarios were considered through the preprocessing of the images. These disturbances include low image resolution, the presence of Gaussian noise, and lighting variation with both low and high exposure, representing the five scenarios shown in the columns of Figure 11, respectively. Specifically, the procedure used to obtain the low-resolution image reduces and compresses the image with 80% lossy compression, resulting in a size of 270 × 380. The additive Gaussian noise, with a mean of 0 and a standard deviation of 40, simulates moderate graininess in the image. The rotation scenario tilts the samples by an angle of 10°.

In order to test the robustness of the YOLOv9s model, which achieved the best quantitative and qualitative performance results, two samples from the test set of the dataset were considered, referred to as sample 1 and sample 2. In Figure 12, the model’s predictions for both samples are shown.

Based on this, the disturbance scenarios, presented earlier, were introduced to these samples in order to compare the performance of the predictions, as shown in Figure 13. The rows represent the samples, and the columns illustrate the disturbance scenarios, along with the results of the YOLOv9s model predictions.

The results achieved for a low-resolution image in Scenario 1, shown in Figure 13a, illustrate that the model detected the circular stamp from sample 1 with a high confidence score, but it was unable to detect the rectangular stamp present in the image. For sample 2, the model also managed to perform detection but with a reduced confidence score.

In Scenario 2, as seen in Figure 13b, when Gaussian noise is present, the model was only able to locate the circular stamp from sample 1, while the second stamp, along with the stamp from sample 2, were not detected.

For the prediction on images with low luminosity variation, Scenario 3 presented in Figure 13c, the model detects all stamps, providing a high confidence score for the circular stamp from sample 1 and the stamp from sample 2, showing only a low confidence score for the rectangular stamp located in the darker region of the image.

In Scenario 4, shown in Figure 13d, a similar behavior to Scenario 3 is observed, with high confidence scores for the detected stamps and difficulty only for the rectangular stamp from sample 1, due to the loss of its characteristics caused by high light exposure.

Finally, Scenario 5, illustrated in Figure 13e, presents a rotation in both samples, where it is observed that the model was able to detect all the stamps present, maintaining a high confidence score. Thus, considering the results for the five disturbance scenarios, the greatest challenge for the evaluated model was observed when the stamp was fully overlapped by text and in Scenario 2, where the addition of noise made pattern recognition more difficult for stamp classification.

To complement the robustness analysis of the YOLOv9s model, a real-world test was conducted using three document samples from the StaVer dataset, which did not contain any stamps and were not part of the adapted dataset used for model training. These samples were printed, manually stamped, scanned, and then, predictions were made by the model, as shown in Figure 14.

The model was able to locate all the stamps present in Figure 14a,c. For Figure 14b, the bottom stamp overlapping the table was detected; however, the top stamp was not detected. It is observed that the confidence scores were slightly reduced due to the printing and scanning process of the documents. The results show that the model exhibited significant performance in the detections, indicating that a subsequent integration with preprocessing techniques may improve the obtained values.

5. Conclusions

This paper presents a performance evaluation for stamp detection in scanned documents using deep learning models. Experiments were conducted with the following four architectures from the YOLO family: YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s. The dataset used in this research was adapted from the StaVer and DDI-100 datasets, and the robustness of the best-performing model was assessed under conditions of low resolution, noise, low and high lighting exposure, and rotation.

The quantitative evaluation was based on object detection metrics such as precision, recall, and mAP. The YOLOv9s model obtained the best results, standing out for its greater robustness compared to YOLOv8s and YOLOv10s, which had similar mAP values. In addition, YOLOv9s outperformed other approaches in the literature for stamp detection. However, when evaluating the complexity of the architecture and the inference time of the predictions, YOLOv11s performs better than the other models.

A qualitative analysis was performed with samples from the test set in order to demonstrate the performance and compare the predictions of the models and their confidence scores for each detection. Considering the detection of all stamps present in the documents, the results of the analysis demonstrated that the YOLOv9s model presented the best behavior, showing agreement with the results obtained in the quantitative analysis. The robustness of the best trained model was evaluated, considering the presence of disturbances in five scenarios. The model’s predictions indicate the ability of the YOLOv9s model in the detection task even in the presence of disturbances, showing only a limitation when there is the presence of noise that makes it difficult to recognize and classify the stamps. Finally, an analysis of a real case reveals significant performance in detections, suggesting that future integration with preprocessing techniques could further improve the results obtained.

Based on the analyses conducted, it is possible to determine the most suitable model for a given scenario, depending on the context required for object detection. In terms of both quantitative and qualitative performance, YOLOv9s achieved the best results. Regarding computational cost, YOLOv11s stands out due to its lightweight architecture and fast inference speed. Meanwhile, YOLOv8s offers a balance between high performance and low inference time. This comparative study contributes to the state of the art by identifying the most effective approach for stamp detection in digitized documents.

For future work, the aim is to increase the diversity of the dataset, apply integrated pre- and post-processing techniques to enhance predictions, conduct a detailed statistical analysis including confidence intervals, modify YOLO architectures for stamp identification, and outline the steps for implementing an automated stamp authenticator.

Author Contributions

Conceptualization, J.B., T.P. and A.B.A.; methodology, J.B.; software, J.B.; validation, J.B., T.P. and A.B.A.; formal analysis, J.B.; investigation, J.B., T.P. and A.B.A.; resources, A.B.A.; data curation, J.B.; writing—original draft preparation, J.B. and T.P.; writing—review and editing, J.B., T.P. and A.B.A.; visualization, J.B.; supervision, A.B.A.; project administration, T.P. and A.B.A.; funding acquisition, A.B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the PAVIC Laboratory, University of Acre, Brazil.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors gratefully acknowledge support from the PAVIC Laboratory, benefited from the SUFRAMA fiscal incentives under Brazilian Law No. 8387/1991.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tan, Y.; Sim, H.; Lim, C.; Yang, C. A Preliminary Study on Stamp Impressions with the Same Placement and Orientation on Reproduced Documents—How Easily can it be Achieved by Deliberately Stamping at the Same Relative Position and Orientation? J. Am. Soc. Quest. Doc. Exam. 2020, 23, 33–40. [Google Scholar] [CrossRef]
de Araújo, J.M.G. Carimbo, sim: O carimbo como um aliado da segurança em coleções especiais. PontodeAcesso 2022, 16, 566–581. [Google Scholar] [CrossRef]
da Silva, E.B.; Costantin de Sá, D.C.; Martins Barbosa, S.A. Document forgery in brazil general panorama and prospects of combat. Rev. Ciênc. Juríd. Sociais-IURJ 2024, 5, 107–125. [Google Scholar] [CrossRef]
Duy, H.L.; Nghia, H.M.; Vinh, B.T.; Hung, P.D. An Efficient Approach to Stamp Verification. In Proceedings of the Smart Trends in Computing and Communications, Singapore, 24–25 January 2023; pp. 781–789. [Google Scholar] [CrossRef]
Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Vilcapoma, P.; Parra Meléndez, D.; Fernández, A.; Vásconez, I.N.; Hillmann, N.C.; Gatica, G.; Vásconez, J.P. Comparison of Faster R-CNN, YOLO, and SSD for Third Molar Angle Detection in Dental Panoramic X-rays. Sensors 2024, 24, 6053. [Google Scholar] [CrossRef] [PubMed]
Kim, J.a.; Sung, J.Y.; Park, S.h. Comparison of Faster-RCNN, YOLO, and SSD for Real-Time Vehicle Type Recognition. In Proceedings of the 2020 IEEE International Conference on Consumer Electronics—Asia (ICCE-Asia), Seoul, Republic of Korea, 1–3 November 2020; pp. 1–4. [Google Scholar] [CrossRef]
Li, M.; Zhang, Z.; Lei, L.; Wang, X.; Guo, X. Agricultural Greenhouses Detection in High-Resolution Satellite Images Based on Convolutional Neural Networks: Comparison of Faster R-CNN, YOLO v3 and SSD. Sensors 2020, 20, 4938. [Google Scholar] [CrossRef] [PubMed]
Khin, P.P.; Htaik, N.M. Gun Detection: A Comparative Study of RetinaNet, EfficientDet and YOLOv8 on Custom Dataset. In Proceedings of the 2024 IEEE Conference on Computer Applications (ICCA), Yangon, Myanmar, 16 March 2024; pp. 1–7. [Google Scholar] [CrossRef]
Munteanu, D.; Moina, D.; Zamfir, C.G.; Petrea, Ș.M.; Cristea, D.S.; Munteanu, N. Sea Mine Detection Framework Using YOLO, SSD and EfficientDet Deep Learning Models. Sensors 2022, 22, 9536. [Google Scholar] [CrossRef] [PubMed]
Hussain, M. YOLOv5, YOLOv8 and YOLOv10: The Go-To Detectors for Real-time Vision. arXiv 2024, arXiv:2407.02988. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Alif, M.A.R. YOLOv11 for Vehicle Detection: Advancements, Performance, and Applications in Intelligent Transportation Systems. arXiv 2024, arXiv:2410.22898. [Google Scholar] [CrossRef]
Micenkov, B.; Beusekom, J.V. Stamp Detection in Color Document Images. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 1125–1129. [Google Scholar] [CrossRef]
Zharikov, I.; Nikitin, P.; Vasiliev, I.; Dokholyan, V. DDI-100: Dataset for Text Detection and Recognition. In Proceedings of the 2020 4th International Symposium on Computer Science and Intelligent Control, ACM, Newcastle upon Tyne, UK, 17–19 November 2020. [Google Scholar] [CrossRef]
Prokudina, K.; Skriplyonok, M.; Vostrikov, A. Development of a Detector for Stamps on Images. In Proceedings of the 2024 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), Sochi, Russia, 12–16 May 2024; pp. 865–869. [Google Scholar]
Nandedkar, A.V.; Mukherjee, J.; Sural, S. A spectral filtering based deep learning for detection of logo and stamp. In Proceedings of the 2015 Fifth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), Patna, Bihar, 16–19 December 2015; pp. 1–4. [Google Scholar] [CrossRef]
Younas, J.; Afzal, M.Z.; Malik, M.I.; Shafait, F.; Lukowicz, P.; Ahmed, S. D-StaR: A Generic Method for Stamp Segmentation from Document Images. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 248–253. [Google Scholar] [CrossRef]
Gayer, A.; Ershova, D.; Arlazarov, V. Fast and Accurate Deep Learning Model for Stamps Detection for Embedded Devices. Pattern Recognit. Image Anal. 2022, 32, 772–779. [Google Scholar] [CrossRef]
Jin, X.; Mu, Q.; Chen, X.; Liu, Q.; Xiao, C. Digital Archive Stamp Detection and Extraction. In International Symposium on Artificial Intelligence and Robotics; Springer: Singapore, 2024; pp. 165–174. [Google Scholar] [CrossRef]
Forczmański, P.; Smolinski, A.; Nowosielski, A.; Małecki, K. Segmentation of Scanned Documents Using Deep-Learning Approach. In Progress in Computer Recognition Systems 11; Springer: Cham, Switzerland, 2020; pp. 141–152. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Wang, C.Y.; Mark Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.; Yeh, I.H. Designing Network Design Strategies Through Gradient Path Analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Padilla, R.; Netto, S.L.; da Silva, E.A.B. A Survey on Performance Metrics for Object-Detection Algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar] [CrossRef]
Arya, S.; Kashyap, A. A novel method for real-time object-based copy-move tampering localization in videos using fine-tuned YOLO V8. Forensic Sci. Int. 2024, 48, 301663. [Google Scholar] [CrossRef]

Figure 1. Example of the dataset images: (a) Image of the StaVer dataset. (b) Image of the DDI-100 dataset.

Figure 2. Variety of stamps on documents.

Figure 3. YOLOv8 architecture.

Figure 4. GELAN architecture, a component of YOLOv9.

Figure 5. YOLOv10 architecture.

Figure 6. YOLOv11 architecture.

Figure 7. Example illustrating the application of the blur filter.

Figure 8. Scanned document with extracted annotation: (a) Image from the StaVer dataset. (b) Mask representing the location of the object in the image. (c) Annotated image.

Figure 9. Confusion matrix for YOLO models: (a) YOLOv8s. (b) YOLOv9s. (c) YOLOv10s. (d) YOLOv11s.

Figure 10. Comparison of stamp detection results: (a) Using YOLOv8s. (b) Using YOLOv9s. (c) Using YOLOv10s. (d) Using YOLOv11s. The confidence score for each stamp detection is depicted above the bounding boxes.

Figure 11. Pre-processed disturbances in the images: (a) Original. (b) With low resolution. (c) With Gaussian noise. (d) With shadow simulation. (e) With high light exposure simulation. (f) With rotation.

Figure 12. YOLOv9s detection in two samples chosen from the test dataset, before applying disturbance scenarios.

Figure 13. Predictions of the trained YOLOv9s model on two samples in various disturbance scenarios: (a) With low resolution. (b) With Gaussian noise. (c) With shadow simulation. (d) With high light exposure simulation. (e) With rotation.

Figure 14. Predictions of the YOLOv9s model in real tests: (a) Document containing two stamps. (b) Document containing two stamps. (c) Document containing one stamp.

Table 1. Distribution of the different types of stamps in the dataset.

Total	Circular	Triangular	Textual	Cartoon	Rectangular
1311	610	196	196	187	122
100%	46.5%	15.0%	15.0%	14.3%	9.3%

Table 2. Organization of data from the adapted dataset.

Total Images	Train	Validation	Test
732	512	110	110
100%	70%	15%	15%

Table 3. Characteristics of the YOLO models architectures.

Models	#Param. (M)	Layers	FLOPS (G)	Size (MB)
YOLOv8s	11.14	225	28.6	22.0
YOLOv9s	9.76	1269	40.4	19.9
YOLOv10s	8.07	402	24.8	16.2
YOLOv11s	9.43	319	21.5	18.7

Table 4. YOLO models training performance.

Models	Complete Epochs	Train Time (min)	Time per Epoch (min)
YOLOv8s	227	32.4	0.14
YOLOv9s	330	361.2	1.09
YOLOv10s	327	64.8	0.20
YOLOv11s	179	27.6	0.15

Table 5. Quantitative results of the models in stamp detection.

Models	Precision (%)	Recall (%)	mAP (%)	Inference Time (ms)
YOLOv8s	95.8	95.6	97.3	13.9
YOLOv9s	97.6	97.6	98.7	33.1
YOLOv10s	97.0	93.5	97.5	14.5
YOLOv11s	95.8	91.7	96.5	13.8

Note: The values in bold indicate the best results among the models.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bento, J.; Paixão, T.; Alvarez, A.B. Performance Evaluation of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 for Stamp Detection in Scanned Documents. Appl. Sci. 2025, 15, 3154. https://doi.org/10.3390/app15063154

AMA Style

Bento J, Paixão T, Alvarez AB. Performance Evaluation of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 for Stamp Detection in Scanned Documents. Applied Sciences. 2025; 15(6):3154. https://doi.org/10.3390/app15063154

Chicago/Turabian Style

Bento, João, Thuanne Paixão, and Ana Beatriz Alvarez. 2025. "Performance Evaluation of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 for Stamp Detection in Scanned Documents" Applied Sciences 15, no. 6: 3154. https://doi.org/10.3390/app15063154

APA Style

Bento, J., Paixão, T., & Alvarez, A. B. (2025). Performance Evaluation of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 for Stamp Detection in Scanned Documents. Applied Sciences, 15(6), 3154. https://doi.org/10.3390/app15063154

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Evaluation of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 for Stamp Detection in Scanned Documents

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Dataset

3.2. YOLOv8

3.3. YOLOv9

3.4. YOLOv10

3.5. YOLOv11

3.6. Metrics and Validation

3.6.1. Intersection over Union (IoU)

3.6.2. Precision and Recall

3.6.3. Average Precision and Mean Average Precision

3.6.4. Confidence Score

4. Experimental Results

4.1. Hardware and Software

4.2. Preprocessing

4.3. Comparison of YOLO Models

4.3.1. Quantitative Comparisons

4.3.2. Qualitative Comparisons

4.4. Robustness Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI