Optimizing Text Recognition in Mechanical Drawings: A Comprehensive Approach

Toro, Javier Villena; Tarkian, Mehdi

doi:10.3390/machines13030254

Open AccessArticle

Optimizing Text Recognition in Mechanical Drawings: A Comprehensive Approach

by

Javier Villena Toro

and

Mehdi Tarkian

^*

Department of Management and Engineering, Linköping University, SE-581 83 Linköping, Sweden

^*

Author to whom correspondence should be addressed.

Machines 2025, 13(3), 254; https://doi.org/10.3390/machines13030254

Submission received: 6 February 2025 / Revised: 14 March 2025 / Accepted: 18 March 2025 / Published: 20 March 2025

(This article belongs to the Special Issue Empowering Design and Production Automation with Data-Driven and Machine Learning Approaches)

Download

Browse Figures

Versions Notes

Abstract

:

The digitalization of engineering drawings is a pivotal step toward automating and improving the efficiency of product design and manufacturing systems (PDMSs). This study presents eDOCr2, a framework that combines traditional OCR and image processing to extract structured information from mechanical drawings. It segments drawings into key elements—such as information blocks, dimensions, and feature control frames—achieving a text recall of 93.75% and a character error rate (CER) below 1% in a benchmark with drawings from different sources. To improve semantic understanding and reasoning, eDOCr2 integrates Vision Language models (Qwen2-VL-7B and GPT-4o) after segmentation to verify, filter, or retrieve information. This integration enables PDMS applications such as automated design validation, quality control, or manufacturing assessment. The code is available on Github.

Keywords:

mechanical drawings; optical character recognition; intelligent document processing; quality control; vision language models

1. Introduction

The code is available on https://github.com/javvi51/edocr2 (accessed on 17 March 2025).

Mechanical engineering drawings (EDs) are precise 2D representations of 3D parts or assemblies, providing essential information for accurate manufacturing. They specify critical details such as dimensions, tolerances, material properties, and surface finishes to ensure components meet exact standards. Beyond serving as the primary medium for design communication, EDs also function as legally binding documents, ensuring compliance and alignment among stakeholders.

The digitalization of mechanical EDs is a crucial step toward automating and integrating various stages of the product development lifecycle, from design to manufacturing and quality control [1,2]. Currently, critical information such as regarding dimensions, tolerances, and materials is often manually extracted and recorded in separate systems [3]. Automating this process can enhance efficiency, reduce errors, and enable companies to allocate resources to higher-value tasks [4,5].

While some companies are adopting Production Manufacturing Information (PMI) features in CAD software to embed critical data directly into 3D models, traditional 2D drawings remain widely used [1]. Although 3D annotations provide valuable insights, 2D drawings are essential for effectively conveying design intent through section views, rotations, and zooms. This structured annotation in 2D helps filter out unnecessary details and highlight critical information [6].

Previous research has explored various approaches to digitalizing EDs, with optical character recognition (OCR) delivering the best results. However, due to the complexity of these documents, existing open-source solutions often fall short of industry expectations. An effective algorithm must not only extract text accurately but also interpret the content and filter relevant information.

Digitalizing EDs is more than just automating data extraction; it is a crucial step toward advancing next-generation Product Design and Manufacturing Systems (PDMSs). These systems can leverage reasoning-driven workflows powered by Large Language Models (LLMs) and Vision Language (VL) models, transforming how design and manufacturing information is processed and utilized.

Unlike traditional OCR, these models do more than recognize text; they analyze content, extract relevant information, and integrate it into broader business workflows. This capability, known as Intelligent Document Processing (IDP), is now offered as a service by major providers such as Microsoft and Amazon Web Services. Additionally, models like OpenAI’s GPT-4o, Mistral’s Pixtral, and local VL solutions such as Meta’s LLama3-VL and Alibaba’s Qwen2-VL can be adapted for IDP applications.

Effective IDP solutions rely on robust OCR engines to accurately extract text and derive meaningful insights from images. This work has two primary objectives: The first is to develop a more robust and efficient OCR tool than existing solutions. The proposed PDMS segments EDs into components and analyzes them using dedicated pipelines. The second is to evaluate the integration of state-of-the-art VL models inside such pipelines to perform tasks such as extracting dimensions or querying specific ED information. The study aims to compare and assess the performance of publicly available VL models in extracting critical data from EDs.

2. Background

Optical character recognition (OCR) refers to the capability of a computer vision system to read characters from an image, and it has been a central research focus in computer vision for decades. The development of the first convolutional neural network (CNN) by LeCun was specifically aimed at identifying handwritten characters [7]. With the rise of the digitalization era, significant efforts have been devoted to developing robust OCR frameworks. The common approach divides OCR into two primary tasks: text detection, which involves locating text within an image, and text recognition, which identifies the content within the detected text bounding boxes.

Prominent algorithms for text detection include EAST [8], Textboxes++ [9], and CRAFT [10]. Recently, the research in text detection has shifted toward tackling complex scene text detection, as exemplified by MixNet [11], which currently ranks first on the Total-Text dataset metrics [12]. However, these advanced detection methods are often more powerful than necessary for document processing applications, where text generally follows standard fonts and is primarily rotated within the document plane, as seen in EDs.

Text recognition is the second stage of the OCR pipeline, with widely used architectures including the convolutional recurrent neural network (CRNN) [13] and Long Short-Term Memory (LSTM) networks [14]. Popular OCR frameworks such as Tesseract [15], keras-ocr [16], and EasyOCR [17] integrate these methods to provide comprehensive OCR solutions.

OCR has become one of the most popular tools for extracting information from EDs across various fields. Das et al. [18] applied OCR to recognize both handwritten and printed text in architectural and construction EDs. Similarly, Mani et al. [19], Rahul et al. [20], and Kang et al. [21] incorporated OCR into their frameworks for digitizing textual elements in Piping and Instrumentation Diagrams (P&IDs) while relying on traditional image processing techniques to detect diagram symbols.

Other methods have also been explored for extracting specific types of information. For instance, Moreno et al. utilized hierarchical segmentation to analyze P&IDs [22], while Stegmaier et al. [23] developed a pipeline that constructs graphs by identifying symbols and their connections within fluid circuit diagrams. Broader insights into digitalization techniques for various ED types can be found in reviews such as Moreno et al. [24].

Depending on the type of information presented and the country of origin, a product in the mechanical engineering sector may be represented by one of several types of EDs. The two most common types are production drawings and assembly drawings. Production drawings, sometimes referred to as machine drawings, typically focus on a single part, providing the detailed information necessary for its manufacturing, including specifications for surface finish, tolerances, and material grades. In contrast, assembly drawings convey instructions for the assembly of multiple parts, presenting partial or comprehensive information about how these parts fit and work together in the final product [1].

The information block in production EDs encompasses critical data such as general tolerances, material specifications, weight, and metadata, along with additional details like surface finishing or welding requirements when applicable. While ISO 7200:2004 [25] provides a standard for layout, companies often customize it to suit their needs. Specific tolerances and surface finishing are typically indicated alongside dimensions in part views, adhering to established dimensioning standards like ISO 5459-1 [26] or ASME Y14.5 [27]. Geometric tolerances, addressing aspects such as form, orientation, location, and run-out—collectively referred to as Geometric Dimensioning and Tolerancing (GD&T)—are conveyed using Feature Control Frames (FCFs). A comprehensive understanding of these three core elements—the information block, dimensioning, and geometric tolerances—is essential to accurately interpret the content of a production ED.

In production EDs, the most relevant content is typically textual—considering GD&T symbols as specialized characters—and organized within tables or scattered across the ED as dimensions and annotations. Efforts to digitize production EDs date back to the 1990s. Early contributions include an object detection system by Das et al. [28] and Lu [29], which used image processing techniques to isolate text from other ED line elements. In 1998, Dori et al. [30] proposed a method combining heuristic-based text detection with a neural network recognizer, which is an approach that closely resembles many of today’s standard OCR techniques.

In recent years, Scheibel et al. [4] proposed using PDF scraping to directly extract textual elements from EDs, which are then clustered together. This method bypasses the need for an OCR recognizer, eliminating the uncertainties associated with character recognition, as the text is already available through scraping. However, the approach has shown challenges in effectively clustering the extracted text in their test EDs. Apart from this method, nearly all recent contributions leverage deep learning in some capacity.

Seliger et al. [2] focused on element detection in EDs using Faster-RCNN for object detection and the SAM segmentation tool from Meta for part segmentation. Haar et al. [5] combine OCR, specifically EasyOCR, with symbol detection using the YOLOv5 model, which is a single-shot object detector from the “You Only Look Once” family. Their framework identifies arrows, tolerance symbols, edges, and surface elements, with training conducted on synthetic data. Schlagenhauf et al. [3] also employed synthetic drawing data to fine-tune the keras-ocr detector and recognizer, achieving improved performance over keras-ocr baseline results.

Villena Toro et al. [1] leveraged the keras-ocr framework as well, fine-tuning the recognizer specifically to handle GD&T symbols and marking the first comprehensive digitalization approach for EDs. Their method uses heuristics for ED segmentation and implements a sliding window to increase detection recall. Another comprehensive approach by Lin et al. [6] also relies on synthetic data to train multiple YOLOv7-based models. Their workflow begins by detecting part views, followed by the detection of scattered text within those views. Text detections are then aligned horizontally, with Pytesseract performing the final recognition step.

The absence of labeled datasets for EDs has driven many authors to rely on synthetic data and pre-trained models, often necessitating extensive preprocessing and postprocessing steps. Emerging solutions like eDOCr [1] and Lin et al. [6], as well as commercial tools such as Werk24, are making ED digitalization increasingly viable, helping to bridge this gap. This paper presents four key contributions. First, we introduce a novel segmentation technique for engineering drawings (EDs) based on heuristics, which improves text detection recall and precision while also enhancing vision language (VL) model performance. Second, we propose an efficient clustering algorithm that groups detected text using bounding box borders instead of traditional center-based approaches like DBSCAN, leading to better text block formation and improved processing speed. Third, we provide the first benchmark in the literature that includes a small but annotated dataset of EDs, facilitating future research in this domain. Fourth, we integrate a VL model within an ED processing framework for the first time, demonstrating its potential for improved downstream tasks. These advancements collectively lead to a lower character error rate (CER), reduced processing time, and overall superior performance compared to existing solutions. Additionally, we conduct the first one-to-one comparison of our method against previous approaches in the literature, providing a direct performance evaluation.

3. eDOCr2 Workflow

The workflow builds extensively on our previous framework, eDOCr [1], which is a software developed with similar goals to those of the tool presented in this paper but with notable shortcomings identified by the community, primarily in processing time, accuracy, and robustness when handling various types of frames and tables. The new tool introduces significant advances in text prediction, recognition, and drawing comprehension, along with a substantial reduction in inference times and the integration of the LLM and VL tools. A detailed comparison with eDOCr is provided in Section 5.

The tool relies on a robust feature extractor to segment the image, directing different parts of the drawing to the three primary pipelines, as shown in Figure 1 as the following: Table, FCF, and Dimension pipelines. Given Pytesseract’s proven versatility and efficiency in handling structured text within top-to-bottom, left-to-right documents in various languages, it has been selected for OCR in tables, allowing the user to specify the language of the drawing. However, Pytesseract has limitations when reading oriented and scattered text, and its pretrained models do not include typical symbols found in drawings, such as GD&T symbols. In these cases, such as for GD&T boxes, a custom CRNN recognizer is synthetically trained to identify these specialized characters.

In the dimension pipeline, the remaining text in the image is detected using the pretrained CRAFT detector provided by keras-ocr [16]. An initial recognition pass with Pytesseract identifies text that is not related to dimensions; if the text identified by Pytesseract is not part of the dimension recognizer’s alphabet, it is classified as “other information”. This dual-recognition approach is used because Pytesseract does not read the CRAFT output with the same accuracy as the custom-trained CRNN, often resulting in insertions, substitutions, or omissions.

As shown in Figure 1, certain sections of the pipeline can be replaced with VL models either to locate specific tabular information or to identify dimensions. The following sections provide a detailed discussion of the methods used for ED segmentation and the dimension pipeline (Section 3.1 and Section 3.3), as well as the combination and integration of VL models into eDOCr2 (Section 4).

3.1. Segmentation

EDs are typically presented as wide images containing scattered text within a frame, including sector references, tables of content, and, following ISO and ASME Y14.5 standards, an information block in the bottom right corner. In vertically oriented EDs, this block may extend across the entire lower border. Without preprocessing, text detection yields poor performance, resulting in mixed and disorganized information. A clear understanding of the layout in these types of documents is essential for accurate text identification.

This step is often overlooked in some studies, where documents are cropped to only process the views [3,4]. Other recent approaches apply deep learning to classify views before performing text detection within them [6], introducing additional uncertainty into the pipeline. Both eDOCr and eDOCr2 employ heuristic segmentation, which has been refined and made more robust in this latest version. Using traditional feature extraction from the widely used OpenCV library, the segmentation process consists of the following:

Frame detection: Vertical and horizontal lines are identified using directional kernels. These lines are analyzed for outliers, which are represented by peaks in line length exceeding a certain threshold and are potential frame candidates (see Figure 2a for reference). The lines closest to the center are selected to ensure the correct segmentation of double-framed EDs. This step is essential, as frames often contain text or numbers that could be mistakenly identified as dimensions, as shown in the ED example in Figure 2.
Rectangle detection: Based on contour detection and filtering for polygons that are parallel to the image borders and have four sides, a hierarchy of rectangles is constructed. The inner rectangles of the boxes of interest are classified as children. Using a proximity algorithm, the boxes are clustered together, and these clusters are classified as tables or FCFs depending on their size (by default, FCFs are smaller than 2% of the image area, but this can be adjusted by the user). This approach has limitations for EDs that do not adhere to standards, allowing text to touch the box lines, which may prevent the contour from forming a polygon. Additionally, if the boxes are imperfect or contain noise at the corners, the binary threshold in the contour function may need adjustment (see Figure 2b; some boxes are not identified). However, in such cases, the clustered bounding box usually encompasses these areas (see Figure 2c).
Image Processing: After identifying the frame and boxes and conducting recognition in tables using Pytesseract and a custom-trained CRNN model for the FCFs, the image is cropped to the size of the inner frame. The successfully predicted boxes (those containing text) are highlighted in white to remove them, resulting in a processed image that retains only the views and text.

3.2. FCF Block Pipeline

As a result of the segmentation process, different clusters (representing FCF blocks) of rectangles (representing each box in the FCF) are available for recognition. A recognizer capable of processing the GD&T control symbols is required for this task. The idea is to train a CRNN to recognize their UTF-8 representation, treating them as another character. Since no labeled corpus with sufficient repetition of the symbols is available, a synthetic generator was built to train the CRNN using specialized fonts that support GD&T writing. Simple strings of text were pasted onto a white background and used for network training (see the Appendix A for details on the training process). A limitation of this method is the limited availability of fonts that support GD&T UTF-8 characters, which restricts the variability in the shape of the symbols, letters, and numbers.

During inference, each box in each cluster is processed to identify the content of the FCF from left to right, top to bottom, ensuring the correct reading and positioning of the symbols and characters displayed.

3.3. Dimensions Pipeline

The dimension pipeline takes the processed image as input, excluding any FCFs or tables. It is the most complex pipeline, featuring interconnected networks alongside intricate preprocessing and postprocessing operations. This pipeline represents the most significant contribution of this work, and this subsection summarizes the various algorithms involved in the process.

3.3.1. Text Detection and Detection Processing

As no publicly available raster ED dataset exists for training to date, there are limited alternatives for performing text detection on them. The first option is to use a pretrained model; however, in other datasets, text typically occupies a larger portion of the image compared to EDs, resulting in suboptimal performance for detecting scattered dimensions. This issue was previously addressed by eDOCr, which divided the ED into overlapping patches, yielding good results. The second option explored was to create a synthetic text generator using ED backgrounds, but this attempt did not enhance the performance of the pretrained CRAFT detector, even with the patch subdivision.

For this reason, the pretrained CRAFT model was once again applied to the overlapping patches of the drawing. A clustering algorithm, using a user-defined threshold—set at 20 pixels for optimal results—was then employed to group text together. This algorithm is essential because the text from tolerances and the text in overlapping areas are detected independently. It examines whether the borders of dimension boxes are within the defined threshold distance from one another, clustering them into a combined oriented rectangle if they meet this criterion. Non-clustered and clustered boxes are shown in Figure 3a. A parameter that can affect both detection and recognition is the max. image size. This parameter controls the number of patches and the scaling factor for detection. A higher max. image size often results in better detection and more computing time.

The final oriented rectangle angle is computed assuming it complies with standard orientation; specifically, the angle of the text should always fall within the first and fourth quadrants,

- 90 \leq α < 90

, where

α

is the angle between the longer side of the oriented rectangle and the horizontal axis. In this step, a contour operation is performed, as dimensions with only one character along the longer side will have the vertical axis as their longer side. Consequently, the region of interest needs to be rotated 90 degrees to achieve proper alignment.

Other preprocessing operations include padding the dimension box, which improves Pytesseract’s performance, and adjusting the font stroke by dilating the text when the stroke is very thin (<2.5 pixels). This adjustment helps the recognizer reduce character omissions.

3.3.2. Text Recognition

During text recognition, Pytesseract performs a preliminary classification. While this could have sufficed for final predictions, the character error rate (CER) for dimensions is extremely high. On the other hand, when reading other text information, the performance increases significantly. For this reason, a second recognizer was synthetically trained to perform dimension recognition specifically for symbols expected in an ED. A text box is classified as a dimension if the characters identified by Pytesseract are present in the dimension alphabet. The detection results for a sample ED are displayed in Figure 3b.

If in Pytesseract’s recognition, all characters belong to the dimension alphabet or a list of exceptions, the recognizer begins to perform a subdivision of the adjusted box image, searching for tolerances—see Section 4.5 in the eDOCr paper [1]—to yield the prediction. Different subdivisions are separated by a space in the string, so if a space is present in the final recognition, the dimension contains a tolerance. Final recognition results for the sample ED from Figure 3 are gathered in the second column in Table 1. The parameter max. image size was set to 1240 px to yield the best results, with only one substitution. Lowering the max. image size in this particular sample affects detection size of the boxes by inducing two more errors.

3.3.3. Region-Oriented Symbol Template Matching

The pretrained CRAFT detector exhibited very good recall in predicting text in EDs, as demonstrated in the results section. However, it often overlooked a recurring symbol in EDs: “⌀”. If more EDs were available, this issue could potentially be addressed through fine-tuning of the detector. Until such a dataset becomes publicly accessible, the chosen approach is to explore the surroundings of the dimension-oriented boxes and perform template matching for the diameter symbol. Upon a successful match, the boxes are clustered together, and the recognition process is repeated. The detection results after using template matching are displayed in Figure 3c. The limited region of interest used for template matching only supposed a 6% increase in the dimension pipeline running time for this sample.

4. LLM Tools

Extracting relevant information is the first step in applying IDP (Intelligent Document Processing) to EDs. Previous solutions typically conclude at this stage. In contrast, for the first time, we integrated the results with LLMs through the Hugging Face Transformers package.

4.1. Information Block Data Extraction with VL

eDOCr2 extracts all table data using Pytesseract with a preselected language. That said, the information block often contains very specific text that the user is interested in extracting, such as part numbers, surface finishes, material specifications, weight, general tolerances, etc. Pytesseract returns all textual elements without distinguishing between labels, title blocks, and the actual important information.

A VL model can understand the contextual relationships between different textual elements and only return the requested information. Since the information is ordered in a table, a very specific query should be enough to extract the information, without an express system role. The prompt engineering required to extract data from the information block is as follows.

Query Input: Based on the image, return only a python dictionary extracting this information: {query_string}.

The variable “

query_string

” is a python list converted to string containing user-defined elements to extract from the table, e.g., [’name’, ’part number’, ’material’]. Figure 4 shows a segmented table and the extracted information using Qwen2-VL-7B. Although the model loading and query phases took significantly longer than with Pytesseract, the extracted information is clean and presented in a Python-friendly format, making it ready for integration into the next business process.

4.2. Manufacturing Context Information Understanding

As a case study, we used the ED shown in Figure 5 to assess the level of understanding an LLM can achieve once the information has been segmented. In this case, the LLM was provided with the information block and part views after running the eDOCr2 segmentation process. The task is to identify the sequence of processes required to manufacture the part.

GPT-4 was selected for this task, since it was judged to be state of the art with more versatile knowledge than Qwen2.

Query input: You are getting the information block of the drawing in the first image and the views of the part in the second. I need you to return a python dictionary with the manufacturing processes (keys) and short description (values) that are best for this part.

The answer, once re-formatted, resulted in the following:

“Turning”: “Used for creating the cylindrical shape of the part, including the outer diameters and chamfers”.
“Drilling/Boring”: “To achieve the internal diameter and the countersink specified in section A-A”.
“Milling”: “For creating the flat surfaces if needed, and refining any features not achievable by turning”.
“Reaming”: “To ensure the internal diameter of $⌀ 21.5 \pm 0.1$ is precise and smooth”.
“Grinding”: “To achieve the surface finish required on precise diameters, such as $⌀ 38 H 12$ ”.
“Deburring”: “To break all sharp edges and remove burrs as specified in the drawing notes”.

This part was designed as a teaching example to introduce students to the use of the lathe; therefore, turning is the primary manufacturing process, which GPT-4 correctly identified. While the overall manufacturing assessment provided by GPT-4 is accurate, there are nuances that a manufacturing engineer would point out. For instance, milling is suggested as an optional step, but it is unnecessary, since turning alone can achieve a better surface finish. Additionally, grinding is not required for the diameter

⌀ 38 H 12

, as there is no surface finish specification, and the given tolerance can be achieved through turning. However, if a surface finish requirement or a tighter tolerance were specified, grinding would be necessary. A noteworthy aspect of the response is its mention of deburring, which aligns with the information explicitly stated in the information block.

4.3. Quality Control Checklist Generation

Continuing with potential uses of LLMs in this field, another case study was conducted in which GPT-4 was asked to identify the measurements that need to be checked during the quality control process. The analyzed ED is the same as the one in Figure 5. In this case, the query and response were as follows:

Query input: I need you to provide a Python list containing only the measurements—numerical values and tolerances—that need to be checked in the quality control process.

Answer: [“Ø21.5 ± 0.1”, “Ø38 H12”]

The response is precise and to the point. The only values that need to be checked in this part are those identified by GPT-4, as they are the only measurements that will affect the interaction with other fitting components.

4.4. Replacing the Dimension Pipeline with a VL Model

While the segmentation time and OCR on tables and FCF running times were very short, taking around 10% of the total processing time, the dimension pipeline is the most computationally expensive task, particularly the text detection for the different patches into which the image has been divided. Instead of using this pipeline, a potential substitute could be VL. At this point, the image is processed to a degree where only dimensions and other annotations are displayed, making it a reasonable task to ask an LLM to extract the dimensions. Two different LLMs were tested: the smaller model Qwen2-VL-7B (tested locally) and GPT-4 (accessed via the OpenAI server API). In both cases, the system role and query input were the same.

System Role: You are a specialized OCR system capable of reading mechanical drawings. You read Measurements, usually scattered and oriented text in the image and with arrows in the surroundings. If tolerances are present, read them as “nominal” “upper” “lower”, e.g., “10 +0.1 0”; Angles, usually oriented text with arrows in the surroundings.

Query Input: Based on the image, return only a python list of strings extracting dimensions.

Table 1 shows the results for the various combinations tested against the ED in Figure 3. All three methods exhibited high recall in detecting the dimension text and high accuracy in recognizing the characters. In terms of recognizing additional information, the VL models performed better at interpreting the context and content of the predictions, while the rule-based system in eDOCr2 incorporated view information for zoomed views. However, none of the methods successfully identified that three text boxes contain roughness specifications. Further exploration of the support line detection could enhance eDOCr2, as implemented by Haar et al. [5]. One advantage of the dimension pipeline over VL models is its provision of positional information relative to the image, while the VL models return only the dimension strings without coordinate data.

5. Results

5.1. LLM Benchmark

eDOCr2 is a PDMS that supports VL integration and can be used for reasoning in tables and analyzing detected text. This section aims to evaluate how well the LLMs perform in extracting dimensions and FCFs. Table 2 presents a comparison between eDOCr2 integrated with GPT4o, referred to as eDOCr2 (GPT4o), and the standard eDOCr2. As a baseline, GPT4o was tested on seven EDs from various sources without the segmentation processing steps provided by eDOCr2 using a controlled benchmark.

The system role was modified for the GPT4o model so that it was also asked to read FCF too.

System Role: Feature Control Frames, usually in boxes, return either the symbol or its description, then the rest of the text.

Even with this adjustment, GPT4o failed to return FCF-related data, resulting in a recall 0% for FCF. Furthermore, the low recall for dimensions in GPT4o and eDOCr2 (GPT4o) can be partially explained by the omission of repeated dimensions within EDs. For example, when analyzing a 10 × 10 mm square, GPT4o would return only one dimension. When repeated dimensions were counted as matches, the recall for GPT4o and eDOCr2 (GPT4o) improved to 92.68% and 96.34%, respectively.

An interesting finding is that the GPT4o recall performance worsened when used alone compared to its integration within eDOCr2. This can be attributed to the fact that images fed to GPT4o within the eDOCr2 framework had been preprocessed to remove extraneous textual elements, such as tables, FCFs, and frames, allowing GPT4o to focus exclusively on the text within the drawing area.

Regarding Qwen2-VL-7B, although it initially produced encouraging results in the sample ED in Figure 3, it encountered difficulties with the remaining benchmark EDs by interpreting the tolerances as separate dimensions. This issue limits the reliability of the metrics in Table 2 for the performance assessment of Qwen2-VL-7B. On the other hand, when simpler EDs with limited or no tolerances were processed, eDOCr2 (Qwen2-VL-7B) showed promising CER and recall results, as shown in Table 1.

Finally, although the FCF CER may seem high, exceeding 5%, it is important to note that only three EDs in the benchmark contain FCFs, and all errors occurred in a single drawing. Specifically, there were three missing periods (.) and one substitution error in a GD&T symbol out of 69 predicted characters.

5.2. Ablation Study on Processing Steps

An ablation study in machine learning is a systematic approach used to understand the impact of different components or design choices of a model or system. In an ablation study, certain parts of the model (such as layers, features, or algorithms) are intentionally removed or modified to observe how their absence affects the model’s performance. The goal is to isolate and understand the contribution of each component to the overall performance.

In this work, we conducted an analysis similar to an ablation study, but instead of modifying the deep learning models, we evaluated the contribution of different processing steps within the pipeline. This approach helps identify the most critical steps in the pipeline and optimize the flow, leading to better detection, recognition, and reduced processing times, without the need to change the underlying neural networks.

The processing steps considered for this ablation study are not essential for the overall functioning of the framework. For example, removing the segmentation of tables and FCFs would not provide a fair comparison, as the pipeline is heavily reliant on them. The selected processing steps represent a tradeoff between processing time and performance improvement, and they include the following: the frame detection algorithm, the sliding window approach for sending cropped windows to the CRAFT detector, postprocessing template matching to find missing “⌀” characters, and stroke correction for thin text. The results of the ablation study are presented in Table 3.

The frame detection step had a substantial impact on the precision of the text detection by ensuring that all information outside the frame boundaries was removed. This improvement in focus within the frame also contributed to a minor increase in the IoU, positively affecting the character recall and consequently reducing the character error rate (CER).

This effect was even more pronounced when the sliding window was removed. The sliding window approach, which divides the ED into smaller patches, had the most significant impact across all metrics, resulting in a notable decline in both the recall and IoU. However, this method does provide an advantage in processing time, as processing each smaller patch requires more computational time than analyzing the entire image at once.

Template matching for the “⌀” symbol achieved a positive tradeoff between processing efficiency and improved CER by specifically targeting symbols that would otherwise be difficult for the text detector to capture accurately.

Finally, stroke correction played a crucial role in improving the OCR performance in EDs with extremely thin text. In one specific ED, stroke correction reduced the CER from 10% to 0%, demonstrating its effectiveness. While the overall impact on the complete set of EDs is moderate, stroke correction is particularly valuable in cases where text legibility is compromised.

5.3. State-of-Art Comparison

This subsection presents a detailed comparison of eDOCr2 against recent solutions for processing engineering drawings, including its predecessor eDOCr [1], DigiEdraw [4], and the approach introduced by Lin et al. [6]. Due to differences in methodologies, evaluation metrics, and datasets used in these publications, a direct, unified comparison across all methods is not feasible. Instead, we conducted a one-to-one quantitative and qualitative analysis assessing eDOCr2 against the specific strengths and reported advantages of each approach. This included evaluating improvements in the text detection recall and precision, processing speed, character error rate (CER), and overall segmentation accuracy. By structuring the comparison in this way, we ensured a fair assessment of eDOCr2’s contributions while highlighting its advancements over prior work.

5.3.1. eDOCr

eDOCr2 demonstrated significant improvements over its predecessor, eDOCr, in terms of the text recognition accuracy and overall processing efficiency. These advancements resulted from enhanced frame detection, a better-trained recognizer, and additional steps, such as template matching and stroke correction. Consequently, eDOCr2 outperformed its predecessor in text metrics, table detection and processing, and in reduced runtime, which were the primary limitations of eDOCr. Both the qualitative evidence (see Figure 6) and quantitative results (see Table 4) are provided to highlight the necessity of eDOCr2 by illustrating the shortcomings of its predecessor. The EDs used to compute the metrics in Table 4 were the same as those used in the LLM benchmark above.

5.3.2. DigiEDraw

DigiEDraw [4] is a tool designed for extracting textual information from EDs. It employs a PDF scraping technique, which is followed by a clustering algorithm that groups the extracted textual information and applies rules to organize it. The metrics presented in their paper only refer to detection, as they assume that the text extracted from the document is accurate. However, DigiEDraw lacks a filtering mechanism to distinguish dimensions from other textual elements present in the ED, resulting in low precision for dimension extraction. A comparison with eDOCr2 is presented in Table 5 using the EDs published in their study.

To complement the quantitative results in Table 5, the segmentation and detection of EDs titled “Elevator Bottom” (E), “Aufspannung” (A), and “Aufspannung Ecke” (A3) are shown in Figure 7. It can be observed that, although the text detection for E should result in 100% recall, one dimension was incorrectly categorized as “other information”, causing errors in the text recognition. Similarly, in A, the roughness specification was split into two boxes, leading the algorithm to treat the number as a separate detection, which reduced the precision.

5.3.3. Lin et al. [6]

In the work presented by Lin et al. [6], a pipeline was proposed that utilizes a concatenation of synthetically trained object detection models (YOLOv7) to segment ED views. The cropped views are then used to extract features such as dimensions, datum planes, and FCFs. To evaluate their framework, they presented a single case study. A quantitative comparison between the metrics reported in their paper and those of eDOCr2 for this case study is provided in Table 6.

The qualitative results for their case study are presented in Figure 8. All dimensions and FCFs were detected and recognized without errors. Some text was found inside one of the views, but it was correctly filtered out as “other information”.

6. Discussion

This section discusses the capabilities of VL models and eDOCr2 to digitalize mechanical and other EDs based on the results, how these tools can be integrated into PDMSs, and future avenues for research.

The benchmark results indicate that eDOCr2 achieved the best performance, narrowly surpassing GPT4o and eDOCr2 (GPT4o). The CER obtained in the benchmark for eDOCr2 was highly dependent on the processing steps, as demonstrated in the ablation study, while both GPT4o and eDOCr2 (GPT4o) consistently maintained the CER below 2%. Furthermore, eDOCr2 (GPT4o) outperformed eDOCr2 in terms of the dimension recall when repeated dimensions were treated as double matches. The conclusions drawn from these results are as follows:

Preprocessing steps and initial segmentation significantly improve the performance of traditional OCR and VL models on dimension predictions.
GPT4o does not identify FCFs and extract their information, even when specifically asked to do so.
Both eDOCr2 and eDOCr2 (GPT4o) delivered near-perfect results, with very few errors and minimal missing dimensions.

Moreover, the results obtained using GPT4o provide strong evidence that digitalizing EDs is now feasible without the need to develop specific algorithms, thanks to the powerful server-based LLM models available today. Using such methods does not come without some challenges. For example, companies are often reluctant to upload sensitive data to servers, where they could be used to retrain models, potentially exposing these data to third parties. This concern has led to the release of features such as OpenAI’s “temporary chat”, which promises not to use input data for training. From a technical standpoint, VL models do not provide the coordinates of the predicted text in the image, which may limit their application in user interfaces or any situation requiring positional information. In these cases, traditional OCR frameworks may be a better approach. On the other hand, for tasks where only content extraction is required, such as analyzing information blocks, VL models are better suited than traditional OCR. The use of large VL models raises concerns about their sustainability and environmental impact. Running smaller VL models locally can be more environmentally friendly, though this may come at the cost of reduced accuracy.

The understanding and reasoning capabilities of LLMs were successfully tested at two different levels. In the first instance, engineering data extraction from the information block can be performed using smaller LLMs such as Qwen2-7B-VL. Meanwhile, more complex reasoning tasks, such as manufacturing assessment and quality control analysis, have been evaluated using GPT-4. The integration of these tasks within the eDOCr framework through prompt engineering enables a seamless information flow across subsequent processes. This tool has various industry-specific applications, particularly in automating and streamlining engineering workflows. In quality control automation, it enables efficient extraction and validation of dimensions for inspection processes, reducing manual effort and minimizing errors. It also enhances drawing database management by facilitating efficient indexing, retrieval, and filtering of engineering drawings. Additionally, it can be integrated into CAD systems for metadata population, automatically filling in design parameters and annotations. Another key application is bill of materials (BOM) generation, where the tool automates component extraction from drawings, improving accuracy and reducing the time required for manual data entry.

The proven versatility of VL models allows them to be integrated into other ED frameworks, such as P&ID diagrams, where symbols can be detected using object detection or traditional computer vision techniques such as template matching, and text can be processed with VL models. These advances have opened up new possibilities for digitalizing EDs, and specific tools that enable IDP in different contexts are rapidly becoming mainstream. In this trend, this paper introduces a PDMS that enhances VL capabilities by combining them with traditional OCR and image processing techniques. Although the results using eDOCr2 are superior to those of GPT-4o and eDOCr2 (GPT-4o) in the test benchmark, it could easily combine both outputs to reduce model uncertainty. In such a setup, GPT-4o would act as a reviewer of the OCR predictions, identifying missing predictions or characters, while most of the positional information would remain intact.

With regard to future work, the research gap in digitalizing machine-produced EDs is closing. However, legacy EDs with manual writing have not been explored extensively. In these cases, eDOCr2 may not yield optimal results, since the recognizers are fine-tuned for machine-printed text. However, the VL models could still show good accuracy in extracting information from these legacy drawings.

7. Conclusions

The digitalization of EDs represents a critical step toward automating and streamlining various stages of product development and quality control processes. This study introduces eDOCr2, an OCR system for EDs, and offers a comparison with VL models for extracting essential information from mechanical EDs. Although server-based LLM models, such as GPT-4, offer powerful capabilities, they are unable to provide positional information, which may limit their applicability in certain scenarios. In contrast, eDOCr2, which combines traditional OCR and image processing methods, excels in accurately extracting positional data and content—93.75% on text recall and a less than 1% CER—while offering integration with VL models as part of its data extraction workflow. The VL model demonstrated its versatility in the case study, excelling in tasks such as explicit information extraction (Qwen2-7B-VL), manufacturing assessment (GPT4), and quality control checklist generation (GPT4).

The tools introduced in this study establish a solid foundation for future advances in IDP within the engineering domain. As the field progresses, we foresee that hybrid approaches that combine VL models with traditional computer vision techniques will play a pivotal role in driving greater automation, efficiency, and accuracy across the manufacturing industry.

Author Contributions

J.V.T.: Conceptualization, Methodology, Software, Validation, Resources, Data curation, Original draft preparation, Review and editing, Visualization. M.T.: Conceptualization, Resources, Review and editing, Supervision, Project administration, Funding acquisition. During the preparation of this work, the authors used ChatGPT-4o to improve the readability and language of the manuscript. After using this tool, the authors reviewed and edited the content as needed and assume full responsibility for the content of the published article. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Vinnova grant number 2021-02481, iPROD project, and grant 2024-01420, DART project.

Data Availability Statement

The code and data used for this work are publicly available in this https://github.com/javvi51/edocr2 GitHub repository (19 March 2025).

Acknowledgments

The authors thank Vinnova for making this research project possible.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Model Training & Setup

While the pretrained text detector achieves high accuracy when using an appropriate scale and a sliding window, the text recognizer has been synthetically trained to handle the new characters and symbols found in EDs. This appendix discusses the hardware setup, architectural details, synthetic data generation, and training hyperparameters.

The training of both the GD&T and dimension pipeline recognizers was conducted on a single Nvidia A4000 GPU (16 GB), with an AMD Ryzen 7700 CPU and 64 GB of RAM. When combining local VL models with the recognizers, a second Nvidia A4000 GPU is required, as a single GPU becomes insufficient when loading both TensorFlow (for the recognizers) and PyTorch (for the VL models). This limitation can be mitigated by using a GPU with higher VRAM.

The recognizer model follows the convolutional recurrent neural network (CRNN) architecture introduced by Shi et al. at the CVPR 2015 conference [13]. Since its introduction, the CRNN has gained popularity as a powerful yet efficient model for OCR tasks. It combines feature extraction through a deep CNN, which organizes features into a sequential representation, with subsequent processing using a Long Short-Term Memory (LSTM) network and a final transcription layer.

A convenient implementation of the CRNN has been provided by Morales F. in the keras-ocr framework [16], upon which eDOCr2 was built for both text detection and recognition.

The training of the CRNN only requires image crops with the characters and a string of texts with the characters on the image, with no information regarding the positioning of each character in the image, thus making the generation of synthetic data easier. Due to the lack of a big labeled dataset of EDs, both GD&T and dimension pipeline’s CRNN were trained using only synthetic data following Table A1.

Table A1. Synthetic data generation details for both pipeline models.

Pipeline	Alphabet	Samples	String Length	Bias Char.
GD&T	“0123456789., ABCDZ⌀” + + GD&T (See https://github.com/javvi51/edocr2/blob/main/test_train.py file (accessed on 17 March 2025).)	25,000	(2, 6)	“.,”
Dimensions	“0123456789AaBCDRGHh MmnxZtd(),.+-±:/°”⌀=”	40,000	(4, 10)	“.,”

The synthetic data generator allows the user to customize the alphabet, the number of samples, and the string length. By default, the generator uses a weighted random distribution to ensure an even character distribution, meaning that the weights are updated dynamically based on character appearance. If the user chooses to introduce a bias for certain characters, this option increases their default weight by a bias factor. This feature was implemented after empirically observing that the CRNN tended to miss small punctuation marks such as “.” and “,”.

Beyond the synthetic data generation parameters, the user can specify hyperparameters such as the number of epochs, batch size, and validation split. These parameters depend on the hardware setup and the number of generated samples. A larger number of generated samples requires fewer epochs for the CRNN to converge. Given our hardware and generated samples, the batch size was set to 256, and training was conducted for three and two epochs, respectively, with a validation split of 0.2.

In addition to the training utility, a script was provided to perform testing on the generated samples. The results after testing the GD&T recognizer were

C E R = 0.42 %

and

R e c a l l = 100 %

. Some examples of the generated test samples are shown in Figure A1. Very similar results were obtained when training the dimension pipeline recognizer.

Figure A1. Synthetic samples generated as part of the test set for the GD&T recognizer.

As a final note to this appendix, synthetic training of the detector has also been explored, and the corresponding code has been released. However, the sim-to-real gap was not effectively solved, as the pretrained CRAFT detector outperformed its synthetically trained counterpart on real EDs.

Appendix B. Frame Detection Algorithm

The problem that this algorithm aims to solve when enabled is identifying the design space within the ED. This design space is enclosed by an inner rectangle that contains tables, views, and other information while excluding the frame references (see Figure 2c). In other words, the frame is defined as the smallest rectangle that encompasses all these elements. Since the goal is to find the smallest enclosing rectangle, a minimum threshold (user-defined) must be set to initiate the search. By default, the requirement for the frame is that the design space must occupy at least 70% of the image’s height and width. This threshold is imposed to prevent the selection of other rectangles in the image that correspond to individual views (such as views in Figure 8).

Looking at the ED from Figure 8, it becomes evident that retrieving contours and searching for rectangles is not a viable solution, as the design space has an L shape. The auto-frame algorithm addresses this by applying directional kernels to detect pixel accumulation in both horizontal and vertical directions. This accumulation highlights horizontal and vertical peaks where lines exceed 70% of the ED’s width or height. The algorithm then selects the minimum distance between top–bottom and left–right peak pairs, defining the frame’s top, bottom, left, and right boundaries based on the image center. The frame is determined by the combination that results in the smallest design space area.

Algorithm A1: Auto-frame algorithm

Appendix C. Detection Clustering

The pretrained CRAFT detector is highly effective in detecting scattered text; however, many text boxes, such as tolerances displayed as superscripts and subscripts, are detected independently. A traditional clustering algorithm like DBSCAN could be used to group these text boxes together. The challenge with DBSCAN is that the threshold

ϵ

determines whether other points belong to a cluster, and when working with bounding boxes, the most common approach is to use their center points. However, in highly packed regions, such as dense dimension annotations, the distances between centers of different dimensions may cause them to be incorrectly clustered together.

To address this issue, we introduced a clustering algorithm specifically designed for box clustering in EDs, where the threshold

ϵ

was based on the distance between box boundaries rather than their centers. This algorithm supports the clustering of oriented bounding boxes. Figure A2 illustrates a real ED example, comparing how our clustering algorithm groups bounding boxes relative to DBSCAN.

Figure A2. (a) DBSCA N clustering using center boxes. (b) Clustering algorithm using boxes boundaries. Boxes in blue represents CRAFT detection.

In Figure A2a, it is evident that when using box centers, clustering the measurement

2 . 8_{- 0}^{+ 0.1}

together required setting the threshold

ϵ

to be so high that other dimensions also were clustered incorrectly. In contrast, when using box boundaries, a much smaller

ϵ

was sufficient to group the elements of a single measurement while still maintaining a clear separation between different measurements.

By default, this parameter is set to 20 pixels, which is a value that has proven effective across benchmarks with EDs from various sources. A sensitivity analysis of

ϵ

is presented on the ED benchmark on Table A2. However,

ϵ

is a crucial parameter that significantly influences the clustering results and should be tuned based on image resolution variations.

Table A2. Sensitivity analysis for different

ϵ

values.

Table A2. Sensitivity analysis for different

ϵ

values.

$ϵ$ (Pixels)	Recall	Precision	IoU	CER	Char. Recall
5	95.0%	76.0%	79.68%	14.9%	85.09%
10	95.0%	87.35%	83.76%	5.53%	94.47%
20	93.75%	96.15%	86.38%	0.73%	99.27%
30	90%	100%	81.56%	8.98%	97%
50	83.75%	104.69%	77%	18.81%	92%

Table A2 illustrates how the clustering threshold impacts the reading of the ED. When

ϵ

was too small, dimension information was not properly clustered, leading to a reduced IoU. As a result, parts of the measurement may be missing, worsening the character recall and increasing the CER due to omission.

Conversely, setting

ϵ

too high caused multiple measurements to be clustered together, which explains the precision values exceeding 100%, as a single cluster can surpass the IoU threshold for multiple dimensions. Inversely to the case of a very small

ϵ

, a large

ϵ

also increased the CER, as unexpected text was included in the clusters, contributing to additional predicted text.

References

Villena Toro, J.; Wiberg, A.; Tarkian, M. Optical character recognition on engineering drawings to achieve automation in production quality control. Front. Manuf. Technol. 2023, 3, 1154132. [Google Scholar] [CrossRef]
Seliger, R.; Gül-Ficici, S.; Göhner, U. From Paper to Pixels: A Multi-modal Approach to Understand and Digitize Assembly Drawings for Automated Systems. In Proceedings of the Database and Expert Systems Applications-DEXA 2024 Workshops, Naples, Italy, 26–28 August 2024; Moser, B., Fischer, L., Mashkoor, A., Sametinger, J., Glock, A.C., Mayr, M., Luftensteiner, S., Eds.; Springer: Cham, Switzerland, 2024; pp. 77–88. [Google Scholar]
Schlagenhauf, T.; Netzer, M.; Hillinger, J. Text Detection on Technical Drawings for the Digitization of Brown-field Processes. In Proceedings of the 16th CIRP Conference on Intelligent Computation in Manufacturing Engineering, Online, 13–15 July 2022; Elsevier: Amsterdam, The Netherlands, 2022. [Google Scholar] [CrossRef]
Scheibel, B.; Mangler, J.; Rinderle-Ma, S. Extraction of dimension requirements from engineering drawings for supporting quality control in production processes. Comput. Ind. 2021, 129, 103442. [Google Scholar] [CrossRef]
Haar, C.; Kim, H.; Koberg, L. AI-Based Engineering and Production Drawing Information Extraction. In Proceedings of the Flexible Automation and Intelligent Manufacturing: The Human-Data-Technology Nexus, Detroit, MI, USA, 19–23 June 2022; Kim, K.Y., Monplaisir, L., Rickli, J., Eds.; Springer: Cham, Switzerland, 2022; pp. 374–382. [Google Scholar]
Lin, Y.H.; Ting, Y.H.; Huang, Y.C.; Cheng, K.L.; Jong, W.R. Integration of Deep Learning for Automatic Recognition of 2D Engineering Drawings. Machines 2023, 11, 802. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An Efficient and Accurate Scene Text Detector. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2642–2651. [Google Scholar] [CrossRef]
Liao, M.; Shi, B.; Bai, X. TextBoxes++: A Single-Shot Oriented Scene Text Detector. IEEE Trans. Image Process. 2018, 27, 3676–3690. [Google Scholar] [CrossRef] [PubMed]
Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character Region Awareness for Text Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zeng, Y.X.; Hsieh, J.W.; Li, X.; Chang, M.C. MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar] [CrossRef]
Ch’ng, C.K.; Chan, C.S. Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 935–942. [Google Scholar] [CrossRef]
Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Smith, R. An Overview of the Tesseract OCR Engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 23–26 September 2007; Volume 2, pp. 629–633. [Google Scholar] [CrossRef]
Morales, F. keras-ocr. 2020. Available online: https://github.com/faustomorales/keras-ocr (accessed on 17 March 2025).
JaidedAI. EasyOCR. 2020. Available online: https://github.com/JaidedAI/EasyOCR (accessed on 17 March 2025).
Das, S.; Banerjee, P.; Seraogi, B.; Majumder, H.; Mukkamala, S.; Roy, R.; Chaudhuri, B.B. Hand-Written and Machine-Printed Text Classification in Architecture, Engineering & Construction Documents. In Proceedings of the 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, NY, USA, 5–8 August 2018; pp. 546–551. [Google Scholar] [CrossRef]
Mani, S.; Haddad, M.A.; Constantini, D.; Douhard, W.; Li, Q.; Poirier, L. Automatic Digitization of Engineering Diagrams using Deep Learning and Graph Search. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 673–679. [Google Scholar] [CrossRef]
Rahul, R.; Paliwal, S.; Sharma, M.; Vig, L. Automatic Information Extraction from Piping and Instrumentation Diagrams. arXiv 2019, arXiv:1901.11383. [Google Scholar] [CrossRef]
Kang, S.O.; Lee, E.B.; Baek, H.K. A Digitization and Conversion Tool for Imaged Drawings to Intelligent Piping and Instrumentation Diagrams (P&ID). Energies 2019, 12, 2593. [Google Scholar] [CrossRef]
Moreno-García, C.F.; Elyan, E.; Jayne, C. Heuristics-Based Detection to Improve Text/Graphics Segmentation in Complex Engineering Drawings. In Proceedings of the Engineering Applications of Neural Networks, Online, 2 August 2017; Boracchi, G., Iliadis, L., Jayne, C., Likas, A., Eds.; Springer: Cham, Switzerland, 2017; pp. 87–98. [Google Scholar]
Stegmaier, V.; Jazdi, N.; Weyrich, M. A method for the automated digitalization of fluid circuit diagrams. Comput. Ind. 2024, 162, 104139. [Google Scholar] [CrossRef]
Moreno-García, C.F.; Elyan, E.; Jayne, C. New trends on digitisation of complex engineering drawings. Neural Comput. Appl. 2018, 31, 1695–1712. [Google Scholar] [CrossRef]
ISO 7200:2004; Technical Product Documentation—Data Fields in Title Blocks and Document Headers. ISO: Geneva, Switzerland, 2004.
ISO 5459-1:2011; Geometrical Product Specifications (GPS)—Geometrical Tolerancing—Datums and Datum Systems. ISO: Geneva, Switzerland, 2011.
ASME Y14.5; ASME Y14.5—Dimensioning and Tolerancing. ASME: New York, NY, USA, 2018.
Das, A.K.; Langrana, N.A. Recognition and Integration of Dimension Sets in Vectorized Engineering Drawings. Comput. Vis. Image Underst. 1997, 68, 90–108. [Google Scholar] [CrossRef]
Lu, Z. Detection of text regions from digital engineering drawings. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 431–439. [Google Scholar] [CrossRef]
Dori, D.; Velkovitch, Y. Segmentation and Recognition of Dimensioning Text from Engineering Drawings. Comput. Vis. Image Underst. 1998, 69, 196–201. [Google Scholar] [CrossRef]

Figure 1. A simplified flowchart illustrating the tool’s information processing workflow. Light yellow boxes indicate stages where deep learning plays a fundamental role. The tool outputs four information lists (green boxes) containing both textual and positional data.

Figure 2. Intermediate segmentation processes. (a) displays the peaks detected that can define the frame, (b) illustrates the identification of boxes, and (c) presents the result after performing recognition on the FCFs and tables prior to processing the image for dimensions. The ED was sourced from DigiEDraw [4].

Figure 3. Intermediate dimension pipeline processes. (a) displays the raw text detected boxes in orange, and the cluster boxes in blue. (b) illustrates the final processing of boxes without using template matching, and (c) presents the final result after template matching.

Figure 4. Information block extracted from the same drawing used in Figure 3 using eDOCr2 segmentation process and the results produced by the Qwen2-VL-7B model.

Figure 5. Segmented result after running eDOCr2 on an ED produced in the university workshop.

Figure 6. eDOCr2 and eDOCr comparison in two different EDs from the benchmark. eDOCr was worse at segmenting the image and detecting text.

Figure 7. Segmentation and detection results of eDOCr2 for three EDs published by Scheibel et al. [4]. The detected frame is shown in light blue, segmented tables in light yellow, FCFs masked in brown, dimensions highlighted in green, and other detected information in the ED masked in dark blue.

Figure 8. Lin et al. [6] case study detection and segmentaiton results using eDOCr2.

Table 1. A summary of the recognition results for the ED shown in Figure 3, using the dimension pipeline from eDOCr2 (second column), and reaplacing it with VL models (third and forth column). Results below the horizontal line represent information which is not a dimension, since it refers to surface roughness or other information. Substitutions and inclusions are highlighted in red, and omissions are shown in strike-through red.

Ground Truth	eDOCr2	eDOCr2 (Qwen2-VL-7B)	eDOCr2 (GPT4o)
12,5	12,5	12,5	12,5
15	15	15	15
25	25	25	25
⌀16,5 +0,1 0	⌀16,5 +0,1 0	⌀16,5 +0.1 0	⌀16,5 +0,1 0
2x⌀19,9 0 −0,1	2x⌀19,9 0:0,1	~~2x⌀~~19,9 +0.1−0.1	2x ⌀19,9 0 −0,1
2xR0,15	2xR0,15	2xR0,15	2xR0,15
2xR0,2	2xR0,2	2xR0,2	2xR0,2
2,5 +0,2 0	2,5 +0,2 0	2,5 +0.2 0	2,5 +0,2 0
20°	20°	20°	20°
20	20	20	20
M8	M8	M8	M8
16	16	16	16
2	2	2	2
1,6	1,6	1,6	1,6
A 20:1	A 20:1	-	-
0,8	0,8	0,8	0,8
1,6	1,6	-	-
0	-	-	0
-	-	20	-

Table 2. A benchma rk against state-of-art frameworks in the testing ED set. The dimension recall of GPT4o is conditioned, since the model excludes repeated dimensions. * When repeated dimensions were counted as matches, the recall for GPT4o and eDOCr2 (GPT4o) improved to 92.68% and 96.34%, respectively.

Metric	Dim. Recall	FCF Recall	FCF CER	Dim. CER
eDOCr2	93.75%	100%	5.70%	0.73%
GPT4o	84.14% *	0%	-	1.95%
eDOCr2 (GPT4o)	87.80% *	100%	5.70%	1.83%

Table 3. Ablation study of processing steps. Impact of suppressing various processing steps on different metrics. For text prediction, the metrics presented are recall, precision, and Intersection over Union. For dimension recognition, character error rate (CER) and character recall were measured. Finally, the average processing time per drawing was also considered.

Suppressed Step	Recall	Precision	IoU	CER	Char. Recall	Time (s)
Frame detection	95.0%	66.67%	83.58%	3.59%	97.61%	10.65
Sliding window	66.25%	94.64%	79.06%	8.69%	93.04%	7.70
Template match	93.75%	94.94%	85.30%	2.20%	97.80%	10.39
Correct Stroke	93.75%	96.15%	86.38%	2.69%	98.77%	10.46
All Enabled	93.75%	96.15%	86.38%	0.73%	99.27%	10.69

Table 4. Quantitative comparison of eDOCr2 against eDOCr. The improvement becomes evident in detection precision, CER, and processing times.

Contribution	Detection Recall	Detection Precision	CER	Time (s)
eDOCr [1]	91%	70%	8%	53.5
eDOCr2	94%	96%	0.7%	10.7

Table 5. Comparison between eDOCr2 and DigiEDraw for the EDs and metrics published in their paper [4].

ED Sample	G	E	A	A2	H	A3	Average
Metric	Detection Recall
DigiEDraw [4]	82%	93%	85%	93%	85%	91%	88%
eDOCr2	91%	94%	100%	86%	92%	100%	94%
Metric	Detection Precision
DigiEDraw [4]	60%	88%	58%	87%	73%	67%	72%
eDOCr2	91%	100%	89%	100%	92%	86%	93%

Table 6. In the study case presented by Lin et al. [6], detection and accuracy was perfect when using eDOCr2; at the same time, the prediction time was decreased by more than half.

Contribution	Detection Acc.	Recognition Acc.	Time (s)
Lin et al. [6]	70%	80%	28.8
eDOCr2	100%	100%	13.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Toro, J.V.; Tarkian, M. Optimizing Text Recognition in Mechanical Drawings: A Comprehensive Approach. Machines 2025, 13, 254. https://doi.org/10.3390/machines13030254

AMA Style

Toro JV, Tarkian M. Optimizing Text Recognition in Mechanical Drawings: A Comprehensive Approach. Machines. 2025; 13(3):254. https://doi.org/10.3390/machines13030254

Chicago/Turabian Style

Toro, Javier Villena, and Mehdi Tarkian. 2025. "Optimizing Text Recognition in Mechanical Drawings: A Comprehensive Approach" Machines 13, no. 3: 254. https://doi.org/10.3390/machines13030254

APA Style

Toro, J. V., & Tarkian, M. (2025). Optimizing Text Recognition in Mechanical Drawings: A Comprehensive Approach. Machines, 13(3), 254. https://doi.org/10.3390/machines13030254

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Text Recognition in Mechanical Drawings: A Comprehensive Approach

Abstract

1. Introduction

2. Background

3. eDOCr2 Workflow

3.1. Segmentation

3.2. FCF Block Pipeline

3.3. Dimensions Pipeline

3.3.1. Text Detection and Detection Processing

3.3.2. Text Recognition

3.3.3. Region-Oriented Symbol Template Matching

4. LLM Tools

4.1. Information Block Data Extraction with VL

4.2. Manufacturing Context Information Understanding

4.3. Quality Control Checklist Generation

4.4. Replacing the Dimension Pipeline with a VL Model

5. Results

5.1. LLM Benchmark

5.2. Ablation Study on Processing Steps

5.3. State-of-Art Comparison

5.3.1. eDOCr

5.3.2. DigiEDraw

5.3.3. Lin et al. [6]

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Model Training & Setup

Appendix B. Frame Detection Algorithm

Appendix C. Detection Clustering

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI