**Enhanced Image-Based Endoscopic Pathological Site Classification Using an Ensemble of Deep Learning Models**

## **Dat Tien Nguyen, Min Beom Lee, Tuyen Danh Pham, Ganbayar Batchuluun \*, Muhammad Arsalan and Kang Ryoung Park**

Division of Electronics and Electrical Engineering, Dongguk University, 30 Pildong-ro 1-gil, Jung-gu, Seoul 04620, Korea; nguyentiendat@dongguk.edu (D.T.N.); mblee@dongguk.edu (M.B.L.); phamdanhtuyen@dongguk.edu (T.D.P.); arsal@dongguk.edu (M.A.); parkgr@dongguk.edu (K.R.P.) **\*** Correspondence: ganabata87@dongguk.edu; Tel.: +82-10-9948-3771; Fax: +82-2-2277-8735

Received: 9 September 2020; Accepted: 21 October 2020; Published: 22 October 2020

**Abstract:** In vivo diseases such as colorectal cancer and gastric cancer are increasingly occurring in humans. These are two of the most common types of cancer that cause death worldwide. Therefore, the early detection and treatment of these types of cancer are crucial for saving lives. With the advances in technology and image processing techniques, computer-aided diagnosis (CAD) systems have been developed and applied in several medical systems to assist doctors in diagnosing diseases using imaging technology. In this study, we propose a CAD method to preclassify the in vivo endoscopic images into negative (images without evidence of a disease) and positive (images that possibly include pathological sites such as a polyp or suspected regions including complex vascular information) cases. The goal of our study is to assist doctors to focus on the positive frames of endoscopic sequence rather than the negative frames. Consequently, we can help in enhancing the performance and mitigating the efforts of doctors in the diagnosis procedure. Although previous studies were conducted to solve this problem, they were mostly based on a single classification model, thus limiting the classification performance. Thus, we propose the use of multiple classification models based on ensemble learning techniques to enhance the performance of pathological site classification. Through experiments with an open database, we confirmed that the ensemble of multiple deep learning-based models with different network architectures is more efficient for enhancing the performance of pathological site classification using a CAD system as compared to the state-of-the-art methods.

**Keywords:** pathological site classification; in vivo endoscopy; computer-aided diagnosis; artificial intelligence; ensemble learning

#### **1. Introduction**

Currently, cancer is a leading cause of human death worldwide [1]. There are several types of cancer such as lung [2,3], breast [4,5], skin [6,7], stomach [8–10], colorectal (also known as colon cancer) [11–16], thyroid [17–19], and brain [20–22] cancers. Among these, stomach and colorectal cancer (CRC) are two of the most common types causing death in humans. To diagnose these types of cancer, in vivo endoscopy is widely used. This technique allows for a detailed visualization of the in vivo structure of the colon or stomach, which is significantly useful to doctors for examining the evidence of disease. However, conventional medical imaging-based diagnostic techniques are still predominantly dependent on the personal knowledge and experiences of doctors (radiologists). Consequently, the diagnosis results have a large variance. To reduce this variance, a double-screening process can be invoked, in which two or more experts (radiologists) are required to read the captured

medical images of a single case. Although this method is more efficient than the conventional diagnosis method, it is costly and time-consuming. Recently, a computer-aided diagnosis (CAD) has been widely employed to assist doctors in the diagnosis process, and it is now becoming important for enhancing the performance of the diagnosis process using medical imaging techniques.

Depending to the stage of the disease, several signatures may appear on the colon or stomach, such as polyp or gastritis regions. These regions are called pathological sites. Previous studies on pathological site detection/classification mostly used a single convolutional neural network (CNN). In a previous study [11], Patino-Barrientoo et al. proposed the use of a Visual Geometry Group (VGG)-based network for classifying endoscopic colon images into malignant and nonmalignant (benign) cases. Similarly, Ribeiro et al. [13] used the CNN for automated classification of polyp images. Experimental results with a relatively shallow network (five convolution layers and three fully connected layers) showed that the CNN is useful for polyp classification problems and outperforms other handcrafted feature extraction-based methods. Instead of training a new classification network, as shown in the studies by Patino-Barrientoo et al. [11] and Ribeiro et al. [13], Fonolla et al. [16] used a pretrained CNN model based on the residual network [23] that was trained on ImageNet image dataset as an image feature extractor, and classified input images into malignant and nonmalignant categories using conventional classification methods. Zhu et al. [9] also used a deep residual network and the transfer learning technique to determine the invasion depth of gastric cancer as well as the classification accuracy. They showed that the CNN-based method achieves significantly higher accuracy than human endoscopists. Similarly, Li et al. [10] used a CNN model, named GastricNet, for gastric cancer identification. As reported by their study, deep CNN-based methods are efficient for gastric cancer identification problems.

Although these studies showed that CNN-based methods have been successfully applied to solve the pathological site (polyp) classification problem, they all have a common limitation, which is the use of a single CNN model for the problem. As indicated by several previous studies on CNNs, the performance of the CNN-based method is highly dependent on several factors such as the depth and width of the network, number of network parameters, and architecture of the network. Among these factors, the design of network architecture performs an important role, especially when the depth of the network increases. All these previous studies used relatively shallow networks [11,13] or extracted image features at the last convolution layer of a deep residual network for classification problems [16]. As indicated by previous studies, the depth of a deep learning-based system can significantly enhance the performance of a detection/classification system, while a small number of network parameters help to prevent over/underfitting problems, ensuring that the network is easy to train [23–27]. Therefore, the performance of a classification system is limited because of the use of a shallow network or the extraction of image features using a pretrained model. In addition, it is significantly difficult to recognize pathological sites as benign or malignant cases at an early stage. Therefore, the classification of polyps into benign or malignant cases as performed by previous studies can yield incorrect results at the early stage of a disease.

In a recent study, Kowsari et al. [28] proposed a new ensemble, deep learning approach for classification, namely random multi-model deep learning (RMDL). The RMDL approach solves the problem of finding the best deep learning structure and architecture to improve the classification. As a result of their study, the author showed that the ensemble learning method is efficient in enhancing the classification performance of various systems such as image-based system and text-based classification system. Inspired by the work by Kowsari et al., we proposed the use of multiple CNNs that differ in network architecture and depth to efficiently extract image texture features from input images to overcome these limitations of previous studies.

In contrast to previous studies that classify pathological site images into malignant and nonmalignant cases, our study classifies an input endoscopy image into one of two categories, with or without the appearance of pathological sites. Although this task can be accomplished by using a method to detect pathological sites in endoscopic images, the training process of a detection method

requires strong efforts to accurately localize the pathological sites in the training dataset. However, at the early stage of the disease, the pathological sites are small and/or unclear, which can cause difficulties in creating ground-truth labels as well as in the detection process. Therefore, we simplified this task by simply classifying the input images into two classes (with or without the appearance of pathological sites) without the requirement for correct labeling of pathological site in input training images. This approach helps doctors focus on images with pathological sites rather than the other normal images during the diagnosis process. Table 1 lists comparative summaries of the proposed and previous studies. Our proposed method is novel in the following four ways:


The remainder of this paper is organized as follows. In Section 2, we propose an image-based pathological site classification method based on an ensemble of deep learning models. In Section 3, we present a validation of the performance of the method proposed in Section 2 using a public in vivo gastrointestinal (GI) endoscopic images, namely the in vivo GI endoscopy dataset [30], and compare it with previous studies and discuss our results. Finally, we present the conclusion of our study in Section 4.


**Table 1.** Comparative summaries of proposed and previous studies on image-based polyp or pathological site classification.

#### **2. Proposed Method**

#### *2.1. Overview of Proposed Method*

Figure 1 shows the overall procedure of our proposed method for enhancing the performance of the pathological site classification system. As explained in Section 1, previous studies predominantly classified the pathological sites using conventional classification methods [11], or used a single CNN model [9–11,13,16]. Consequently, the classification performance is limited. These studies have a common drawback that they used single network architectures for classification problems. Therefore, the extracted image features are dependent on the network architecture. As stated in previous studies [23–26], the architecture of the CNN performs an important role in the performance of deep learning-based systems. For example, conventional CNNs, which are a linear stack of convolution layers, are suitable for a shallow design to solve a simple problem, whereas the residual or dense network is used to increase the depth of conventional CNNs; consequently, they can easily train a complex problem; alternatively, the inception network is used to extract richer features in a single network when compared to conventional CNNs. Based on these observations, we designed our CNN for the pathological site classification problem by incorporating these observations into a single classification network, as shown in Figure 1.

**Figure 1.** Overview of the proposed method.

As shown in Figure 1, our proposed method first preprocesses the input endoscopic images to reduce noise and prepares them for inputting to the subsequent stages, which are based on deep learning techniques for the classification problem. A detailed explanation of this step is presented in Section 2.2. Our proposed method comprises of the following three main steps for classification. The first step is the classification using a conventional CNN based on VGG16 network architecture [25]. The second step is the classification using an inception-based network to extract richer features according to the different sizes of objects [26]; finally, the third step is based on a DenseNet architecture to exploit the effect of a significantly deep network [31]. From the results of these three networks, we can obtain three classification results. We believe that these three branches are equivalent to the previous studies that are based on a single simple CNN for classification problems. To enhance the performance of the pathological site classification problem, our study further combines these classification results using the ensemble learning technique and performs classification based on the combined results. These explanations are provided in detail in Section 2.3.

#### *2.2. Preprocessing of Captured Endoscopic Images*

As explained in Section 2.1, the first step in our proposed method is the preprocessing step to eliminate redundant information from the endoscopic images before inputting them to CNNs. In Figure 2a, we show an example of a captured GI endoscopic image. As shown in this figure, the captured image normally contains two parts, including the background region with low illumination, and the foreground region with higher illumination. It can be observed that the background contains no useful information for our classification steps. Therefore, it should be removed before using the classification steps. This step is useful because the background region not only contains no information about the pathological sites but also presents noise that can consequently decrease the performance of further classification networks.

As shown in Figure 2a, the background region appears with a significantly low illumination when compared to the foreground region. Based on this characteristic, we implemented a simple method for background removal. A graphical explanation using a GI endoscopic image is shown in Figure 2a where the accumulated histogram-like features of pixels are represented in the horizontal direction. As shown in this figure, we first accumulate the histogram-like features of the input image by projecting the gray-level of the image pixel in the horizontal and vertical directions. Because of the low illumination characteristics of the background regions when compared to the foreground regions, we can arbitrarily set a threshold for separating the background and foreground. An example of the experimental result of this step using the image in Figure 2a is shown in Figure 2b. We can observe that, although Figure 2b still contains certain small background regions owing to the characteristics of the capturing devices, most of the background region was removed from the input image and the image is ready for further use in our proposed method.

**Figure 2.** (**a**) Example of a captured gastrointestinal endoscopic image and (**b**) the corresponding result of the preprocessing step.

#### *2.3. Pathological Site Classification Method*

#### 2.3.1. Deep Learning Framework

Recently, with the development of learning-based techniques, deep learning has been widely used in computer vision research. This technique has been successfully applied to various computer vision/pattern recognition problems such as image classification [23–27], object detection [32,33], and image generation [34–36]. The success of deep learning comes from the fact that this technique simulates the way in which the human brain processes information (images). Originally, a deep learning network was constructed by using several neural network layers to create a deep network that is used

2.3. Pathological Site Classification Method

to process incoming information (images, voice, text, etc.). For computer vision research, an efficient type of deep learning technique, called CNN, has received significant attention. The idea of this type of deep learning technique is the use of convolution operations to extract useful information from input images, and the use of a neural network to efficiently solve a classification/regression problem. Figure 3 shows a graphical representation of a conventional CNN. As shown in this figure, a conventional CNN is composed of two primary parts, i.e., multiple convolution and classification/regression layers. The primary purpose of the convolution layers is to extract abstract and useful texture features that satisfactorily represent the characteristics/content of input images. The classification/regression layers are used to classify the input images (image classification problem), or regress continuous values such as height and width position of an object (object detection problem) based on the extracted image features produced by the first part of the CNN. Because of the use of convolution operation with the weight sharing scheme, the number of network parameters is significantly reduced when compared to a fully connected network with the same number of network layers. Consequently, it can help to successfully train a deep network and reduce the over/underfitting problem that normally occurs while using deep neural networks because of the large number of network parameters. However, with the increase in the depth and width of modern networks, the number of network parameters is still large, which prevents the successful training of a significantly deep network.

**Figure 3.** Conventional architecture of convolutional neural networks.

#### 2.3.2. Enhanced Convolutional Neural Network (CNN) Structures for Efficient Feature Extraction

As explained in Section 2.3.1, although the CNN method is efficient for many computer vision tasks, it still has several drawbacks that are caused by the presence of a large number of layers (depth) and network parameters. As the number of layers in the CNN increases, it increases difficulty in training the network and presents the gradient vanishing problem. In addition, training a network with a large number of network parameters requires a large amount of training data to prevent the over/underfitting problem. Further, extracting efficient texture features is also crucial for enhancing the performance of a CNN-based system. To reduce the effects of these problems and enhance the performance of CNN-based systems, several network architectures were proposed. In our study, we used two popular network architectures, including the inception and dense networks that are designed for this purpose.

First, the conventional CNN is constructed by linearly stacking layers to create a deep network. Further, each convolution layer uses a fixed convolution kernel (such as 3 × 3, 5 × 5, or 7 × 7 kernel), as shown in Figure 3. However, when the problem becomes complex with the complex texture structure of input images and/or different sizes of objects, the use of fixed and single convolution kernels is insufficient for extracting efficient features for the classification/detection problem. To solve this problem, Szegedy et al. [26] proposed a new network architecture named inception block to extract richer information from input images than the conventional convolution layers. The concept of the

inception block is depicted in Figure 4. In this figure, the input from the previous layer is shown as a red box, while the output and different feature maps obtained using various convolution kernels are marked as blue, yellow, green, and purple boxes. As shown in Figure 4, the inception block is constructed by using multiple convolution layers with different convolution kernels such as 1 × 1, 3 × 3, 5 × 5, and a max-pooling layer. These operations are performed in a parallel manner, and the results of all operations are concatenated to form the final input of an inception layer. It can be easily observed from Figure 4 that the inception layer can extract more useful texture information than the conventional convolution layer because of the use of different convolution kernels. A small convolution kernel can extract small texture features (small object texture). By increasing the convolution kernel, larger receptive fields are used to extract the texture information in the input images. By concatenating the results of all single convolution operations, the final feature maps are expected to contain a large amount of information at various texture levels when compared to the conventional convolution layer. Thus, more useful and efficient information is extracted using the inception layer. As demonstrated by the author of the inception method, it is significantly more efficient than a conventional CNN for the classification problem using the ImageNet dataset [26].

**Figure 4.** Naïve inception architecture for rich feature extraction used in inception network.

The second CNN architecture used in our study, the dense connection, was proposed by Huang et al. [31], which was designed to alleviate the vanishing gradient problem, reuse features, make the network easier to train, and reduce the number of network parameters. In Figure 5, we show the difference between the conventional CNN architecture (Figure 5a) and the dense-connection network architecture (Figure 5b). In this figure, Conv-ReLU-BN indicates the sequence of convolution (Conv), activation (rectified linear unit (ReLU)), and batch normalization (BN) blocks that are used to manipulate the input feature maps. While the conventional CNN architecture only has a connection between a single parent/child layer, the dense-connection architecture uses all the outputs of the preceding layers as inputs to the current layer. This design helps to reuse the features from early layers and reduces the effects of the vanishing gradient problem. By using dense connections, the network can be thinner and more compact. Consequently, it helps to reduce the number of network parameters that normally cause the over/underfitting problem in CNNs.

**Figure 5.** Comparison between (**a**) plain convolutional blocks and (**b**) dense-connection block architecture.

Inspired by the advantages of the inception and dense-connection networks, we used these two networks in addition to the conventional CNN based on the VGG architecture in our proposed method and experiments. Using these three networks with three different architectures, we processed input images in a different manner to enhance the performance of our classification system when combining the strength of each network.

#### 2.3.3. Ensemble of CNN Models for Pathological Site Classification

Based on the success of the CNN in the image classification problem, we propose a method for pathological site image classification, as shown in Figure 1 and Section 2.1. As explained in Section 2.3.1, although CNNs were successfully applied to various computer vision problems, this technique still has several drawbacks that prevent us from obtaining high-performance classification systems. First, the large number of network parameters can increase the difficulty in training the network and can also result in the over/underfitting problem. Second, the difference of the size of objects (texture features) that appear in the input images can affect the classification performance. To reduce the effects of these drawbacks and enhance the classification performance of the pathological site classification system, we propose the use of an ensemble learning technique that combines the classification results of three different CNN architectures for the problem, as shown in Figure 1.

In our study, we used three CNN architectures with different characteristics and depths for the classification problem, including VGG-based, DenseNet-based, and inception-based networks, as shown in Figure 1. Although it is possible to use other network architectures, we selected these network architectures because of our purpose for ensemble learning, that is, to combine the classification results of different network architectures in which each network classifies input images in a specific way. In our study, the VGG-based network serves as a conventional deep CNN for classification problems, which is composed of a linear stack of convolution layers. The DenseNet-based network serves as a very deep CNN with a short-cut path, which helps to easily train the network and extract more abstract and efficient image features. Further, the inception-based network helps extract image features with rich texture features and different sizes of objects. The detailed descriptions of these network architectures are listed in Table 2.


**Table 2.** Detail description of the convolutional neural networks (CNNs) used in our study (N/A means "not available").

In this table, the convolution layers of the base networks (VGG16 [25], DenseNet121 [31], and inception [26] networks) are marked as "Main Convolution Layers." For the classification part, we modified the number of output neurons from 1000 of the original network (the original networks were designed for the ImageNet classification challenge; therefore, it contains 1000 neurons at the output) to 2 neurons that represent two possible cases of our problem (with/without the appearance of pathological sites).

In the final step in our proposed method, as mentioned in Section 2.1, we attempted to combine the classification results of three different CNN models, i.e., VGG-, inception-, and DenseNet-based networks. In our study, we invoke three combination methods, including the MAX, AVERAGE, and VOTING rules, which are presented in Equations (1)–(3), respectively.

$$\text{MAX\\_rule} = \arg\max(\max(S\_i))\tag{1}$$

$$\text{AVERAGE rule} = \arg\max \left(\frac{\sum\_{i=1}^{n} S\_i}{n}\right) \tag{2}$$

$$\text{VOTING rule} = \operatorname\*{sign}\left(\sum\_{i=1}^{n} w\_i \times \operatorname\*{argmax}(S\_i)\right) \tag{3}$$

In Equations (1)–(3), *S<sup>i</sup>* indicates the classification probability of the *i th* classifier, and *n* indicates the number of classifiers. The "*argmax*" operator indicates the selection process of the class label whose classification probability is maximum. In our experiments, we used *n* = *3* as we used three classifiers. As explained in the previous sections, we are dealing with a binary classification problem. Therefore, *Si* is a vector of two components, i.e., *S<sup>i</sup>* = (*S*0*<sup>i</sup>* , *S*1*i*), where *S*0*<sup>i</sup>* indicates the probability that the input image belongs to class 0 (negative cases (without pathological sites)), and *S*1*<sup>i</sup>* stands for the probability of the input image belonging to class 1 (positive case (with pathological sites)). In Equation (3), *w<sup>i</sup>* indicates the weight for the *i*th classifier and *argmax*(*Si*) indicates the classification results of the *i*th classifier (*argmax*(*Si*) = 0 for predicting the input image belonging to class 0; and *argmax*(*Si*) = 1 for predicting the input image belonging to class 1. The weight value (*w<sup>i</sup>* ) is determined using Equation (4). In our experiments, we will measure the performance of the classification system

using all three combination rules and select the rule that is most suitable with our pathological site classification problem based on the experimental results, as shown in Section 3.

$$w\_i = \begin{cases} -1 \text{ if } \operatorname{argmax}(\mathbb{S}\_i) = 0\\ +1 \text{ if } \operatorname{argmax}(\mathbb{S}\_i) = 1 \end{cases} \tag{4}$$

#### *2.4. Classification Performance Measurement*

To measure the performance of a CAD system, previous studies used three popular metrics: Sensitivity, specificity, and overall accuracy [37,38]. These metrics are used to measure three different aspects of a binary classification system, i.e., the classification/detection ability of the system with respect to the positive (with the appearance of disease), negative (without the appearance of disease), and overall cases. The definition of these metrics is presented in Equations (5)–(7). In these equations, Sens, Spec, and Accuracy stands for sensitivity, specificity, and overall accuracy; TP indicates the number of true-positive cases (the case in which a positive case is correctly classified as a positive case); FN indicates the number of false-negative cases (the case where a positive case is incorrectly classified as a negative case); TN indicates the number of true-negative cases (the case in which a negative case is correctly classified as a negative case); finally, FP indicates the number of false-positive cases (the case in which a negative case is incorrectly classified as a positive case).

$$\text{Sens} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{5}$$

$$\text{Spec} = \frac{\text{TN}}{\text{TN} + \text{FP}} \tag{6}$$

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \tag{7}$$

As indicated in Equation (5), the sensitivity measurement is the ratio between the TP samples over the total number of positive samples (TP + FN). Therefore, it indicates the ability of the classification system to detect positive cases (distinguish disease cases from all possible disease cases). The specificity is the ratio between the TN samples and the total number of negative samples. Consequently, the specificity is the measurement of the classification performance in detecting negative samples from all possible negative samples. Finally, the overall accuracy is measured by measuring the ratio between all correct classification samples (TP + TN) over all testing samples (TP + TN + FP + FN). Thus, the overall accuracy indicates the ability of a classification system to detect correct samples from the universe of samples.

In our study, we used these three measurements to evaluate the performance of our proposed method. In addition, as indicated by the meaning of the overall accuracy, we used the overall accuracy measurement to compare the performance of our proposed method with that of previous studies.

#### **3. Experimental Results**

Using the proposed method mentioned in Section 2, we conducted various experiments using the publicly available in vivo GI endoscopy dataset [30] to measure the classification performance of our proposed method in this section. The detailed experimental results are presented in the following subsections.

#### *3.1. Dataset and Experimental Setups*

To evaluate the performance of our proposed method and compare it with previous studies, we conducted experiments using a public dataset, namely the in vivo GI endoscopic dataset [30]. We called this dataset Hamlyn-GI for convenience, as this dataset is collected and provided by the Hamlyn Center for Robotic Surgery [30]. This dataset was originally collected for tracking and

retargeting of GI endoscopic pathological sites using Olympus narrow-band imaging and Pentax i-Scan endoscope devices [30]. Specifically, this dataset contains 10 video sequences of GI endoscopic scans. Each video is saved in the format of successive still images. In the study by Ye et al. [30], the authors first manually defined a pathological site (a small polyp or suspected region) at the beginning of the endoscopic image sequence. Then, they tracked and retargeted this region for the remaining sequence of images. The information regarding the selected region and ground-truth tracked-retargeted region is provided for each video sequence in an annotation file. Because these regions are carefully set by experts and the polyp region or possible polyp regions are focused on, we used the information in the annotation file as an indicator of the existence of pathological sites in the still images. In the case of a still image containing a pathological site region, an approximate location of the pathological site is provided in the annotation file; otherwise, a negative value is provided. Based on the provided information in the annotation file, we preclassified the still images into two categories: With and without the existence of the pathological site in the stomach; that is, if the annotation of a still image is provided, then the still image is considered to contain the pathological site and assigned to the "with pathological site" class; otherwise, the still image is assigned to the "without pathological site" class. In Figure 6, we show certain examples of images from the Hamlyn-GI dataset. Figure 6a shows example images without the existence of pathological sites, whereas Figure 6b shows images containing pathological sites, which are marked with white bounding boxes. From Figure 6b, it can be observed that the bounding boxes are approximately provided by the author of the dataset. Therefore, they do not fit the correct location of pathological site regions. Consequently, it is difficult to perform a detection to solve this problem. Instead, we classified the input still images of this dataset into two classes: With and without the existence of pathological sites. In Table 3, we list the detailed statistical information regarding the Hamlyn-GI dataset. In total, the Hamlyn-GI dataset contains 7894 images.

**Table 3.** Detailed description of images in the Hamlyn-GI dataset.


To measure the performance of the proposed method, we performed two-fold cross-validation. For this purpose, we divided the Hamlyn-GI dataset into two separate parts, namely training and testing datasets. In the first fold, we assigned images of the first five video sequences (video files 1–5 in Table 3) as the training dataset and the images of the remaining five video sequences (video files 6–10 in Table 3) as the testing dataset. In the second fold, we exchanged the training and testing datasets of the first fold, i.e., training dataset contains images of the last five video sequences (video files 6–10 in Table 3) and the testing dataset contains images of the first five video sequences (video files 1–5 in Table 3). This division method ensures that the images of the same person (identity) only exist in either the training or testing dataset. Finally, the overall performance of the dataset with a two-fold cross-validation approach is measured by calculating the average (weighted by the number of testing images) of the two folds. In Table 4, we list a detailed description of the training and testing datasets in our experiments. In this table, "With PS" indicates the existence of a pathological site condition; further, "Without PS" indicates the absence of the existence of a pathological site condition. Although it is possible to use other cross-validation methods such as three-fold, five-fold, or leave-one-out approaches, we decided to use a two-fold approach in our experiments to save the processing time of the experiments.

**Figure 6.** Example of images: (**a**) Gastric images without the existence of pathological sites and (**b**) gastric images with the existence of pathological sites (Upper and lower images indicate those including polyp and complex vascular regions, respectively).



#### *3.2. Training*

In our first experiment, we performed a training process to train the three deep learning-based models which are illustrated in Figure 1 and Table 2. For this experiment, we programmed our network using Python programming language with the help of the Tensorflow library [39] for the implementation of deep learning-based models. A detailed description of the parameters is listed in Table 5. For the training method, we used the adaptive moment estimation (Adam) optimizer with an initial learning rate of 0.0001, and we trained each model with 30 epochs. As the epoch increases, the network parameters become finer; therefore, we continuously reduced the learning rate after every epoch. In addition, a batch size of 32 was used in our experiment.


**Table 5.** Training parameters used in our study.

In Figure 7, we illustrated the results of the training process using the training datasets. As mentioned in Section 3.1, we used a two-fold cross-validation procedure in our experiments. Therefore, we calculated the average result of the two folds and presented it in Figure 7. In this figure, we show the curves of the loss and training accuracy of the training procedure for all three CNN models. As shown in these curves, the losses continuously decrease while the training accuracies increase with the increase in the training epoch. Thus, we can consider that the training procedures were successful in our experiments.

**Figure 7.** Training results (loss vs. accuracy) of Visual Geometry Group (VGG)-based, Inception-based, and DenseNet-based models in our experiments.

#### *3.3. Testing of Proposed Method (Ablation Studies)*

For the ablation studies, we performed the experiments presented in Sections 3.3.1–3.3.3.

#### 3.3.1. Classification Results Based on Individual CNN Models with the Preprocessing Procedure

As explained in Section 2 and Figure 1, our proposed method is composed of three main classification branches. Based on this structure, in our first experiment, we performed experiments using a single CNN-based classification method, as shown in Figure 1, using the three classification networks described in Table 2. For this experiment, we scaled the output images of the preprocessing step to 224 × 224 pixel-sized images and inputted them to each individual network for classification purposes. The detailed experimental results are listed in Table 6. From this table, it can be observed that the VGG-based network achieved an overall classification accuracy of 68.912% with a sensitivity of 66.271% and specificity of 71.946%. Similarly, we obtained an overall classification accuracy of 66.505% with a sensitivity of 87.438% and specificity of 42.476% using the inception-based network; moreover, an overall classification accuracy of 52.609% with a sensitivity of 19.554% and specificity of 90.558% was achieved using the DenseNet-based network. These experimental results show that the CNN is suitable for pathological site classification problems. However, the overall classification result was still low as the largest accurate classification result was obtained using a VGG-based network, which was measured to be 68.912%. This problem is caused by the fact that the dataset used in our experiments

was not large and included the complex structure of the GI endoscopic images. In addition, we can observe from these experimental results that the VGG-based network outperforms the inception- and DenseNet-based networks. This is because the inception- and DenseNet-based networks are deeper than the VGG-based network. Consequently, the training of the network is affected because our dataset is not significantly big with images of only five patients.


**Table 6.** Classification performance using individual CNN-based model (unit: %).

We can also observe from Table 6 that the sensitivity and specificity measurements vary according to each CNN-based model. Using the VGG-based model, we obtained a sensitivity of 66.271% and specificity of 71.946%. This result indicates that the VGG-based network is more efficient in detecting negative images (images without possible pathological sites) than positive images (images with possible pathological sites). Similarly, we obtained a sensitivity of 19.554% and specificity of 90.558% using the DenseNet-based network. As it has a large value of specificity, the DenseNet-based network is more efficient in classifying negative images than positive ones. However, the situation is different when using the inception-based network. Using the inception-based network, we obtained a sensitivity of 87.438% and specificity of 42.476%. The high value of sensitivity indicates that the inception-based network is more efficient for classifying positive images than negative images. In addition, the difference between the sensitivity and specificity using the VGG-based network is approximately 5.675%, which is significantly smaller than the difference of 44.962% obtained using the inception-based network and 71.004% obtained using the DenseNet-based network. This result indicates that the VGG-based network has minimal bias while classifying positive and negative images when compared the other networks. This is because the VGG-based network is significantly shallower than the inception- and DenseNet-based networks. Therefore, the negative effects caused by the over/underfitting problem are less significant than those caused by the inception- and DenseNet-based networks.

#### 3.3.2. Classification Results Based on Individual CNN Models without the Preprocessing Procedure

As shown in Figure 1, our proposed method performs a preprocessing step before applying the classification step by using deep learning-based models to reduce the effect of noise on the classification results. In this experiment, we demonstrated the effects of noise and the advantages of the preprocessing step on our system by measuring the classification performance in a system that does not consider the preprocessing step. For this purpose, we trained and evaluated the performance of the deep learning-based models without considering noise reduction by the preprocessing step. The detailed experimental results are listed in Table 7. As shown in this table, we obtained overall classifications accuracies of 67.392%, 47.745% and 52.356% using the VGG-, inception-, and DenseNet-based methods, respectively. When compared to the experimental results in Table 6, the preprocessing procedure helps to enhance the classification results in the case of all three CNN models. Specifically, the classification accuracy is reduced from 68.912% in the case of using a preprocessing step to 67.392% in the case where a preprocessing step is not used in the experiment with the VGG-based method. In the case of the inception-based method, the overall classification accuracy is significantly reduced from 66.505% to 47.745% for the cases with and without the preprocessing step, respectively. Finally, a marginal reduction is observed in the case of the DenseNet-based method with an overall classification accuracy of 52.609% and 52.356% for the cases with and without the preprocessing step, respectively. Through these experimental results, we can observe that the background and noise have strong negative effects

on the classification accuracy of the pathological site classification problem, and the preprocessing step is efficient in reducing these negative effects.


**Table 7.** Classification performance using an individual CNN-based model (unit: %).

3.3.3. Classification Results of the Proposed Ensemble Model of Three CNNs with the Preprocessing Procedure

As listed in Table 6, the three CNN-based models work differently with the same dataset. This indicates that each network has its own advantages and disadvantages for our classification problem. Based on these characteristics, we considered a combination of the results of these networks to enhance the performance of our classification system. Then, we performed an experiment to evaluate the performance of our proposed method mentioned in Section 2.1 using Equations (1)–(3). The detailed experimental results are listed in Table 8 for all three combination methods.

**Table 8.** Classification accuracy using our proposed method (unit: %).


In the first experiment in this section, we used the MAX combination rule to combine the classification results of the three CNN-based models. As explained in Equation (1), the classification is performed based on the maximum classification scores of all three networks. From Table 8, it can be noted that the overall classification result (the average result of two folds) is 64.630% with a sensitivity of 57.952% and specificity of 72.299%. We can observe that this classification result is higher than 52.609%, which was obtained using only the DenseNet-based model (as listed in Table 6). However, this classification accuracy is lower than 68.912% and 66.505%, which were obtained using the VGGand inception-based networks, respectively. As the specificity of 72.299% is higher than the sensitivity of 57.952%, we can conclude that the combined system based on the MAX rule is more efficient in recognizing the negative images than the positive images. In addition, the difference between the sensitivity and specificity using the MAX rule is approximately 14.347% (72.299–57.952%), which is smaller than the case of using only the inception- or DenseNet-based network.

In the second experiment, we used the AVERAGE combination rule to combine the classification results of the three CNN-based models. As indicated by its meaning, the classification based on the AVERAGE rule was performed by classifying images based on the average classification scores of the three CNN-based models for negative and positive classes. As listed in Table 8, the AVERAGE combination rule produces an overall classification accuracy of 70.775% with a sensitivity of 68.452% and specificity of 73.442%. When compared to the classification accuracies listed in Table 6, we can observe that the classification accuracy by the AVERAGE rule is significantly better than that of the individual CNN-based models. The overall classification accuracy of 70.775% is significantly higher than the results of 68.912%, 66.505%, and 52.609% obtained using the VGG-, inception-, and DenseNet-based networks, respectively. In addition, the difference between the sensitivity and specificity of the AVERAGE rule is approximately 4.99% (73.442–68.452%), which is better than the value of 5.675% (the smallest difference between sensitivity and specificity of individual CNN-based models, as listed in Table 6) produced by the VGG-based network. This result indicates that the AVERAGE combination methods have minimal over/underfitting effects when compared to the individual CNNs.

Finally, we performed an experiment using the VOTING combination rule, as indicated using Equations (3)–(4). As listed in Table 8, the VOTING combination rule produces an overall classification accuracy of 70.369% with a sensitivity of 68.452% and specificity of 72.571%. It can be observed that this overall classification accuracy is higher than the overall classification accuracy produced by the individual CNN-based models listed in Table 6. Although the overall classification accuracy produced by the VOTING rule is marginally lower than that produced by the AVERAGE rule (70.369% versus 70.775%), the difference is insignificant (approximately 0.406%). In addition, the difference between the sensitivity and specificity produced by the VOTING rule is 4.119% (72.571–68.452%). This value is the smallest value for the difference between the sensitivity and specificity and it indicates that the VOTING combination rule has fewer effects on the over/underfitting problem among the three combination rules and three individual CNN-based models.

To deeply analyze the obtained experimental results, we obtained the receiver operating characteristic (ROC) curve of various system configurations, including three systems based on three individual CNN networks and three systems based on three combination rules, as shown in Figure 8. ROC curve is one of the most popular measurements used in medical statistic test, which provides a visualization of a classification performance. The ROC curve demonstrates the change of false acceptance rate (FAR) versus false rejection rate (FRR). In our experiment, we use the genuine acceptance rate (GAR) instead of FRR in measuring the ROC curve. That is, GAR is measured by (100 − FRR (%)). Because of this measurement method, a ROC curve looks like the ones in Figure 8, and the higher position on the upper-left side indicate the high performance of a classification system. As we perform a two-fold cross validation procedure, the ROC curves of each system configuration are obtained by taking average of the two ROC curves of two folds. From Figure 8, we can see that the AVERAGE rule outperforms the other system configurations, while the MAX and VOTING rule performs similar with the VGG-based or Inception-based network, and better than DenseNet-based network. −

**Figure 8.** Receiver operating characteristic (ROC) curves of various system configurations in our experiment.

These experimental results show that the combination of the three CNN-based models is efficient in enhancing the classification performance of pathological site classification for a GI endoscopic

64

examination when compared to previous studies. In addition, the AVERAGE and VOTING rules are more efficient than the MAX combination rule. As the best accuracy was obtained using the AVERAGE rule, we can conclude that the AVERAGE rule is the most suitable combination rule for implementation with the Hamlyn-GI dataset. However, as the size of the Hamlyn-GI dataset is relatively small, the applicability of these rules should be further investigated using larger data.

#### *3.4. Comparisons with the State-of-the-Art Methods and Processing Time*

Based on the experimental results presented in the above sections, we compared the classification performance of our proposed method with those of previous studies. As explained in Section 1, most of the previous studies used the conventional CNN (training a single shallow network or extracting image features from a pretrained network) for classification purposes. As discussed in Sections 2.3 and 3.3, our proposed method applies the ensemble learning technique to three individual deep CNNs, i.e., VGG-, inception-, and DenseNet-based networks. Therefore, we can consider that the performances of previous studies correspond to the case of using each individual network. Based on this assumption, we conducted a comparison experiment and its results are listed in Table 9. From the table, it can be observed that by using the individual CNNs, we obtained the overall classification accuracies of 68.912%, 66.505% and 52.609% using the VGG-, inception-, and DenseNet-based networks, respectively. The best accuracy of 68.912% was obtained using the VGG-based method. Using our proposed method, we obtained the best accuracy of 70.775% with the AVERAGE combination rule, which is higher than the accuracy produced by the use of an individual network. Moreover, our proposed method with the VOTING combination rule also produced an accuracy of 70.369%, which is also higher than the best value of 68.912% of the individual network. From these experimental results, we can conclude that our proposed method outperforms previous studies in the pathological site classification system. However, as the best accuracy is still low (70.775%), the classification system still requires enhancements in our future works.


**Table 9.** Comparison of the classification performances between the proposed method and the previous studies (unit: %).

In the final experiment, we measured the processing time of our proposed method for pathological site classification problems. For this purpose, we created a deep CNN program in Python programming language using the Tensorflow library [39]. To run the program, we used a desktop computer with an Intel Core i7-6700 central processing unit with a working clock of 3.4 GHz and 64 GB of RAM memory. To increase the speed of the deep learning networks, we used a graphical processing unit, namely GeForce Titan X, to run the inference of the three deep learning models [40]. The detailed experimental results are listed in Table 10. We also performed the comparisons of the processing time between the proposed method and the previous studies. As shown in Table 10, it takes about 37.646 ms, 67.472 ms and 65.901 ms to classify an input image using VGG-based [11,25], Inception-based [26], and DenseNet-based [31] network, respectively. Using our proposed method, it takes 179.440 ms to classify a single input image, and our proposed method can work at a speed of 5.6 frames per second. From these experiment results, we can find that our proposed method takes longer processing time

than previous studies. However, it is acceptable in medical image processing applications where the accuracy, but not processing time, is the primary requirement.



#### *3.5. Analysis and Discussion*

As shown in our experimental result, the average rule outperforms the max and voting rule in enhancing classification performance. This is caused by the fact that the average rule uses the prediction scores of individual CNN models directly, whereas the max rule is performed based on only the maximum classification score of one CNN model among several CNN models, and the voting rule is performed based on majority decisions of CNN models. As a result, the combined classification score by average rule contains more detailed classification information from all CNN models than the other two combination rules.

We used a public dataset, namely Hamlyn-GI, which showed the large visual differences among the collected images [30]. To measure the performance of the proposed method, we first divided the entire Hamlyn-GI dataset into training and testing sets. The division is done by ensuring that the training and testing datasets are different, i.e., images of the same patient only exist in either training or testing datasets as open world evaluation. As a result, these training and testing set are independent from the other as shown in Figure 6 (for example, the left-bottom and middle-bottom images are the samples of training data whereas the others are those of testing one for the 1st fold validation). We used the training dataset to train the classification models, and testing dataset to measure the performance of the trained models. Therefore, the measured performance is the optimal generalization because the testing set is independent from the training set.

To demonstrate the efficiency of our proposed method, we show certain examples of the classification results in Figure 9. As shown in this figure, images without a pathological site are accurately classified as negative cases (Figure 9a), while images with pathological sites (small polyp regions appear in Figure 9b, complex vascular structure regions in Figure 9c) are classified as positive cases. However, our proposed method also produces certain incorrect classification results, as shown in Figure 10. In Figure 10a, we show certain sample images that were classified as positive cases (images with pathological sites) even if they did not contain pathological sites. The errors were caused either by the fact that the endoscopic images contain complex vascular structures (the left-most image in Figure 10a) or owing to imperfect input images (middle and right-most images in Figure 10a). In Figure 10b,c, we show certain example images in which our classification method incorrectly classified. As we can observe from these figures, certain text boxes in the input images can cause classification errors (right-most image in Figure 10c). In addition, the input images contain small or blurred polyp regions, resulting in classification errors, which can be solved by super-resolution reconstruction or deblurring in our future works.

To provide a deep look inside the actual operation of deep learning-based models for classification problems in our study, we measured the regions of focus in the input images on which the classification models are used for accomplishing their functionalities, and the results are illustrated in Figure 11. For this purpose, we obtained the class activation maps using the gradient-weighted class activation mapping (grad-CAM) method [41]. Grad-CAM is a popular method that explains the working of a deep CNN. In the activation maps of these figures, the brighter regions indicate the regions that are focused on in the feature maps, which our system uses for classification purposes. As shown in Figure 11, our deep learning models focus on pathological sites (possible polyps) or complex vascular

structures in classifying input images into negative or positive classes. By providing class activation maps to medical doctors, our system can provide a reasonable explanation about why an input image is classified as positive data, and it can demonstrate the functionality of explainable artificial intelligence. In addition, it is a significantly important characteristic of a high-performance classification system in CAD applications.

**Figure 9.** Examples of correct classification results by the proposed method: (**a**) True-negative cases and (**b**,**c**) true-positive cases (Upper and lower images indicate the ground-truth and testing images, respectively).

(**c**)

**Figure 10.** Examples of incorrect classification results by the proposed method: (**a**) False-positive cases and (**b**,**c**) false-negative cases (Upper and lower images indicate the ground-truth and testing images, respectively).

In our experiments, we use only three sub-models for ensemble algorithm because of two reasons. First, these models (VGG-based, Inception-based, and DenseNet-based) are different in network architecture. Therefore, each sub-model has its own advantages and disadvantages that can be compensated by the other sub-models. Second, although we can use more sub-models to perform ensemble algorithm, the complexity and processing time of the proposed method are increased, which can cause the difficulty in training and deployment of model. Because of these reasons, we only used three sub-models in our experiments. However, it is possible to use more sub-models to possibly enhance the system performance. In that case, our work can be seen as a specific example to demonstrate the enhancement possibility of ensemble algorithm in pathological site classification problem. When the number of sub-models increases, it not only increases the complexity and computation of system, but also represents noises to system that caused by error cases of every individual system.

(**c**)

**Figure 11.** Obtained class activation maps corresponding to the pathological site class using (**a**) VGG-based network, (**b**) inception-based network, and (**c**) DenseNet-based network.

For a demonstration purpose, we additionally performed experiments with more than three sub-models, i.e., ensemble method with four sub-models. For this purpose, we used an additional CNN classifier based on residual network [23]. The experimental results are given in Table 11. From these experimental results, we see that we obtained a highest classification accuracy of 69.686% using the average rule that is little worse than the accuracy of 70.775% obtained using three sub-models mentioned in Section 3.3. The reason is that when the number of models increases, it also represents noises to the ensemble system caused by error cases of the new model. In addition, the residual network also uses the short-cut connection in the similar way as the DenseNet-based network. Therefore, the residual network shares similar characteristics with DenseNet-based network. Because of these reasons, the performance of ensemble system based on four sub-models is little reduced compared to the case of using three sub-models.

**Table 11.** Classification accuracy using our proposed method with four sub-models (unit: %).


#### **4. Conclusions**

In this study, we proposed an ensemble learning-based method that combines the classification results of multiple deep learning models to enhance the classification results of the endoscopic pathological site classification system. For this purpose, we first invoked and trained several CNN models, including VGG-, Inception- and DenseNet-based networks for pathological site classification problems. Based on the classification results of these CNN models, we combined them using three combination rules, MAX, AVERAGE and VOTING rules. Using our proposed method, we enhanced the classification performance over that observed in previous studies. In addition, we showed that the AVERAGE rule outperforms the other two combination rules.

**Author Contributions:** D.T.N. and G.B. designed and implemented the overall system, made the code and performed experiments, and wrote this paper. M.B.L., T.D.P., M.A. and K.R.P. helped with parts of experiments and gave comments during experiments. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (MSIT) through the Basic Science Research Program under Grant NRF-2020R1A2C1006179, in part by the NRF funded by the MSIT through the Basic Science Research Program under Grant NRF-2019R1F1A1041123, and in part by the NRF funded by the MSIT through the Basic Science Research Program under Grant NRF-2019R1A2C1083813.

**Acknowledgments:** This work was supported in part by the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (MSIT) through the Basic Science Research Program under Grant NRF-2020R1A2C1006179, in part by the NRF funded by the MSIT through the Basic Science Research Program under Grant NRF-2019R1F1A1041123, and in part by the NRF funded by the MSIT through the Basic Science Research Program under Grant NRF-2019R1A2C1083813.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **DoF-Dependent and Equal-Partition Based Lens Distortion Modeling and Calibration Method for Close-Range Photogrammetry**

#### **Xiao Li <sup>1</sup> , Wei Li 1,\*, Xin'an Yuan <sup>1</sup> , Xiaokang Yin <sup>1</sup> and Xin Ma <sup>2</sup>**


Received: 24 September 2020; Accepted: 16 October 2020; Published: 20 October 2020

**Abstract:** Lens distortion is closely related to the spatial position of depth of field (DoF), especially in close-range photography. The accurate characterization and precise calibration of DoF-dependent distortion are very important to improve the accuracy of close-range vision measurements. In this paper, to meet the need of short-distance and small-focal-length photography, a DoF-dependent and equal-partition based lens distortion modeling and calibration method is proposed. Firstly, considering the direction along the optical axis, a DoF-dependent yet focusing-state-independent distortion model is proposed. By this method, manual adjustment of the focus and zoom rings is avoided, thus eliminating human errors. Secondly, considering the direction perpendicular to the optical axis, to solve the problem of insufficient distortion representations caused by using only one set of coefficients, a 2D-to-3D equal-increment partitioning method for lens distortion is proposed. Accurate characterization of DoF-dependent distortion is thus realized by fusing the distortion partitioning method and the DoF distortion model. Lastly, a calibration control field is designed. After extracting line segments within a partition, the de-coupling calibration of distortion parameters and other camera model parameters is realized. Experiment results shows that the maximum/average projection and angular reconstruction errors of equal-increment partition based DoF distortion model are 0.11 pixels/0.05 pixels and 0.013◦ /0.011◦ , respectively. This demonstrates the validity of the lens distortion model and calibration method proposed in this paper.

**Keywords:** lens distortion; DoF-dependent; distortion partition; vision measurement

#### **1. Introduction**

Vision measurement is a subject that allows quantitative perception of scene information by combining image processing with calibrated camera parameters. Therefore, the calibration accuracy of the parameters is an important determinant of the vision measurement uncertainty. The lens distortion is closely related to the depth of field (DoF), which refers to the distance between the nearest and the farthest objects that are in acceptably sharp focus in an image. For medium- or high-accuracy applications, close-range imaging parameters (e.g., short object distance (<1 m) and small focal length) are often adopted. In such occasions, DoF has a significant influence on lens distortion and, hence, becomes a major cause of vision measurement errors. For instance, to ensure that the vision has a micron-level accuracy when detecting the contouring error of a machine tool [1], the camera is placed 400 mm away from the focal plane to collect and analyze the image sequence of the interpolation trajectory running in the DoF. In this case, measurement errors ranging from dozens to hundreds

of microns can be caused by a large lens distortion. Therefore, to improve the vision measurement accuracy in close-range photogrammetry, accurate modeling and calibration of the DoF-dependent lens distortion are urgently needed.

The lens distortion model maps the relation between distorted and undistorted image points. Models to show the relations vary according to the types of optical systems, which include the polynomial distortion model, logarithmic fish-eye distortion model [2], polynomial fish-eye distortion model [2–5], field-of-view (FoV) distortion model [6], division distortion model [7,8], rational function distortion model [9,10], and so on. In 1971, Brown [11,12] proposed the Gaussian polynomial function to express radial and decentering distortion, which is particularly suitable for studying the distortion of a standard lens in high-accuracy measurements [13,14]. Later, researchers noticed that the observed radial and decentering distortion varies with the focal length, the lens focusing state (i.e., focused or defocused), and the DoF position. Since then, researchers have focused on the improvement of distortion calibration and modeling methods to obtain a precise representation of distortion behavior. For the distortion calibration, the study goes in two directions: the coupled-calibration method and the decoupled-calibration method. The former can be generally divided into three types: self-calibration method [15], active calibration method, and traditional calibration method [16]. Among the traditional ones, Zhang's calibration method [17] and its improved method [18–20], used widely in industry and scientific research, are the most popular. In this coupled-calibration method, the distortion parameters are calculated by performing a full-scale optimization for all parameters. Due to the strong coupling effect, the estimated errors of other parameters (i.e., intrinsic and extrinsic parameters) in the camera model would be propagated to that of distortion parameters, thus leading to the failure of getting optimal solutions. By contrast, the decoupled-calibration method does not involve coupling other factors or entail any prior geometric knowledge of the calibration object, and only geometric invariants of some image features, such as straight lines [6,12,21–23], vanishing points [24], or spheres [25], are needed to solve the parameters. Among these features, straight lines can be easily reflected in scenes and extracted from noise images, thus having enormous potential.

Regarding the distortion modeling, some researchers incorporated the DoF into the distortion function. Magill [26] used the distortion of two focal planes at infinity to solve that of an arbitrary focal plane. Then, Brown [12] improved Magill's model by establishing distortion models of any focal plane and any defocused plane (the plane perpendicular to the optical axis in the DoF) on the condition that the distortions of two focal planes are known. Soon after, Fryer [27], based on Brown's model, realized the lens distortion calibration of an underwater camera [28]. Fraser and Shortis [29] introduced an empirical model and solved the Brown model's problem of inaccurate description of large image distortion. Additionally, Dold [30] established a DoF distortion model that is different from Brown's and solved the model parameters through the strategy of bundle adjustment. In 2004, Brakhage [31] characterized the DoF distortion of the telecentric lens in a fringe projection system by using Zernike Polynomials. Moreover, in 2006, the DoF distortion distribution of the grating projection system was experimentally analyzed by Bräuer-Burchardt. In 2008, Hanning [32] introduced depth (object distance) into the spline function to form a distortion model and used the model to calibrate radial distortion.

The above DoF distortion models not only depend on the focusing state but also relate to the distortion coefficients on the focal plane. For these models, on the one hand, the focusing state is usually adjusted by manually twisting the zoom and focus rings, which introduces human errors and changes the camera parameters. On the other hand, the focus distance and distortion parameters on the focal plane cannot be determined accurately. To overcome the problem, Alvarez [33], based on Brown's and Fraser's models, deduced a radial distortion model that is suitable for planar scenarios. With this model, when the focal length is locked, distortion at any image position can be estimated by using two lines in a single photograph. In 2017, Dong [34] proposed a DoF distortion model, by which the researcher accurately calibrated the distortion parameters on arbitrary object planes, and reduced the error from 0.055 mm to 0.028 mm in the measuring volume of 7.0 m × 3.5 m × 2.5 m with the

large-object-distance of 6 m. Additionally, in 2019, Ricolfe-Viala [35] proposed a depth-dependent high distortion lens calibration method, by embedding the object distance in the division distortion model, and the highly distorted images can be corrected with only one distortion parameter. However, these researchers only used one set of coefficients, which is not sufficient to accurately represent the distortion. To address this problem, some scholars adopted the idea of partitioning to process image distortion, which uses several sets of distortion coefficients to characterize the distortion. The study, however, which is only applicable to the partitioning of a 2D object plane, fails to take into account the distortion partition within the DoF and the correlation between lens distortion and DoF. Our previous work partitioned the distortion with an equal radius [36]. Although it improved the vision measurement accuracy, the distortion correction accuracy within the partition corresponding to the image edge is still low. Besides, the distortion model we adopted depends on the focusing state of the lens, thus is less practical. In general, the current distortion model and partitioning method cannot accurately reflect the lens DoF distortion behavior in close-range photography, especially for short-distance measurements.

To solve the above problems, the lens distortion model and calibration method for short-distance measurement, which takes into consideration the dimensions of DoF and equal-increment partition of distortion, are proposed in this paper. The rest of this paper is organized as follows. In Section 2, a focusing-state-independent DoF distortion model, which only involves the spatial position of the observed point, is constructed. In Section 3, based on the model in previous section, an equal-increment partitioning DoF distortion model is proposed, which enables a fine representation of the lens distortion in the photographic field. Section 4 details the calibration method for both DoF distortion and camera model parameters, as well as the image processing of the control field for distortion calibration. In Section 5, experimental verification of the proposed lens distortion model and calibration method is carried out. Finally, Section 6 concludes this paper.

#### **2. Focusing-State-Independent DoF Distortion Model**

The observed distortion of a point varies with its position within the DoF. Though the close-range imaging configuration increases the visible range, it enlarges the DoF image distortion, consequently affecting the measurement accuracy. To break the limitations of the aforementioned in-plane and DoF distortion model in the vision measurement of short-distance and small-focal-length settings, a DoF-dependent yet focusing-state-independent distortion model is proposed in this paper.

#### *2.1. Pinhole Camera Model with Distortion*

As illustrated in Figure 1, the linear pinhole camera model depicts the one-to-one mapping between the 3D points in the object space and its 2D projections in the image. Let *p <sup>u</sup>li <sup>v</sup>li* be undistorted coordinates mapped from a spatial point in the world coordinate system *OwXwYwZ<sup>w</sup>* to the image coordinate system *ouv* through the optical center *OC*. Then, camera mapping can be expressed as [17]

$$\begin{aligned} z' \begin{bmatrix} u\_{li} \\ v\_{li} \\ 1 \end{bmatrix} = \mathbf{K} \cdot \mathbf{M} \cdot \begin{bmatrix} X\_w \\ Y\_w \\ Z\_w \\ 1 \end{bmatrix} \end{aligned} \tag{1}$$

where *z* ′ describes the scaling factor; *K* is the intrinsic parameter matrix, which quantitatively characterizes the critical parameters of the image sensor (i.e., Charge Coupled Device (CCD) or Complementary Metal-Oxide Semiconductor (CMOS)); Matrix *M*, expressing the transformation between the vision coordinate system (VCS) and the world coordinate system, consists of the rotation matrix *R* and translation matrix *T*.

**Figure 1.** Schematic diagram of camera model and lens distortion: (**a**) camera model; (**b**) barrel distortion; (**c**) pincushion distortion.

However, manufacturing and assembly errors can lead to radial and decentering lens distortion. Consequently, the pinhole assumption does not hold for real camera systems, and the image projection of a straight line would be bent into a curve (Figure 1b,c). To characterize the lens distortion, Brown proposed the distortion model in a polynomial form [11,12]:

$$\begin{cases} \begin{aligned} \boldsymbol{u}\_{\text{li}} &= \overline{\boldsymbol{u}}\_{\text{li}} + \delta\_{\text{lu}} \\ \boldsymbol{v}\_{\text{li}} &= \overline{\boldsymbol{v}}\_{\text{li}} + \delta\_{\text{lv}\_{\text{li}}} \\ \boldsymbol{\delta}\_{\text{li}} &= \overline{\boldsymbol{u}}\_{\text{li}} \cdot (1 + \mathcal{K}\_{1} \cdot \boldsymbol{r}^{2} + \mathcal{K}\_{2} \cdot \boldsymbol{r}^{4} + \cdots) + \left[ P\_{1} \cdot (\boldsymbol{r}^{2} + 2 \cdot \overline{\boldsymbol{u}}\_{\text{li}}^{2}) + 2 \mathcal{P}\_{2} \cdot \overline{\boldsymbol{u}}\_{\text{li}} \cdot \overline{\boldsymbol{v}}\_{\text{li}} \right] \cdots \\ \boldsymbol{\delta}\_{\text{lv}\_{\text{li}}} &= \overline{\boldsymbol{v}}\_{\text{li}} \cdot (1 + \mathcal{K}\_{1} \cdot \boldsymbol{r}^{2} + \mathcal{K}\_{2} \cdot \boldsymbol{r}^{4} + \cdots) + \left[ P\_{2} \cdot (\boldsymbol{r}^{2} + 2 \cdot \overline{\boldsymbol{v}}\_{\text{li}}^{2}) + 2 \mathcal{P}\_{1} \cdot \overline{\boldsymbol{u}}\_{\text{li}} \cdot \overline{\boldsymbol{v}}\_{\text{li}} \right] \cdots \end{aligned} \tag{2}$$

( ) *d d* = - + where *<sup>u</sup>li <sup>v</sup>li* is the distorted coordinates; δ*uli* and δ*vli* are the distortion functions of an image point in the *u* and *v* direction respectively; ( *u*<sup>0</sup> *v*<sup>0</sup> ) denotes the distortion center; *r* = q (*uli* − *u*0) <sup>2</sup> + (*vli* <sup>−</sup> *<sup>v</sup>*0) 2 stands for the distortion radius of the image point; *K*<sup>1</sup> and *K*<sup>2</sup> are the first and second-order coefficients of radial distortion respectively; while, *P*<sup>1</sup> and *P*<sup>2</sup> are the first and second-order coefficients of decentering distortion respectively.

#### *2.2. Distortion Model in the Focal Plane*

*d d*

#### 2.2.1. Radial Distortion Model

*d* ¥ *d* - ¥ Let δ*r*<sup>∞</sup> be the radial distortion for a lens that is focused on plus infinity, while δ*r*−∞ on the minus infinity. *m<sup>s</sup>* refers to the vertical magnification in the focal plane at object distance *s*. According to Magill's model [26], δ*r<sup>s</sup>* , the lens radial distortion in the focal plane, can be expressed as

$$
\delta r\_s = \delta r\_{-\infty} - m\_s \cdot \delta r\_{\infty} \tag{3}
$$

*dd d* = -⋅ -¥ ¥ Let δ*rs<sup>m</sup>* and δ*rs<sup>k</sup>* be the radial distortions in the focal planes when the lens is focused on the distances of *s<sup>m</sup>* and *s<sup>k</sup>* respectively. Then, the distortion function for focal plane δ*r<sup>s</sup>* at distance *s* can be written as

*d*

$$
\delta r\_{\rm s} = \alpha\_{\rm s} \cdot \delta r\_{\rm s\_{\rm m}} + (1 - \alpha\_{\rm s}) \cdot \delta r\_{\rm s\_{\rm k}} \tag{4}
$$

*d*

where, *f* is the focal length; α*<sup>s</sup>* = (*sk*−*sm*)·(*s*−*f*) (*sk*−*s*)·(*sm*−*f*) . The *i*-th radial distortion coefficients *K s i* for focused object plane at distance *s* are

$$K\_{\rm j}^{\rm s} = \alpha\_{\rm s} \cdot K\_{\rm j}^{\rm s\_m} + (1 - \alpha\_{\rm s}) \cdot K\_{\rm j}^{\rm s\_k} \text{ } i = 1 \text{ } 2. \tag{5}$$

where *K sm i* and *K sk i* are the *i*-th radial distortion coefficients when the lens is focused on the distances of *s<sup>m</sup>* and *s<sup>k</sup>* respectively. As can be easily noticed in Equation (5), if the radial distortion coefficients of two different focal planes are known, the radial distortion coefficients of any focal plane can be obtained.

#### 2.2.2. Decentering Distortion

As for the decentering distortion, the equations are as follows [12]:

$$\begin{cases} \delta r\_u = (1 - \frac{f}{s})(P\_1 \cdot (r^2 + 2u^2) + 2P\_2 \cdot u \cdot v) \\ \delta r\_v = (1 - \frac{f}{s})(P\_1 \cdot (r^2 + 2v^2) + 2P\_2 \cdot u \cdot v) \\ r^{s, \prime} = \frac{s - f}{s' - f} \cdot \frac{s'}{s} \end{cases} \tag{6}$$

where (1 − *f s* ) · *r s*,*s* ′ is the compensation coefficient; δ*r<sup>u</sup>* and δ*r<sup>v</sup>* represent the components of the decentering distortion in the *u* and the *v* direction respectively; *s* and *s* ′ depict the object distances corresponding to the two focal planes, respectively.

#### *2.3. DoF-Dependent Distortion Model for Arbitrary Defocused Plane*

#### 2.3.1. DoF-Dependent Radial Distortion Model

Fraser and Shortis [29] proposed an empirical model for describing the distortion of any object plane (or defocused plane), which solved Brown model's problem of inaccurate description of severe distortion caused by the image configuration of short-distance and small-focal-length settings. The equation is as follows:

$$K^{\mathfrak{s}\mathfrak{s}\_p} = K^{\mathfrak{s}} + \mathfrak{g} \cdot (K^{\mathfrak{s}\_p} - K^{\mathfrak{s}}) \tag{7}$$

where *K <sup>s</sup>*,*s<sup>p</sup>* denotes the radial distortion coefficient in the defocused plane with the depth of *s<sup>p</sup>* when the lens is focused at distance *s*; *g* is the empirical coefficient; *K <sup>s</sup><sup>p</sup>* and *K s* represent the radial distortion coefficients in the focal planes at distances *s<sup>p</sup>* and *s* respectively. By extending the equation, we can get the radial distortion function δ*r <sup>s</sup>*,*s<sup>n</sup>* at *s<sup>n</sup>* expressed by the δ*r <sup>s</sup>*,*s<sup>m</sup>* at *s<sup>m</sup>* when the lens is focused at the distance of *s*:

$$\begin{cases} \delta r^{s, s\_{\mathfrak{m}}} = \delta r^s + \mathbf{g} \cdot (\delta r^{s\_{\mathfrak{m}}} - \delta r^s) \\\ \delta r^{s, s\_{\mathfrak{n}}} - \delta r^{s, s\_{\mathfrak{m}}} = \delta r^s + \mathbf{g} \cdot (\delta r^{s\_{\mathfrak{n}}} - \delta r^s) - \delta r^s - \mathbf{g} \cdot (\delta r^{s\_{\mathfrak{m}}} - \delta r^s) \end{cases} \tag{8}$$

From the above equation, we can easily obtain δ*r <sup>s</sup>*,*s<sup>n</sup>* = δ*r <sup>s</sup>*,*s<sup>m</sup>* + <sup>α</sup>*s*,*sm*(*sn*) · (δ*r <sup>s</sup>* <sup>−</sup> <sup>δ</sup>*<sup>r</sup> <sup>s</sup>*,*s<sup>m</sup>* ). Then, by extending the results to the radial distortion of a point in the defocused plane at distance *s<sup>k</sup>* , the relationship between δ*r <sup>s</sup>*,*s<sup>n</sup>* , δ*r <sup>s</sup>*,*s<sup>m</sup>* and δ*r <sup>s</sup>*,*s<sup>k</sup>* can be given by

$$\begin{cases} \delta r^{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{in}}} = \delta r^{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{m}}} + \alpha\_{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{m}}(\boldsymbol{s}\_{\rm{n}})} \cdot (\delta r^{\boldsymbol{s}} - \delta r^{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{m}}}) \\ \delta r^{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{n}}} = \delta r^{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{k}}} + \alpha\_{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{k}}(\boldsymbol{s}\_{\rm{n}})} \cdot (\delta r^{\boldsymbol{s}} - \delta r^{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{k}}}) \\ \delta r^{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{m}}} = \delta r^{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{k}}} + \alpha\_{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{k}}(\boldsymbol{s}\_{\rm{m}})} \cdot (\delta r^{\boldsymbol{s}} - \delta r^{\boldsymbol{s},\boldsymbol{\varsigma}\_{\rm{k}}}) \end{cases} \tag{9}$$

in which α*s*,*sm*(*sn*) = *sm*−*s<sup>n</sup> sm*−*s* · *s*−*f sn*−*f* , α*s*,*s<sup>k</sup>* (*sn*) = *sk*−*s<sup>n</sup> sk*−*s* · *s*−*f sn*−*f* , and α*s*,*s<sup>k</sup>* (*sm*) = *sk*−*s<sup>m</sup> sk*−*s* · *s*−*f sm*−*f* .

After eliminating the focus distance and the distortion in the focal plane, we can obtain the following equation:

$$\begin{cases} \delta r^{s,s\_{\mathfrak{N}}} = \delta r^{s,s\_{\mathfrak{k}}} + \frac{s\_{\mathfrak{k}} - s\_{\mathfrak{m}}}{s\_{\mathfrak{k}} - s} \cdot \frac{s - f}{s\_{\mathfrak{m}} - f} \cdot \left(\delta r^{s} - \delta r^{s,s\_{\mathfrak{k}}}\right) \\\ \delta r^{s,s\_{\mathfrak{N}}} = \delta r^{s,s\_{\mathfrak{k}}} + \frac{s\_{\mathfrak{k}} - s\_{\mathfrak{n}}}{s\_{\mathfrak{k}} - s} \cdot \frac{s - f}{s\_{\mathfrak{n}} - f} \cdot \left(\delta r^{s} - \delta r^{s,s\_{\mathfrak{k}}}\right) \end{cases} \tag{10}$$

Then, we can have *<sup>s</sup>*−*<sup>f</sup> sk*−*s* · (δ*r <sup>s</sup>* <sup>−</sup> <sup>δ</sup>*<sup>r</sup> <sup>s</sup>*,*s<sup>k</sup>* ) = (δ*r <sup>s</sup>*,*s<sup>m</sup>* <sup>−</sup> <sup>δ</sup>*<sup>r</sup> <sup>s</sup>*,*s<sup>k</sup>* ) · *sm*−*f sk*−*s<sup>m</sup>* . δ*r <sup>s</sup>*,*s<sup>n</sup>* can be expressed as

$$\delta r^{\mathcal{S}\mathcal{S}\_{\rm li}} = \frac{\delta r^{\mathcal{S}\mathcal{S}\_{\rm li}} \cdot (s\_{\rm m} - f) \cdot (s\_{\rm k} - s\_{\rm m}) + (s\_{\rm k} - s\_{\rm n}) \cdot (s\_{\rm k} - f) \cdot (\delta r^{\mathcal{S}\mathcal{S}\_{\rm m}} - \delta r^{\mathcal{S}\mathcal{S}\_{\rm k}})}{(s\_{\rm m} - f) \cdot (s\_{\rm k} - s\_{\rm m})} \tag{11}$$

Obviously, when the lens is focused at distance *s*, through two distortions corresponding to object distances *s<sup>m</sup>* and *s<sup>k</sup>* respectively, radial distortion coefficient in any defocused plane with the depth of *s<sup>n</sup>* when the lens is focused at distance *s* can be obtained:

$$K\_{\hat{i}}^{\mathcal{S}\text{-}n} = \frac{K\_{\hat{i}}^{\mathcal{S}\text{-}k} \cdot (\mathbf{s}\_{\text{m}} - f) \cdot (\mathbf{s}\_{\text{k}} - \mathbf{s}\_{\text{m}}) + (\mathbf{s}\_{\text{k}} - \mathbf{s}\_{\text{n}}) \cdot (\mathbf{s}\_{\text{k}} - f) \cdot (\mathbf{K}\_{\hat{i}}^{\mathcal{S}\text{-}m} - \mathbf{K}\_{\hat{i}}^{\mathcal{S}\text{-}k})}{(\mathbf{s}\_{\text{m}} - f) \cdot (\mathbf{s}\_{\text{k}} - \mathbf{s}\_{\text{m}})} i = 1 \cdots 2. \tag{12}$$

When two object planes are set, *K s*,*s<sup>m</sup> i* , *K s*,*s<sup>k</sup> i* , *sm*, *s<sup>k</sup>* and *f* are known. Thus, *K s*,*s<sup>n</sup> i* in Equation (12) is only dependent on *sn*, and it is independent of the distortion coefficient *K s i* on the focal plane and the focus distance *s*.

#### 2.3.2. DoF-Dependent Decentering Distortion Model

In Equation (6), since δ*P <sup>s</sup>*,*s<sup>m</sup>* = *r <sup>s</sup>*,*s<sup>m</sup>* · <sup>δ</sup>*<sup>P</sup>* ∞, the distortion in the focal plane can be written as

$$\begin{cases} \delta P^{\rm s,\sim} = \left(1 - \frac{f}{\rm s}\right) \cdot \frac{\rm s - f}{\rm s\_m - f} \cdot \frac{\rm s\_m}{\rm s} \cdot \delta P^{\rm cs} \\\ \delta P^{\rm s,\sim} = \left(1 - \frac{f}{\rm s}\right) \cdot \frac{\rm s - f}{\rm s\_k - f} \cdot \frac{\rm s\_k}{\rm s} \cdot \delta P^{\rm cs} \\\ \delta P^{\rm s,\sim} = \left(1 - \frac{f}{\rm s}\right) \cdot \frac{\rm s - f}{\rm s\_n - f} \cdot \frac{\rm s\_n}{\rm s} \cdot \delta P^{\rm cs} \end{cases} \tag{13}$$

where δ*P <sup>s</sup>*,*s<sup>m</sup>* , δ*P <sup>s</sup>*,*s<sup>k</sup>* and δ*P <sup>s</sup>*,*s<sup>n</sup>* are the decentering distortion functions in the defocused plane at the object distances of *sm*, *s<sup>k</sup>* and *s<sup>n</sup>* when the lens is focused at distance *s* respectively. From the first two lines of the above equation, we get <sup>δ</sup>*<sup>P</sup> s*,*s k* <sup>δ</sup>*Ps*,*sm* = *s*−*f sk*−*f* · *sm*−*f s*−*f* · *sk s* · *s sm* = *sm*−*f sk*−*f* · *sk sm* = *Ms<sup>k</sup>* ,*s<sup>m</sup>* , and then

$$\begin{cases} \quad f = \frac{1 - M^{s\_k \cdot s\_m}}{s\_k - s\_m \cdot M^{s\_k \cdot s\_m}}\\ \quad \frac{\delta P^{s\_s \cdot s\_n}}{\delta P^{s\_s \cdot s\_m}} = \frac{s\_m - f}{s\_n - f} \cdot \frac{s\_n}{s\_m} \end{cases} \tag{14}$$

Put the first line into the second one, and we obtain

$$\frac{\delta P^{s,s\_n}}{\delta P^{s,s\_m}} = \frac{s\_m - \frac{1 - M^{\mathbf{g}\_k, s\_m}}{s\_k - s\_m \cdot M^{\mathbf{g}\_k, s\_m}}}{s\_n - \frac{1 - M^{\mathbf{g}\_k, s\_m}}{s\_k - s\_m \cdot M^{\mathbf{g}\_k, s\_m}}} \cdot \frac{s\_n}{s\_m} = \frac{s\_m \cdot s\_k - s\_m^2 \cdot M^{\mathbf{s}\_k, s\_m} - 1 + M^{\mathbf{s}\_k, s\_m}}{s\_n \cdot s\_k - s\_m \cdot s\_n \cdot M^{\mathbf{s}\_k, s\_m} - 1 + M^{\mathbf{s}\_k, s\_m}} \cdot \frac{s\_n}{s\_m} \tag{15}$$

Equation (15) can be simplified to

$$\begin{cases} & \delta P^{\mathcal{S}, \mathcal{S}\_{\text{fl}}} = \frac{M^{\mathcal{S}\_{\text{f}}, \mathcal{S}\_{\text{fl}}} \cdot (1 - s\_{\text{m}}^2) + (s\_{\text{m}} \cdot s\_{\text{k}} - 1)}{M^{\mathcal{S}\_{\text{f}}, \mathcal{S}\_{\text{m}}} \cdot (1 - s\_{\text{n}} \cdot s\_{\text{m}}) + s\_{\text{n}} \cdot s\_{\text{k}} - 1} \cdot \frac{s\_{\text{N}}}{s\_{\text{m}}} \cdot \delta P^{\mathcal{S}, \mathcal{S}\_{\text{m}}} \\\ P\_{i}^{\mathcal{S}, s\_{\text{n}}} = \frac{M^{\mathcal{S}\_{\text{f}}, \mathcal{S}\_{\text{m}}} \cdot (1 - s\_{\text{m}}^2) + (s\_{\text{m}} \cdot s\_{\text{k}} - 1)}{M^{\mathcal{S}\_{\text{f}}, \mathcal{S}\_{\text{m}}} \cdot (1 - s\_{\text{n}} \cdot s\_{\text{k}} - 1)} \cdot \frac{s\_{\text{N}}}{s\_{\text{m}}} \cdot P\_{i}^{\mathcal{S}, s\_{\text{m}}} \quad i = 1, 2. \end{cases} \tag{16}$$

Put *Ms<sup>k</sup>* ,*s<sup>m</sup>* = *P s*,*s k i P s*,*sm i* (*i* = 1, 2) into Equation (16), and the following equation is obtained:

$$P\_{i}^{\mathcal{S}\_{\text{N}}} = \frac{P\_{i}^{\mathcal{S}\_{\text{N}}} \cdot (1 - s\_{\text{m}}^{\mathcal{Q}}) + P\_{i}^{\mathcal{S}\_{\text{N}}} \cdot (s\_{\text{m}} \cdot s\_{\text{k}} - 1)}{P\_{i}^{\mathcal{S}\_{\text{N}}} \cdot (1 - s\_{\text{m}} \cdot s\_{\text{n}}) + P\_{i}^{\mathcal{S}\_{\text{N}}} (s\_{\text{n}} \cdot s\_{\text{k}} - 1)} \cdot \frac{s\_{\text{n}}}{s\_{\text{m}}} \cdot P\_{i}^{\mathcal{S}\_{\text{N}}} i = 1,2. \tag{17}$$

Given the parameters *P s*,*s<sup>m</sup> i* , *P s*,*s<sup>k</sup> i* , *s<sup>m</sup>* and *s<sup>k</sup>* are known, it can be illustrated from Equation (17) that the decentering distortion coefficient *P s*,*s<sup>n</sup> i* in any defocused plane is dependent only on the object distance, *sn*, and is independent of the focus distance *s* and the distortion *P s i* in the focal plane. Moreover, since focal length *f* is not included in Equation (17), decentering distortion is not affected by this parameter.

*d*

*d* =D =D =D D

¹¹¹¹

Hereto, the DoF-dependent yet focusing-state-independent distortion model suitable for close-range, short-distance measurement scenes is established, which overcomes the limited practicability caused by the way of calibrating DoF distortion by manual adjustment of the focus and zoom rings, and it also solves the problem when the current position and the distortion parameters of the focal plane are not exactly known.

#### **3. Equal-Increment Partition Based DoF Distortion Model**

The distortion coefficients are solved by minimizing the straightness error of the observed points. If a set of distortion coefficients is used to describe the distortion in the whole image, the distortion coefficients will be the error balance of all points. However, for each region of the image the error is not the minimum. Hence, an equal-increment partition based DoF distortion model is proposed in this section. The distortion spreads outward from the image center along a circumferential contour, with the characteristics of the image being small in the middle and large on the image edge. In this paper, we first partition the in-plane distortion in an equal-increment way, then the 2D partition strategy is extended to the 3D photographic field.

#### *3.1. Equal-Increment Based Distortion Partitioning Method*

Figure 2 presents two distortion partitioning methods. The *X* axis represents the distance from an image point to the distortion center (*u*0, *v*0), namely the distortion radius. The *Y* axis describes the distortion in pixels. The blue curve is the distortion curve calculated by the features in the whole image. As illustrated in Figure 2a, when DoF distortion is partitioned by an equal radius, the distortion increment of each partition is different (∆<sup>1</sup> < ∆<sup>2</sup> < ∆<sup>3</sup> < ∆<sup>4</sup> < ∆5) despite the same distortion radius increment (*R*<sup>1</sup> = *R*<sup>2</sup> = *R*<sup>3</sup> = *R*<sup>4</sup> = *R*5) [36]. For a polynomial-based distortion function, it is well known that the more scattered the distorted points and the larger the distortion increments are, the lower the regression accuracy of the function to the distortion is. As a result, the estimated accuracy of the partition's distortion parameters decreases gradually from inside to outside (ε<sup>1</sup> > ε<sup>2</sup> > ε<sup>3</sup> > ε<sup>4</sup> > ε5). DDDDD <<<< ==== *eeeee* >>>>

**Figure 2.** Two distortion partitioning methods: (**a**) equal-radius partition; (**b**) equal-increment partition.

To solve the problem, a DoF distortion model based on the equal-increment partition is proposed in this paper, and the procedures are as follows:


*d d*

*d dd* =- -

*d*


Then the distortion partition of the 2D object plane is extended to the 3D DoF. As can be known from Equation (1), the object-to-image mapping satisfies the following: ïïî

$$\begin{cases} \mathbf{x} = f \cdot \frac{\mathbf{X}\_m}{\mathbf{Z}\_m} = f \cdot \frac{\mathbf{X}\_k}{\mathbf{Z}\_k} \\ \mathbf{y} = f \cdot \frac{\mathbf{Y}\_m}{\mathbf{Z}\_m} = f \cdot \frac{\mathbf{Y}\_k}{\mathbf{Z}\_k} \end{cases} \tag{18}$$

where *Pm*( *X<sup>m</sup> Y<sup>m</sup> Z<sup>m</sup>* ) and *P<sup>k</sup>* ( *X<sup>k</sup> Y<sup>k</sup> Z<sup>k</sup>* ) are two points in the VCS. The 2D point *p*( *x y* ) (in millimeters) is the image projection of the *P<sup>m</sup>* and *P<sup>k</sup>* (*Pm*, *P<sup>k</sup>* , and *O<sup>C</sup>* are collinear). Let ρ be the partition radius, then *x* <sup>2</sup> + *y* <sup>2</sup> = ρ 2 , and we get *r r* ìï ï ï⋅ + ⋅ = ï ïï í ï ï ï ï ⋅ +⋅ =

$$\begin{cases} f^2 \cdot \frac{X\_m^2}{Z\_m^2} + f^2 \cdot \frac{Y\_m^2}{Z\_m^2} = \rho^2\\ f^2 \cdot \frac{X\_k^2}{Z\_k^2} + f^2 \cdot \frac{Y\_k^2}{Z\_k^2} = \rho^2 \end{cases} \tag{19}$$

From the above equation, we can know that *f* · *R<sup>m</sup>* = ρ · *Z<sup>m</sup>* and *Z<sup>m</sup>* · *R<sup>k</sup>* = *Z<sup>k</sup>* · *Rm*, where *Z<sup>m</sup>* and *Z<sup>k</sup>* are the depths of the *m*-th (Π*m*) and the *k*-th (Π*<sup>k</sup>* ) object planes in the VCS respectively. *R<sup>m</sup>* and *R<sup>k</sup>* are the partition radius of the two object planes. Let *s<sup>m</sup>* = *Z<sup>m</sup>* and *s<sup>k</sup>* = *Z<sup>k</sup>* , and then extend the above distortion partitions to 3D DoF domain. As shown in Figure 3, if the range of the *g*-th partition in the object plane <sup>Π</sup>*<sup>m</sup>* is <sup>h</sup> (*g* − 1) · *R<sup>m</sup> g* · *R<sup>m</sup>* i , the partition range in object planes Π*<sup>k</sup>* and Π*<sup>n</sup>* are h (*g* − 1) · (*s<sup>k</sup>* · *Rm*/*sm*) *g* · (*s<sup>k</sup>* · *Rm*/*sm*) i and <sup>h</sup> (*g* − 1) · (*s<sup>n</sup>* · *Rm*/*sm*) *g* · (*s<sup>n</sup>* · *Rm*/*sm*) i respectively. In this way, although the distortion radius in each partition is different, distortion coefficients can be obtained with high accuracy when the image distortion is partitioned by equal distortion increments. = = <sup>P</sup> é ù -⋅ ⋅ê ú ë û <sup>P</sup> <sup>P</sup> é ù -⋅ ⋅ ⋅ ⋅ ê ú ë û é ù -⋅ ⋅ ⋅ ⋅ ê ú ë û

. **Figure 3.** The geometric relationship between the partition radii in different object planes.

P

#### *3.2. Equal-Increment Partition Based DoF Distortion Model*

After partitioning the DoF distortion, we incorporate the partitions into the DoF distortion model. Procedures to solve the partition radius and distortion coefficients on any object distance *s<sup>n</sup>* are as follows:


$$\begin{cases} \,^{\mathcal{S}}K\_{\stackrel{\mathcal{S},\mathcal{S}\_{\mathcal{U}}}{\dot{\mathcal{Y}}}}^{\mathcal{S},\mathcal{S}\_{\mathcal{U}}} = f(^{\mathcal{S}}K\_{\stackrel{\mathcal{S},\mathcal{S}\_{\mathcal{U}}}{\dot{\mathcal{Y}}}}^{\mathcal{S},\mathcal{S}\_{\mathcal{U}}}, ^{\mathcal{S}}K\_{\stackrel{\mathcal{S},\mathcal{S}\_{\mathcal{U}}}{\dot{\mathcal{Y}}}}^{\mathcal{S},\mathcal{S}\_{\mathcal{U}}}, s\_{\mathcal{U}})\\ \,^{\mathcal{S}}P\_{\stackrel{\mathcal{S}}{\dot{\mathcal{Y}}}}^{\mathcal{S},\mathcal{S}\_{\mathcal{U}}} = f(^{\mathcal{S}}P\_{\stackrel{\mathcal{S},\mathcal{S}\_{\mathcal{U}}}{\dot{\mathcal{Y}}}}^{\mathcal{S},\mathcal{S}\_{\mathcal{U}}}, s\_{\mathcal{U}}) \end{cases} \tag{20}$$

From the equation, we can know

$$\begin{cases} \ \ \mathcal{S}K^{\mathcal{S},\mathbf{s}\_{\mathrm{tr}}}\_{i} = \frac{\ ^{\mathcal{S}}\mathcal{S}^{\mathcal{S},\mathbf{s}\_{\mathrm{tr}}}\cdot(\mathbf{s}\_{\mathrm{m}}-\mathbf{f})\cdot(\mathbf{s}\_{\mathrm{k}}-\mathbf{s}\_{\mathrm{m}})\cdot(\mathbf{s}\_{\mathrm{k}}-\mathbf{f})\cdot(\mathcal{S}K^{\mathcal{S},\mathbf{m}}\_{i}-\mathcal{S}K^{\mathcal{S},\mathbf{k}}\_{i})}{\ (\mathbf{s}\_{\mathrm{m}}-\mathbf{f})\cdot(\mathbf{s}\_{\mathrm{k}}-\mathbf{s}\_{\mathrm{m}})} \\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$$

At this point, we have established an equal-increment partition based DoF distortion model for any object plane at *s<sup>n</sup>* when the lens is focused at distance *s*.

#### **4. Calibration Method for Camera Parameters**

In close-range photography, the DoF images are seriously distorted, so the calibration accuracy of the distortion parameters is the decisive factor affecting the vision measurement accuracy. When the coupled-calibration method is used to solve the distortion parameters, the estimated errors of intrinsic and extrinsic parameters will be propagated to distortion parameters. Thus, a two-step method is proposed to calibrate the camera parameters, in which distortion parameters are estimated independently.

#### *4.1. Independent Distortion Calibration Method Based on Linear Conformation*

Figure 4 details the experimental system for DoF lens distortion, which consists of a monocular camera, a control field, a light source, an electric control platform, and a multi-axis motion controller. The X, Y, A, and C axes of the platform are in the object space, while the Z-axis is in the image space. A control field, with the features of circle, corner, and line, is used to calibrate the lens distortion, and the geometric relationship between the features is known accurately. On this basis, the pose of the control field relative to the image plane can be adjusted by the Perspective-n-Point (PnP) algorithm.

In this paper, the distortion coefficients can be estimated by the plumb-line method [12] alone. It is defined by Brown (1971) as "a straight line in the object space will be mapped to the image plane in a straight way after a perfect lens, and any change of straightness can be reflected as the lens distortion described by the radial and decentering distortion coefficients."

= ´

**Figure 4.** Experimental system for calibrating depth of field (DoF) lens distortion.

As demonstrated in Figure 5, when *<sup>N</sup>* edge points *u*<sup>1</sup> *v*<sup>1</sup> · · · *u<sup>N</sup> v<sup>N</sup>* on the same curve are known, the regression line equation determined by the point group is

$$
\alpha' u + \beta' v - \gamma' = 0 \tag{22}
$$

( )( ) Let α ′ = sin θ, β ′ = cos θ, γ ′ = *A<sup>u</sup>* sin θ + *A<sup>v</sup>* cos θ, tan 2θ = − 2*Vuv Vuu*−*Vvv* . where θ is the angle between the regression line and the *u* axis (Figure 5). *A<sup>u</sup>* = <sup>1</sup> *N* P*N i*=1 *ui* , *A<sup>v</sup>* = <sup>1</sup> *N* P*N i*=1 *vi* , *Vuu* = 1 *N* P*N i*=1 (*u<sup>i</sup>* − *Au*) 2 , *Vuv* = <sup>1</sup> *N* P*N i*=1 (*u<sup>i</sup>* <sup>−</sup> *<sup>A</sup>u*)(*v<sup>i</sup>* <sup>−</sup> *<sup>A</sup>v*), *<sup>V</sup>vv* <sup>=</sup> <sup>1</sup> *N* P*N i*=1 (*v<sup>i</sup>* − *Av*) 2 .

*abg* ¢ ¢¢ + -= Given there are *L* lines and there are *N<sup>l</sup>* points in the *l*-th line, the average sum of squared distances from the points *<sup>u</sup>li <sup>v</sup>li* to all the lines can be written as

$$D = \frac{1}{L} \cdot \sum\_{l=1}^{L} \frac{1}{N\_l} \cdot \sum\_{i=1}^{N\_l} \left( a\_l' u\_{li} + \beta\_l' v\_{li} - \gamma\_l' \right)^2 \tag{23}$$

<sup>=</sup> å - , <sup>=</sup> å - - , <sup>=</sup> å - Any distortion of a line's straightness in the image plane can be corrected by a mapping involving radial and decentering distortion. Thus, substitute Equation (2) into Equation (23) and we can get

$$F(\overline{u}\_{\text{li}}, \overline{v}\_{\text{li}}; \mathbf{K}\_1, \mathbf{K}\_2, \mathbf{P}\_1, \mathbf{P}\_2) = \mathbf{0} \tag{24}$$

=

<sup>=</sup> å

⋅ + ⋅ >+ **Figure 5.** Schematic diagram of distortion calibration based on linear conformation configuration.

If there are *L* lines in an image and *N<sup>l</sup>* observation points are extracted from each line, we can have *L* · *N<sup>l</sup>* equations. In these equations, there are *L* + 4 variables (*L* line coefficients and 4 distortion coefficients). If *L* · *N<sup>l</sup>* > *L* + 4, the optimal solution of distortion coefficients can be obtained.

After solving the image distortion coefficients, the inverse mapping *imR*(*u*, *v*) = *imD*(*u<sup>d</sup>* , *v<sup>d</sup>* ) between the undistorted image *imR* and distorted image *imD* is established by cubic B-spline interpolation. In this way, the image distortion can be corrected. Besides, in this paper, the three straightness indicators of the maximum, average, and root mean square (RMS) *d* = q *D*/ P*L <sup>l</sup>*=<sup>1</sup> *N<sup>l</sup>* of the point-to-line distance, and the Peak Signal-to-Noise Ratio (PSNR) *PSNR* = 10 × log10((2 *<sup>n</sup>* <sup>−</sup> <sup>1</sup>) 2 /*MSE*, are used to evaluate the distortion correction effects. *D* has been defined in Equation (23), and *MSE* is the mean square error of the image before and after distortion correction.

#### *4.2. Image Processing and Camera Calibration*

In this paper, the parameters in the equal-increment partition based DoF distortion model are calculated by using straight lines in a particular area of the control field. To this end, the corner control based method, for extracting line segments within a partition, is proposed. As shown in Figure 6, the image processing procedures include the following:


By combining the image processing results with the DoF distortion partition model, distortion parameters at any position of the DoF can be determined. To avoid the coupling effect between the distortion parameters and other parameters in the camera model, the camera's intrinsic and extrinsic parameters are preliminarily calibrated by Zhang's method. Then, we fix the distortion parameters and place the high-precision target in multiple spatial positions to optimize the intrinsic and extrinsic parameters. The cost function to be optimized is

$$\begin{cases} \begin{array}{l} E^{q}\_{\text{depth\\_dependent}}(\mathbf{R}\_{q}) = \sum\_{g=1}^{m\_{\mathcal{S}}} \left( \mathbf{H}^{-1}(u\_{0\prime}v\_{0\prime}f\_{\mathbf{x}\prime}f\_{\mathbf{y}\prime}\mathbb{S}\_{i\prime}\mathbb{S}\_{j\prime}\mathbf{R}\_{q\prime}\mathbf{T}\_{q}) \right) \\\ i=1,2 \\\ j=1,2 \end{array} \tag{25}$$

where *E q depth*\_*dependent*(*Rq*) describes the cost function when the control field is in the *<sup>q</sup>*-th pose. *<sup>R</sup><sup>q</sup>* and *<sup>T</sup><sup>q</sup>* are the rotation and translation matrices in the *<sup>q</sup>*-th pose. *<sup>g</sup>K<sup>j</sup>* and *<sup>g</sup>P<sup>j</sup>* are the *i*-th order radial and *j*-th order decentering distortion coefficients in the *g*-th partition of the *q*-th pose. By using the Levenberg–Marquardt (LM) algorithm, the optimal solution of the camera's intrinsic and extrinsic parameters can be obtained.

Through the above process, the monocular camera calibration can be realized. In practice, the partition where a spatial point is located <sup>√</sup> *<sup>X</sup>*2+*Y*2· *<sup>f</sup>* ρ·*Z* can be determined after estimating its 3D position ( *X Y Z* ). Then, the observed distortion can be corrected by choosing the proper distortion coefficients, thus realizing high-accuracy vision measurements.

**Figure 6.** Image processing procedures for linear conformation: (**a**) image of the control field; (**b**) corner detection; (**c**) edge point connection; (**d**) point reselection; (**e**) horizontal line extraction; (**f**) line detection results in different areas.

#### **5. Accuracy Verification Experiments of Both the Distortion Modeling and Calibration Method**

#### *5.1. Experimental Verification of the 2D Distortion Partitioning Method*



**Figure 7.** Experimental system for DoF distortion calibration: (**a**) system hardware; (**b**) control field.

( )

First, the accuracy of the 2D distortion partitioning method is verified. The image of the control field at the focus distance is divided into five concentric rings by the equal-radius (Figure 8a–e) and equal-increment (Figure 9a–e) distortion partition models, respectively. In each partition, the corresponding lines (green ones) are selected to solve the distortion coefficients and correct the image distortion. For each of the two partitioning methods, five corrected images can be obtained (i.e., Figures 8f–j and 9f–j). Here, we use Figures 8f and 9f as an example to illustrate the results of distortion correction. The distortion on the image edge solved by the distortion coefficients of the first partition is far beyond the actual distortion here. After distortion correction, distortion is removed overly, thus resulting in the distortion in the opposite direction.

**Figure 8.** Distortion calibration and correction results based on the equal-radius partition method (f = 18 mm): (**a**) partition 1; (**b**) partition 2; (**c**) partition 3; (**d**) partition 4; (**e**) partition 5; (**f**) distortion correction (partition 1); (**g**) distortion correction (partition 2); (**h**) distortion correction (partition 3); (**i**) distortion correction (partition 4); (**j**) distortion correction (partition 5); (**k**) difference (partition 1); (**l**) difference (partition 2); (**m**) difference (partition 3); (**n**) difference (partition 4); (**o**) difference (partition 5).

**Figure 9.** *Cont*.

**Figure 9.** Distortion calibration and correction results based on the proposed partition method (f = 18 mm): (**a**) partition 1; (**b**) partition 2; (**c**) partition 3; (**d**) partition 4; (**e**) partition 5; (**f**) distortion correction (partition 1); (**g**) distortion correction (partition 2); (**h**) distortion correction (partition 3); (**i**) distortion correction (partition 4); (**j**) distortion correction (partition 5); (**k**) difference (partition 1); (**l**) difference (partition 2); (**m**) difference (partition 3); (**n**) difference (partition 4); (**o**) difference (partition 5).

To compare the distortion correction effect of each partition in the image, we subtract the simulated undistorted image with the corrected image, and we get Figures 8k–o and 9k–o. Obviously, the smaller the gray value is, the closer the undistorted image is to the ground truth, and the better the distortion removal effect is. As can be seen from the figures, the distortion correction results of each partition by the equal-radius partitioning method (Figure 8k–m) were not as good as that by the equal-increment partitioning method. Notably, in the fourth partition and the fifth partition located at the edge of the image, the green concentric ring in Figure 8n–o had a larger gray value, while in Figure 9n–o the gray values of pixels in the green concentric ring were approximately 0. This shows that the equal-increment partitioning method had a better performance on eliminating distortion.

Meanwhile, all the lines were used to solve and correct the image distortion as well. Then, the distortion correction effects with and without partition were compared using the aforementioned indexes (Section 4.1). As shown in Table 1, the undistorted images obtained by the two distortion partition methods had a good PSNR of up to 37.61 dB. Compared with the results obtained when without partition, the two partition methods showed a smaller straightness error in each partition. However, compared with the two partitioning methods, the maximum and average errors in the fourth and fifth partitions by the equal-radius partitioning method were at least 4 times and 2 times those by the partitioning method proposed in this paper. That is to say, with the equal-increment partitioning method, each partition can get better distortion correction results. The enlarged image of the best distortion curve for each partition is shown in Figure 10, which validates the effectiveness and accuracy of the proposed partitioning method in 2D settings.

**Figure 10.** Distortion curves solved by the lines in each partition using the proposed partition method.


**Table 1.** Comparison of distortion correction of the two partition models.

#### *5.2. Accuracy Verification Experiments of DoF Distortion Partitioning Model and Camera Calibration*

In this section, the accuracy of the DoF distortion model and camera calibration is verified. The control field is driven to move four different object planes within the DoF, two of which are at the limit positions of the front and rear DoF, and the other two planes are within the DoF. The front object plane was divided into five areas with equal distortion increment of 20.2 pixels. Then, based on the distortion parameters in two object planes with known depths, the distortions in the other two object planes are calculated by the non-partition model, the proposed DoF distortion model with equal-radius partition, and the proposed DoF distortion model with equal-increment distortion partition, respectively. Thereafter, we manually adjusted the ring to focus the lens on the two object planes located at the limit positions of the front and rear DoF. Then, based on the calculated radial and decentering distortion coefficients on the two focal planes, Brown's model [12] with equal-radius partition is used to estimate distortion parameters on the two planes within the DoF.

Furthermore, the results are compared with the distortion directly solved by the lines (the observed value) within the corresponding partition. To compare the accuracy of different DoF distortion models, we took the in-plane point located in the common area (the second column of Table 2) partitioned by the two models at the same object distance (e.g., 400 mm in the first column of Table 2) as an example. As shown in Table 2, for Brown's model [12] with equal-radius partition, the maximum and average absolute differences between the calculated and the observed values were 7.32 µm and 2.81 µm, respectively. Those errors are smaller than that of the traditional Zhang's model without considering the DoF and distortion partition, but much larger than those of the proposed DoF distortion model with equal-radius partition and the proposed DoF distortion model with equal-increment distortion partition, respectively. The maximum and average absolute differences between the calculated and the observed values of the equal-increment distortion partition based DOF distortion model were 1.53 µm and 0.88 µm, respectively. By contrast, the errors of equal-radius partition based DOF distortion model were 4.64 µm and 1.94 µm, which was more than two times those of the proposed model in this paper. The results verified the accuracy of the DoF distortion partitioning model in 3D settings.


**Table 2.**Accuracy verification for DoF distortion partition model.

*Sensors* **2020**, *20*, 5934

Images of circular markers with known precise distance on the planar artifact are collected. The calibration accuracy of the monocular camera is verified by the re-projection errors and the angular reconstruction errors, respectively. Specifically, the planar artifact is driven by the high-accuracy pitch axis to rotate five positions, between two adjacent ones of which are 10◦ . In each position, the pose matrix between the planar artifact Figure 11a and the calibrated camera is calculated by the OPNP algorithm [38] with the equal-radius and the equal-increment partitioning based DoF distortion models, respectively. Thereafter, 20 markers are projected back to the image via the estimated pose matrix, and the re-projection errors, the image distances between the projected and observed points, are calculated. As shown in Figure 11b, for equal-radius DoF distortion partition model, the maximum and average re-projection errors of the five positions were 0.29 pixels and 0.17 pixels, respectively, while the maximum and average projection errors of the proposed model were 0.11 pixels and 0.05 pixels, respectively. The angle between two adjacent positions of the artifact is reconstructed with the two models as well. As illustrated in Figure 11c, the 3D measurement accuracy of the system is assessed by comparing it with the nominal angle. The results show that the maximum and average angular errors of equal-radius based DoF distortion partition model were 0.48◦ and 0.30◦ , respectively, while those of the proposed model were 0.013◦ and 0.011◦ , which means that the angular reconstruction errors are effectively reduced. The above results comprehensively verify the accuracy of the DoF distortion partitioning model and the camera calibration method proposed in this paper. ° °

**Figure 11.** *Cont*.

**Figure 11.** Camera calibration accuracy verification: (**a**) artifact; (**b**) re-projection error; (**c**) angular reconstruction error.

#### **6. Conclusions**

This paper has investigated the methods of modeling and calibration of lens distortions for close-range photogrammetry (e.g., short object distance and small focal length). Our work finds that the following:


The main limitation of the present study is that the number of partitions is not optimized to achieve higher calibration accuracy. Our future work will focus on this and extend our model to other optical systems with fisheye or catadioptric lenses.

**Author Contributions:** Conceptualization, W.L. and X.L.; formal analysis, X.L.; funding acquisition, W.L. and X.L.; investigation, X.L.; methodology, X.L.; project administration, W.L.; supervision, X.M. and X.Y. (Xiaokang Yin); writing—original draft, X.L.; writing—review & editing, X.Y. (Xin'an Yuan). All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Special National Key Research and Development Plan (No. 2016YFC0802303), National Natural Science Foundation of China (No. 52005513), and the Fundamental Research Funds for the Central Universities (No. 27RA2003015).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

(List of abbreviations and symbols present in this article)



#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*
