**1. Introduction**

In recent years, vision-based human activity recognition (HAR) has developed rapidly with many exciting achievements [1–3]. Camera calibration is the upstream task of visionbased HAR, which can establish the mapping between real space and image space. Its accuracy determines the performance of downstream tasks such as feature points recognition and 3D reconstruction, and thereby affects the final performance of vision-based HAR [4]. For instance, the fisheye camera, which has been widely used in HAR tasks in the field of monitoring and security, although it has an ultra-wide-angle field of view, the object at the edge of the fisheye image has grea<sup>t</sup> deformation and serious information distortion. If the distortion of the camera is not accurately calibrated, it will seriously affect the accuracy of the subsequent algorithm. So, camera calibration is of grea<sup>t</sup> significance to vision-based HAR, containing daily activity recognition, self-training for sports exercises, gesture recognition and person tracking [5].

Distortion calibration of the camera impacts the accuracy of other parameters' estimations. With the development of this field, distortion models' degree of freedom is increasing,

**Citation:** Jin, Z.; Li, Z.; Gan, T.; Fu, Z.; Zhang, C.; He, Z.; Zhang, H.; Wang, P.; Liu, J.; Ye, X. A Novel Central Camera Calibration Method Recording Point-to-Point Distortion for Vision-Based Human Activity Recognition. *Sensors* **2022**, *22*, 3524. https://doi.org/10.3390/s22093524

Academic Editor: Raffaele Gravina

Received: 23 March 2022 Accepted: 29 April 2022 Published: 5 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

thus, there is much difference compared polynomial distortion models with point-to-point distortion models. In 1992, Weng [6] summarized distortion camera models, namely, radial, decentering, and thin prism distortions, which describe the real distribution of distortion by polynomials and parameters. Polynomial distortion models are idealized models and have a gap with the actual camera imaging relationship, resulting in limited accuracy of the calibration method. For higher accuracy of distortion calibration, some general distortion models and corresponding calibration methods [7–13] are proposed.

Since radial distortion is the main distortion of the camera, some researchers [7,8] developed a general radial distortion model that does not adopt a classical two-to-six parameter radial distortion, but rather a freer form of radial distortion. Inspired by their success, more general distortion models have been developed [9–13], describing lens distortion per pixel or by some kind of interpolation. In this kind of model, as distorted points can be extracted directly, the key problem to be solved is how to determine the original position (of pixels or spaces) of distorted points. Sagawa et al. employed structuredlight patterns to obtain a dense distortion sample; the camera is aligned opposite to the target to make the feature points fixed [9]. Aubrey K. et al. set a synthetic image plane and recorded distortion as bias between real camera images and images projected on the synthetic image plane [10]. Jin et al. assumed that distortion in the central area of the image plane is negligible, and calculated distortion of the surrounding area by cross-ratio invariance [13]. Based on a raxel model, Thomas S. et al.'s pipeline [11] achieved the highest accuracy, but needs a large number of images. In our method, we designed a novel objective function that treats the distortion of each pixel as a constant quantity between different images and reprojects reference images by optimization results to create "virtual photos" which can determine the original position of distorted points.

Our method is based on the central generic camera model, which assumes all lights pass through a single optical center in the imaging process. Since the rays diverge from a point in the central generic camera model, the order and spacing ratio of rays remains unchanged, and the distortion rectification map remains unchanged with distance. Accordingly, there are sufficient reasons to believe that the distortion of a pixel is consistent across images taken with the same camera, which is the basis of our objective function.

Before the iteration, using the initial estimation of parameters with Zhang's calibration method [14], we reprojected reference images to create "virtual photos" and extract dense features between "virtual photos" and original images. We designed our objective function to be a mean square error of the deformation between the "virtual photos" and original images. This objective function can remove the influence of distortion during parameter optimization, and obtain a more precise estimation of the camera parameters and target pose in each image.

To describe deformation adequately in the objective function, dense features are needed in our method. Although active phase targets can provide dense features [9,10,15,16], they are inconvenient to use. Chen et al.'s work [17] verified the accuracy and stability of feature detection methods based on digital image correlation (DIC). In Gao et al.'s work [18], the result of DIC is used to determine the accuracy of camera distortion calibration. Inspired by them, we incorporated a speckle pattern target and DIC into our camera calibration method, but unlike Chen, we did not utilize polynomial distortion models, but rather a full-pixel distortion description.

Since the polynomial distortion model is only an approximation of real distortion, the results of the camera calibration method based on the polynomial distortion model will be affected by incomplete distortion estimation. Our method can establish a point-to-point correspondence between distorted pixels and rectified pixels, which describes the camera distortion more comprehensively, and then gets a more accurate estimation of the camera parameters. Compared with methods based on the raxel model, our method needs only dozens of images, and strict experimental conditions are not required.

In our results, distortion is calculated for each point as the average value of the DIC calculation results across multiple images, which eventually formed a distortion rectification map that mapped images taken by the camera to undistorted ones. Figure 1 displays a distortion rectification map obtained by our point-to-point distortion calibration method. Figure 2 illustrates the difference between Figure 1 and the distortion rectification map obtained by Zhang's method with a polynomial distortion model using the same set of calibration images, indicating free distortion, which the polynomial distortion model cannot describe.

**Figure 1.** (**a**) Distortion rectification map of point-to-point calibration method for X directions; (**b**) Distortion rectification map of point-to-point calibration method for Y directions.

**Figure 2.** (**a**) Distortion rectification map of point-to-point calibration method subtracted from distortion rectification map of Zhang's calibration method for X directions; (**b**) Distortion rectification map of point-to-point calibration method subtracted from distortion rectification map of Zhang's calibration method for Y directions.

The paper is organized as follows. Section 2 illustrates relative work. Section 3 introduces the camera model and lens distortion in our method. Section 4 describes our point-to-point distortion calibration method. In Section 5, experiments are performed to verify our method's effectiveness. In Section 6, we discussed the issues not mentioned above. Finally, the conclusion is made in Section 7.

#### **2. Related Work**

#### *2.1. Camera Model*

From special to general, camera models can be classified as perspective cameras, central generic cameras, and non-central generic cameras [19]. The perspective camera is a single-view camera described by a pinhole imaging model, in which the imaging process is subjected to projective transformation, containing the finite projective camera and affine cameras [20].

The central generic camera contains the wide-angle camera, fisheye camera, and other cameras with refraction and reflection [19], which is unlikely to undergo a projective transformation and has a single focal point. In the imaging process of this camera, since the rays radiate from only one point, the order and spacing ratio of the rays remain unchanged, and the distortion rectification map remains constant with distance. That is why a distortion rectification map can describe the central generic camera's distortion. Following distortion

rectification, the central generic camera is simplified to be a camera that conforms to the pinhole imaging model.

The non-central generic camera is also referred to as a general camera. It lacks a single focal point, the order and spacing ratio of the rays will vary with distance, and the distortion rectification map cannot be used for distortion correction. Michael D. Grossberg and Shree K. Nayar from Columbia University first proposed a raxel model for a general camera [21], which uses a point *p* and a direction q to describe a ray entering the camera from the outside and colliding with the sensor. Subsequent works on general camera calibration have adopted the raxel model [11,19,22–24].

#### *2.2. Pattern Design and Feature Detection*

While a chessboard or circle pattern target is usually used in camera calibration, methods for improving feature detection precision have been proposed [25–29]. Ha, and Hyowon et al. discussed a triangle pattern target [30]. The intersection of three triangles can be approximated using a series of third-order polynomials as control points. An active phase target is also used for calibration [9,10,15,16], which provides more freedom for feature setting and de-focus situations. Chen et al. utilized speckle patterns and extracted feature points using the DIC method [17]. Experiments demonstrated that calibrating with a speckle pattern produces a smaller reprojection error than calibrating with a chessboard or circle pattern.

#### *2.3. Digital Image Correlation Method*

Digital image correlation (DIC), first proposed by researchers from the University of South Carolina [31], is a method for determining material deformation. In its application there are two kinds of DIC: (1) 2D-DIC, which is used for flat materials and requires the materials to remain flat during measurement; and (2) Stereo-DIC, which is used for three-dimensional materials and deformation, and can handle more variable situations.

The core objective of DIC algorithms is to match points of interest (POI) from the speckle pattern feature on the surface of materials in images, which usually consists of two main steps: (1) obtaining an initial guess and (2) iterative optimization. In the first step, there are methods such as correlation criteria [32,33], fast Fourier transform-based cross-correlation (FFT-CC) [34], and a scale-invariant feature transform (SIFT) [35] for a path-independent initial guess. For iterative optimization, Bruck HA et al. [36] proposed the forward additive Newton–Raphson (FA-NR) algorithm, which was later improved and widely used. As calculating the gradient and the Hessian matrix in optimization progress is a noticeable burden, one feasible approach is simplifying the Hessian matrix by making some assumptions, thereby converting it to a forward additive Gauss–Newton (FA-GN) algorithm. Pan, B. et al. introduced the (IC-GN) algorithm into the DIC [37], which maintains a constant Hessian matrix and can be pre-computed.

#### **3. Model of Camera and Lens Distortion**

A camera can be regarded as a mapping between a 3D world and a 2D image. Our method was developed to address the issue of central generic camera calibration. To describe this 3D–2D mapping, we combined a pinhole camera model and a point-to-point lens distortion model.

#### *3.1. Pinhole Camera Model*

In the pinhole camera model, point **Pw** in the 3D world was transformed into a point (u, v) in an image after transformation in Equation (1) [20]. **T** (Equation (2)) is a rigid body transformation from point **Pw** in the world coordinate system to point (X, Y, Z) in the camera coordinate system, using the rotation matrix **R** and translation matrix **t**. **A** (Equation (3)) is an inner parameter matrix that transforms the point in the image coordinate system (the normalized camera coordinate system) to point (u, v) in the pixel coordinate system, where

fx and fy are focal lengths in pixels, and cx and cy are pixel coordinates of the principle point. To normalize the image plane, the formula is divided by Z.

$$
\begin{bmatrix} \mathbf{u} \\ \mathbf{v} \\ 1 \end{bmatrix} = \frac{1}{Z} \mathbf{A} \cdot \mathbf{d} (\mathbf{T} \cdot \mathbf{P}\_{\mathbf{w}}) \tag{1}
$$

$$\mathbf{T} = \begin{bmatrix} \mathbf{R} & \mathbf{t} \\ 0 & 1 \end{bmatrix} \tag{2}$$

$$\mathbf{A} = \begin{bmatrix} \mathbf{f}\_{\mathbf{x}} & \mathbf{0} & \mathbf{c}\_{\mathbf{x}} \\ \mathbf{0} & \mathbf{f}\_{\mathbf{y}} & \mathbf{c}\_{\mathbf{y}} \\ \mathbf{0} & \mathbf{0} & \mathbf{1} \end{bmatrix} \tag{3}$$

Distortion d in Equation (1) describes the geometric deformation arising from the optical imaging system. In Zhang's method [14], distortion is employed on normalized image planes using polynomial representation [6]. However, in our method, for generality, distortion is defined as unknown mapping.

#### *3.2. Point-to-Point Lens Distortion Model*

This section will illustrate the generality of the point-to-point lens distortion model and its representation. Since **A** is a linear transformation, we can modify Equation (1) to apply distortion mapping on pixel coordinates.

$$
\begin{bmatrix} \mathbf{u} \\ \mathbf{v} \\ 1 \end{bmatrix} = \mathbf{D} \left( \frac{1}{Z} \mathbf{A} \cdot \mathbf{T} \cdot \mathbf{P}\_{\mathbf{w}} \right) \tag{4}
$$

By substituting D for d, the representation and rectification progress of distortion can be simplified. The distortion calibration result obtained with this lens distortion model can be shown as a point-to-point distortion rectification map. It can describe distortion caused by any central generic camera. If we rectify a central generic camera after obtaining point-to-point distortion rectification mapping, it is simplified to be a camera that conforms to the pinhole imaging model.

Figure 3 illustrates the mechanism of rectifying a camera with point-to-point distortion rectification mapping. Point-to-point mapping contains a mapping of the *X* direction and a mapping of the *Y* direction, which is stored as two matrices. Assuming a feature point is (ude, vde) in a deformed image, the corresponding point with the same feature in the reference image is point (u, v). An element (u, v) in the mapping matrix of the *X* direction stores the displacement dxu,v of feature point (ude, vde)'s location in the deformed image relative to feature point (u, v)'s location in the reference image in the *X* direction. It is identical for the mapping matrix of the *Y* direction. Following that, the location of feature point (ude, vde) in the deformed image can be calculated using feature point (u, v)'s location in the reference image and element (u, v) in the mapping matrix of *X* and *Y* directions, as displayed in Equation (5).

$$\begin{aligned} \mathbf{u}\_{\rm de} &= \mathbf{u} + \mathbf{d}^{\chi}\_{\mathbf{u},\mathbf{v}} \\ \mathbf{v}\_{\rm de} &= \mathbf{v} + \mathbf{d}^{\chi}\_{\mathbf{u},\mathbf{v}} \end{aligned} \tag{5}$$

For every point (u, v) in the reference image, we can obtain its pixel value by copying the value of the corresponding point (ude, vde). If displacements dxu,v and dyu,v are decimals, bilinear interpolation is performed to obtain the value of the point (ude, vde). Going through every point (u, v) to obtain its value by Equation (5) and bilinear interpolation, a complete distortion corrected image is generated.

**Figure 3.** Mechanism of point-to-point mapping.
