*Article* **Improving Census Transform by High-Pass with Haar Wavelet Transform and Edge Detection**

**Jiun-Jian Liaw 1, Chuan-Pin Lu 2,\*, Yung-Fa Huang 1, Yu-Hsien Liao <sup>1</sup> and Shih-Cian Huang <sup>1</sup>**


Received: 29 March 2020; Accepted: 27 April 2020; Published: 29 April 2020

**Abstract:** One of the common methods for measuring distance is to use a camera and image processing algorithm, such as an eye and brain. Mechanical stereo vision uses two cameras to shoot the same object and analyzes the disparity of the stereo vision. One of the most robust methods to calculate disparity is the well-known census transform, which has the problem of conversion window selection. In this paper, three methods are proposed to improve the performance of the census transform. The first one uses a low-pass band of the wavelet to reduce the computation loading and a high-pass band of the wavelet to modify the disparity. The main idea of the second method is the adaptive size selection of the conversion window by edge information. The third proposed method is to apply the adaptive window size to the previous sparse census transform. In the experiments, two indexes, percentage of bad matching pixels (PoBMP) and root mean squared (RMS), are used to evaluate the performance with the known ground truth data. According to the results, the computation required can be reduced by the multiresolution feature of the wavelet transform. The accuracy is also improved with the modified disparity processing. Compared with previous methods, the number of operation points is reduced by the proposed adaptive window size method.

**Keywords:** census transform; sparse census transform; disparity; stereo vision

#### **1. Introduction**

The development of automatic equipment has always been one of the focuses in the field of computer science. The features of automatic equipment are its self-moving and exploring characteristics that can be used to reduce the risk of participation of personnel. In many automated devices, the distance of the device from the object in the surrounding environment is an important parameter, such as using vision to avoid obstacles. In modern automated or intelligent devices, the distance from the object is an important indicator. The corresponding action can be performed by determining the distance, whether this is using a robot arm to grab the object [1], automatic car driving to determine the road condition [2,3], or using a robot that can self-plan a path to shuttle through the environment [4]. All of these studies show that measuring distance is an essential part of achieving automation.

The methods of measuring distance are roughly divided into two types. The first one uses ray or waves as the direct measurement method [5]. These methods use laser, infrared, or ultrasonic means to shoot toward the object and simultaneously record the time of transmission and the time of receiving the reflection. The distance of the object can be calculated by the time difference between the transmission and reception. The second method uses a camera and image processing or machine vision technologies [6]. These methods use a camera to capture an image of an object, and then analyze the pixels in the image and measure the distance of the object. This type of method is mainly divided into two types: monocular vision with additional reference objects and two-eye stereoscopic vision.

The method of using ray or sound waves is more accurate and faster than the method of using a camera with an algorithm, but the equipment is expensive and the application is not easy to popularize. In recent years, in the case of the popularity of digital cameras, the technology for obtaining distance by image processing schemes has gradually received attention. Furthermore, as the advancement of the computer has improved the speed of this in recent years, the performance of two-eye stereoscopic processing has also improved. Even when using an accessible style camera, the stereo vision system can also be applied to measure the target distance [7].

In the steps of using image processing or machine vision technology to measure distance, the most-time consuming loading step is matching. The matching includes object recognition or feature extraction [8]. The problems of object recognition or feature extraction can be divided into hardware and algorithm domains. In the hardware methods, a field-programmable gate array can be used to implement the spike-based method [9]. The circuit can be also designed to achieve the efficiency of low power consumption by combining an active continuous-time front-end logarithmic photoreceptor [10]. In the algorithm methods, the visual information (such as an image) can be calculated by the address-event-representation [11] or by constructing stereo vision with cameras [12]. However, the devices for matching are more expensive than consumer cameras. It is easier and more effective when we use the algorithms with general cameras to solve the matching problem.

Most common cameras record two-dimensional images, with only the horizontal axis and the vertical axis. Since the camera takes images only in two dimensions, the distance measurement function that can be achieved is limited. To solve this problem, scholars have proposed mechanical stereo vision. This uses two cameras to shoot the same object from different positions and analyze the distance between the camera and the object by the algorithms of image processing or machine vision. The axis perpendicular to the image plane and the two axes of the image plane constitute the three-dimensional relationship between the camera and the object [12].

When we use two cameras to observe the same object, the positions of the object on the two images will be slightly different. This difference is called disparity. A simple two-eye stereo vision model is shown in Figure 1, where *P* is the target, *CAML* and *CAMR* are the two cameras, *b* is the distance between two cameras, *f* is the focal length, *IL* and *IR* are the imaging planes, *dL* and *dR* are the distances between targets on the image planes and the centers of the images, *OL* and *OR* represent the center lines of the lens, and *Z* is the distance we are looking for. We can see that *Z* can be calculated by the relationship of similar triangles [13]:

$$\frac{b}{Z} = \frac{b - (d\_L + d\_R)}{Z - f} = \frac{d\_L + d\_R}{f} \tag{1}$$

The disparity is denoted as *dL* + *dR*, and it is the amount of horizontal displacement that is produced by the same object that is imaged by two cameras. Since both *f* and *b* are known, it can be seen that it is quite important to obtain the disparity in the stereo vision system. The key to obtaining the disparity is matching the same object in the two images [14,15].

As shown in the above description, the method for obtaining three-dimensional information by the mechanical stereo vision system is to analyze and obtain the disparity between the two images. Census transform (CT) is one of the most robust algorithms for calculating the disparity of two images [16]. When we use CT, the size of the conversion window directly affects the computational load and accuracy. In a previous study, it was confirmed that a larger conversion window makes the result of object matching more accurate [17,18]. When the conversion window is larger than a certain size, the matching performance is not as significant as the window becomes larger. However, an oversized window not only consumes computational resources but also makes too many errors in matching. Therefore, the size of the window in CT is one of the important keys to determining the performance [19].

**Figure 1.** Schematic diagram of a three-dimensional model with disparity.

The calculation of each pixel by CT requires a large computational load and memory requirements. This makes it difficult for CT to be applied in real-time systems. Since the object in the image is bound by its edge, which is a sudden change in intensity, the edge (the high-frequency information in the image) is an important image feature [20]. This edge detection and the high-frequency information method are very important parts of image processing. These have also been applied in many real applications, such as oil spill detection for offshore drilling platforms [21], vehicle license plate location identification [22], pedestrian detection [23], and gesture recognition [24]. In the stereo vision system, a boundary can be used to identify whether the region is flat or has texture. The boundaries in the image can be obtained by gradients. Changing the length and width of the window according to the vertical and horizontal gradients can be used to reduce the bad matching of CT [25]. However, the quality of the disparity and operation loading are not discussed. The disparity quality can be improved by matching with a variable window and *p* post-processing with sub-pixel interpolation after CT [26]. This method does not adjust the window size when performing CT. However, using the sub-band of the high-frequency (such as edge) to improve the performance of CT is one of the feasible methods. Wavelet transform is a multi-resolution analysis method. The image data can be transformed into different sub-bands according to the defined wavelet [17]. The Haar wavelet is a well-known method to analyze the frequency information from sub-bands [27].

In this paper, two methods are proposed to improve the performance of CT using edge information. The first method is named census transform with Haar wavelet (CTHW) and uses edge information that is extracted by a wavelet. Since the edge information provides more accurate object information, the high-passed data is used to modify the disparity. The second method is called an adaptive window census transform (AWCT). The AWCT can determine whether the boundary of the window is increased or not when the window is enlarged. The increased rate of the boundary pixels in the window is used to determine the suitable window size. Moreover, since the sparse census transforms can be used to enhance the CT's performance by the designed mapping positions [28]; we also applied sparse census transforms to AWCT. AWCT and adaptive window sparse census transform (AWSCT) are applied to avoid using the oversized window and improving the performance.

#### **2. Related Methods**

#### *2.1. The CT Algorithm*

CT converts the grayscale intensity represented by each pixel in the grayscale image into the grayscale intensity relationship of each pixel to the neighbor pixels. The relationship can be treated as a feature of the pixel and used to find the two most similar features in the left and right images by the Hamming distance. The positions of the most similar points can be used to compute the disparity by Equation (1).

CT is defined as an order in a local neighborhood (conversion window) by comparing the relationship between the center pixel and the neighborhood pixels. The relationship between the center point (denoted as *p*) and the neighbor point (denoted as *p*') can be described by the conversion function [16,29]:

$$
\xi(p, p\nu) = \begin{cases} 1, & \text{if } I(p) > I(p\nu) \\ 0, & \text{otherwise} \end{cases} \tag{2}
$$

where *I*() is the intensity of the pixel. A conversion window is defined to select neighbor pixels. In Equation (2), *p* is located at the center of the window, and the other pixels located in this window are selected to be *p*' in turn. Usually, the shape of the converted window is square, and the size is user-defined. The CT at the pixel *p* can be written as

$$C(p\_{xy}) = \underset{p\_{ij} \in w}{\otimes} \xi(I(p\_{xy}), I(p\_{ij})) \tag{3}$$

where ⊗ is the concatenation operator and *w* is the conversion window. In the stereo vision with CT, two images (such as *IL* and *IR* in Figure 1) are transformed by CT and hamming distance is applied to obtain the disparity between two transformed images. The disparity can be determined by using winner-takes-all to find the minimum value among all possible disparity values [30].

#### *2.2. The Haar Wavelet*

The wavelet transform converts signals into small waves and performs signal processing and signal analysis with multi-resolution. It is widely used in compression, transmission and image analysis [31]. The Haar wavelet was proposed by Alfréd Haar in 1909 [27]. It is the earliest proposed and simplest type of wavelet transforms [32]. With two pixels (*p*<sup>1</sup> and *p*2) in the image, the Haar wavelet can be implemented to the low-band and high-band by

$$\text{Low band} = (p\_1 + p\_2) / \sqrt{2} \tag{4}$$

and

$$\text{High band} = (p\_1 - p\_2) / \sqrt{2} \tag{5}$$

respectively. In practice, the Haar wavelet can be described as a transformation matrix [32]:

$$
tau = \frac{\sqrt{2}}{2} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \tag{6}
$$

We can see that the image is high-passed and low-passed in the *x*-direction with down-sampling. The result obtained is also high-passed and low-passed in the *y*-direction with down-sampling. Finally, we obtain four sub-bands of LL (horizontal low-band and vertical low-band), LH (horizontal low-band and vertical high-band), HL (horizontal high-band and vertical low-band), and HH (horizontal high-band and vertical high-band).

#### *2.3. Edge Detection*

The boundary information can be extracted by edge detection and regarded as a result of high-pass filtering. The result of edge detection is mainly used to highlight whether the image is an area where the pixel changes significantly. In this study, the boundary information is used to classify and determine the complexity in the vicinity of the pixel. If the gray scale intensity changes significantly, a smaller conversion window can be used; otherwise, if the gray scale intensity of the area near the pixel does not change significantly, a larger conversion window must be used. In this paper, the Canny edge detection method is used for boundary detection [33,34]. First, the noise must be filtered by a two-dimensional Gaussian filter. The Gaussian function can be described as

$$G(x,y) = e^{-\frac{x^2 + y^2}{2x^2}}\tag{7}$$

where σ is the variance of the Gaussian function and it can be regarded as a smoothing factor in the filtering. The Gaussian function and the image can be computed by convolution to obtain the amount of change in the *x* and *y* directions, which are denoted as *gx* and *gy*. The gradient based on the pixel value in the image can be expressed as the gradient magnitude and the gradient direction, by

$$I\_{\mathcal{S}^m}(\mathbf{x}, y) = \sqrt{\mathcal{g}\_x^2 + \mathcal{g}\_y^2} \tag{8}$$

and

$$I\_{\mathcal{S}^\partial}(x, y) = \tan^{-1} \left( \frac{\mathcal{S}\_{\mathcal{Y}}}{\mathcal{S}x} \right) \tag{9}$$

respectively. Since the edge can be described as the gradient, the boundary points can be detected as the larger gradient magnitude on the gradient direction. We compare the pixels on the gradient direction with pixels not on the gradient direction. If the value of a pixel on the direction is larger than the value of pixel not on the direction, the pixel is regarded as a boundary point; otherwise, it is a non-boundary point.

### **3. Proposed Methods**
