**1. Introduction**

Image fusion can often analyze and extract the complementary information of multi-sensor data, and compose a robust or informative image, which can provide more complex and detailed target scene representation [1–3]. Since the fusion process of multi-sensor images is in accordance with human visual perception, it has an important research significance for object detection and target recognition in the areas of remote sensing image, medical imaging analysis, military target detection and video surveillance [4–6]. E.g., face recognition is one of the major applications of multi-sensor data, such as visible and thermal infrared images. Visible images can provide face texture details with high spatial resolution, while thermal infrared images are not likely to be interfered with by illumination variation or face disguise with high thermal contrast. Therefore, it is beneficial for face recognition to fuse the multi-sensor data, which can combine the advantages of texture detail and thermal radiation information in the multi-spectral face images.

However, image registration is a prerequisite for the success of multi-spectral image fusion, which is an essential and challenging step in the process of image fusion research [7]. Currently, multi-spectral images registration can generally be implemented in two ways: Hardware-based registration and software-based registration. In order to ensure that the multi-spectral images are strictly geometrically aligned, hardware-based registration can be realized based on the coaxial catadioptric optical system through the beam splitter [8]. But the sophisticated and cumbersome imaging sensor equipment for generating co-registered image pairs may not be practical in many face recognition scenarios, due to its high cost and low availability. In contrast, software-based registration can capture dual band images simultaneously and independently by utilizing different off-the-shelf low cost wavebands sensors. Since no additional hardware is needed, it may be more appropriate for practical detection and recognition scenarios compared to the hardware-based registration [9]. In general, software-based registration can be classified into two categories: Area-based and feature-based approaches [10]. Area-based methods attempt to deal directly with the image intensity values to detect, for instance, cross-correlation [11], Fourier transform and mutual information [12,13]. On the contrary, feature-based methods attempt to indirectly extract salient structures or features in the images, such as points of high curvature, corners, strong edges, intersections of lines, structural contours and silhouettes within the images [14–16]. For visible and thermal infrared image pairs usually involve quite different intensity values, and the area-based methods are good usually only on a selected small region of the images. Instead, salient structures with a strong edge could often represent a significant common feature of the heterogeneous images. In view of this, this paper focuses on the feature-based registration methods for the visible and thermal infrared face images.

Feature-based registration methods first extract different feature point sets of salient structures in the multi-spectral images, and then the registration problem can be simplified to determine the correct correspondence and the inherent spatial transformation between two point sets of extracted features. Unlike the face recognition problem [17,18], the multi-sensor face registration problem has not enough common features available to build a convolutional neural network (CNNs) framework to predict the classification model; a popular strategy is to regard the alignment of two point sets as probability density estimation, e.g., the Gaussian Mixture Model (GMM) [19–21] and Student's t Mixture Model (SMM) [22–24]. Because GMM formulations are intensively used, and the heavily tailed SMM is more robust and precise against noise and outliers than the classical GMM, it is a potential research direction to take advantage of SMM for image registration methods [25–27]. In general, both GMM and SMM probabilistic methods are intended to exploit global relationships in the point sets. However, the neighborhood structures among the feature points also contribute to the alignment of two point sets. Thus, another interesting point matching strategy would be aimed first at using local neighborhood structures as the feature descriptor to recover the feature point correspondences, e.g., Shape Context (SC) [28] and Inner-Distance Shape Context (IDSC) [29]. But for multi-spectral face images registration, the local neighborhood structures of feature points are not discriminative enough, because it will inevitably include a number of mismatching points. To address this issue, in this paper we modify the SMM probabilistic framework and take full advantage of the global and local structures during the multi-spectral face images registration process.

Our contribution in this paper includes the following three aspects: Firstly, in order to keep the global and local structures of feature point sets in the registration process, we use IDSC as the local shape descriptors of feature point sets, and treat the confidence of feature matching as the mixing proportion of the finite mixture model, which aims to describe the local shape and global geometrical relationship of feature point sets in a new SMM probabilistic framework. Secondly, in order to make the multi-spectral fusion image hold more apparent details of a visible image and the thermal radiation information of an infrared image, we propose a guided filtering and gradient preserving image fusion strategy for multi-spectral face images fusion. It can improve the robustness of face recognition techniques in varying illumination conditions. Thirdly, we construct a simple multi-spectral imaging system in order to acquire higher resolution multi-spectral face images from visible and thermal infrared cameras simultaneously.

#### **2. The Proposed Registration Method for Visible and Infrared Face Images**

For the software-based registration problem, a basic process of infrared and visible images fusion mainly includes the following procedures: Feature extraction and description, feature point sets registration, image transformation and interpolation and finally image fusion [1,30]. As is shown in Figure 1, the first step is to extract robust common salient features that might be preserved in the multi-spectral face images. Considering the fact that visible and infrared images are characterizations of two different modalities, we use edge maps to represent the common salient features of multi-spectral face images. After obtaining the salient features of multi-spectral images, the edge maps can usually be discretized as two feature point sets. Thus, the registration procedure is converted into determining the correct correspondence and estimating the spatial transformation between the feature point sets for multi-spectral face images. Furthermore, the other visible image points can be interpolated using a smoothing Thin-Plate Spline (TPS) [31] based on the pixel coordinates of the thermal infrared image. Finally, the multi-spectral image fusion process takes place in the output image, and the fused image can preserve both the apparent details of the visible image and the thermal radiation information of the infrared image, which will greatly improve the accuracy and efficiency of detection and recognition. In this paper, our attention mainly focuses on the following two aspects: Feature point sets registration and multi-spectral face images fusion.

**Figure 1.** Visible and infrared face images fusion basic process for software-based registration.

#### *2.1. Inner-Distance Shape Context for 2D Feature Descriptors*

In order to establish the initial correspondences between the two feature point sets of visible and infrared face images, we first use IDSC as the local feature descriptors, and solve the point sets matching problem through a bipartite graph method [32]. Compared to the SC feature matching method, the IDSC feature matching method uses inner-distance to replace the Euclidean distance, which is robust to articulation shape and part structure. For example, Figure 2 shows the schematic illustration of face silhouette feature descriptors based on IDSC, in the IDSC feature histograms from Figure 2c, the horizontal axis *n*<sup>θ</sup> = 12 denotes the numbers of orientation bins and the vertical axis *nd* = 5 denotes the numbers of logarithm distance bins.

**Figure 2.** Schematic illustration of face silhouette feature descriptors based on Inner-Distance Shape Context (IDSC). (**a**) Bellman-Ford shortest path graph built using face silhouette landmark points; (**b**,**c**) Four marked points and their corresponding IDSC feature histograms.

As can be seen from Figure 2c, the IDSC feature histograms for all marked points are quite different, and the excellent discriminative performance demonstrates the inner-distance's ability to capture local structures in the point sets. Consider a point *pi* in the point set *X* and a point *qj* in the point set *Y*. Let *Ci*,*<sup>j</sup>* denote the cost of matching *pi* to *qj*, *hX*,*pi* (*k*) and *hY*,*qj* (*k*) denote the normalized histogram of the relative coordinates of the remaining points, respectively, and *K* be the number of histogram bins. As the IDSC is represented based on the histogram, it can be measured by the χ<sup>2</sup> test statistic:

$$\mathcal{C}\_{i,j} = \mathcal{C}(p\_{i\prime}q\_{j}) = \frac{1}{2} \sum\_{k=1}^{K} \frac{\left[h\_{\mathbf{X},p\_{i}}(k) - h\_{\mathbf{Y},q\_{j}}(k)\right]^2}{h\_{\mathbf{X},p\_{i}}(k) + h\_{\mathbf{Y},q\_{j}}(k)},\tag{1}$$

In general, the smaller the matching cost *Ci*,*<sup>j</sup>* is, the more similar the local appearance at points *pi* and *qj* are. Once the cost matrix *C* for all correspondence points in the point set *X* and *Y* is obtained, the correspondences **Ω** between the two point sets are an instance of an assignment problem, which can be solved by the Hungarian method.

#### *2.2. Student's t Mixtures for Feature Point Set Registration*

In order to fully exploit the global structures in the feature point sets, we treat the alignment of two point sets as a probability density estimation of an SMM distribution with the IDSC local feature descriptors. Compared to the GMM distribution, the main advantage of SMM is that Student's t distribution has a heavier tail than the exponentially decaying tail of Gaussian distribution. As is shown in the Figure 3, each component of the SMM has an additional parameter υ called the degrees of freedom, which can change the shape of a Student's t distribution curve, and hence SMM provides a more robust model than the classical GMM.

**Figure 3.** The Student's t distribution for various degrees of freedom.

Given a point set *XN*×*<sup>D</sup>* = (*x*1, ... , *xN*) T, which can be considered as an observation datum, the other point set *YM*×*<sup>D</sup>* = (*y*1, ... , *yM*) <sup>T</sup> is treated as the SMM centroids, where *N* and *M* represent the number of points in *X* and *Y*, respectively, and *D* represents the dimension of each point. Then the goal of registration is to align the centroids point set *Y* to the observation data point set *X*. Typically, the point sets contain noise and outliers, which can be supposed to be a uniform distribution 1/*N*, and the weight of the uniform distribution is denoted as η ∈ [0, 1]. Let *wij* ∈ [0, 1] denote a mixing proportion corresponding to the *i* th component of the SMM for point *xj* ( *M i*=1 *wij*= 1), then the mixture probability density function can take the form:

$$f(\mathbf{x}|\boldsymbol{\upmu}) = \eta \frac{1}{N} + (1 - \eta) \sum\_{i=1}^{M} \alpha\_{i\boldsymbol{j}} f(x\_{\boldsymbol{j}} | y\_{i\boldsymbol{\upnu}} \boldsymbol{\upnu}\_{i\boldsymbol{\upnu}} \boldsymbol{\upnu}\_{i}),\tag{2}$$

where ψ = (ψ1, ψ2, ... , ψ*M*) is mixture parameter set with ψ*<sup>i</sup>* = (*wij*, *yi*, *<sup>i</sup>*, υ*i*), and *f* (*xj*|*yi*, *<sup>i</sup>*, υ*i*) is the Student's t distribution probability density function for the *i* th component of SMM; we express that:

$$f(\mathbf{x}\_{j}|y\_{i}, \Sigma\_{i}, \upsilon\_{i}) = \frac{\Gamma\left(\frac{\upsilon\_{i} + D}{2}\right) |\Sigma\_{i}|^{-\frac{1}{2}}}{(\pi \upsilon\_{i})^{\frac{1}{2}D} \Gamma\left(\frac{\upsilon\_{i}}{2}\right) \left[1 + \frac{d\left(x\_{j}, y\_{i}; \Sigma\_{i}\right)}{\upsilon\_{i}}\right]^{\frac{\upsilon\_{i} + D}{2}}} \tag{3}$$

where *d*(*xj*, *yi*; *<sup>i</sup>*) is the Mahalanobis squared distance between observation point *xj* and centroid point *yi*, while *<sup>i</sup>*, Γ and υ represent the covariance matrix, Gamma function and degrees of freedom, respectively.

### *2.3. Multi-Spectral Face Registration Using Global and Local Structures*

In order to estimate the mixture parameter ψ in the mixture probability density function (2), according to [22,23], we first rewrite the function (2) as a complete data logarithm likelihood function ln *LC*(ψ) <sup>=</sup> *<sup>N</sup> j*=1 ln *M i*=1 *wijf xj yi*, <sup>Σ</sup>*i*, <sup>υ</sup>*<sup>i</sup>* , and then use the Expectation Maximization (EM) to train the logarithm likelihood function. In general, the EM algorithm can be divided into the following two steps: Expectation step (E-step) and Maximization step (M-step), and the iterative process proceeds by alternating between the E- and M-steps until convergence.

E-step: We first use the current parameter ψ(*k*) to calculate the posterior probability τ*ij*(*k*), and according to the Bayesian Theorem, it can be expressed in terms of the observation data *xj* belonging to the *i* th component of the SMM

$$\pi\_{ij}^{(k)} = \frac{w\_{ij}^{(k)} f(x\_j | y\_i^{(k)}, \Sigma\_i^{(k)}, \nu\_i^{(k)})}{f(x\_j | \mathfrak{p}^{(k)})},\tag{4}$$

Next we need to calculate the *k*th iteration conditional expectation of the logarithm likelihood function ln *LC*(ψ), which can be calculated as follows:

$$\mathcal{Q}(\psi|\psi^{(k)}) = \mathcal{Q}\_1(w|\psi^{(k)}) + \mathcal{Q}\_2(\sigma|\psi^{(k)}) + \mathcal{Q}\_3(y, \Sigma|\psi^{(k)}),\tag{5}$$

where *Q*1, *Q*<sup>2</sup> and *Q*<sup>3</sup> are respectively written as follows:

$$\begin{split} \mathbb{Q}\_{1}\{\mathbf{w}|\boldsymbol{\Psi}^{(k)}\} &= \sum\_{j=1}^{N} \sum\_{i=1}^{M} \pi\_{ij}^{(k)} \ln w\_{ij} \\ \mathbb{Q}\_{2}\{\mathbf{w}|\boldsymbol{\Psi}^{(k)}\} &= \sum\_{j=1}^{N} \sum\_{i=1}^{M} \pi\_{ij}^{(k)} \Big{(} -\ln \Gamma\left(\frac{v\_{i}}{2}\right) + \frac{v\_{i}}{2} \ln\left(\frac{v\_{i}}{2}\right) + \frac{v\_{i}}{2} \Bigg{\Big{(}} \sum\_{j=1}^{N} \left(\ln u\_{ij}^{-(k)} - u\_{ij}^{-(k)}\right) + \gamma \Big{(}\frac{v\_{i}^{(k)} + D}{2}\Big{)} - \ln\left(\frac{v\_{i}^{(k)} + D}{2}\right) \Bigg{)} \\ \mathbb{Q}\_{3}\{\boldsymbol{y},\boldsymbol{\Sigma}\_{i}|\boldsymbol{\Psi}^{(k)}\} &= \sum\_{j=1}^{N} \sum\_{i=1}^{M} \pi\_{ij}^{(k)} \Big{(} -\frac{D}{2}\ln(2\pi) - \frac{1}{2}\ln|\boldsymbol{\Sigma}\_{i}| + \frac{D}{2}\ln u\_{ij}^{(k)} - \frac{u\_{ij}^{(k)}|\boldsymbol{x}\_{j} - \boldsymbol{y}\_{i}|^{2}}{2\Sigma\_{i}} \end{split}$$

with the gamma in *Q*<sup>2</sup> that denotes the Digamma function, and the *uij* (*k*) in *Q*<sup>3</sup> that represents the conditional expectation about additional missing data in the complete-data; it can be written as:

$$\mu\_{i\dot{j}}{}^{(k)} = \frac{\nu\_{i}{}^{(k)} + D}{\nu\_{i}{}^{(k)} + D \left(x\_{j}, y\_{i}{}^{(k)}; \Sigma\_{i}{}^{(k)}\right)} {}' \tag{6}$$

In addition, we use the displacement function ρ to denote the non-rigid transformation in the point set: *X* = *T*(*Y*, ρ) = *Y* + ρ(*Y*), and add the regularization term ϕ(ρ) to the *Q*3, which can enforce the smoothness of the displacement function in the alignment of two point sets. So this *Q*<sup>3</sup> term can be rewritten as:

$$\widetilde{\mathcal{Q}}\_3 = \mathcal{Q}\_3(\rho, \Sigma\_i^{(k+1)} | \psi^{(k)}) + \frac{\lambda}{2} q \varphi(\rho), \tag{7}$$

where λ is the regularization parameter and controlling the trade-off of two terms in Equation (7). By reproducing Kernel Hilbert Space and Fourier transformation, the displacement function ρ can be denoted as:

$$\rho(y\_j) = \sum\_{i=1}^{M} h\_i \mathcal{G}(y\_i, y\_j)\_{\prime} \mathcal{G}(y\_{i\prime} y\_j) = e^{-\frac{1}{2} \left\| \frac{y\_i - y\_j}{\overline{\beta}} \right\|^2},\tag{8}$$

where *G* is a Gaussian kernel matrix, β determines the width of the smoothing Gaussian filter, and the Equation (8) can be conveniently denoted in matrix form as: ρ*(Y)* = *GH*, where *HM*×*<sup>D</sup>* = (*h*1, ... , *hM*) is a matrix of coefficients. Thus, the Equation (7) can be expanded in the following matrix form:

$$\widetilde{\mathbf{Q}}\_3 = \sum\_{j=1}^N \sum\_{i=1}^M \tau\_{ij}^{(k)} \left( -\frac{D}{2} \ln(2\pi) - \frac{1}{2} \ln|\Sigma\_i| + \frac{D}{2} \ln u\_{ij}^{(k)} - \frac{\mu\_{ij}^{(k)} \|\mathbf{x}\_j - y\_i\|^2}{2\Sigma\_i} \right) + \frac{\lambda}{2} \text{tr}\Big( \left( \mathbf{H}^T \right)^{(k)} \mathbf{G} \mathbf{H}^{(k)} \Big), \tag{9}$$

M-step: On the M-step at the (*k* + 1)th iteration of the EM algorithm, considering that *Q*1, *Q*<sup>2</sup> and *Q*<sup>3</sup> in Equation (5) can be computed independently of each other, thus the maximization of the objective function *Q*1, *Q*<sup>2</sup> and *Q*<sup>3</sup> with respect to the parameters *w*, υ, , ρ can be operated separately. The mixing proportion for SMM is updated by our consideration of the first term *Q*1. Given the feature descriptors in Section 2.1, it can obtain the coarse correspondences **Ω** between the point sets *X* and *Y*. For an observation data *xj*, we define ι (0 ≤ ι ≤1) as a confidence by IDSC feature matching, and then the mixing proportion *wij* by incorporating the local structures among neighboring points can be updated by the following rule:

$$w\_{ij}^{\cdot,(k+1)} = \begin{cases} \sum\_{j=1}^{N} \frac{\pi\_{ij}(k)}{N}, & \exists \big(x\_j \to y\_i\big) \in \Omega\\ \iota\_{\prime} & \exists \big(x\_j \to y\_i\big) \in \Omega \end{cases},\tag{10}$$

If observation data *xj* does not have a corresponding point *yi* in the label **Ω**, the mixing proportion *wij* (*k*+1) is given by the average of the posterior probabilities τ*ij*(*k*) of the SMM component membership. If the observation datum *xj* corresponds to the centroid point *yi* in the label **Ω**, the *wij* (*k*+1) is given by a constant confidence ι.

*Appl. Sci.* **2019**, *9*, 4623

Note that in Equation (10) the mixing proportion *wij* does not only depend on prior assignment by local structures, but also depends upon the posterior probability of observation data *xj* belonging to the *i* th component of the SMM by global structures. In addition, we obtain degree of freedom υ by taking the corresponding derivative of *Q*<sup>2</sup> to zero, and the updated value υ(k+1) is a solution of the following equation:

$$1 - \mathfrak{p}\left(\frac{\upsilon\_i^{(k+1)}}{2}\right) + \ln\left(\frac{\upsilon\_i^{(k+1)}}{2}\right) + \frac{\sum\_{j=1}^{N} \tau\_{ij}^{(k)} \left(\ln u\_{ij}^{-(k)} - u\_{ij}^{-(k)}\right)}{\sum\_{j=1}^{N} \tau\_{ij}^{(k)}} + \mathfrak{p}\left(\frac{\upsilon\_i^{(k)} + D}{2}\right) - \ln\left(\frac{\upsilon\_i^{(k)} + D}{2}\right) = 0,\tag{11}$$

Then, to update the estimates of covariance matrices and displacement function ρ in Equation (7), we also need to take the derivative of *Q*3, and the update values can be denoted as:

$$\begin{aligned} \mathbf{H}^{(k+1)} &= \left( \operatorname{diag} \{ \mathbf{P}^{(k)} \mathbf{1} \} \mathbf{G} + \lambda \boldsymbol{\Sigma}\_{i}^{(k)} \mathbf{I} \right)^{-1} \left( \mathbf{P}^{(k)} \mathbf{X} - \operatorname{diag} \{ \mathbf{P}^{(k)} \mathbf{1} \} \mathbf{Y} \right), \mathbf{P}\_{ij}^{-} (k) = \tau\_{ij}^{-(k)} u\_{ij}^{(k)} \\ \boldsymbol{\Sigma}\_{i}^{-(k+1)} &= \frac{\sum\_{j=1}^{N} \tau\_{ij}^{-(k)} u\_{ij}^{(k)} \left\| \mathbf{x}\_{j} - \mathbf{T}(y\_{i}, \boldsymbol{\rho}) \right\|^{2}}{\sum\_{j=1}^{N} \tau\_{ij}^{-(k+1)}}, (i = 1, \dots, N) \end{aligned} \tag{12}$$

where **1** is a column vector of all ones, and *I* is an identity matrix, *diag*(•) denotes a diagonal matrix and *G*(*i*,•) denotes the column vector in the kernel matrix *G*. Furthermore, for our centroid point *yi* in the point set *Y*, the non-rigid transformation *T*(*Y*, ρ) = *Y* + ρ(*Y*) can be expressed by Equation (8) as follows:

$$T(y\_i, \rho) = y\_i + G(i, \cdot) H\_i^{(k+1)},\tag{13}$$

At the end, the iterative process alternates between E-steps and M-steps until satisfying the following convergence condition:

$$||\frac{L\_{\mathcal{C}}\left(\boldsymbol{\Psi}^{(k+1)}\right) - L\_{\mathcal{C}}\left(\boldsymbol{\Psi}^{(k)}\right)}{L\_{\mathcal{C}}\left(\boldsymbol{\Psi}^{(k+1)}\right)}||\leq\varepsilon,\tag{14}$$

where ε is a convergence threshold. After a number of repeatability testing, we use the following setting for the parameters η = 0.1, ι = 0.8, λ = 3, β = 2, ε = 10−<sup>5</sup> in our experiment. Furthermore, to register the visible and infrared images accordingly, the image transformation and interpolation is performed on the visible image. Since our multi-spectral face registration method is implemented by using global and local structures, then the registration method is to be named as Face Registration using the Global and Local Structure (FR-GLS) in the rest of this paper.

#### **3. The Proposed Fusion Strategies for Visible and Infrared Face Images**

After obtaining the accurate alignment of the visible and infrared images, the second problem is then to solve the multi-spectral images fusion. In order to make sure that the fused image can preserve both the apparent details of the visible image and thermal radiation information of the infrared image, we propose the fusion strategies based on Guided Filtering and Gradient preserving (GF-GP). Firstly, we use two-scale image decomposition and reconstruction with a guided filtering algorithm to get an initial fusion image, and then the output fusion image is optimized by combining the intensity distribution of the infrared image and the intensity variation of the visible image. According to [33], since each source image can be separated into a base layer containing the large-scale details, and the detail layer containing the small-scale details, the two-scale image fusion *FG* is given by the guided filtering algorithm as follows:

$$F\_{\mathbb{G}} = \overline{\mathcal{B}} + \overline{\mathcal{D}},\tag{15}$$

where the fused base layer *B* and the fused detail layer *D* can be denoted as a weighted average cost of the base *B<sup>n</sup>* and detail *B<sup>n</sup>* layers of different source images:

$$
\overline{B} = \sum\_{n=1}^{N} \mathsf{W}\_n^B \mathsf{B}\_{n\prime} \ \overline{D} = \sum\_{n=1}^{N} \mathsf{W}\_n^D \mathsf{D}\_{n\prime} \tag{16}
$$

where *W<sup>B</sup> <sup>n</sup>* and *W<sup>D</sup> <sup>n</sup>* represent the refined weight maps of the base *B<sup>n</sup>* and detail layers *D<sup>n</sup>* via guided image filtering respectively. In order to optimize the initial fusion image *FG*, here the visible, infrared and fused images are defined by *V*, *I* and *F*, respectively. Given that the thermal radiation information in the infrared image is typically characterized by the pixel intensity values, the pixel intensity values are quite different between the target and background. Thus, we constrain the fused image *F* to have the similar pixel intensity distribution with the infrared source image *I*, and the optimization problem can be formulated as:

$$
\delta\_1 = \|\mathbf{F} - \mathbf{I}\|\_\prime \tag{17}
$$

where • denotes the *l*<sup>1</sup> norm. Besides, given that the pixel intensity distribution in the same physical location might be discrepant for infrared and visible face images, they are manifestations of two different phenomena. Note that the detailed information of object edge and texture is mainly characterized by the pixel gradient values in the visible image. Hence, we constrain the fused image *F* to have the similar pixel gradient values with the visible source image *V*, the optimization problem can also be formulated as:

$$
\delta\_2 = \|\nabla \mathbf{F} - \nabla \mathbf{V}\|\_{\ast} \tag{18}
$$

where ∇ denotes the gradient operator. By combining the Equations (17) and (18), the optimization problem of the fusion image *F* can be formulated as minimizing the following objective function:

$$
\delta\_{\mathbf{F}} = \delta\_1 + a\delta\_2 = \|\mathbf{F} - \mathbf{I}\| + a\|\nabla \mathbf{F} - \nabla \mathbf{V}\|\_{\star} \tag{19}
$$

where α is a weighting factor, the optimal *F* could be found by using gradient descent strategy. Furthermore, it is important to note that if the α is small, the fusion image preserves the more thermal radiation information of the infrared image, otherwise, the fusion image preserves more edge and texture information of the visible image. We set the weighting coefficient α = 5 in our experiment, because it can achieve good subjective visual quality in most cases.

#### **4. Experimental Results**

To demonstrate the robustness and efficiency of our proposed registration method in solving the multi-spectral face image registration problem, subjective and objective evaluation experiments are conducted with the UTK-IRIS standard multi-spectral face database [34] and self-constructed multispectral face data sets. The experiments are performed on a desktop with 3.3 GHz Intel Core™ i5-4590 CPU, 8 GB memory and MATLAB Code.

### *4.1. Registration on Real Face Images with the UTK-IRIS Database*

In this section, the performance of face image registration with our FR-GLS method is evaluated by using the public UTK-IRIS Database. The images in the database have a spatial resolution of 320 × 240 pixels and the database contains individuals with various poses, facial expressions and illumination variations. We first give an intuitive impression of registration and fusion results, and then we provide a quantitative comparison with three typical methods such as Coherent Point Drift (CPD) [19], Regularized Gaussian Fields (RGF) [20] and SMM [25], based on the specified landmarks of visible and thermal infrared image pairs. The reasons for choosing these three comparison methods rest on the following two considerations. On the one hand, CPD and SMM use both global structures based on finite mixture models to parameterize the transformations. On the other hand, besides the global

relationships in the feature point sets, RGF uses local features descriptor such as SC to recover the accurate transformation.

However, our FR-GLS has two major advantages compared to RGF: (i) Unlike the local features descriptor used in RGF, we use a more robust local features descriptor such as IDSC to initialize the correspondences between the two shape features; (ii) In order to preserve the global and local structures of feature point sets during the registration process, a heavy tail distribution in our FR-GLS is used to replace the traditional Gaussian distribution in the RGF method. Furthermore, we use the IDSC descriptor to assign the mixing proportion *wij* of the SMM.

#### 4.1.1. Qualitative Evaluation

In order to obtain a qualitative evaluation of the registration results issued by the software-based registration process in Figure 1, several typical frame pairs of four individuals involving different poses and illumination changes are given in Figure 4a–d. Specifically, the original visible images and thermal infrared images are shown in the first and second row, respectively. Then the middle row shows the edge maps extracted by the canny edge detector. We use the red point set to denote the edge map of the thermal infrared images, and the blue point set corresponds to the edge map of the visible images. Subsequently, the visible image is aligned to the thermal infrared image by our FR-GLS method, and the registration performance is demonstrated by the checkerboard in the fourth row, where the aligned visible image is superimposed onto the corresponding thermal infrared image. We can see that the seams between two grids in the checkerboard are natural, and this demonstrates the feasibility and effectiveness of the proposed registration method qualitatively. Finally, the last row represents the fusion results of the visible and thermal IR images by our GF-GP method. We can see that the fused images in the last row contain both the apparent details of the visible image and the thermal radiation information of the infrared image, which can be clearly seen from noses, cheeks, ears, mouths and glasses. Therefore, it will be beneficial for the performance of follow-up face detection and recognition under the case of illumination changes or disguise.

#### 4.1.2. Quantitative Evaluation

In order to give a quantitative comparison of our FR-GLS and other typical methods, such as CPD [19], RGF [20] and SMM [25], we further utilize several visible and infrared image pairs of the four individuals in Figure 4, and manually selected several landmarks and their correspondences (about 20 pairs) in each visible and infrared image pairs as ground truth. For example, the landmarks involve the salient features like the cheeks, eyes, nose, mouth, eyebrows, ears and glasses, which are highly distinguished. Referring to the metric used in [20,21,35], the recall rate is proposed to be a qualitative indicator for evaluating the registration results on all landmarks set pairs of an individual. Here the recall rate, or true positive rate, is defined as the proportion of the true positive correspondences between the landmark pairs to the ground truth correspondences, and a true positive correspondence is counted when the pair falls within a given accuracy threshold of Euclidean distance. Here the Euclidean distance can be defined as 2-norm between a landmark in the aligned feature point set of visible images and the corresponding landmark in the feature point set of infrared images. The recall curves plot the recall rates under different threshold values.

Figure 5 shows the recall rates against different threshold values using the registration methods of CPD, SMM, RGF and FR-GLS, considering four individuals. Each individual contains about 100 hundred-image pairs corresponding to every spectrum. By the definition of recall, we can see that the FR-GLS and RGF methods for the global and local structures are significantly superior to the CPD and SMM methods for the global structures only based on finite mixture models. Moreover, the curves of our FR-GLS method are consistently above the curves of the RGF method for different accuracy thresholds and individuals, which could be attributable to the combination of the robust SMM and IDSC local feature descriptor.

**Figure 4.** Registration and fusion results of our method on four typical unregistered visible/infrared image pairs in the database of UTK-IRIS. (**a**) Charles; (**b**) Heo; (**c**) Meng; (**d**) Sharon.

**Figure 5.** Quantitative comparisons of multispectral face image pairs with four individuals in Figure 4. (**a**) Charles; (**b**) Heo; (**c**) Meng; (**d**) Sharon.

The average registration errors of each individual for different registration methods are summarized in Table 1. We again see that the FR-GLS and RGF methods with the global and local structures significantly outperform the CPD and SMM methods just with finite mixture models.

Moreover, our FR-GLS method can achieve consistently better performance compared to the RGF method. The average registration errors of this RGF method are about 2.40, 2.24, 2.24 and 1.92 pixels on the four individuals, respectively. By contrast, the average registration errors of our FR-GLS method are reduced to about 2.25, 2.06, 1.99 and 1.78 pixels, respectively. In addition, The SMM method with SMM has a slightly better performance than the CPD method with GMM, where the average registration errors are 2.88, 2.40, 2.56 and 2.13 pixels, respectively. Whereas, the average registration errors of the CPD method are up to 3.04, 2.58, 2.73 and 2.35 pixels, respectively. It can quantitatively verify the heavy-tailed property of SMM.


**Table 1.** The average registration errors comparison of CPD, SMM, RGF and our FR-GLS method on four typical individuals in Figure 4.
