A Light-Weight Self-Supervised Infrared Image Perception Enhancement Method

Xiao, Yifan; Zhang, Zhilong; Li, Zhouli

doi:10.3390/electronics13183695

Open AccessArticle

A Light-Weight Self-Supervised Infrared Image Perception Enhancement Method

by

Yifan Xiao

^1,2,

Zhilong Zhang

^1,* and

Zhouli Li

²

¹

National Laboratory of Automatic Target Recognition, National University of Defense Technology, Changsha 410073, China

²

College of Chemistry and Chemical Engineering, Xi’an Shiyou University, Xi’an 710065, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3695; https://doi.org/10.3390/electronics13183695

Submission received: 23 August 2024 / Revised: 6 September 2024 / Accepted: 8 September 2024 / Published: 18 September 2024

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Convolutional Neural Networks (

C N N s

) have achieved remarkable results in the field of infrared image enhancement. However, the research on the visual perception mechanism and the objective evaluation indicators for enhanced infrared images is still not in-depth enough. To make the subjective and objective evaluation more consistent, this paper uses a perceptual metric to evaluate the enhancement effect of infrared images. The perceptual metric mimics the early conversion process of the human visual system and uses the normalized Laplacian pyramid distance (

N L P D

) between the enhanced image and the original scene radiance to evaluate the image enhancement effect. Based on this, this paper designs an infrared image-enhancement algorithm that is more conducive to human visual perception. The algorithm uses a lightweight Fully Convolutional Network (

F C N

), with

N L P D

as the similarity measure, and trains the network in a self-supervised manner by minimizing the

N L P D

between the enhanced image and the original scene radiance to achieve infrared image enhancement. The experimental results show that the infrared image enhancement method in this paper outperforms existing methods in terms of visual perception quality, and due to the use of a lightweight network, it is also the fastest enhancement method currently.

Keywords:

visual perception; normalized Laplacian pyramid; self-supervision; lightweight

1. Introduction

The realm of infrared image capture has attracted considerable interest in recent years, primarily fueled by its widespread applications in defense, surveillance, and various other sectors. Nonetheless, infrared images frequently encounter challenges such as low contrast and blurred details. Specifically, the low resolution of infrared images and the existence of low clouds can obstruct the detection of heat-emitting objects, limiting the efficacy of observing infrared targets and impeding the progress of infrared imaging applications. Therefore, achieving high-quality infrared images necessitates enhancement [1,2]. Enhancement techniques aim to improve image quality, enhance clarity, and boost visual impact, playing a crucial role in tasks such as object recognition [3], instance segmentation [4], tracking, and detection [5,6], among others. The existing methods for enhancing infrared images can be broadly classified into two groups: traditional methodologies and deep learning approaches.

Throughout the past few decades, classical methods such as histogram equalization [7,8,9] and gamma correction [10] have demonstrated efficacy in enhancing low-light images [11,12]. Moreover, a variety of classical techniques rooted in the Retinex theory [13] have been devised, integrating various prior regularization optimization models to separate the illumination and reflectance image layers’ structures [14,15,16]. However, these manually crafted constraints and priors lack adaptability, potentially resulting in outcomes marked by noticeable noise or influenced by excessive or insufficient enhancement, which could adversely impact human visual perception.

In recent years, propelled by rapid advancements in deep learning, an increasing number of researchers have harnessed this technology in the field of image enhancement [17]. These methods, based on distinct learning approaches, can be classified into supervised learning [18,19,20] and unsupervised learning [21,22]. However, the effectiveness of these deep learning techniques heavily relies on intricately designed architectures and meticulously curated training datasets. Consequently, while excelling in objective metrics, they often struggle to align subjective visual experiences with objective assessments in practical scenarios.

The essence of crafting a successful enhancement algorithm, whether through classical methodologies or deep learning approaches, lies in the capacity to extract and safeguard relevant information within the image, while simultaneously eliminating redundancy and noise, all while conforming to human perceptual standards. This foundational principle forms the core of our design philosophy. Historically, discrepancies have frequently arisen in the evaluation metrics utilized for image enhancement between subjective and objective assessments. While many techniques excel in objective evaluations, a discernible gap often persists when assessed against human perception subjectively. Therefore, bridging this disparity between subjective and objective evaluations necessitates the implementation of an efficient methodology for enhancing infrared perception.

With the objective of enhancing infrared perception, we have introduced a self-supervised algorithm. Departing from traditional supervised learning methods, our approach moves away from relying on actual data for supervised training. Instead, we utilize the estimation of the original scene’s radiation intensity range from the input image as the training ground truth. Moreover, we have developed a no-reference quality assessment metric based on perceptual criteria. This metric utilizes the

N L P D

between the image and the original scene’s radiation intensity to evaluate the perceptual enhancement impact of the image [23], simulating the initial transformations within the human visual system [24]. Expanding on this framework, we incorporated

N L P D

into a

C N N

to enhance the perceptual similarity between the image and the original scene’s radiation intensity by minimizing the perceptual loss metric. The

C N N

we implemented boasts a lightweight structure capable of preserving intricate image details while enabling swift online deployment. This design achieves commendable performance at a minimal computational cost.

Our contributions can be succinctly summarized as follows:

Owing to the specific heat distribution and object features contained in infrared images, a lightweight $F C N$ structure is designed to capture the key information in infrared images, such as hotspots, edges, and textures.
By training the model in a self-supervised manner, the proposed method overcomes the limitation of traditional supervised learning, which requires a large amount of ground truth data. This reduces the data cost and workload and improves the utilization rate of the available data.
By incorporating $N L P D$ into the loss function of the $C N N$ , the proposed method leverages the multi-scale image details extracted by the normalized Laplacian pyramid. This enables the enhancement model to achieve excellent results in infrared image perceptual enhancement and demonstrates robust and generalized performance.
Our method achieves excellent performance with a small computational cost and has the fastest running speed, making it suitable for a wider range of physical scenarios.

Through these advancements, our infrared image enhancement technique surpasses existing methods in both visual perception quality and operational speed, promising broad application prospects in various domains.

The rest of this paper is organized as follows: Section 2 reviews related works on traditional image enhancement and perceptual image enhancement. Section 3 details the proposed approach and its underlying principles. Section 4 describes the experimental results and analysis. Section 5 discusses the experimental findings. Finally, Section 6 summarizes the research insights and provides an outlook on future work.

2. Related Work

2.1. Traditional Image Enhancement Methods

Numerous conventional enhancement algorithms have been developed to produce high-contrast infrared images. Among these techniques, histogram equalization (

H E

) [25] emerges as a foundational method that redistributes the grayscale values of image pixels to achieve a more uniform distribution across the entire grayscale spectrum. This technique enhances the contrast and brightness distribution of the image, proving especially effective in visible light imagery. Nonetheless, in the context of noisy infrared images,

H E

often amplifies both the desired signal and the noise simultaneously, leading to suboptimal results. To mitigate this challenge, two variants of

H E

have been introduced:

Dynamic Histogram Equalization and Contrast Enhancement (

D H E C I

) [26]:

D H E C I

integrates dynamic histogram equalization with contrast enhancement to improve both contrast and detail information in images. This technique seeks to enhance the overall effect by incorporating dynamic adjustments and contrast enhancements, proving especially effective in preserving details and enhancing contrast, particularly in hyperspectral images. Contrast-Limited Adaptive Histogram Equalization (

C L A H E

) [27]:

C L A H E

is an adaptive

H E

technique that divides the image into multiple local regions and applies the HE algorithm to each region separately, avoiding potential issues of over-enhancement that global enhancement methods might face. This method performs exceptionally well in situations where images display uneven brightness distributions.

Moreover, Natural Picture Enhancement (

N P E

) [28] is designed to enhance the visual quality of natural scene images by refining contrast and brightness levels.

N P E

enriches image details and colors, enhancing visual appeal and clarity. Unlike

C L A H E

,

N P E

is a versatile enhancement technique suitable for a wide range of natural scene images. By manipulating parameters like

g a m m a

[29] (e.g.,

g a m m a

= 0.8), users can customize contrast and brightness adjustments to achieve desired image enhancement effects. However, these methods may inadvertently introduce increased detail blurring while suppressing noise, potentially compromising the overall quality of the image.

Additionally, classic methods based on wavelet transform [30] and low-light image enhancement (

L I M E

) [31] have been developed:

Wavelet Transform: This approach involves decomposing input images with infrared noise into multiple scales, denoising each scale by utilizing a soft threshold, and subsequently converting the processed wavelet coefficients back to the spatial domain. Low-Light Image Enhancement (

L I M E

):

L I M E

is specifically designed to improve images captured under low-light conditions. It focuses on enhancing brightness and clarity to enhance visibility in dim environments.

L I M E

effectively boosts the quality of low-light images by adjusting relevant parameters, thereby enhancing overall visibility and image quality. In contrast to wavelet-based denoising techniques,

L I M E

emphasizes enhancing visual effects rather than solely processing noise; while these methods are effective in preventing noise amplification, the extent of their enhancement impact may vary.

In recent years, propelled by the rapid progress of deep learning, a growing cohort of researchers has begun integrating it into the realm of image enhancement. Supervised learning approaches commonly leverage paired images to grasp the mapping relationship from suboptimal illumination images to those captured under normal illumination conditions. The MLLEN-IC network framework, as introduced by Fan et al. [32], incorporates the core unit of the Rse2Net constructed network to boost model efficacy through the extraction of multi-scale features. Yang et al. introduced the Deep Recursive Band Network (

D R B N

) for reconstructing improved normal illumination images from paired low-light images utilizing a linear band representation [33]. Furthermore, the URetinex-Net network framework, proposed by Wu et al., decomposes low-light images into reflection layers and illumination layers to amplify illumination effects [34]. However, training these techniques on paired datasets can potentially result in overfitting, limiting the model’s ability to generalize effectively. In a departure from the dependence on paired data, Jiang et al. introduced an unsupervised Generative Adversarial Network (

E n l i g h t e n G A N

) that operates without the need for paired low-light or normal-light image training. Instead, it regularizes based on information gleaned from inputs [35]. This approach of unpaired training regularization utilizing input-extracted information adeptly tackles various novel challenges in low-light image enhancement. Likewise, Liu et al. developed intrinsic exposure structures that delineate weak light images grounded on Retinex principles. They devised a lightweight and efficient low-light image enhancement network by uncovering low-light prior structures from a condensed search space through collaborative learning without reference [36]. Li et al. introduced Zero-Reference Deep Curve Estimation (

Z e r o - D c e P

), a method that enables training without paired data by employing a non-reference loss function, effectively addressing cutting-edge challenges in low-light image enhancement [37]. Nevertheless, the efficacy of these deep learning techniques is heavily contingent on intricately crafted architectures and thoughtfully chosen training data, frequently leading to subpar generalization in real-world settings.

2.2. Image Perception Enhancement Methods

When assessing image quality, metrics like

P S N R

[38],

S S I M

[39],

F M I

[40], and others frequently exhibit notable disparities in comparison to subjective evaluations, underscoring the intricacies of visual perceptual processes. Minimizing or even eradicating this gap poses a fundamental challenge for contemporary image enhancement methods.

Laparra et al. introduced perceptual distance, utilizing the

N L P D

to simulate the initial processing stages of the human visual system. By minimizing the perceived differences between generated images and the original scenes, image enhancement is accomplished, framing the task as a constrained optimization problem. However, due to the non-convex nature of

N L P D

and the high dimensionality of constrained optimization problems, the gradient-based iterative solving methods initially proposed frequently present formidable computational hurdles; while its image enhancement strategy aligns with human visual perception, the slow processing speed hampers its practical applicability in various scenarios.

S R C N N

[41] was one of the pioneering works to employ

C N N

for image super-resolution enhancement, leveraging the benefits of full convolution and a lightweight design. Nonetheless, relying solely on pixel errors between generated and real images as the objective function resulted in relatively blurry generated images. To enhance super-resolution performance further, Ledig et al. introduced

S R G A N

[42]. Generative Adversarial Networks (

G A N

) was the pioneer in utilizing

G A N

for image super-resolution processing, integrating two objective functions: pixel errors between generated and real images and perceptual loss between them. Perceptual loss captures variations at the deep feature level between generated and real images, guaranteeing that the generated images not only closely align with real images in terms of pixel values but also excel in terms of visual perception. This underscores the significant role of perceptual loss in augmenting the quality of generated images.

Building on the methodologies discussed earlier, this study enriches infrared images by employing a lightweight

F C N

guided by perceptual distance to reduce the perceptual gap between enhanced and original images for image enhancement. Unlike conventional supervised learning techniques, this method does not necessitate real data for training, offering a self-supervised approach.

3. Materials and Methods

The algorithm presented in this paper for infrared image perceptual enhancement treats the task as an image transformation challenge. It utilizes a lightweight

F C N

to transform the original image into an enhanced version, leveraging the

N L P D

between the enhanced and original images as the objective function, facilitating network training via a self-supervised strategy. The algorithm’s overall framework is depicted in Figure 1. Initially, the training dataset f is processed to extract inherent supervisory information as training labels S. Subsequently, f is passed through the lightweight

F C N

to derive a transformed image I that matches the size of f. Following this, the labels S and the transformed image I undergo normalized Laplace transforms to obtain their multiscale transforms

f (I)

and

f (S)

. Ultimately, the training process of

F C N

is supervised by minimizing the distance between

f (I)

and

f (S)

.

The actual effectiveness of the proposed infrared image perceptual enhancement algorithm is depicted in Figure 2. Compared to the original image and the method detailed in Section 2 of this paper, the image enhanced by the algorithm in this paper exhibits improved content and detail presentation, which is more conducive to human perception of the scene. The practical effect of the proposed infrared image perception enhancement algorithm is illustrated in Figure 2. Compared to the original image and methods detailed in Section 2 of this paper,

L I M E

and

N P E

exhibit blurred and unclear branches, with missing architectural texture details and overall low image contrast. Images generated by

E n g l i g h t e n G A N

show excessive noise, while

C L A H E

and

D H E C L

exhibit artifacts and insufficient texture description, impacting visual perception.

Z e r o D c e P

and

O u r s

proposed methods strike a good balance, effectively preserving crucial information and light details of targets like pedestrians, branches, and buildings. Among these methods, our proposed approach not only presents superior visual effects but also retains and enhances important target information and scene details.

3.1. Lightweight $F C N$

We employ an

F C N

to achieve the transformation from input images to enhanced images, with the advantage of maintaining consistency in size between the input and output images. To ensure computational efficiency and speed of the algorithm, we adopt a lightweight

F C N

, aiming to minimize the number of network layers and parameter size while preserving the network’s learning capabilities. The lightweight CNN network, with its simple yet effective three-layer architecture and the introduction of perceptual distance as a loss function, demonstrates excellent visual quality in infrared image enhancement. The images enhanced by this network not only exhibit more accurate details but also present a more natural overall perception. Therefore, we propose the network architecture depicted in Figure 3.

The network consists of three convolutional layers:

The first layer serves as the feature extraction layer, extracting overlapping image patches from the input image and mapping them to a high-dimensional space. This process is accomplished through convolutional filters. Formally, assuming the input image is denoted as f, the feature extraction operation can be expressed as follows:

F_{1} (f) = σ (λ_{1} * f + B_{1})

(1)

This layer takes the original image as input with 1 channel and produces image features with 64 channels. Each convolutional kernel has

9 \times 9

weight parameters and 1 bias parameter. Therefore, this layer consists of a total of

64 \times (9 \times 9 + 1) = 5248

parameters.

The second layer functions as the non-linear mapping layer, mapping high-dimensional feature representations to another set of high-dimensional feature representations. This non-linear mapping is crucial for capturing complex structures and patterns within image patches. Its expression is

F_{2} (F_{1} (f)) = σ (λ_{2} * F_{1} (f) + B_{2})

(2)

This layer takes image features as input with 64 channels and produces high-dimensional features with 32 channels. Each convolutional kernel has

9 \times 9

weight parameters and 1 bias parameter. Therefore, this layer consists of a total of

32 \times ((64 \times 7 \times 7) + 1) = 100,384

parameters.

The third layer serves as the image reconstruction layer, reconstructing the image from the high-dimensional features obtained through the non-linear mapping. Its expression is

I = λ_{3} * F_{2} (F_{1} (f)) + B_{3}

(3)

This layer takes high-dimensional features as input with 32 channels and produces the enhanced image with 1 channel. Each convolutional kernel has

7 \times 7

weight parameters and 1 bias parameter. Therefore, this layer consists of a total of

32 \times 7 \times 7 + 1 = 1569

parameters.

In summary, this network has a total of 107,201 parameters, making it a lightweight

F C N

.

In Equations (1)–(3), ∗ denotes the convolution operation,

σ ()

represents the activation function,

λ_{1}, λ_{2}, λ_{3}

are convolutional kernels, and

B_{1}, B_{2}, B_{3}

are bias terms. Here, I denotes the enhanced image output by this network.

Infrared images reflect the thermal radiation distribution of a scene.

C N N

can extract key information such as hotspots, edges, and textures from infrared images. Infrared images are typically blurry and contain various types of noise, which hinder human recognition of targets and perception of the environment. By utilizing our proposed lightweight

F C N

, it is possible to effectively suppress noise in infrared images, enhance image details, improve image clarity, and enhance perceptual effects.

3.2. Objective Function Based on Metric Perceptual Distance

We employ the

N L P D

to quantify the perceptual variances between images.

N L P D

is intricately linked to the early human visual system; it initiates by constructing the Laplacian pyramid of an image [43], executes local luminance subtraction across various scales, and performs amplitude segmentation via local amplitude to diminish redundancy compared to the original image pixels. Substantial research suggests that this metric aligns more closely with human perception [44,45].

Assuming S represents the infrared radiation intensity of the real scene,

f (S)

represents the perceptual result of humans towards S; I represents the displayed infrared image, and

f (I)

represents the perceptual result of humans towards I. The intensity range of real-world infrared radiation is typically very large, while the brightness range of a display is usually between 5 and 300

{(cd / m}^{2})

. This can lead to differences between the perceptual results

f (S)

and

f (I)

for humans.

N L P D

can reflect the differences between the perceptual results

f (S)

and

f (I)

.

3.2.1. $N L P$

Before subjecting the image S to

N L P

processing, preprocessing is essential. Initially, the image S undergoes a non-linear transformation grounded in visual biological principles, approximating the light response transformation in the photoreceptors of the retina [46], as follows:

X^{(0)} = S^{r}

(4)

Next, the preprocessed results undergo Gaussian pyramid transformation [47], as follows:

X^{(k + 1)} = D (L (X^{(k)})), k = 0, 1, 2, \dots, N

(5)

Then, the Laplacian pyramid

Z^{(k)}

[48] is obtained, as follows:

Z^{(k)} = X^{(k)} - L (U (X^{(k + 1)})), k = 0, 1, 2, \dots, N

(6)

Finally, the normalized Laplacian pyramid

Y^{(k)}

is obtained as follows:

Y^{(k)} = \frac{Z^{(k)}}{σ + p (| z^{k} |)}, k = 0, 1, 2, \dots, N

(7)

In the above expression,

D (\cdot)

and

U (\cdot)

represent linear downsampling and linear upsampling, respectively. L is a low-pass filter, k denotes the k-th level of the pyramid, the constant

σ

is a parameter for algorithm stability. P is a filter, commonly using the following template:

P = [\begin{matrix} 0.04 & 0.04 & 0.05 & 0.04 & 0.04 \\ 0.04 & 0.03 & 0.04 & 0.03 & 0.04 \\ 0.05 & 0.04 & 0.05 & 0.04 & 0.05 \\ 0.04 & 0.03 & 0.04 & 0.03 & 0.04 \\ 0.04 & 0.04 & 0.05 & 0.04 & 0.04 \end{matrix}]

(8)

3.2.2. Using $N L P D$ as the Loss Function

We use

N L P D

to describe the difference between the original image S and the rendered image I, and we represent its image enhancement as a constrained optimization problem, aiming to optimize the rendered image I by minimizing the perceptual difference between the rendered image I and the original image S, as follows:

\hat{I} = {arg}_{I} min D (S, I)

(9)

where

D (,)

is a measure of dissimilarity in human perception.

Perceptual difference is quantitatively represented by two parts, firstly, defining a non-linear perceptual transformation

f ()

that approximates the early processing of the human visual system. We apply this transformation to the original scene brightness S and the rendered image I, then measure the distance between

f (S)

and

f (I)

.

The determination of the non-linear perceptual transformation

f ()

:

From Equation (5), the combination of

N L P

coefficients

f (S)

and

f (I)

for each channel represents the response of the perceptual transformation, as follows:

f (S) = {Y^{(k)} : k = 1, \dots, N}

(10)

f (I) = {{\tilde{Y}}^{(k)} : k = 1, \dots, N}

(11)

Using this to measure the distance between

f (S)

and

f (I)

, as follows:

D (S, I) = {[\frac{1}{N} \sum_{k = 1}^{N} {(\frac{1}{{N_{c}}^{(k)}} \sum_{i = 1}^{N_{c}} {| Y_{i}^{(k)} - {\tilde{Y}}_{i}^{(k)} |}^{α})}^{\frac{β}{α}}]}^{\frac{1}{β}}

(12)

Therefore, we use Equation (12) as the

N L P D

to describe the difference between the original image S and the rendered image I, serving as a similarity measure and supervising our lightweight

C N N

. Here,

Y_{i}^{(k)}

and

{\tilde{Y}}^{(k)}

represent corresponding layers of the normalized Laplacian pyramid, and

α

and

β

are parameters optimized to match human perceptual assessments of image quality in image quality databases.

3.3. Self-Supervision

By using

N L P D

as a perceptual metric, we make a reasonable experimental estimation of the radiation intensity range of the real scene S for the displayed image I and linearly rescale the radiation intensity measurements. This resulting image is used as ground truth in the training model. This approach enables the lightweight

C N N

to train an infrared perceptual enhancement model through self-supervised learning. This process is specifically demonstrated as follows:

S = T {{[f]}^{μ}}

(13)

Here, f represents the input image. Taking

μ

as 2.2 indicates a power transformation for

g a m m a

correction, utilized to enhance the display effect of images or adapt to the characteristics of the display device. T linearly maps the

g a m m a

-corrected range of f to [1, 3000], which is the radiation intensity range subjectively estimated through numerous experiments by assessors.

4. Experimental Results and Analysis

To validate the proposed infrared image enhancement method, comprehensive experiments were carried out on various public datasets. This section commences by outlining the experimental setup, datasets used, and evaluation metrics employed. Subsequently, the efficacy of the proposed method is confirmed through ablation and comparative experiments, followed by assessing the processing efficiency.

4.1. Experimental Setup

The experiments were conducted on a computer equipped with an

N V I D I A

G e F o r c e

R T X

4070 Laptop GPU (NVIDIA, Santa Clara, CA, USA). Training samples were image blocks of size

640 \times 480

. The model underwent training for 1000 epochs with a batch size of 8. The Adam optimizer [49] was utilized with a learning rate of

10^{- 5}

. In Equation (4), set

γ = 1 / 2.6

; similarly, in Equation (8), set

σ = 0.17

, in Equation (12), setting the values of

α

and

β

to 2 and 0.6, respectively; and fixing the number of levels N in the Laplacian pyramid to 6, aims to best explain human perceptual ratings of distorted images in a common database [50]. Specifically, we select these parameters to maximize the correlation between the average scores given by human observers and the distances calculated by our metric.

4.2. Dataset

In this study, the experiments were conducted using four datasets:

H I T - U A V

[51],

M S R S

[52],

T N O

[53], and

V I F B

[54]. The training was carried out using 1083 infrared images from the

M S R S

dataset. For performance evaluation, 361 images from the

M S R S

dataset, 25 images from the

T N O

dataset, 290 images from the

H I T - U A V

dataset, and 21 images from the

V I F B

dataset (totaling 697 infrared images) were utilized as the test dataset.

The

H I T - U A V

dataset is a well-known dataset in the field of drone vision research. It offers images and video sequences captured by drones, specifically designed for tasks like object detection, tracking, recognition, and localization. This dataset serves as a valuable resource for researchers and practitioners working on drone vision-related applications.

The

M S R S

dataset predominantly focuses on traffic scenes, encompassing diverse objects such as cars, pedestrians, and bicycles in both daytime and nighttime settings. Moreover, an image-enhancement algorithm grounded in the dark channel prior is applied to enhance the contrast and signal-to-noise ratio of infrared images within this dataset.

The

T N O

dataset consists of 63 images showcasing a range of military and surveillance scenes at various resolutions. These images depict a diverse array of objects and targets set against different backgrounds, including rural and urban environments. This dataset offers a valuable collection for research in military and surveillance imaging applications.

The

V I F B

dataset comprises 21 infrared images sourced from the internet and various tracking datasets. These images span different resolutions and encompass a variety of environments and conditions, including indoor, outdoor, low-light, and overexposed scenarios. This dataset provides a diverse set of images useful for exploring various challenges and scenarios in infrared imaging applications.

We have selected some raw data from the aforementioned four datasets for presentation, as shown in Figure 4. Specifically, a1 to a5 represent data from the

H I T - U A V

dataset; b1 to b5 correspond to the

M S R S

dataset; c1 to c5 depict data from the

T N O

dataset; and d1 to d5 showcase the

V I F B

dataset.

4.3. Evaluation Metrics

We evaluate the proposed infrared image enhancement method using six metrics:

N I Q E

(Naturalness Image Quality Evaluator) [55],

E N

(Entropy) [56],

S S I M

,

A G

(Average Gradient) [57],

P S N R

, and

N L P D

.

N I Q E

is a metric crafted to evaluate the naturalness of an image by measuring how well the image quality aligns with human visual perception. A lower

N I Q E

value indicates that the image closely resembles a natural image. The metric is defined as

NIQE = \sqrt{{(μ_{1} - μ_{2})}^{T} Σ^{- 1} (μ_{1} - μ_{2})}

(14)

Among them,

μ_{1}

and

μ_{2}

represent the mean vectors of the evaluation image and the original natural image, while

Σ

denotes the covariance matrix.

E N

is a metric utilized to measure the information content in an image, serving as a gauge of the image’s complexity. A higher

E N

value signifies that the image contains a more extensive amount of information. This metric is defined as

EN = \sum_{i} P_{i} {log}_{2} P_{i}

(15)

Among them,

P_{i}

represents the probability of each gray level appearing in the image.

S S I M

compares the structural similarity between two images by taking into account factors such as contrast, brightness, and local patterns. The definition of

S S I M

is as follows:

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1})} \cdot \frac{(2 σ_{x y} + C_{2})}{(σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(16)

In the

S S I M

formula, x and y represent the source image and the enhanced image within the sliding window, respectively.

σ_{x y}

denotes the image covariance, while

σ_{x}

and

σ_{y}

represent the standard deviations of the images.

μ_{x}

and

μ_{y}

stand for the mean values of the images, respectively.

C_{1}

and

C_{2}

are parameters included to ensure algorithm stability. The

S S I M

value ranges from 0 to 1, where 1 indicates complete similarity between the two images.

A G

is a metric employed to assess the sharpness of an image by calculating the average gradient of the image. A higher

A G

value suggests that the image is sharper. It is defined as

AG = \frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} \sqrt{\frac{{(\nabla F_{x} (i, j))}^{2} + {(\nabla F_{y} (i, j))}^{2}}{2}}

(17)

In this context,

\nabla F_{x}

and

\nabla F_{y}

symbolize the gradient of the image in the x and y directions, respectively, while M and N denote the total number of pixels in the image.

The quality of an image can be evaluated by computing the

P S N R

, which represents the ratio of the signal to noise in the image. A higher

P S N R

value signifies better image quality. The formula to calculate

P S N R

is as follows:

PSNR = 10 {log}_{10} (\frac{{MAX}^{2}}{MSE})

(18)

In the

P S N R

formula provided earlier,

M A X

represents the maximum pixel value in the image, while

M S E

(Mean Squared Error) denotes the average squared difference between the original image and the processed image. The calculation formula for

M S E

is

MSE = \frac{1}{m n} \sum_{i = 1}^{m} \sum_{j = 1}^{n} {(x (i, j) - y (i, j))}^{2}

(19)

In the context of the

P S N R

formula,

x (i, j)

and

y (i, j)

represent the pixel values of the original and processed images at coordinate

(i, j)

, and m and n stand for the width and height of the image, respectively.

N L P D

quantifies the disparity between the original scene and the enhanced image by evaluating the

N L P D

between them. A lower

N L P D

value signifies a higher quality of the processed image. It is defined as in Equation (12) above.

4.4. Ablation Study

In order to visually understand the effects of each major component of the perceptual loss metric, this section presents image renderings by selectively removing one of the three components of the perceptual loss metric. As shown in Figure 5, a1 to a4 display the original images; b1 to b4, c1 to c4, and d1 to d4, respectively, demonstrate the removal of one of the components of the perceptual loss metric: removal of the initial pointwise non-linearity (changing

γ

from 1/2.6 to 1 in Equation (4)), removal of the multi-scale decomposition (changing the number of layers N from 6 to 1 in the Laplacian pyramid), and removal of the split normalization (changing

σ

from 0.17 to 1 in Equation (7), and changing P to P = 0 in Equation (8)). On the other hand, e1 to e5 showcase images rendered using the complete perceptual loss metric.

After removing the initial pointwise non-linearity component, the overall effect of the images becomes noticeably blurred with inadequate texture description, leading to compromised visual perception. Images where the multi-scale decomposition and split normalization components are removed exhibit severe color distortion or inconsistency, resulting in the loss of details in objects such as pedestrians and trees. Conversely, images rendered using the complete perceptual loss metric alleviate these issues and yield the most visually appealing results. For a closer examination of details, please refer to the ablation experiments shown in Figure 5.

4.5. Comparative Test Analysis

To validate the efficacy of our proposed method, we conducted a comparative analysis of its infrared enhancement performance against several methods detailed in Section 2 of this paper:

C L A H E

,

D H E C L

,

E n g l i g h t e n G A N

,

H E

,

L I M E

,

N P E

, and

Z e r o D c e P

. To ensure fairness, default parameters provided by the respective authors were utilized for all comparison methods. These comparative experiments were carried out across four datasets,

H I T - U A V

,

M S R S

,

T N O

, and

V I F B

, encompassing qualitative evaluation, quantitative assessment, and subjective analysis. The qualitative evaluation involved visually inspecting the enhanced images, with specific targets and regions of interest highlighted using red and green boxes.

4.5.1. Results on the $H I T - U A V$ Dataset

In qualitative comparisons,

D H E C L

,

E n g l i g h t e n G A N

, and

L I M E

demonstrate significant limitations in preserving detailed textures and visual perception in the resultant images. In Figure 6a,

D H E C L

effectively enhances the aircraft visually but causes severe distortion in the ground near the aircraft.

L I M E

exhibits overall ground distortion and poor effects on multiple targets, leading to a significant loss of detail. In Figure 6a–c,

E n g l i g h t e n G A N

yields overall blurriness, accompanied by aliasing artifacts, inadequate texture description, and weak visual perception. Moreover, while

N P E

and

Z e r o D c e P

can effectively display most targets in the enhanced images, they still lack details when representing certain small targets, such as the street lamp in Figure 6b.

Z e r o D c e P

exhibits an overall high contrast in Figure 6, losing texture quality. Although many targets are clearly displayed, the enhancement effect on these targets is, at best, average.

H E

produces images with superior effects; however, it has some drawbacks. For instance, in the case of the aircraft and trees in Figure 6a, the street lamp in Figure 6b, and the wall in Figure 6c, the enhancement effects are slightly inferior to those generated by our proposed method. Our method effectively enhances and retains both the saliency information of targets and fine textures.

Quantitative Comparison: For a quantitative evaluation, 20 images from the

H I T - U A V

dataset were utilized to compare the proposed method against six other enhancement techniques. The average results on the

H I T - U A V

dataset are summarized in Table 1. Six metrics were employed to gauge the quality of the enhanced infrared images produced by the various methods. Our proposed method achieved the second-best results in the

A G

and

E N

evaluation metrics. Moreover, it notably secured the top position in the

N L P D

metric, significantly surpassing the second-best method. A higher

E N

value indicates richer information content in the images, while a higher

A G

value signifies enhanced image clarity. Conversely, a lower

N L P D

value suggests that the perceptual effect of the image aligns more closely with human perception.

4.5.2. Results on the $M S R S$ Dataset

Qualitative Comparison: Figure 7 showcases the enhanced images produced by various algorithms in diverse scenarios. In these images,

L I M E

and

N P E

present blurry tree branches with unclear structures, lacking texture details in buildings, inadequate texture description, and an overall low image contrast.

E n g l i g h t e n G A N

’s images demonstrate excessive noise, impacting visual perception negatively. Conversely,

C L A H E

,

D H E C L

,

Z e r o D c e P

, and our proposed method strike a good balance in various scenes, effectively preserving essential information about targets and lighting conditions like pedestrians, tree branches, and buildings. Among these methods, our proposed approach not only delivers superior visual effects but also preserves and enhances critical target information and scene details.

Quantitative Comparison: The proposed method has been quantitatively compared with six other enhancement methods on the

M S R S

dataset, and the average experimental results on the

M S R S

dataset are presented in Table 2. Our proposed method excels in the

A G

and

E N

metrics, securing the second-best results, and attains the top position in the

N L P D

and

N I Q E

metrics.

4.5.3. Results on the $T N O$ Dataset

Qualitative Comparison: The images generated by

E n g l i g h t e n G A N

and

Z e r o D c e P

exhibit excessively high contrast, leading to detail loss and poor perceptual effects. Specifically, in Figure 8a, distinguishing between the fence and the ground outside is challenging. In Figure 8b, the brightness of the aircraft is excessively high, leading to a diminished contrast between the base of the trees and the branches, resulting in an overall lack of image clarity. Furthermore, in Figure 8c, the tree branches appear blurry. The images generated by

H E

exhibit noticeable aliasing artifacts and insufficient texture description. Images produced by

L I M E

and

N P E

mildly enhance the original images, retaining complete details and textures, but the enhancement effect is relatively weak. Conversely,

D H E C L

, while preserving details and textures intact, offers superior visual perception enhancements in different aspects. Nonetheless, in terms of visual effects, our proposed method still attains the best results.

Quantitative Comparison: The proposed method has been quantitatively compared with six other enhancement methods using images from the

T N O

dataset. The average experimental results on the

T N O

dataset are detailed in Table 3. Our proposed method excels in the

N I Q E

and

A G

metrics and secures the top position in the

N L P D

and

E N

metrics.

4.5.4. Results on the $V I F B$ Dataset

Qualitative Comparison: The images produced by

H E

exhibit noticeable artifacts and lack sufficient texture description. For instance, in Figure 9a, the ground, in Figure 9b, the building, and in Figure 9c, the road suffer from these issues. Furthermore,

H E

overexposes the wall in Figure 9a, resulting in an overall low image quality. The images produced by

E n g l i g h t e n G A N

,

L I M E

,

N P E

, and

Z e r o D c e P

in Figure 9a are of poor quality, appearing blurry with unclear target boundaries and insufficient detail description. The contrast in the images generated by

E n g l i g h t e n G A N

and

Z e r o D c e P

in Figure 9a,b is excessively high, resulting in poor visual quality. The image enhancement effects of

L I M E

,

N P E

, and

Z e r o D c e P

in Figure 9b,c are average, as they struggle to enhance the details of their targets effectively.

D H E C I

and our proposed method effectively preserve and enhance the crucial information present in the original images, including pedestrians, branches, buildings, vehicles, zebra crossings, and more. Notably, our proposed method excels in capturing details and aligns closely with human perception, distinguishing it as a standout performer in the image enhancement process.

Quantitative Comparison: The proposed method has undergone a quantitative comparison with six other enhancement methods using the

V I F B

dataset. The average experimental results on the

V I F B

dataset are provided in Table 4. Our proposed method demonstrates strong performance across the

N I Q E

,

E N

, and

A G

metrics and secures the top position in terms of absolute performance in the

N L P D

indicator.

4.5.5. People’s Subjective Evaluation

In our study, we conducted a human subjective evaluation to compare the performance of our proposed infrared perceptual enhancement method with other methods. We randomly selected five images from each of the test sets

H I T - U A V

,

M S R S

,

T N O

, and

V I F B

, totaling 20 images. For each image, we applied seven different methods (

D H E C I

,

E n g l i g h t e n G A N

,

H E

,

L I M E

,

N P E

,

Z e r o D c e P

,

O u r s

) to enhance it. Subsequently, we presented these seven output images to eleven participants for evaluation. During the evaluation, participants were shown a randomly selected pair of images from the seven outputs and asked to determine which image exhibited better quality in each comparison.

The quality was evaluated based on the following criteria:

(1): Whether the image exhibited texture distortion;
(2): Whether the image contained visible noise;
(3): Whether the image contained over-exposed or under-exposed artifacts.

By having participants subjectively score the 7 methods, we were able to rank the methods from 1 to 7 based on their perceived quality. This ranking process was repeated for all 20 images, and the results are displayed in Figure 10.

In Figure 10, the seven histograms depict the distribution of overall scores given by participants for the seven enhancement methods across the 20 test images. A comparison of these histograms clearly indicates that our proposed method yielded the most favorable results, according to the human participants.

E n g l i g h t e n G A N

received the lowest scores, likely due to its tendency to cause overexposure and sometimes amplify noise.

H E

and

Z e r o D c e P

received varying scores, with

H E

and

Z e r o D c e P

being suitable for some scene images but less effective for others.

L I M E

and

N P E

received average scores, indicating moderate overall image enhancement effects and inadequate detailed descriptions.

4.5.6. Computational Efficiency Analysis

To evaluate processing efficiency, we conducted 10 runs of all images within the

H I T - U A V

,

M S R S

,

T N O

, and

V I F B

datasets using our proposed method and the other six comparison methods, following the previously mentioned configuration. Subsequently, we calculated the average processing time for each method, and the outcomes are summarized in Table 5. The data showcase that our proposed method achieved the fastest average processing speed in comparison to existing image enhancement techniques. This underscores the computational efficiency of our approach, a critical aspect for practical applications that necessitate real-time or near-real-time performance.

5. Discussion

Based on the experimental results and analysis, our proposed method has showcased superior subjective performance when compared to the seven comparison methods across the four test datasets, underscoring the effectiveness and robustness of our approach. While our method may not surpass others in certain objective metrics such as

S N R P

and

S S I M

, it excels in providing a more balanced enhancement of image quality across various aspects. In terms of other objective measures, our method demonstrates relatively better performance compared to the comparison techniques. Particularly noteworthy is its significant outperformance in the

N L P D

metric, which closely aligns with human perceptual assessment. Moreover, our method boasts the fastest processing speed among current infrared image enhancement techniques, highlighting its computational efficiency as a key advantage.

6. Conclusions

In this research paper, we introduced a novel, straightforward, and efficient approach based on an

F C N

for self-supervised perceptual enhancement of infrared images, aiming to enhance images in a manner that aligns closely with human perception. Our method incorporates the

N L P D

metric to evaluate enhancement effects, bridging the gap between objective metric assessments and visual perceptual mechanisms, effectively. Additionally, our approach has showcased the fastest computational efficiency in processing enhanced infrared images, compared to the methods examined in our study. For future work, we intend to delve into the realm of infrared and visible light image fusion using a perceptual loss metric. We also plan to broaden the scope of our method’s evaluation to encompass a wider array of application scenarios. Furthermore, we aim to explore how perceptual enhancement techniques can enhance the efficacy of various visual tasks such as object detection, tracking, and segmentation.

Author Contributions

Methodology, Z.Z. and Y.X.; writing advice, Z.Z. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available at https://github.com/suojiashun/HIT-UAV-Infrared-Thermal-Dataset; accessed on 7 September 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fan, Z.; Bi, D.; Xiong, L.; Ma, S.; He, L.; Ding, W. Dim infrared image enhancement based on convolutional neural network. Neurocomputing 2018, 272, 396–404. [Google Scholar] [CrossRef]
Kuang, X.; Sui, X.; Liu, Y.; Chen, Q.; Gu, G. Single infrared image enhancement using a deep convolutional neural network. Neurocomputing 2019, 332, 119–128. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Gao, Z.; Dai, J.; Xie, C. Dim and small target detection based on feature mapping neural networks. J. Vis. Commun. Image Represent. 2019, 62, 206–216. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Geosci. Remote. Sens. Lett. 2020, 18, 1670–1674. [Google Scholar] [CrossRef]
Yuan, Z.; Jia, L.; Wang, P.; Zhang, Z.; Li, Y.; Xia, M. Infrared Image Enhancement Based on Multiple Scale Retinex and Sequential Guided Image Filter. In Proceedings of the 2024 3rd Asia Conference on Algorithms, Computing and Machine Learning, Shanghai, China, 22–24 March 2024; pp. 196–201. [Google Scholar]
Shanmugavadivu, P.; Balasubramanian, K. Particle swarm optimized multi-objective histogram equalization for image enhancement. Opt. Laser Technol. 2014, 57, 243–251. [Google Scholar] [CrossRef]
Gupta, B.; Tiwari, M. A tool supported approach for brightness preserving contrast enhancement and mass segmentation of mammogram images using histogram modified grey relational analysis. Multidimens. Syst. Signal Process. 2017, 28, 1549–1567. [Google Scholar] [CrossRef]
Huang, S.C.; Cheng, F.C.; Chiu, Y.S. Efficient contrast enhancement using adaptive gamma correction with weighting distribution. IEEE Trans. Image Process. 2012, 22, 1032–1041. [Google Scholar] [CrossRef]
Lore, K.G.; Akintayo, A.; Sarkar, S. LLNet: A deep autoencoder approach to natural low-light image enhancement. Pattern Recognit. 2017, 61, 650–662. [Google Scholar] [CrossRef]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep retinex decomposition for low-light enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar]
Petro, A.B.; Sbert, C.; Morel, J.M. Multiscale retinex. Image Process. Line 2014, 4, 71–88. [Google Scholar] [CrossRef]
Yang, W.; Wang, W.; Huang, H.; Wang, S.; Liu, J. Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Trans. Image Process. 2021, 30, 2072–2086. [Google Scholar] [CrossRef]
Li, M.; Liu, J.; Yang, W.; Sun, X.; Guo, Z. Structure-revealing low-light image enhancement via robust retinex model. IEEE Trans. Image Process. 2018, 27, 2828–2841. [Google Scholar] [CrossRef]
Levoy, M.; Hanrahan, P. Light field rendering. In Seminal Graphics Papers: Pushing the Boundaries; ACM Digital Library: New York, NY, USA, 2023; Volume 2, pp. 441–452. [Google Scholar]
Gong, X.; Chang, S.; Jiang, Y.; Wang, Z. Autogan: Neural architecture search for generative adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3224–3234. [Google Scholar]
Gong, C.; Tao, D.; Maybank, S.J.; Liu, W.; Kang, G.; Yang, J. Multi-modal curriculum learning for semi-supervised image classification. IEEE Trans. Image Process. 2016, 25, 3249–3260. [Google Scholar] [CrossRef]
Rani, V.; Nabi, S.T.; Kumar, M.; Mittal, A.; Kumar, K. Self-supervised learning: A succinct review. Arch. Comput. Methods Eng. 2023, 30, 2761–2775. [Google Scholar] [CrossRef] [PubMed]
Papandreou, G.; Chen, L.C.; Murphy, K.P.; Yuille, A.L. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1742–1750. [Google Scholar]
Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 172–189. [Google Scholar]
Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Laparra, V.; Ballé, J.; Berardino, A.; Simoncelli, E.P. Perceptual image quality assessment using a normalized Laplacian pyramid. Electron. Imaging 2016, 28, 1–6. [Google Scholar] [CrossRef]
Laparra, V.; Berardino, A.; Ballé, J.; Simoncelli, E.P. Perceptually optimized image rendering. JOSA A 2017, 34, 1511–1525. [Google Scholar] [CrossRef] [PubMed]
Abdullah-Al-Wadud, M.; Kabir, M.H.; Dewan, M.A.A.; Chae, O. A dynamic histogram equalization for image contrast enhancement. IEEE Trans. Consum. Electron. 2007, 53, 593–600. [Google Scholar] [CrossRef]
Nakai, K.; Hoshi, Y.; Taguchi, A. Color image contrast enhacement method based on differential intensity/saturation gray-levels histograms. In Proceedings of the 2013 International Symposium on Intelligent Signal Processing and Communication Systems, Okinawa, Japan, 12–15 November 2013; pp. 445–449. [Google Scholar]
Reza, A.M. Realization of the contrast limited adaptive histogram equalization (CLAHE) for real-time image enhancement. J. VLSI Signal Process. Syst. Signal Image Video Technol. 2004, 38, 35–44. [Google Scholar] [CrossRef]
Wang, Y.; Cao, Y.; Zha, Z.J.; Zhang, J.; Xiong, Z.; Zhang, W.; Wu, F. Progressive retinex: Mutually reinforced illumination-noise perception network for low-light image enhancement. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2015–2023. [Google Scholar]
Guan, X.; Jian, S.; Hongda, P.; Zhiguo, Z.; Haibin, G. An image enhancement method based on gamma correction. In Proceedings of the 2009 Second International Symposium on Computational Intelligence and Design, Changsha, China, 12–14 December 2009; Volume 1, pp. 60–63. [Google Scholar]
Zhang, D.; Zhang, D. Wavelet transform. In Fundamentals of Image Data Mining: Analysis, Features, Classification and Retrieval; Springer: Berlin/Heidelberg, Germany, 2019; pp. 35–44. [Google Scholar]
Guo, X.; Li, Y.; Ling, H. LIME: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 2016, 26, 982–993. [Google Scholar] [CrossRef]
Fan, G.D.; Fan, B.; Gan, M.; Chen, G.Y.; Chen, C.P. Multiscale low-light image enhancement network with illumination constraint. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7403–7417. [Google Scholar] [CrossRef]
Cha, D.; Jeong, S.; Yoo, M.; Oh, J.; Han, D. Multi-input deep learning based FMCW radar signal classification. Electronics 2021, 10, 1144. [Google Scholar] [CrossRef]
Wu, W.; Weng, J.; Zhang, P.; Wang, X.; Yang, W.; Jiang, J. Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5901–5910. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef] [PubMed]
Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10561–10570. [Google Scholar]
Li, C.; Guo, C.; Loy, C.C. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4225–4238. [Google Scholar] [CrossRef] [PubMed]
Setiadi, D.R.I.M. PSNR vs. SSIM: Imperceptibility quality assessment for image steganography. Multimed. Tools Appl. 2021, 80, 8423–8444. [Google Scholar] [CrossRef]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
Walach, H.; Buchheld, N.; Buttenmüller, V.; Kleinknecht, N.; Schmidt, S. Measuring mindfulness—the Freiburg mindfulness inventory (FMI). Personal. Individ. Differ. 2006, 40, 1543–1555. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference (Proceedings, Part II 14), Amsterdam, The Netherlands, 11–14 October 2016; pp. 391–407. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Burt, P.J.; Adelson, E.H. The Laplacian pyramid as a compact image code. In Readings in Computer Vision; Elsevier: Amsterdam, The Netherlands, 1987; pp. 671–679. [Google Scholar]
Paris, S.; Hasinoff, S.W.; Kautz, J. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. ACM Trans. Graph. 2011, 30, 68. [Google Scholar] [CrossRef]
Laparra, V.; Muñoz-Marí, J.; Malo, J. Divisive normalization image quality metric revisited. JOSA A 2010, 27, 852–864. [Google Scholar] [CrossRef]
Heeger, D.J. Normalization of cell responses in cat striate cortex. Vis. Neurosci. 1992, 9, 181–197. [Google Scholar] [CrossRef]
Lan, Z.; Lin, M.; Li, X.; Hauptmann, A.G.; Raj, B. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 204–212. [Google Scholar]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Ponomarenko, N.; Lukin, V.; Zelensky, A.; Egiazarian, K.; Carli, M.; Battisti, F. TID2008-a database for evaluation of full-reference visual quality assessment metrics. Adv. Mod. Radioelectron. 2009, 10, 30–45. [Google Scholar]
Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection. Sci. Data 2023, 10, 227. [Google Scholar] [CrossRef] [PubMed]
Serp, J.; Allibert, M.; Beneš, O.; Delpech, S.; Feynberg, O.; Ghetta, V.; Heuer, D.; Holcomb, D.; Ignatiev, V.; Kloosterman, J.L.; et al. The molten salt reactor (MSR) in generation IV: Overview and perspectives. Prog. Nucl. Energy 2014, 77, 308–319. [Google Scholar] [CrossRef]
Kuenen, J.; Visschedijk, A.; Jozwicka, M.; Denier Van Der Gon, H. TNO-MACC_II emission inventory; a multi-year (2003–2009) consistent high-resolution European emission inventory for air quality modelling. Atmos. Chem. Phys. 2014, 14, 10963–10976. [Google Scholar] [CrossRef]
Zhang, X.; Ye, P.; Xiao, G. VIFB: A visible and infrared image fusion benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 104–105. [Google Scholar]
Zhang, L.; Zhang, L.; Bovik, A.C. A feature-enriched completely blind image quality evaluator. IEEE Trans. Image Process. 2015, 24, 2579–2591. [Google Scholar] [CrossRef]
Guttman, A. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, Boston MA, USA, 18–21 June 1984; pp. 47–57. [Google Scholar]
Horn-von Hoegen, M.; Schmidt, T.; Meyer, G.; Winau, D.; Rieder, K. Lattice accommodation of low-index planes: Ag (111) on Si (001). Phys. Rev. B 1995, 52, 10764. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the proposed infrared enhancement method.

Figure 2. The actual effect of infrared image perception enhancement, and highlighting specific targets and points of interest using red and green boxes.

Figure 3. Lightweight

F C N

structure.

Figure 3. Lightweight

F C N

structure.

Figure 4. Display some sample images from the datasets used.

Figure 5. Visual comparison of ablation studies, and highlighting specific targets and points of interest using red and green boxes.

Figure 6. Qualitative comparison of three images (a–c) from the

H I T - U A V

dataset with different algorithms. Provide detailed annotations using red and green boxes to emphasize important information.

Figure 6. Qualitative comparison of three images (a–c) from the

H I T - U A V

dataset with different algorithms. Provide detailed annotations using red and green boxes to emphasize important information.