1. Introduction
Color transfer is one of the key research topics in the field of digital image processing, and its related technologies can be applied in image coloring and color reproduction, the digital restoration of old image archives, medical image coloring, art restoration, remote sensing image enhancement, infrared image enhancement and other fields. In 2016, Google, Facebook, Adobe, Tencent and many other internet companies had already announced that they would start research projects related to image stylization. Color transfer technique is based on the reference image
and the target image
T to synthesize a new image
, and it requires
to retain the content of
T and inherit the color of
. Therefore, content preservation and color transfer based on semantic information need to be solved before the algorithm is designed. For content preservation, we need to find a way to change the color of the image well without causing any geometric change in the image. Reinhard [
1] addressed this challenge with global color transformation, but it could only handle simple color transfers because it could not model the effects of spatial changes. For color transfer, we should respect the semantic of the scene. In [
2], convolutional neural network (CNN) and Markov random field(MRF) were adopted for regional matching, but other unrelated regions of the style image were ignored, resulting in a great difference between the generated image and the expected style. It can be seen that color transfer between images is still a research difficulty for the near future. It is of great practical significance to continue to optimize existing color transfer algorithms or to explore new methods. Gatys et al. [
3] have verified that VGG19-based style transfer can also perform color transfer, which also inspires the idea of our work in this paper.
Color enhancement of the image or video can be divided into two categories: fully automatic color enhancement [
4,
5,
6] and semi-automatic color enhancement of user interaction. Among them, there are four ways of user interaction: image coloring based on color hints [
7,
8], image coloring based on reference images [
1,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22], image coloring based on palette [
23,
24] and image coloring based on text [
25,
26]. The image coloring based on the reference image requires the user to provide a reference image as input, and then find the feature matching between the reference color image and the target gray image with the help of the neural network model, and complete the copy and transfer of the corresponding color.
As early as 2001, Reinhard et al. [
1] proposed a color transfer algorithm between color images by using the orthogonality of matrices. Later, color transfer has attracted extensive attention from scholars at home and abroad. Many scholars have improved or proposed new color transfer algorithms on the basis of the original algorithm. Inspired by these color transfer algorithms, Welsh et al. [
9] achieved color transfer of grayscale images by matching pixel brightness and texture information of target images and reference images. However, to find the best matching point, we need to traverse all the pixels of the color image, which is quite time-consuming. In order to shorten the time and achieve more accurate local color transfer, Chang et al. [
10] proposed to first classify images according to the color, and then match the pixels of target images to the pixels of reference images in the same category. Gupta et al. [
11,
12,
13] proposed that users need to provide or use an algorithm to retrieve a reference image that is semantically similar to the target image from the internet, and then use feature matching to identify the corresponding relationship between pixels in the two images to shorten the time. Since the effectiveness of features is often significantly affected by local features of the image, these traditional methods are more likely to combine multiple low-level features to improve the matching performance. The fact is that low-level features cannot capture the semantic information of the image, so they have a poor color enhancement effect on complex texture images.
Recent studies have used deep learning to learn the semantic information of images, and establish the mapping relationship of pixels to achieve color enhancement. Li et al. [
14] realized the automatic coloring of grayscale images by using dictionary matching and sparse reconstruction, all of which are performed at the superpixel level. Vondrick et al. [
15] proposed a self-supervised model that learns to track targets by learning video coloring tasks. Li et al. [
16] proposed an image coloring method that automatically divides the image area into uniform and non-uniform areas, and selects the appropriate feature vector for each block of the target image to determine the color. Finally, the results are merged together through MRF to improve the consistency. He et al. [
17] proposed the first example-based local coloring model. This is achieved by first looking for a reference image in the database similar to the target image, and then using an end-to-end CNN to achieve image coloring. Xiao et al. [
18] designed the network as a pyramid structure to facilitate the transfer of color distribution from the low layer to the high layer, and took into account semantic information and details in the coloring. The model no longer produces a fixed color image. Fang et al. [
19] proposed that multi-level features of each superpixel should be extracted from the two images, and the most appropriate color of each target should be determined by using variational method. Li et al. [
20] proposed an automatic coloring method based on local texture matching. Its innovation is that it introduces a new idea of cross-scale matching, upper and lower color position distribution, and a color propagation method based on confidence weighting to make the edge coloring better.
Color transfer between images means to transfer the color of reference image to the target image through a mapping, so as to transform the color of the target image from one distribution to another distribution, and requires the minimum cost of transformation, which is essentially the optimal transmission problem. Our target is to transform grayscale images or images with little color into color images with high contrast, clear details, clear colors, and better visual and sensory effects, so as to improve the ability of human eyes to distinguish image details. In this paper, a color transfer algorithm between images based on two-stage CNN is proposed, and the resulting images are shown in
Figure 1. The image
output by the neural network in the first stage has the content of
T and the attributes of
, with better effect. The image
output by the neural network in the second stage contains more emotional factors, which can express content beyond the image and meet the needs of users better.
The contributions of this paper include: (1) A new deep neural networks (DNN) model that can generate multi-target coloring results is proposed. The model can generate different color palettes according to different reference images, and then produce different coloring effects. (2) Color transfer is only applied to the color space, which can suppress the image distortion and generate satisfactory coloring effects in various scenes. (3) Regardless of whether the reference image is related to the target image, the model can deliver reasonable colors and prevent spillover effects, such as the texture of the building not being transferred to the sky.
2. Methods
The architecture of the entire neural network is composed of two stages: the reference image-based color transfer (RICT) model and the palette-based emotional color enhancement (PECE) model, as shown in
Figure 2. The network model in the first stage can independently color the target image. The second stage of the network model is to adjust the emotional color of the resulting image, that is, the user can modify the emotional color of the image by editing the palette. The second stage is the emotional enhancement of the coloring result of the first stage, which can make the colored image into the emotional result that the user wants to express.
2.1. Reference Image-Based Color Transfer
Colorists are classified as having a technical post because it is not easy to design a set of natural and harmonious color combinations and apply them properly to an image. The purpose of the RICT model is to extract the attribute features of the reference image and apply it to the target image with semantically related content. Before constructing the model, the following two problems need to be solved: First, calculate the semantic correlation between the two images. This paper uses the gray-VGG19 [
27] to extract the detailed features of the images and perform feature matching. Second, transfer colors based on similarity rules. The RICT network architecture is shown in
Figure 2, the input is the target image
and the reference image
. Next, the network uses the pre-trained VGG19 to extract the depth feature maps of the two images, and calculates the semantic similarity between the two to obtain the two-way mapping function
. Finally, this information is used to propagate the correct colors to color blocks or pixels with the same semantics.
2.1.1. Feature Extraction and Feature Matching
As T and have different visual effects, it is difficult for the network to directly learn the mapping from T to . However, it can be decomposed into two steps: (1) is a mapping of the same position; (2) is a color mapping. Since , is an alignment mapping of the same position points, is defined as T or to R or . Similarly, is a mapping from R or prime to T or prime. Assuming that the point p of the graph T is mapped to the point of the graph , then: . At the same time, p and should be same, namely: . To strengthen the symmetry constraint, bidirectional constraint can be added: .
The output of VGG−19 network is a pyramid with 5 layers of feature maps, with feature maps and for each layer. Assuming the highest level , the first step is to calculate the mapping of the fifth layer by using nearest-neighbor Field Search: and . The second step is to modify according to to obtain . As layer pooling is adopted between layers of VGG network, the obtained is half of the size of . Then, it is calculated layer by layer until we obtained the , , and of the first layer.
For the four feature maps
of layer
i, it is defined as follows:
Among them, , is a small patch centered on point p. When , patch takes ; when , patch takes .
When we obtain
, we need to construct the
th layer
:
Among them, is obtained by through transformation. That is, we use the ith layer to obtain , and obtain after upsampling, and then act on to get as . Convolution, pooling, and other operations have been performed between the th layer and the ith layer, and and are not aligned. To obtain , you need to calculate and reconstruct in advance. If is defined as a sub-network between layer and layer i, then obviously has to be as close to as possible. Therefore, we can obtain: , can be approximated.
2.1.2. Color Transfer
Obviously, the larger the
, the more we want
to use more content structure of
and less detailed features of
. Now, we define:
Where
is obtained by the sigmoid function after normalization of
. That is:
must be obtained by fine adjustment of the features of 4 images in the
layer through nearest-neighbor field search. In the nearest-neighbor field search of the
th layer, a random search is performed only within a certain range around the point
p of the mapping relationship
, so as to fine-tune
to obtain
. For the layer
, the search range radius is
, respectively. Calculations are performed layer by layer until we obtain
, and then use
as
. This is because there is no pooling between
and the input layer, and the spatial size is the same. When
is obtained, the colored image
can be obtained:
where
.
For the backlight image
I, the target image is first enhanced through the color estimation model (CEM) [
28]:
where
,
is the adjustment parameter (
), and
is the gray mean of the image I. By using the CEM to enhance the backlight image globally, the overall brightness of the input image can be improved and the color and detail information of the image can be restored.
2.2. Palette-Based Emotional Color Enhancement
Images can not only affect people on an emotional level, but also directly express people’s emotions. An emotion can be expressed in multiple color combinations, and the same color combination can also express different emotions. However, when the proportion of colors in the image is similar, the emotions displayed in the image will be similar. The purpose of the PECE model is to modify the color palette of the resulting image to enhance its emotional color based on the various data of the reference image. Before constructing the model, the following three problems need to be solved: First, the emotional value of the resulting image and the reference image is calculated. CNN extracts color features, texture features, and content features for image emotion classification, so as to train and learn the image emotion and simulate to solve the problem of subjective evaluation of image emotion. Second, we obtain the color palette of the resulting image and the reference image, and count the proportion of each color in the respective screen. K-Means algorithm is used to re-cluster the foreground palette and the background palette into “5 + 2” color palettes. Third, color remapping is performed. In these tasks, the palette is expressed in the form of a color wheel, which can display specific colors, color names, and proportions, and it has good artistic reference value.
2.2.1. Emotional Computation of Images
You et al. [
29] proposed a progressive convolutional neural network (PCNN) model based on CNN, which uses CNN to continuously learn the semantic features of the image itself to classify the image emotions positively and negatively. This paper will directly use the PCNN to train and realize the classification of eight kinds of emotions in images on ArtPhoto dataset [
30] and FlickrEmotion dataset [
31]. The main process is shown in
Figure 3, including image preprocessing, feature extraction and feature selection, classifier design and learning.
(1) Image Preprocessing
Segmentation, enhancement, and morphological processing of the input original image can remove redundant information in the input image, filter the noise, and enhance the information features in the image. The size of each image after preprocessing is adjusted to , which makes the classifier better extract the features of the input image.
(2) Feature Extraction and Feature Selection
Machajdik et al. [
30] proposed that from the perspective of psychology and art theory, color, texture, and composition can be used as features to express image emotions. This article will use the color histogram based on HSV blocks [
32] to extract the color features of the image, that is, first convert the RGB value of each pixel to HSV. Then, the three HSV values are weighted and summed to obtain a value to represent the color feature. Local binary pattern (LBP) features [
33] are used to describe the local texture features of the image. Histogram of oriented gradients (HOG) features [
34] are used to represent the appearance and shape of local objects in the image. Haar-like features (content features) [
35] are used to recognize human faces. These features are combined to obtain a 47-dimensional feature. There are many methods of feature selection, such as principal component analysis (PCA) [
36], exhaustion [
37], heuristic search [
38], or random search [
39].
(3) Loss Function
The loss function is the key to affecting the image emotion classifier and model learning effect. For the loss function in CNN, the nonlinear error function is:
where
k,
,
W, and
b in turn represent the convolution kernel parameters in the convolutional layer that are continuously optimized and updated during the back propagation (BP) in CNN, the weight coefficient of the down-sampling layer, the weight of the fully connected layer, and the bias value.
represents the output value of the network layer,
represents the corresponding expected output value. According to the final output loss function, the entire network is fine-tuned in the reverse direction, and the BP algorithm is used to adjust the parameter values in each layer, so that the loss reaches the minimum value, and the final classification is closer to the expected value.
2.2.2. Palette Generation
Considering that the color distribution is related to the object, GrabCut [
40] can be used to extract the foreground and background of the image, and then K-Means [
41] is used to re-cluster the foreground and background palettes into “K + 2” (default
) color palette. The effect is shown in
Figure 4:
(1) GrabCut
This involves selecting a quadrilateral box and using the color image in the box as an input parameter of GrabCut, indicating that the pixels in the box may belong to the foreground, but the part outside the box must belong to the background. This article will use the mask image as an input parameter of GrabCut to mark the foreground and background of the image.
(2) K-Means
K-Means is used to cluster the foreground and background of the image, and then integrate the palettes of the two by 5:2 to identify the theme color of the image. The step of K-Means is to determine K objects as the initial clustering center by histogram peak filtering method, then calculate the distance between each object and each seed clustering center, and assign each object to the clustering center that is closest to it. The cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster is recalculated according to the existing objects in the cluster. This process is repeated until a termination condition is met. The termination condition can be that no objects are reassigned to different clusters, no cluster centers change again, and the sum of squared errors is locally minimum.
(3) Determine the Range of Color Blocks
The RGB color space of the image is converted to the HSV color space, and then the corresponding range of the color is set to be extracted by comparing with the reference table of HSV to get the number of pixels of various colors and determine the range of color blocks (for example, h is red within 156–180).
(4) Color Ratio Calculation
Divide the number of pixels of a certain color in the image with the total number of pixels, and the obtained proportion can be roughly determined as the color ratio of that color in the image.
2.2.3. Palette-Driven Based on Color Remapping
Palette-driven based on color transfer mainly modifies the brightness of pixels with the help of the brightness transfer function
, modifies the corresponding color value with the help of the color transfer function
, and finally combines the color emotion with the image coloring to produce an image with emotional tone. As for the mapping function
f, it is necessary to satisfy the interpolation properties, range properties, function continuity, one-to-one relationship, and the monotonicity of brightness [
23].
(1) Brightness Transfer Function
A weighted combination of the two most recent palette entries is used. The first operation is to extract the brightness corresponding to the modified palette, and perform a monotonic sorting operation on the corresponding colors in each palette. Define the original brightness L color value of the seven-color palettes of the input image to reach , and the brightness value of the seven-color palettes after color editing to meet . The palette can be edited by the user at will, but one condition must always be maintained: .
(2) Color Transfer Function
Single palette: Assume that only one color
C is gathered in the palette. For any color
x, there is
, that is, the user can modify the color
C to
through the color transfer function
. For the function
, it needs to meet the one-to-one color transfer rule, which can be divided into two steps: The first step is to find
, that is, the intersection of ray and color gamut boundary from
C to
direction; the second step is to determine whether
is in the color gamut. If the range of
is in the color gamut, it is far from the color gamut boundary, and
is the position where the parallel rays from
x to
intersect the color gamut boundary. If
is beyond the color gamut boundary, but it is very close to the color gamut boundary, then
can be defined as the point from
to
where it intersects the gamut boundary. Finally, let
, the point from
x to
is expressed as:
When it is far from the boundary of the gamut, is proportional to and the maximum ratio is 1. In this case, the color of the palette is changed in parallel. As it is very close to the boundary, the offset problem is , which helps to achieve the desired palette change in these areas.
K color Palette: When
K colors are accumulated in the palette,
should be promoted to include
theme colors. The strategy adopted is to define K transfer functions, each of which is equivalent to the function
, and then mix them and weigh them by proximity. The weight is:
Among them,
. Finally, the weights are obtained by using the least squares method. As follows:
where the scalar parameters meet
, and
are the color values of the palette. If you want to change the colors of five palettes at the same time, compatibility and coordination between the five colors is very important, and the color-enhanced image cannot appear with color penetration or color inconsistency. Therefore, after the color conversion equation is determined, all pixels in the input image are updated next:
,
is the pixel of the input images,
is the pixel after the color update,
is the five-color color combination in the selected database, and
C is the color in the image palette.
2.3. Objective Function and Network Training
The objective function needs to be determined by two necessary conditions: First, the color of the reference image with the highest similarity is applied first; second, even if there is no reliable reference image, the network can learn the inherent color association between the two based on the gray information. This paper directly uses the loss function in the literature [
21] to train the network. It is a multi-task network including a chromaticity branch and a perceptual branch, but both branches use the same network structure
C and weight
.
When training the chroma branch, we input and into the network to generate the result : . is colored based on , and the chromaticity information should be restored when the correct sample and color are selected. The chromaticity error adopts smooth : . Using can avoid obtaining a mean solution in the fuzzy coloring problem.
When training the perceptual branch, we input
and
into the network to predict chromaticity information
:
. Perception error:
.
is the feature map obtained by
through the VGG relu5_1 layer, and
is the feature map obtained by
through the VGG relu5_1 layer. Perceptual error can eliminate semantic differences caused by incorrect coloring, and it can enhance the robustness of two different reasonable colors. The parameters can be optimized in the following ways:
Among them, the parameter is used to indicate the relative weight between the two branches, and it is set to 0.005.
4. Conclusions
As an image enhancement processing method, image coloring aims to improve the accuracy of image coloring, visual effects, and precise application in image analysis. With the help of deep learning technology, this paper designs a color transfer method based on reference images. It has three main advantages: (1) The robustness of reference selection. Even if the two images or local regions are unrelated, this model can achieve a good result. (2) Flexible operation. Unlike previous deep learning frameworks, we can still manually control the results of coloring. At the same time, we can also color images and videos with automatic coloring. (3) The model has good transferability. It can be better applied to the color restoration of faded photos, medical image coloring, art restoration, remote sensing image enhancement, infrared image enhancement, and other fields. Our model also has three limitations: (1) Due to the perceptual loss function, we cannot generate colors that contain particularly strange or artist-formed colors. (2) The perceptual loss based on the classification network cannot penalize wrong colors in regions with less semantic importance, such as similar sand and grass textures, that is, the model cannot distinguish fewer semantic regions with similar local textures. (3) When there is a significant difference in brightness between the images, the color of the resulting image is not very faithful to the reference image. To mitigate this issue, our model enforces brightness consistency before performing color transfers.