ARiRTN: A Novel Learning-Based Estimation Model for Regressing Illumination

Choi, Ho-Hyoung; Kim, Gi-Seok

doi:10.3390/s23208558

Open AccessArticle

ARiRTN: A Novel Learning-Based Estimation Model for Regressing Illumination

by

Ho-Hyoung Choi

^1,*

and

Gi-Seok Kim

²

¹

School of Dentistry, Advanced Dental Device Development Institute, Kyungpook National University, Daegu 41940, Republic of Korea

²

School of Logos College, Gyeongju University, Gyeongjusi 38065, Republic of Korea

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(20), 8558; https://doi.org/10.3390/s23208558

Submission received: 16 September 2023 / Revised: 3 October 2023 / Accepted: 16 October 2023 / Published: 18 October 2023

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

:

In computational color constancy, regressing illumination is one of the most common approaches to manifesting the original color appearance of an object in a real-life scene. However, this approach struggles with the challenge of accuracy arising from label vagueness, which is caused by unknown light sources, different reflection characteristics of scene objects, and extrinsic factors such as various types of imaging sensors. This article introduces a novel learning-based estimation model, an aggregate residual-in-residual transformation network (ARiRTN) architecture, by combining the inception model with the residual network and embedding residual networks into a residual network. The proposed model has two parts: the feature-map group and the ARiRTN operator. In the ARiRTN operator, all splits perform transformations simultaneously, and the resulting outputs are concatenated into their respective cardinal groups. Moreover, the proposed architecture is designed to develop multiple homogeneous branches for high cardinality, and an increased size of a set of transformations, which extends the network in width and in length. As a result of experimenting with the four most popular datasets in the field, the proposed architecture makes a compelling case that complexity increases accuracy. In other words, the combination of the two complicated networks, residual and inception networks, helps reduce overfitting, gradient distortion, and vanishing problems, and thereby contributes to improving accuracy. Our experimental results demonstrate this model’s outperformance over its most advanced counterparts in terms of accuracy, as well as the robustness of illuminant invariance and camera invariance.

Keywords:

computational color constancy; learning-based estimation model; primary color; appearance; unknown light source; ARiRTN architecture

Graphical Abstract

1. Introduction

Colors in a scene image tend to be biased due to unknown light sources, different reflection characteristics of scene objects, and the external spectral sensitivity of diverse imaging sensors. Surprisingly, colors are perceived as constant by the human visual perception system (HVPS) despite unexpected interactions between different light sources. Color constancy is a key attribute of the HVPS that enables consistency in perceiving the original color appearance of an object under any illuminants. This feature has long been drawing much attention from researchers in the computational color constancy community as this characteristic serves as the underlying mechanism for a wide range of computer vision fields and applications. In computer vision, color constancy primarily deals with estimating the illumination color of a scene and reproducing the canonical color of scene objects. In computer vision color constancy, an array of approaches [1,2,3,4,5,6] rely on estimation accuracy to regress the illuminant and use the simple but effective von Kries model [7] to render the scene image. A network is designed to learn regression mapping from the consistent illuminant label or ground truth dataset and thereby perform illumination estimation. To enable networks to perform the most accurate possible estimation, it is also critical to formulate the best possible hypothesis about the illuminant [4]. This is a tough task and requires coping with appearance contradiction and label vagueness. The color appearance of a captured scene object varies significantly depending on the sensitivity of the sensor and illumination spectrum. To reduce such influences, networks are trained on a camera-specific predictor; however, this is deemed ineffective due to the challenge of data demands. Some approaches attempt to make camera-agnostic illuminant predictions and accomplish robust performance. Other approaches, as in ref. [5,6,8,9], have been suggested to address appearance contradiction. In the inception approaches [10,11,12,13], it is worth noting that theoretical complexity is the basis for building highly sophisticated architecture and resultantly improving estimation accuracy. Inception networks have been evolving over time [11,12]. Behind those networks is a split–transform–merge strategy. With a fixed set of receptive field sizes, the network blocks in the architecture perform transformation simultaneously, and the resultant outputs merge in a concatenating manner. They have made progress in estimation accuracy, which is of course attributable to the complexity of the architecture. With the number of receptive fields and their sizes tailored for transformation, the architecture handles data step by step. In this way, constructing more sophisticated architecture has brought incremental progress in accuracy, but not innovation. This begs the question as to the applicability of the network to a new or broader range of tasks or datasets. Inspired to seek the answer and make meaningful enhancements to the color constancy system, a novel learning-based estimation model is introduced, an aggregate residual-in-residual transformation network (ARiRTN) architecture, by combining the inception model with the residual network and embedding residual networks into a residual network. The proposed model has two parts: the feature-map group and the ARiRTN operator. In the ARiRTN operator, all splits perform transformations simultaneously, and the resulting outputs are concatenated into their respective cardinal groups. The proposed architecture is designed to develop multiple homogeneous branches for high cardinality, and an increased size of a set of transformations, which extends the network in width and in length.

This article makes three key contributions, as summarized below.

♦: Creating a novel learning-based estimation model, an aggregate residual-in-residual transformation network (ARiRTN), by combining the inception model with the residual network and embedding residual networks into a residual network.
♦: Experimenting and demonstrating the applicability of the inception model to new tasks and datasets.
♦: Achieving next-level estimation accuracy, as verified by experiments on standard, public datasets, and making a meaningful contribution to the field of computer vision color constancy.

2. Previous Works

The Gray-world hypothesis is at the center of traditional color constancy approaches such as GW [14] and its extended versions in ref. [15,16]. These approaches assume that a real-life scene has an achromatic mean of reflection under a neutral source illuminant. The hypothesis uses low-level statistics that describe scene reflectance statistics for achromatic scene color. It is derived from perfect reflectance [17,18] and has been used to develop WP approaches. These approaches feature fast computing speeds and require a small number of free parameters. However, they are too dependent on their hypothesis to cope with unexpected situations outside the conditions of the hypothesis. Some approaches use Bayesian theory in ref. [19] to calculate the posterior distribution for the estimation of the illuminant color and scene surfaces. Bayesian theory was developed to compute the prior distribution of illuminant colors and surface reflectance. The prior distribution is the analytical result of a multivariate truncated normal distribution of the weights of a linear approach. Other approaches [20,21] classify the illuminant color space through the use of the Bayesian framework and train the networks on the histogram frequencies of real-life scenes to generate the surface reflectance prior distributions. For illumination estimation, the approach in ref. [20] uses the prior distribution, which is meant to be a uniform distribution across a subset of illuminant colors, whereas that of ref. [21] uses the empirical distribution of the learning illuminant colors.

In fully supervised works, learning-based approaches [22,23] encompass combinational and direct methods, and their dependence on hand-crafted image features results in performance constraints. Recently, color constancy approaches based on fully supervised convolutional neural networks (CNNs) have made remarkable progress in estimation accuracy. They use either local patches [23,24] or the entire image input [6,25,26,27,28,29,30]. From a color classification perspective, some approaches, including convolutional color constancy [24] and its extended version—the fast Fourier color constancy approach [9]—use a color space on which a histogram shift is used to verify image re-illumination. As a result, they achieve successful and efficient estimation of diverse illumination candidates. The approach in ref. [31] employs a K-means cluster to gather illumination from datasets and adopts a CNN to perform a classification task. Here, the input is a single pre-white balancing image and the output is a K-class probability, and the K-mean cluster predicts each class of illuminants, which accounts for the rendering image.

Finally, the approach in ref. [32] adopts two CNNs for multi-device training: one carries out sensor-independent linear transformation of a

3 \times 3

receptive field size and transforms the RGB color images into a canonical color space, while the other offers the estimated illumination. This approach uses a variety of datasets, except those captured by the test imaging device, and arrives at a successful result. Ref. [33] achieves imaging device invariance by using various samples across diverse imaging devices and datasets in a meta-learning framework. The approach in [34] assumes that standard RGB images gathered from websites are good white-balancing images. They undergo natural de-gamma correction for inverse tone-mapping, and a CNN is used to pick achromatic pixels for illumination estimation. These images were taken using unknown imaging devices and processed with diverse ISP pipelines. Therefore, they might have already been manipulated by unknown software. Nevertheless, this approach makes incremental progress, not innovative. To take estimation accuracy to the next level, this article introduces a novel learning-based estimation model, an aggregate residual-in-residual transformation network (ARiRTN) architecture, by combining the inception model with the residual network and embedding residual networks into a residual network. The proposed model has two parts: the feature-map group and the ARiRTN operator. In the ARiRTN operator, all splits perform transformations simultaneously, and the resulting outputs are concatenated into their respective cardinal groups. The proposed architecture is designed to develop multiple homogeneous branches for high cardinality, an increased size of a set of transformations, which extends the network in width and in length. The next section provides a more detailed elaboration on the proposed approach.

3. The Proposed Method

During the last several decades, the inception approach has demonstrated that complexity increases accuracy by carefully designing architectures. Inception networks have evolved over time and their key feature is a split–transform–merge strategy. In their architecture, the network blocks perform transformation simultaneously with a set of specialized receptive fields, and the resulting outputs merge in a concatenating manner. As a result, inception networks bring improved accuracy, which is attributable to their structural complexity. Inspired by the inception network and to take estimation accuracy to the next level, a novel learning-based estimation model is introduced, an aggregated residual-in-residual transformation network (ARiRTN) architecture, by combining the inception model with the residual network and embedding residual networks into a residual network. The proposed model has two parts: the feature-map group and the ARiRTN operator. The subsections that follow discuss the proposed architecture in detail.

3.1. Cardinal Groups of ARiRTN

Cardinal groups are formed by separating features into feature-map groups using a cardinality hyper-parameter

K

as in RexNext [35]. A radix hyper-parameter,

R

, in this subsection expresses the number of splits within a cardinal group. Hence, the total number of feature-map groups is described as

G = K R

. Supposing that a cardinal group has a series of transformation

\{F_{1}, F_{2}, F_{3}, \dots, F_{G}\}

, a cardinal group is represented as

U_{i} = F_{i} (X), f o r i \in \{1, 2, 3, \dots, G\}

. As in ref. [36,37], an integral representation of a cardinal group is obtained through an element-wise summation across all the splits. Suppose that

{\hat{U}}^{k} \in R^{H \times W \times \frac{C}{k}}, f o r k \in 1, 2, 3, \dots, K,

with H, W, and C referring to the output feature-map sizes; the

k - t h

cardinal group is represented as

{\hat{U}}^{k = \sum_{j = R (k - 1) + 1}^{R k} U_{j}}

. With channel-wise statistics embedded in the architecture, the global contextual information is obtained through the global average pooling operation in a spatial dimension

s^{k} \in R^{C / K}

. Hence, the

c - t h

constituent is computed as follows [38]:

s_{c}^{k} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {\hat{U}}_{c}^{K} (i, j)

(1)

If each feature-map channel is created with a weighted integration across all the splits, the weighted integration of each cardinal group

V^{k} \in R^{H \times W \times C / K}

is gathered via channel-wise soft attention. Let

a_{i}^{k} (c)

represent a soft assignment weight and mapping,

g_{i}^{c}

, decide the weight of each split for the

c - t h

channel based on the global context,

s^{k}

; the

c - t h

channel is described as follows:

V_{c}^{k} = \sum_{i = 1}^{R} a_{i}^{k} (c) U_{R (k - 1) + i}

(2)

where

a_{i}^{k} (c) = \{\begin{matrix} \frac{e x p (g_{j}^{c} (s^{k}))}{\sum_{j = 1}^{R} e x p (g_{j}^{c} (s^{k}))} i f R > 1 \\ \frac{1}{1 + e x p (g_{j}^{c} (s^{k}))} i f R = 1 \end{matrix}

(3)

The cardinal groups are then concatenated according to the channel dimension as in

V = c o n c a t \{V_{1}, V_{2}, V_{3}, \dots, V_{K}\}

. Supposing that the input and output feature maps have the same form, the proposed architecture ultimately generates the output,

Y

, using skip-connection, described as

Y = V + X

. Further, a transformation,

T

, is adopted to modify the output as follows:

Y = T (X) + V

.

3.2. Efficient Implementation of the Proposed ARiRTN Architecture

What the previous subsection discussed is the layout of cardinality-major implementation. Here, the feature-map groups with the same cardinal index are placed next to one another. Cardinality-major implementation is a simple and intuitive task, but is challenging to modularize and accelerate using CNN architecture. To address this challenge, radix-major implementation is adopted for the proposed architecture. Figure 1 presents the proposed ARiRTN architecture with radix-major implementation. The feature map is separated into several

R K

groups that have cardinality and radix indexes. The groups with the same radix index are placed next to one another. An add operation is conducted across all the splits. Finally, the feature-map groups are concatenated in a sequence of cardinal numbers; the feature-map groups with identical cardinality indexes merge through the concatenation operation, but not those with different radix indexes. Following the operation of the global pooling layer, K number of successive cardinalities are added up to estimate the attention weights for each split, as shown in Figure 1.

Figure 2 illustrates the Bottleneck Residual Block (BR-Block) and the Dense Selective Kernel Block (DSKB) shown in Figure 1. The BR-Block is a variant of the residual block that uses a 1 × 1 convolution and is used to create a bottleneck intended to reduce the number of parameters and perform matrix multiplication. The DSKB, which is already introduced in ref. [30], is composed of Selective Kernel Convolutional Blocks (SKCBs). The l-th input of the l-th SKCB is made up of the feature maps of all its preceding SKCBs, which have undergone the split, fuse, and select. The SKCB has the advantage of adjusting the receptive field size to the changing intensity of the input stimuli. Accordingly, the proposed ARiRTN architecture is expected to obtain stability and robustness in regressing the illuminant. Furthermore, the architecture has potential for broader use in a variety of deep learning applications as it keeps up with the latest network configuration trends.

4. Experimental Results and Evaluations

This section discusses the experimental results and evaluations. The proposed ARiRTN architecture experiment was conducted on public, standard datasets of a great number of diverse images taken under a multitude of illumination conditions: the Gehler and Shi illuminant dataset [21] of 0.568 K images that capture a considerable variety of indoor and outdoor scenes, the Gray-ball dataset [39] of 11.340 K illuminant images of diverse scenes, and the Cube+ [40] illuminant dataset of 1365 images that capture different scenes, with their illuminant colors known and additional semantic data used for the purpose of improving the training process towards greater progress in estimation accuracy.

The proposed ARiRTN architecture runs on the machine learning codes in TensorFlow [41] and is operated in NVIDIA TITAN RTX 24 G. The total training time is 1 day and 11 h with 10 K epochs. In addition to resizing an image into 227 × 227 pixels, the network is set up to have an input batch size of 16. The parameters are optimized through several experiments on the Gehler and Shi illuminant dataset. Figure 3 shows that the proposed ARiRTN architecture tends to converge to zero training loss. Here, with a weight decay of

5 \times 10^{5}

and a momentum of 0.9, several training loss values are compared to determine the optimal initial training rate for the proposed ARiRTN architecture. As highlighted in the previous section, a prominent feature of the proposed architecture is its use of BR-Block and DSKB, as opposed to the CNN and Dense network, their counterparts in conventional network structures, which both consist of 1 × 1 convolutional networks. The proposed ARiRTN architecture employs the BR-Block and DSKB to grow in complexity and increase in width and in depth. As a result, the proposed architecture makes meaningful improvements in estimation accuracy. Figure 4 and Figure 5 depict the comparisons between BR-Block and CNN, and between the DSKB and Dense network, by calculating and representing their median and average angular errors on the logarithmic scale. The Gehler and Shi dataset is used in the comparative experiments of training as well as cross validation, and the median and average angular errors are registered at an interval of 20 epochs.

The next experiments use several standard datasets, the Cube+, Gray-ball, and MultiCam datasets, to compare the performance of the proposed ARiRTN architecture against its most advanced counterparts [42,43,44,45,46,47,48,49,50,51]. In recent decades, the CNN architecture has played an integral role in performing advanced computer vision tasks, including regressing illuminants. However, this approach has struggled with the challenge of accuracy arising from label vagueness caused by unknown source lights, different reflection characteristics of scene objects, and extrinsic factors such as various types of imaging sensors.

Recently, inception approaches have demonstrated that complexity increases accuracy by carefully designing architectures. Inception networks have evolved over time and their key feature is a split–transform–merge strategy. In the architecture, the network blocks perform transformation simultaneously with a set of specialized receptive fields, and the resulting outputs merge in a concatenating manner. As a result, inception networks bring improved accuracy, which is attributable to their structural complexity. Inspired by the inception network and to overcome the limitation of the conventional CNN architecture, a novel learning-based estimation model is introduced, an aggregate residual-in-residual transformation network (ARiRTN) architecture, by combining the inception model with the residual network and embedding residual networks into a residual network. The proposed model has two parts: the feature-map group and the ARiRTN operator.

Figure 6 displays the resulting images at each step delivered by the proposed ARiRTN architecture with the Gehler and Shi illuminant dataset. Figure 6 shows (a) the original input image, (b) the resulting image of estimating the illuminant, (c) the ground truth image, and finally, (d) the resulting image after correcting the original image, which ultimately manifests the real-scene image well without an undesired illuminant effect.

Table 1 is a summary table of the comparative analysis between multiple conventional approaches and the proposed ARiRTN architecture in terms of the mean, median, trimean, best 25%, and worst 25%. The experimental results highlight extraordinary performance of the proposed ARiRTN architecture compared to its latest counterparts.

Table 2 is a summary table of the test results that evaluate the proposed ARiRTN architecture against its conventional counterparts, and highlights that the proposed ARiRTN architecture significantly outperforms its conventional counterparts, topping the state-of-the-art approaches in terms of estimation accuracy. Table 1 and Table 2 prove the robust illuminant invariance of the proposed ARiRTN architecture. Table 3 is a summary table of the test results that evaluates the proposed ARiRTN architecture against its conventional counterparts in terms of the angular errors of inter-camera estimation, using a MultiCam dataset that consists of 1365 outdoor images captured using a Cannon 550D camera. The results demonstrate that the proposed ARiRTN architecture surpasses its conventional counterparts in terms of the angular errors of the inter-camera estimation, as well as proving its robustness in terms of illuminant and imaging device invariance.

5. Conclusions

In computational color constancy, regressing illumination is a classical approach to manifesting the original color appearance of an object in a real-life scene. However, this approach has struggled with the challenge of accuracy arising from label vagueness, which is caused by unknown source lights, different reflection characteristics of scene objects, and extrinsic factors such as various types of imaging sensors. This article introduces a novel learning-based estimation model, an aggregate residual-in-residual transformation network (ARiRTN) architecture, by combining the inception model with the residual network and embedding residual networks into a residual network. The proposed model has two parts: the feature-map group and the ARiRTN operator. In the ARiRTN operator, all splits perform transformations simultaneously, and the resulting outputs are concatenated into their respective cardinal groups. Moreover, the proposed architecture is designed to develop multiple homogeneous branches for high cardinality, an increased size of a set of transformations, which extends the network in width and in length. Comparative experiments were conducted using the four most popular datasets in the field: Shi’s dataset, the Cube + dataset, the Gray-ball dataset and the MultiCam dataset. The proposed architecture makes a compelling case that complexity increases accuracy by demonstrating outstanding progress in estimation accuracy compared with its previous counterparts. The combination of the two complicated networks, residual and inception networks, is proven to contribute to reducing overfitting, gradient distortion, and vanishing problems, and thereby improving accuracy. These experimental results support this model’s outperformance over its most advanced counterparts in terms of accuracy, as well as the robustness of illuminant invariance and camera invariance. Nevertheless, it is ever meaningful and worthwhile to continue to strive towards creating more advanced learning-based illuminant estimation models and take color constancy to newer heights.

Author Contributions

Conceptualization, H.-H.C.; methodology, H.-H.C. and G.-S.K.; software, H.-H.C.; formal analysis, H.-H.C. and G.-S.K.; investigation, H.-H.C.; resources, H.-H.C.; data curation, H.-H.C. and G.-S.K.; writing—original draft preparation, H.-H.C.; visualization, H.-H.C.; supervision, H.-H.C.; project administration, H.-H.C.; funding acquisition, H.-H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2023-00240987).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

http://colorconstancy.com/ (accessed on 2 October 2023).

Acknowledgments

The first author would like to thank the editor and the anonymous reviewers for their insightful comments, and also thanks his wife, Jiwon-Lee, a professional Korean–English interpreter and translator, for proofreading this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Qian, Y.; Kamarainen, J.-K.; Nikkanen, J.; Matas, J. On finding gray pixels. In Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chen, K.; Jia, K.; Huttunen, H.; Matas, J.; Kämäräinen, J.-K. Cumulative attribute space regression for head pose estimation and color constancy. Pattern Recognit. 2019, 87, 29–37. [Google Scholar] [CrossRef]
Cheng, D.; Price, B.; Cohen, S.; Brown, M.S. Effective learning-based illuminant estimation using simple features. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Bianco, S.; Cusano, C.; Schettini, R. Single and multiple illuminant estimation using convolutional neural networks. IEEE Trans. Image Process. 2017, 26, 4347–4362. [Google Scholar] [CrossRef] [PubMed]
Shi, W.; Loy, C.C.; Tang, X. Deep specialized network for illuminant estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Hu, Y.; Wang, B.; Lin, S. Fc4: Fully convolutional color constancy with confidence-weighted pooling. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
von Kries, J. Chromatic adaption. In Festschrift der Albrecht-Ludwigs-Universitat; Springer: Berlin/Heidelberg, Germany, 1902. [Google Scholar]
Barron, J.T. Convolutional color constancy. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Barron, J.T.; Tsai, Y.-T. Fast fourier color constancy. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the ICML, Lille, France, 6–11 July 2015. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V. Inception-v4, inceptionresnet and the impact of residual connections on learning. In Proceedings of the ICLR Workshop, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Buchsbaum, G. A spatial processor model for object colour perception. J. Frankl. Inst. 1980, 310, 1–26. [Google Scholar] [CrossRef]
Finlayson, G.D.; Trezzi, E. Shade of gray and color constancy. In Proceedings of the IS&T/SID Color Imaging Conference, Scottsdale, AZ, USA, 9–12 November 2004; pp. 37–41. [Google Scholar]
van de Weijer, J.; Gevers, T.; Gijsenij, A. Edge-based color constancy. IEEE Trans. Image Process. 2007, 16, 2207–2214. [Google Scholar] [CrossRef] [PubMed]
Land, E.H.; McCann, J.J. Lightness and retinex theory. J. Opt. Soc. Am. 1971, 61, 1–11. [Google Scholar] [CrossRef]
Funt, B.V.; Shi, L. The rehabilitation of maxrgb. In Proceedings of the 18th Color and Imaging Conference, CIC 2010, San Antonio, TX, USA, 8–12 November 2010; pp. 256–259. [Google Scholar]
Freeman, W.T.; Brainard, D.H. Bayesian decision theory, the maximum local mass estimate, and color constancy. In Proceedings of the Fifth International Conference on Computer Vision (ICCV 95), Massachusetts Institute of Technology, Cambridge, MA, USA, 20–23 June 1995; pp. 210–217. [Google Scholar]
Rosenberg, C.R.; Minka, T.P.; Lad-sariya, A. Bayesian color constancy with non-gaussian models. In Proceedings of the Advances in Neural Information Processing Systems 16, Vancouver, BC, Canada, 8–13 December 2003; pp. 1595–1602. [Google Scholar]
Gehler, P.V.; Rother, C.; Blake, A.; Minka, T.; Sharp, T. Bayesian color constancy revisited. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Funt, B.V.; Xiong, W. Estimating illumination chromaticity via support vector regression. In Proceedings of the Twelfth Color Imaging Conference: Color Science and Engineering Systems, Technologies, Applications, CIC 2004, Scottsdale, AZ, USA, 9–12 November 2004; pp. 47–52. [Google Scholar]
Wang, N.; Xu, D.; Li, B. Edge-based color constancy via support vector regression. IEICE Trans. Inf. Syst. 2009, 92-D, 2279–2282. [Google Scholar] [CrossRef]
Bianco, S.; Cusano, C.; Schettini, R. Color constancy using cnns. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2015, Boston, MA, USA, 7–12 June 2015; pp. 81–89. [Google Scholar]
Lou, Z.; Gevers, T.; Hu, N.; Lucassen, M.P. Color constancy by deep learning. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, 7–10 September 2015; pp. 76.1–76.12. [Google Scholar]
Gong, H. Convolutional mean: A simple convolutional neural network for illuminant estimation. In Proceedings of the British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, 9–12 September 2019. [Google Scholar]
Choi, H.-H.; Kang, H.-S.; Yun, B.-J. CNN-based illumination estimation with semantic information. Appl. Sci. 2020, 10, 4806. [Google Scholar] [CrossRef]
Choi, H.-H.; Yun, B.-J. Deep learning-based computational color constancy with convoluted mixture of deep experts (CMoDE) fusion technique. IEEE Access 2020, 8, 188309–188320. [Google Scholar] [CrossRef]
Choi, H.-H.; Yun, B.-J. Learning-based illuminant estimation model with a persistent memory residual network (PMRN) architecture. IEEE Access 2021, 9, 29960–29969. [Google Scholar] [CrossRef]
HChoi, H. CVCC Model: Learning-Based Computer Vision Color Constancy with RiR-DSN Architecture. Sensors 2023, 3, 5341. [Google Scholar]
Oh, S.-W.; Kim, S.-J. Approaching the computational color constancy as a classification problem through deep learning. Pattern Recognit. 2017, 61, 405–416. [Google Scholar] [CrossRef]
Afifi, M.; Brown, M. Sensor-Independent Illumination Estimation for DNN Models. In Proceedings of the British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, 9–12 September 2019. [Google Scholar]
McDonagh, S.; Parisot, S.; Li, Z.; Slabaugh, G.G. Meta-learning for few-shot camera-adaptive color constancy. arXiv 2018, arXiv:1811.11788. [Google Scholar]
Bianco, S.; Cusano, C. Quasi-unsupervised color constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 12212–12221. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNest: Splite-Attention Networks. arXiv 2020, arXiv:2004.08955v2. [Google Scholar]
Ciurea, F.; Funt, B. A large image database for color constancy research. In Proceedings of the Eleventh Color Imaging Conference: Color Science and Engineering Systems, Technologies, Applications, CIC 2003, Scottsdale, AZ, USA, 4–7 November 2003; pp. 160–164. [Google Scholar]
Ershov, E.; Savchik, A.; Semenkov, I.; Banic, N.; Belokopytov, A.; Senshina, D.; Koscevic, K.; Subasic, M.; Loncaric, S. The Cube++ illumination Estimation Dataset. IEEE Access 2020, 8, 227511–227527. [Google Scholar] [CrossRef]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
Land, E.H. The Retinex theory of color vision. Sci. Am. 1977, 237, 108–129. [Google Scholar] [CrossRef]
Koscevic, K.; Subasic, M.; Loncaric, S. Iterative convolutional neural network-based illumination estimation. IEEE Access 2021, 9, 26755–26765. [Google Scholar] [CrossRef]
Xiao, J.; Gu, S.; Zhang, L. Multi-domain learning for accurate and few-shot color constancy. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3258–3267. [Google Scholar]
Domislović, I.; Vršnak, D.; Subašić, M.; Lončarić, S. One-Net: Convolutional color constancy simplified. Pattern Recognit. Lett. 2022, 159, 31–37. [Google Scholar] [CrossRef]
Chen, Y.; Lin, W.; Zhang, C.; Chen, Z.; Xu, N.; Xie, J. Intra-and-inter-constraint-based video enhancement based on piecewise tone mapping. IEEE Trans. Circuits Syst. Video Technol. 2013, 23, 74–82. [Google Scholar] [CrossRef]
Gijsenij, A.; Gevers, T. Color constancy using natural image statistics and scene semantics. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 687–698. [Google Scholar] [CrossRef]
Barnard, K. Improvements to gamut mapping colour constancy algorithms. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2000; pp. 390–403. [Google Scholar]
Chakrabarti, A.; Hirakawa, K.; Zickler, T. Color constancy with spatio-spectral statistics. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1509–1519. [Google Scholar] [CrossRef] [PubMed]
Qian, Y.; Pertuz, S.; Nikkanen, J.; Ka, J.-K.; Matas, J. Revisiting Gray Pixel for Statistical Illumination Estimation. arXiv 2018, arXiv:1803.08326. [Google Scholar]
Qiu, J.; Xu, H.; Ye, Z. Color Constancy by Reweighting Image Feature Maps. arXiv 2020, arXiv:1806.09248v3. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The proposed ARiRTN architecture with radix-major implementation; the feature-map groups, grouped by radix index and cardinality, are sitting next to one another.

Figure 2. Representation of (a) Bottleneck Residual Block (BR-Block) and (b) Dense Selective Kernel Block (DSKB) composed of Selective Kernel Convolutional Blocks (SKCBs) from Figure 1.

Figure 3. Comparison of initial training rates by calculating their train losses to find one that best fits the proposed architecture.

Figure 4. Performance comparison between Bottleneck Residual (BR)-Block and Convolutional Neureal network(CNN) by calculating (a) median angular errors and (b) average angular errors.

Figure 5. Performance comparison between DSKB and Dense network by calculating (a) median angular errors and (b) average angular errors.

Figure 6. The resulting images at each step delivered by the proposed ARiRTN architecture: (a) the original input image (b) the estimated illuminant image, (c) the ground truth image, and (d) the rendered image.

Table 1. Comparison of angular errors between multiple conventional approaches and the proposed ARiRTN architecture with Cube+ datasets (lower value means higher accuracy).

Method(s)	Mean	Median	Trimean	Worst 25%	Best 25%
Statistics-Based Approach
White-Point [42]	9.69	7.48	8.56	20.49	1.72
Gray-World [14]	7.71	4.29	4.98	20.19	1.01
Shade of Gray [15]	2.59	1.73	1.93	6.19	0.46
1st Gray-Edge [16]	2.41	1.52	1.72	5.89	0.45
2nd Gray-Edge [16]	2.50	1.59	1.78	6.08	0.48
Learning-Based Approach
Fast Fourier CC [9]	1.38	0.74	0.89	3.67	0.19
Sq-FC4 [6]	1.35	0.93	1.01	3.24	0.30
VGG-16 method [43]	1.34	0.83	0.97	3.20	0.28
Multi-Domain LCC [44]	1.24	0.83	0.92	2.91	0.26
One-net [45]	1.21	0.72	0.83	3.05	0.21
Ours	1.11	0.58	0.79	2.50	0.17

Bold numbers mean the results of the proposed method.

Table 2. Comparison of angular errors between the proposed ARiRTN architecture and the conventual approaches with the Gray-ball dataset.

Method(s)	Mean	Median	Trimean	Best 25%	Worst 25%
Support Vector Regression [46]	13.17	11.28	11.83	4.42	25.02
Bayesian approach [21]	6.77	4.70	5.00	-	-
Natural Image Statistics [47]	5.24	3.00	4.35	1.21	11.15
Effective Learning-based [3]	4.42	3.48	3.77	1.01	9.36
CNN-based [25]	4.80	3.70	-	-	-
Ours	2.85	1.53	1.63	0.42	5.95

Bold numbers mean the results of the proposed method.

Table 3. Comparison of inter-camera estimation angular errors between the proposed ARiRTN architecture and the conventional approaches with the MultiCam dataset.

Method(s)	Mean	Median	Trimean	Best 25%	Worst 25%
Gray-World [14]	4.57	3.63	3.85	1.04	9.64
Gamut mapping-based [48]	3.76	2.99	3.10	1.14	7.70
White-Point [51]	3.64	2.84	2.95	1.17	7.48
1st Gray-Edge [16]	3.21	2.51	2.65	0.93	6.61
2nd Gray-Edge [16]	3.12	2.42	2.54	0.86	6.55
Bayesian approach [21]	3.04	2.28	2.40	0.67	6.69
Shade of Gray [15]	2.93	2.24	2.41	0.66	6.31
Spatio-spectral statistics [49]	2.92	2.08	2.17	0.46	6.50
Revisiting gray pixel [50]	2.80	2.00	2.22	0.55	6.25
Quasi-unsupervised [37]	2.39	1.69	1.89	0.48	5.47
CNN-based [25]	1.88	1.47	1.54	0.38	4.90
3-H [51]	1.67	1.20	1.30	0.38	3.78
Fast Fourier CC [9]	1.55	1.22	1.23	0.32	3.66
Sq-FC4 [6]	1.54	1.13	1.20	0.32	3.59
Ours	1.43	1.09	1.02	0.28	3.40

Bold numbers mean the results of the proposed method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Choi, H.-H.; Kim, G.-S. ARiRTN: A Novel Learning-Based Estimation Model for Regressing Illumination. Sensors 2023, 23, 8558. https://doi.org/10.3390/s23208558

AMA Style

Choi H-H, Kim G-S. ARiRTN: A Novel Learning-Based Estimation Model for Regressing Illumination. Sensors. 2023; 23(20):8558. https://doi.org/10.3390/s23208558

Chicago/Turabian Style

Choi, Ho-Hyoung, and Gi-Seok Kim. 2023. "ARiRTN: A Novel Learning-Based Estimation Model for Regressing Illumination" Sensors 23, no. 20: 8558. https://doi.org/10.3390/s23208558

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ARiRTN: A Novel Learning-Based Estimation Model for Regressing Illumination

Abstract

1. Introduction

2. Previous Works

3. The Proposed Method

3.1. Cardinal Groups of ARiRTN

3.2. Efficient Implementation of the Proposed ARiRTN Architecture

4. Experimental Results and Evaluations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI