HLNet: A Unified Framework for Real-Time Segmentation and Facial Skin Tones Evaluation

Feng, Xinglong; Gao, Xianwen; Luo, Ling

doi:10.3390/sym12111812

Open AccessFeature PaperArticle

HLNet: A Unified Framework for Real-Time Segmentation and Facial Skin Tones Evaluation

by

Xinglong Feng

^†

,

Xianwen Gao

^* and

Ling Luo

^†

College of Information Science and Engineering, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Symmetry 2020, 12(11), 1812; https://doi.org/10.3390/sym12111812

Submission received: 21 October 2020 / Revised: 29 October 2020 / Accepted: 30 October 2020 / Published: 1 November 2020

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

Real-time semantic segmentation plays a crucial role in industrial applications, such as autonomous driving, the beauty industry, and so on. It is a challenging problem to balance the relationship between speed and segmentation performance. To address such a complex task, this paper introduces an efficient convolutional neural network (CNN) architecture named HLNet for devices with limited resources. Based on high-quality design modules, HLNet better integrates high-dimensional and low-dimensional information while obtaining sufficient receptive fields, which achieves remarkable results on three benchmark datasets. To our knowledge, the accuracy of skin tone classification is usually unsatisfactory due to the influence of external environmental factors such as illumination and background impurities. Therefore, we use HLNet to obtain accurate face regions, and further use color moment algorithm to extract its color features. Specifically, for a

224 \times 224

input, using our HLNet, we achieve 78.39% mean IoU on Figaro1k dataset at over 17 FPS in the case of the CPU environment. We further use the masked color moment for skin tone grade evaluation and approximate 80% classification accuracy demonstrate the feasibility of the proposed method.

Keywords:

semantic segmentation; deep convolutional neural network; skin tone classification; color moment

1. Introduction

Augmented Reality (AR) technology has been widely used in various fields as a hot spot in recent years. Among them, automatic hair-dyeing based on 2D color imaging, as shown in Figure 1, attracts the most attention, the prerequisite for which is the precise segmentation of the hair area. Early studies on hair segmentation focused primarily on hand-crafted features [1,2,3], which required professional skills and was labor-intensive. In the meanwhile, the generalization of the model was generally poor.

In recent years, the advent of deep convolutional neural networks (DCNNs) improved the performance of many tasks, the most significant of which is semantic segmentation. Semantic segmentation is an advanced visual task, whose goal is to assign dense labels to each image pixel. As its subtask, hair segmentation has also received widespread attention in recent years. For example, Borza et al. [4] performed hair segmentation with the aid of symmetrical UNet, which was subsequently refined using morphological knowledge. Wen et al. [5] proposed an end-to-end detection-segmentation system to implement detailed face labeling, including hair. Using a pyramid FCN encoded multi-level feature maps, this method effectively alleviates the imbalance of semantic categories. More recently, Luo et al. [6] designed a lightweight segmentation network that combines the advantages of multiple modules to effectively solve the ambiguity of edge semantics. At the same time, the method is suitable for mobile devices.

Even so, its application is limited by the following factors. First of all, due to the diverse appearance of hair and its complex structural information, it is extremely difficult to accurately process the edges [7]. Although existing semantic segmentation methods [8,9,10] have relatively high segmentation performance for simple objects, only coarse masks can be obtained for hair segmentation. Secondly, the vast majority of networks require graphics processing units (GPUs) with powerful computing capabilities that most mobile devices do not have, which greatly limits their usage scenarios; Thirdly, taking into account runtime limitations, Conditional Markov random fields (CRFs) [11] is not suitable for processing the edge (e.g., shredded hair), so it is necessary to find an alternative solution. Taking all these factors into consideration, real-time hair dyeing faces enormous challenges. On the other hand, e-commerce and digital interaction with clients allows people to buy their favorite products without leaving home. Among them, the robust product recommendation function plays an important role. Automatic assessment of skin tone levels which makes it possible to personalize recommendation for beauty products. However, taking into account complex external factors, such as illumination, shadows, and background impurities can affect the determination, in which case even an experienced skin therapist can hardly judge it with the naked eye. This paper is dedicated to solving the aforementioned problems using machine learning and promising deep learning algorithms.

In this paper, we strive to balance the relationship between performance and efficiency, and provide a much simpler and more compact alternative for our segmentation task. To get accurate segmentation results, local and global context information should be considered simultaneously. Based on this observation, we propose a spatial and context information fusion framework called HLNet, which integrates high-dimensional and low-dimensional feature maps in parallel. While increasing the receptive field, it effectively alleviates the insufficient extraction of shallow features. Moreover, inspired by BiSeNet [12], the Feature Fusion Module (FFM) module is used to re-encode feature channels using context to improve feature representation in a particular category. Extensive experiments confirm that our HLNet achieves significant trade-off between efficiency and accuracy. Considering that background illumination is not conducive to identifying skin, we extract features (a.k.a. masked color moments) based on the segmented face and color moment algorithm [13]. The mask color moments are then thrown into a powerful Random Forest Classifier [14] to evaluate a person’s skin tone level. Furthermore, we verify the feasibility of the method on a manually labeled dataset.

In summary, our main contributions are as follows:

(1): We propose an efficient hair and face segmentation network that uses newly proposed modules to achieve real-time inference while guaranteeing performance.
(2): A module called InteractionModule is given, which exploits multi-dimensional feature interactions to mitigate the weakening of spatial information as the network becomes deeper and deeper.
(3): A novel skin color level evaluation algorithm is proposed and obtains accurate results on a manually labeled dataset.
(4): Our method achieves superior results on multiple benchmark datasets.

The rest of the paper is organized as follows. In Section 2, we review previous work done on lightweight model design and edge post-processing. In Section 3, we describe the proposed method in detail. Section 4 provides experimental data and parameter configuration as well as a manually annotated dataset. In Section 5, we report the experimental results. Conclusion marks and future work are drawn in Section 6.

2. Related Works

Real-time semantic segmentation. Since pioneering work [8] based on deep learning, many high-quality backbones [15,16,17,18] have been derived. However, due to the requirements of computationally limited platforms (e.g., drones, autonomous driving and smartphone), researchers pay more attention to the efficiency of the networks than just the performance. ENet [19] is the first lightweight network for real-time scene segmentation which does not apply any post-processing steps in an end-to-end manner. Zhao et al. [20] introduced a cascade feature fusion unit to quickly achieve high-quality segmentation. Howard et al. [21] proposed a compact encoder module based on a streamlined architecture that uses depthwise separable convolutions to build light-weight deep neural networks. Poudel et al. [22] combined spatial detail at high resolution with deep features extracted at lower resolution yielding beyond real-time effects. DFANet [23] starts from a single lightweight backbone and aggregates discriminative features through sub-network and sub-stage cascade respectively. Recently, LEDNet [24] has been proposed which channel split and shuffle are used in each residual block to greatly reduce computation cost while maintaining higher segmentation accuracy.

Contextual information. Some details cannot be recovered during conventional up-sampling of the feature maps to restore the original image size. The design of skip connections [25] can alleviate this deficiency to some extent. Besides, Zhao et al. [17] proposed a pyramid pooling module that can aggregate context information from different regions to improve the ability to capture multi-scale information. Zhang et al. [26] designed a context encoding module to introduce global contextual information, which is used to capture the context semantics of the scene and selectively highlight the feature map associated with a particular category. Fu et al. [27] addressed the scene parsing task by capturing rich contextual dependencies based on spatial and channel attention mechanisms, which significantly improved the performance on numerous challenging datasets.

Post processing. Generally, the quality of the above segmentation methods is obviously rough and requires additional post-processing operations. Post-processing mechanisms are usually able to improve image edge detail and texture fidelity, while maintaining a high degree of consistency with global information. Chen et al. [28] proposed a CRF post processing method that overcomes poor localization in a non-end-to-end way. CRFasRNN [11] considers the CRF iterative reasoning process as an RNN operation in an end-to-end manner. To eliminate the excessive execution time of CRF, Levinshtein et al. [29] presented a hair matting method with real-time performance on mobile devices.

Our approach draws upon these strengths. Furthermore, for the upstream skin tone grading task, we employ masked color moment to handle it, which will be discussed in Section 3.2.

3. Methodology

3.1. High-To-Low Dimension Fusion Network

The proposed HLNet network is inspired by HRNet [30] which maintains high-resolution representation through the whole process by connecting high-to-low resolution convolutions in parallel. Figure 2 illustrates the overall framework of our model. We experimentally prune the model parameters to increase the speed without excessive performance degradation. Furthermore, the existing SOTA modules [12,22,31,32] are reasonably combined to further improve the performance of the network. Table 1 gives an overall description of the modules involved in the designed network. The model consists of different kinds of convolution modules, bilinear up-sampling, bottlenecks, and other feature maps communication modules. In the following part, we will expand the above modules in detail.

To preserve details as much as possible, the downsampling rate of the entire network is set to

1 / 8

. Specifically, in the first three layers, we refer to Fast-SCNN [22] to employ vanilla convolution and depth separable convolution for fast down-sampling in order to ensure low-level feature sharing. Depth separable convolution reduces the amount of model parameters effectively while achieving a comparable representation ability. The above-mentioned convolution is uniform with a step size of 2 and a kernel size of

3 \times 3

, followed by BN [33] and a ReLU activation function.

According to FCOS [34], the low-dimensional detail information of the feature map promotes the segmentation of small objects, so we strengthen the model’s ability to represent details by stacking low-dimensional layers. Moreover, the interaction of high-resolution and low-resolution information facilitates learning of multi-scale information representation. We draw the above advantages and propose an information interaction module (InteractionModule) with feature maps of different resolutions to obtain elegant output results. Conceptually, for the backbone

ϕ_{n}^{i} (x)

, a stage process can be defined as

ϕ_{n}^{i}

, where n and i represent the index and the width of the stage, respectively. The calculation process in the dotted rectangle can be formulated as:

ϕ_{n}^{i} = \{\begin{matrix} C o n v (ϕ_{n - 1}^{i}), & n = 4, \\ \sum_{i = 1}^{M} C o n v (ϕ_{n - 1}^{i}), & n = 5, \\ C o n c a t (ϕ_{n - 1}^{i}, . . ., ϕ_{n - 1}^{M}), & o t h e r w i s e \end{matrix}

(1)

where M is 3.

C o n v

and

C o n c a t

represenet convolution operator and feature maps are stacked in the channel dimension, respectively. MobileNet v2 [31] takes advantage of residual block and deep separable convolution, which greatly reduces the calculation parameters while effectively avoiding gradient dispersion. The inverted residual block proposed by MobileNet v2 is utlized to improve the sparse parameter space by proper pruning. In particular, for

ϕ_{n}^{i} (i = 1, . . . M)

, the corresponding parameters

{k = 3, c = 64, t = 6, s = 1, n = 3}

→

{k = 3, c = 96, t = 6, s = 2, n = 3}

→

{k = 3, c = 128, t = 6, s = 4, n = 3}

are given in order, where k, c, t, s and n denote the size of convolution kernel, the number of feature map channels, the channel multiplication factor, stride and the number of module repetitions, respectively. Next, feature maps of different scales are combined and exchanged by using a

1 \times 1

convolution, strided convolution or upsampling.

1 \times 1

convolution can well perform the dimensional increase and decrease of the feature map without significantly increasing the amount of parameters. In addition, the ReLU behind it can increase the overall nonlinear fitting ability of the network. The last part of the InteractionModule is implemented by using

C o n c a t

in order to aggregate multi-scale context features. Subsequently, following the FFM Attention [12], the model focuses more on channels that contain important features and suppress those that are not important. It is composed of the following: the FFM performs an element-level multiplication operation with the input after passing through a global pooling layer and two convolutional layers with ReLU and Sigmoid, respectively. In order to mitigate the gradient disappearance in the back propagation of parameters, skip connections are added between the input and output. Then, to cature multi-scale context information, we also introduce a multi-receptive field fusion block (DilatedGroup), whose dilation rates are set to 2, 4, and 8.

For simplicity, the decoder performs bilinear upsampling (transposed convolution layer can cause gridding artifacts [29]) directly on the

28 \times 28

feature map followed by a

3 \times 3

convolution to maintain that the number of channels and the number of categories are consistent. Finally, a SoftMax layer is connected for dense classification.

In terms of loss function, we apply generalized dice loss (GDL) [35] to compensate for the segmentation performance of small objects, which is formulated as:

G D L_{l o s s} = 1 - \frac{2}{L} \frac{\sum_{l = 1}^{L} ω_{l} \cdot \sum_{n = 1}^{N} r_{l n} p_{l n}}{\sum_{l = 1}^{L} ω_{l} \cdot \sum_{n = 1}^{N} r_{l n} + p_{l n}}

(2)

ω_{l} = \frac{1}{{(\sum_{n = 1}^{N} r_{l n})}^{2}}

(3)

where p denotes the SoftMax output and r denotes the one-hot encoding of the ground truth. N and L represent the total number of pixels and categories, respectively. Equation (3) gives the expression of

ω_{l}

, which is the category balance coefficient.

To pursue perceptual consistency and reduce the time complexity of running, we advocate the idea of Guided Filter [36,37] to achieve edge-preserving and denoising. Guided Filter can effectively suppress gradient-reversal artifacts and produce visually pleasing edge profiles. Given a guidance image I and filtering input image P, our goal is to learn a local linear model to describe the relationship between the former and the output image Q while seeking consistency between P and Q just like the role of Image Matting [38]. During the experiment, s, r,

ζ

are empirically set to 4, 4, and 50, respectively.

3.2. Facial Skin Tone Classification

The purpose of the second stage is to classify the facial skin tone. Usually for Asians, we divide it into porcelain white, ivory white, medium, yellowish and black. For skin tone features, due to the small feature space, it is not suitable to use DCNNs-based methods for feature extraction. Therefore, after repeated thinking and experimental trial and error, the scheme is selected to extract the color moment of the image as the features to be learned and put it into the classic machine learning algorithm for learning. Considering facial skin tone in complex scenes, background lighting has a incurable impact on the results. So we employ image morphology algorithms and pixel-level operations to get rid of background interference. Algorithm 1 summarizes the pseudo code of the extraction process. The pre-processed face image is used to extract the color moment features, which are then put into a powerful Random Forest Classifier [14] for learning. Color moment can be expressed as:

μ_{i} = \frac{1}{N} \sum_{j = 1}^{N} p_{i, j}

(4)

σ_{i} = {(\frac{1}{N} \sum_{j = 1}^{N} {(p_{i, j} - μ_{i})}^{2})}^{\frac{1}{2}}

(5)

s_{i} = (\frac{1}{N} \sum_{j = 1}^{N} | p_{i, j} - μ_{i} {|^{3})}^{\frac{1}{3}}

(6)

where

p_{i, j}

denotes the probability of a pixel in the i channel with a value of j, and N denotes the total number of pixels. Color feature

F_{c o l o r}

= [

μ_{U}

,

σ_{U}

,

s_{U}

,

μ_{V}

,

σ_{V}

,

s_{V}

,

μ_{Y}

,

σ_{Y}

,

s_{Y}

],

U, V, Y

denote each channel of the image.

Algorithm 1: Segmentation-based inference algorithm for smoothed facial region extraction

4. Experiments

4.1. Implementation Details

Our experiments are conducted using Keras framework with Tensorflow beckend. Standard mini-batch gradient descent (SGD) is employed as the optimizer with a momentum of 0.98, a weight decay of 2

\times 10^{- 5}

. and a batch size of 64. We adopt the widely equipped “poly” learning rate policy in configuration where the initial rate is multiplied by

{(1 - \frac{i t e r}{t o t a l_i t e r})}^{p o w e r}

with power

0.9

and initial learning rate is set as 2.5

\times 10^{- 3}

. Data augmentation includes normalization, random rotation

θ_{r o r a t i o n} \in [- 20, 20]

, random scale

θ_{s c a l e} \in [- 20, 20]

, random horizontal flip and random shift

θ_{s h i f t} \in [- 10, 10]

. For fair comparison, all the methods are conducted on a server equipped with a single NVIDIA GeForce GTX1080 Ti GPU. Code is available at: https://github.com/JACKYLUO1991/Face-skin-hair-segmentaiton-and-skin-color-evaluation.

4.2. Datasets

Data is the soul of deep learning, because it determines the upper limit of an algorithm to some extent. To ensure the robustness of the algorithm, it is necessary to construct a dataset with human faces in extreme situation such as large angles, strong occlusions, complex lighting changes, etc.

4.2.1. Face and Hair Segmentation Datasets

Labeled Faces in the wild (LFW). Ref. [39] dataset consists of more than 13,000 images on the Internet. We use its extension version (Part Labels) during the experiment which automatically labeled via a super-pixel segmentation algorithm. We adopt the same data division method in [4] as 1500 images in the training, 500 used in validation, and 927 used for testing.

Large-scale CelebFaces Attributes dataset (CelebA). Ref. [40] consisting of more than 200 k celebrity images, each with multiple attributes. The main advantage of this dataset is that it combines large pose variations and background clutter making the knowledge learned from this dataset easier to satisfy demand of actual products. In the experiment, we adopt the CelebHair version (http://www.cs.ubbcluj.ro/~dadi/face-hair-segm-database.html) of CelebA in [4] which includes 3556 images. We use the same configuration as the original paper, i.e.,

20 %

for validation.

Figaro1k. For the last dataset, we employ Figaro1k [41], which is dedicated to hair segmentation. It needs to be considered that the dataset is developed for general hair detection, many of which do not include faces, which is not conducive to subsequent experiments. In this case, we follow the pre-processing in [7], leaving 171 images for experiments. To better take advantage of batch training, offline data augmentation is adopted to expand the images (×10).

4.2.2. Manually Annotated Dataset

An outstanding contribution of this work is a manually labeled facial skin tone rating dataset. In the process of labeling, three professionally trained makeup artists rated the face tone color using a voting mechanism. Once all three markers have judged the results differently, the label will be decided by a makeup artist with 5 or more years of experience. Our face data is collected from the web without conflicts of interest. The obtained image is filtered by an off-the-shelf face detection library (i.e., MTCNN [42]) to remove images without detected faces, and the remaining ones are used for feature extraction and further machine learning. The number of each category is 95, 95, 96, 93 and 94, samples are shown in Figure 3. Besides, their statistical distribution are plotted in Figure 4.

4.3. Evaluation Metrics

All segmentation experiments are applied to mean-interesction-over-union (mIoU) criterion. The definition of mIoU is as follows:

m I o U = \frac{1}{1 + k} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}}

(7)

where

k + 1

is the number of classes (including background),

p_{i j}

indicates the number of pixels that belong to category i but have been misjudged as category j. For more metrics, please refer to [8].

5. Results and Discussion

5.1. Segmentation Results

In this section, we carry on the experiments to demonstrate the potential of our segmentation architecture in terms of accuracy and efficiency trade-off.

5.1.1. Overall Comparison

We use four FCN [8] introduced metrics to evaluate the performance of our algorithm. Subsequently, comparative experiments across different datasets with the outstanding UNet variant [4] are constructed. Unless otherwise stated, the input resolution is 224 × 224. The training continues for 200 epochs, after which the model will become saturated. Table 2 reports the qualitative results.

Experimental results show that our HLNet outperforms the trimmed U-Net (tU-Net) [4] by a large margin, except for LFW dataset. Nevertheless, one drawback of fast down-sampling is that the feature extraction for the shallow layers is not sufficient. As we know, shallow features contribute to extracting texture and edge details, so our HLNet is slightly worse than the tU-Net in LFW dataset (LFW facial details are blurry than others).

From another perspective, considering the latency time, we reach 60 ms per image on an Intel Core i5-7500U CPU without any tricks. We can further reach no more than 10 ms under GPU. Comparing tU-Net with HLNet (8 ms vs.

7.2 \pm 0.3

ms) shows that the latter is more efficient, while performance is more remarkable. This conclusion suggests that we can further apply this framework to the edge and embedded devices with small memory and battery budget. The qualitative analysis results are shown in Figure 5. Post-processing employs Guided Filter to achieve more realistic edge results.

5.1.2. Comparison with SOTA Lightweight Networks

In this subsection, we compare our algorithm with several state-of-the-art (SOTA) lightweight networks including ENet [19], LEDNet [24], Fast-SCNN [22], MobileNet [21] and DFANet [23] on CelebHair test set. For fair comparison, we re-implement the above networks under the same hardware configuration without any fine-tuning or fancy tuning techniques. It should be noted that the framework implementation is slightly different from the original, so the results may be slightly different, but the overall performance deviation is within the acceptable range. Since ENet has a downsampling rate of 32, we resize all the input to 256 × 256. In addition, we measure frames per second (FPS) in our CPU environment without any running loads, which takes an average of 200 forward propagations.

From Figure 6 and Table 3, we can observe that our proposed method is more accurate than other methods. Compared with the sub-optimal ENet, our method improves mIoU by 0.35%, while the FPS is half higher. Although DFANet has 2× less parameters, as well as 11× less FLOPs than our HLNet, it delivers poor segmentation accuracy of 7.44% in terms of mIOU. We conjecture that this is due to DFANet’s overdependence on pre-trained lightweight backbones. From Figure 6c, it can be clearly observed that the DFANet has a serious misclassification on pixels. MobileNet’s situation is consistent with DFANet. In particular, our HLNet is 3.18% higher than Fast-SCNN in terms of accuracy, and the parameters are reduced by 0.4 M. Excessive deep separable convolutions affect its performance, and even if this reduces time delay and computational complexity (FLOPs), it gets insufficient generalization capabilities. Compare the last line of Figure 6g,h, which contains the second person (the latter one), even if the Ground Truth does not mark it. In contrast, benefiting from the rich context captured by the DilatedGroup, our method can roughly segment it. Moreover, compared with other methods, with the help of the introduced InteractionModule, HLNet has an advantage in detail processing of multi-scale objects (e.g., hairline). A more intuitive reference diagram for comparison of different methods is shown in Figure 7. The whole experiment demonstrate that our HLNet achieve the best trade-off between accuracy and efficiency.

5.1.3. Ablation Study

We further conduct the ablation experiments on the Figaro1k test set, and follow the same training strategy for the fairness of the experiments. In addition, we mainly evaluate the impact of InteractionModule (IM) and DilatedGroup (DG) components on the results, as illustrated in Figure 8. IM without information exchange (connected using Upsampling and Concat) and a

3 \times 3

convolution with a rate of 1 are used to replace the corresponding components as baseline.

On the one hand, IM module can capture multi-resolution patterns. On the other hand, DG module fuses multi-scale features while increasing the receptive field. When we append DG and IM modules respectively, mIoU increases by

1.54 %

and

3.19 %

relative to the baseline. When we apply two modules at the same time, mIoU increases dramatically by

4.26 %

. The obvious performance gains reflects the efficiency of our proposed modules.

5.2. Facial Skin Tone Classification Results

In the second phase of the experiment, we construct comparative studies to compare the influence of different color spaces and different experimental protocols over the results.

As shown in Table 4, we report the accuracy of facial skin color classification. The best results are obtained using the YCrCb color space with color moment backend, with an accuracy of 80%. It should be noted that before putting into the Classifier, the data first needs to be oversampled to ensure that the number of samples is consistent across the different categories. We simply split the dataset into 8:2 for training and testing, and then use the powerful Random forest Classifier for training. Figure 9 provides the confusion matrix for this configuration. As can be seen from it, the main errors are between adjacent categories, a situation that also plagues a trained professional makeup artist when he/she is labeling data.

The shortcoming of the experiment is the paucity of data. There are reasons to believe that with sufficient data, the accuracy will be further improved.

6. Conclusions

In this paper, we propose a fully convolutional network that leverages lightweight components such as InteractionModule, depth separable convolution and DilatedGroup to solve the real-time semantic segmentation problem in order to achieve a balance between speed and performance. We further apply it to hair and skin segmentation tasks, and extensive experiments confirm the effectiveness of the proposed method. Moreover, based on the segmented skin regions, we introduce color moments to extract color features and then classify the skin tones.

80 %

classification accuracy demonstrate the effectiveness of the proposed solution.

The aim of this work is to apply our algorithms to real-time coloring, face swapping, skin tone rating systems, and skin care product recommendations based on skin tone ratings in real-life scenarios. In our future work, we will investigate semi-supervised methods to address the lack of data volume.

Author Contributions

Conceptualization, X.F.; methodology, X.F. and L.L.; software, X.F. and L.L.; validation, X.F. and L.L.; formal analysis, X.F.; investigation, X.F.; resources, X.G.; data curation, L.L.; writing—original draft preparation, X.F.; writing—review and editing, X.G.; visualization, L.L.; supervision, X.G.; project administration, X.G.; funding acquisition, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 61573087 and 61573088.

Acknowledgments

This work is done when L.L. was an intern at Meidaojia Research, Beijing, China. Thanks their support for this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rousset, C.; Coulon, P.Y. Frequential and color analysis for hair mask segmentation. In Proceedings of the 2008 15th IEEE International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 2276–2279. [Google Scholar]
Shen, Y.; Peng, Z.; Zhang, Y. Image based hair segmentation algorithm for the application of automatic facial caricature synthesis. Sci. World J. 2014, 2014, 748634. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Abbas, Q.; Garcia, I.F.; Emre Celebi, M.; Ahmad, W. A feature-preserving hair removal algorithm for dermoscopy images. Skin Res. Technol. 2013, 19, e27–e36. [Google Scholar] [CrossRef] [PubMed]
Borza, D.; Ileni, T.; Darabant, A. A deep learning approach to hair segmentation and color extraction from facial images. In Proceedings of the International Conference on Advanced Concepts for Intelligent Vision Systems, Poitiers, France, 24–27 September 2018; pp. 438–449. [Google Scholar]
Wen, S.; Dong, M.; Yang, Y.; Zhou, P.; Huang, T.; Chen, Y. End-to-end detection-segmentation system for face labeling. IEEE Trans. Emerg. Top. Comput. Intell. 2019. [Google Scholar] [CrossRef]
Luo, L.; Xue, D.; Feng, X. EHANet: An Effective Hierarchical Aggregation Network for Face Parsing. Appl. Sci. 2020, 10, 3135. [Google Scholar] [CrossRef]
Muhammad, U.R.; Svanera, M.; Leonardi, R.; Benini, S. Hair detection, segmentation, and hairstyle classification in the wild. Image Vis. Comput. 2018, 71, 25–37. [Google Scholar] [CrossRef] [Green Version]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 39, 640–651. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2017; pp. 234–241. [Google Scholar]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. 2015. Available online: https://arxiv.org/abs/1511.07122 (accessed on 30 April 2016).
Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1529–1537. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Stricker, M.A.; Orengo, M. Similarity of color images. In Proceedings of the International Society for Optics and Photonics, San Jose, CA, USA, 23 March 1995; pp. 381–392. [Google Scholar]
Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. 2017. Available online: https://arxiv.xilesou.top/abs/1706.05587 (accessed on 5 December 2017).
Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–27 July 2017; pp. 11–19. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. 2016. Available online: https://arxiv.xilesou.top/abs/1606.02147 (accessed on 7 June 2016).
Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 405–420. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. 2017. Available online: https://arxiv.xilesou.top/abs/1704.04861 (accessed on 17 April 2017).
Poudel, R.P.; Liwicki, S.; Cipolla, R. Fast-SCNN: Fast Semantic Segmentation Network. 2019. Available online: https://arxiv.xilesou.top/abs/1902.04502 (accessed on 12 February 2019).
Li, H.; Xiong, P.; Fan, H.; Sun, J. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA, 16–20 June 2019; pp. 9522–9531. [Google Scholar]
Wang, Y.; Zhou, Q.; Liu, J.; Xiong, J.; Gao, G.; Wu, X.; Latecki, L.J. LEDNet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation. 2019. Available online: https://arxiv.xilesou.top/abs/1905.02423 (accessed on 13 May 2019).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, CA, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 7151–7160. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected Crfs. 2014. Available online: https://arxiv.xilesou.top/abs/1412.7062 (accessed on 7 June 2016).
Levinshtein, A.; Chang, C.; Phung, E.; Kezele, I.; Guo, W.; Aarabi, P. Real-time deep hair matting on mobile devices. In Proceedings of the Conference on Computer and Robot Vision, Toronto, ON, Canada, 8–10 May 2018; pp. 1–7. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. 2019. Available online: https://arxiv.xilesou.top/abs/1902.09212 (accessed on 25 February 2019).
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 4510–4520. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. 2015. Available online: https://arxiv.xilesou.top/abs/1502.03167 (accessed on 2 March 2015).
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. 2019. Available online: https://arxiv.xilesou.top/abs/1904.01355 (accessed on 20 August 2019).
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Cardoso, M.J. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Québec City, QC, Canada, 14 September 2017; pp. 240–248. [Google Scholar]
He, K.; Sun, J.; Tang, X. Guided image filtering. In Proceedings of the European Conference on Computer Vision, Crete, Greece, 5–11 September 2010; pp. 1–14. [Google Scholar]
He, K.; Sun, J. Fast Guided Filter. 2015. Available online: https://arxiv.xilesou.top/abs/1505.00996 (accessed on 5 May 2015).
Levin, A.; Lischinski, D.; Weiss, Y. A closed-form solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 228–242. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kae, A.; Sohn, K.; Lee, H.; Learned-Miller, E. Augmenting CRFs with Boltzmann machine shape priors for image labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Sydney, Australia, 1–8 December 2013; pp. 2019–2026. [Google Scholar]
Yang, S.; Luo, P.; Loy, C.C.; Tang, X. From facial parts responses to face detection: A deep learning approach. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 3676–3684. [Google Scholar]
Svanera, M.; Muhammad, U.R.; Leonardi, R.; Benini, S. Figaro, hair detection and segmentation in the wild. In Proceedings of the IEEE International Conference on Image Processing, Phoenix, AZ, USA, 25–28 September 2016; pp. 3676–3684. [Google Scholar]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE. Signal. Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Automatic hair dyeing exemplar. (a) Input RGB image. (b) The guided filter output of our proposed algorithm. (c) Final dyed rendering.

Figure 2. An overview of our asymmetric encoder-decoder network. Blue, red and green represents background, hair mask recolored and face mask recolored, respectively. In the dotted rectangle (also called InteractionModule), arrows in different directions represent different operations. “C” and “+” represent Add and Concatenate (abbreviated as Concat) operations, respectively.

Figure 3. Manually marked face skin hue level sample after voting mechanism. From (a–e), it represents porcelain white, ivory white, medium, yellowish and black races. In order to have a consistent understanding of image classification criteria, only Chinese actresses are used here. It is worth noting that the images are crawled from the online web with no conflict of interest.

Figure 4. Pie chart visualization. (a) viewpoint graph. According to the yaw angle, it is divided into small (

| θ | < 15^{\circ}

), modarate (

15^{\circ} \leq | θ | \leq 45^{\circ}

) and large (

| θ | > 45^{\circ}

). (b) Occlusion graph. Including ‘mild occlusion’ (<20%), ‘severe occlusion’ (>50%) and ‘modarate occlusion’ between the two items.

Figure 4. Pie chart visualization. (a) viewpoint graph. According to the yaw angle, it is divided into small (

| θ | < 15^{\circ}

), modarate (

15^{\circ} \leq | θ | \leq 45^{\circ}

) and large (

| θ | > 45^{\circ}

). (b) Occlusion graph. Including ‘mild occlusion’ (<20%), ‘severe occlusion’ (>50%) and ‘modarate occlusion’ between the two items.

Figure 5. Samples of hair and face segmentations on different datasets.

Figure 6. Qualitative comparison results with other SOTA methods. From (a) to (h) are input images, ground truth, segmentation outputs from DFANet [23], ENet [19], MobileNet [21], LEDNet [24], Fast-SCNN [22] and our HLNet. From top to bottom, the difficulty of segmentation increases in turn.

Figure 7. Results for running speed vs. performance of different methods.

Figure 8. Ablation Experiments on the Figaro1k test set.

Figure 9. Multi-classification confusion matrix.

Table 1. HLNet consists of an asymmetric encoder and decoder. The whole network is mainly composed of standard convolution (Conv2D), deep separable convolution (DwConv2D), inverted residual bottleneck blocks, bilinear upsampling (UpSample2D) module and several custom modules.

Stage	Type	Output Size
Encoder	-	$224 \times 224 \times 3$
	Conv2D	$112 \times 112 \times 32$
	DwConv2D	$56 \times 56 \times 64$
	DwConv2D	$28 \times 28 \times 64$
	InteractionModule	$28 \times 28 \times 128$
	FFM	$28 \times 28 \times 64$
	DilatedGroup	$28 \times 28 \times 32$
Decoder	UpSample2D	$224 \times 224 \times 32$
	Conv2D	$224 \times 224 \times 3$
	SoftMax	$224 \times 224 \times 3$

Table 2. Segmentation performance on LFW, CelebHair and Figaro1k test sets. “OC” denotes the number of output channels. All values are in %. Moreover, the best one is highlighted in bold.

Metric	LFW (OC = 3)		CelebHair (OC = 3)		Figaro1k (OC = 2)
Metric	U-Net	HLNet	U-Net	HLNet	U-Net	HLNet
mIoU	83.46	83.81	88.56	89.55	77.75	78.39
fwIoU	92.75	90.28	91.79	91.98	83.01	83.12
pixelAcc	95.83	94.69	95.54	96.08	90.28	90.73
mPixelAcc	88.84	90.35	93.61	94.49	84.72	84.93

Table 3. Comparison with SOTA approaches on CelebHair test set in terms of segmentation accuracy and execution efficiency. “†” indicates fine-tuning from LFW. 0.5 represents the contraction factor. “#Param” represents the number of model parameters. Bold means better.

Model	#Param (M)	FPS	FLOPs (G)	mIoU (%)
ENet [19]	0.36	8.24	0.94	89.97
LEDNet [24]	2.3	6.44	3.28	88.63
Fast-SCNN [22]	1.6	20.35	0.41	87.14
MobileNet(0.5) + UNet [21]	0.37	5.80	0.75	86.08
DFANet [23]	0.42	17.72	0.08	82.88
HLNet (ours)	1.2	12.23	0.94	90.32
HLNet (ours) †	1.2	12.23	0.94	90.98

Table 4. Classification accuracy of different methods in different color spaces. PCA stands for principal component analysis. Bold means better.

Method	RGB	HSV	YCrCb
Histogram (8 bins)	75%	78%	73%
Histogram with PCA (256 bins)	77%	-	-
Color Moment	73%	77%	80%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, X.; Gao, X.; Luo, L. HLNet: A Unified Framework for Real-Time Segmentation and Facial Skin Tones Evaluation. Symmetry 2020, 12, 1812. https://doi.org/10.3390/sym12111812

AMA Style

Feng X, Gao X, Luo L. HLNet: A Unified Framework for Real-Time Segmentation and Facial Skin Tones Evaluation. Symmetry. 2020; 12(11):1812. https://doi.org/10.3390/sym12111812

Chicago/Turabian Style

Feng, Xinglong, Xianwen Gao, and Ling Luo. 2020. "HLNet: A Unified Framework for Real-Time Segmentation and Facial Skin Tones Evaluation" Symmetry 12, no. 11: 1812. https://doi.org/10.3390/sym12111812

APA Style

Feng, X., Gao, X., & Luo, L. (2020). HLNet: A Unified Framework for Real-Time Segmentation and Facial Skin Tones Evaluation. Symmetry, 12(11), 1812. https://doi.org/10.3390/sym12111812

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HLNet: A Unified Framework for Real-Time Segmentation and Facial Skin Tones Evaluation

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. High-To-Low Dimension Fusion Network

3.2. Facial Skin Tone Classification

4. Experiments

4.1. Implementation Details

4.2. Datasets

4.2.1. Face and Hair Segmentation Datasets

4.2.2. Manually Annotated Dataset

4.3. Evaluation Metrics

5. Results and Discussion

5.1. Segmentation Results

5.1.1. Overall Comparison

5.1.2. Comparison with SOTA Lightweight Networks

5.1.3. Ablation Study

5.2. Facial Skin Tone Classification Results

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI