1. Introduction
Due to the natural process of technological evolution, demands that require digital recognition from digital image processing and computer vision are increasingly appearing, such as people and movement identification and human skin recognition, among others. Of these, skin segmentation is an important step in enabling several computer vision-based applications, such as facial expression recognition [
1], detection of nudity [
2,
3] or child pornography [
4], body motion tracking, gesture recognition [
1], and skin disease diagnostics, among other human–computer interaction (HCI) applications [
5].
Methods commonly used in the skin segmentation problem are, in general, situation- or application-specific [
6,
7]. In the case of face recognition, the methods segment skin and non-skin pixels based on face detection [
8], while in applications of medicinal interest (e.g. the abdominal region), they tend to solve the problem by actually considering the existence of elements corresponding to the abdomen in the examined image [
9].
In these situations, the entire examination is applied in a controlled environment—the input image has previously known aspects, considering the lighting conditions, capture equipment, and objects present. There are approaches suggested by several authors which seek to segment the skin pixels according to each particular one in an optimal way [
10,
11]. For example, in an application related to biometrics, the response time for the user is fundamental, together with an acceptable accuracy for the expected result of this application [
12]. Otherwise, in an application with medical purposes, the processing speed tends to not represent so much relevance; otherwise, the accuracy becomes fundamental due to the characteristics of the analysis, which primarily aims to increase the accuracy of the result and not the prediction time. Therefore, considering the application and its requirements, one approach may be more suitable than another in segmenting skin and non-skin pixels.
There are several problems—depending on the approach—when the input images have no given pattern, either because of difficulty in segmenting the elements belonging to the background of the objects of interest or the quality. The quality is affected by several factors, such as the image capture equipment, image illumination conditions during the capture process, and sampling process during digitization [
6]. Therefore, this work aims to contribute to improving the image examination process in the forensic field, in which the input images may contain any of the above-mentioned aspects.
Furthermore—considering the possibility of sharing forensic examination target material through internet usage applications—these images suffer from data compression, which causes the loss of relevant image information, reducing its quality [
13]. During compression processing, quantization and sampling methods eliminate similar information from image pixels, severely affecting skin regions. Skin regions have very discrete pixel variations, and these form the texture of the skin. However, this texture ends up being compromised by the compression processing, removing from it the texture information [
13]. Therefore, applications with a controlled environment, considering images with previously known quality standards, allow approaches and techniques based on skin texture recognition to present the expected results [
7,
14]. However, these methods do not deliver the same results when examining images that have the skin texture information corrupted by the sampling and quantization process or the hard compression processing performed by sharing applications.
Figure 1 presents the change of an image after sharing the image through the message exchange application. In this example, we can note there was a loss of a piece of the pixel information. Therefore, the input image was modified after the compression process. This process induces a loss in the resulting qualities, mainly in applications based on images with discrete textures or small skin regions. However, image compression methods—such as the discrete cosine transform (DCT)—can eliminate the coefficients with relatively small amplitudes, which are a minimum deformation in the image. These processes still incur a loss of image information, mainly related to texture [
15]. Therefore, texture analysis is a non-viable method for forensic applications [
16].
Therefore, this work brings a new approach to human skin segmentation in digital images. The contributions of this work are as follows:
- (1)
A segmentation model with the ability to classify and quantize the pixels of human skin present in an image according to the regions of the human body;
- (2)
The proposed model presents the possibility of segmenting and quantizing the pixels of human skin according to the instantiation of the object of interest in the image (i.e., of people);
- (3)
The generation of a new dataset compiled through the collection of datasets present in the literature, thus generating a heterogeneous dataset;
- (4)
The proposed model allows image analysis, especially in forensic applications, to focus on regions of interest to the human body, reducing the possibility of false positives in its examination.
The new dataset is due to existing challenges in the skin pixel segmentation technique of considering lighting conditions, ethnic differences, image quality, capture equipment, and relevant exposure of regions with the presence of skin.
In this work, we propose a solution for skin and non-skin pixel detection in digital images, even with compromised images by events related to capturing and storage. In the following sections, we review the background of the computational vision algorithm (
Section 2). In
Section 3, we describe the skin detection process. We present and discuss our results in
Section 4. Lastly, we conclude this work (
Section 5).
3. Skin Detection
3.1. Segmentation
Most approaches on the subject of skin segmentation are based on a particular color space [
2,
4,
21,
23] or on fusion components from different color spaces [
3,
7,
24], but they are evidently limited to color images only. In general, skin color occupies a limited range of values in different color spaces [
25], and detection consists of determining an optimal model for the thresholds belonging to the skin and non-skin groups [
7]. In addition, some studies, such as [
24,
25,
26], still tried to analyze other features, such as by segmenting regions by texture type. However, this approach needs images with previously known features, because it is necessary to locate the descriptors corresponding to the textures, which need to be preserved.
A skin detection process mainly needs two decisions, as presented in
Figure 5. The first choice is to determine the appropriate color space in which it is easiest to discriminate skin pixels, which is a key factor in the design of a skin detector, considering the computational cost of transforming between color spaces [
27]. The second decision is to determine the skin segmentation model that will be used to classify a given pixel, being either skin or non-skin. Thus, the system must be intelligent and flexible enough to handle various aspects, such as different skin tones or lighting conditions, reducing false negative rates by not identifying a skin pixel as a non-skin pixel [
8]. In addition, it should be sensitive enough to differences between classes—background objects with skin color—to reduce false positive rates (i.e., identifying a non-skin pixel as a skin pixel) [
25].
These choices aim to overcome the main challenges of skin recognition: the definition of an optimal model of skin and non-skin pixels for different skin types [
26] and diverse lighting conditions [
6,
8,
10]. Furthermore, the accuracy of skin recognition is affected by pixels of skin-like colors in the image background or clothing. These pixels generate false positives and hinder accurate skin recognition. This problem can be emphasized when we apply the recognition process in small regions containing skin, such as a person’s hands, since the treatment enforced to suppress false negatives can also suppress small true regions of skin [
25].
An example is shown in
Figure 6, where there are areas of clothing pixels that are confused with the values of regions belonging to skin tones in the color space. The main question in the recognition of skin and non-skin pixels is to determine an optimal model for this classification, neither being rigid in the threshold values for skin tones nor being flexible, which would increase the various false positives [
10,
20].
In general, the proposed approaches seek to balance recognition accuracy and the required computational power, since human–computer interaction applications need low computational costs coupled with high accuracy [
3,
4]. The most common strategies for defining skin models can be separated into three approaches: (1) explicitly defined regions [
2,
3], where the boundaries are fixed by a set of rules, (2) parametric methods [
8,
28] and non-parametric methods [
10,
24], which are based on histograms applied (or not) to Gaussian models and elliptical boundaries, and (3) machine learning-based approaches which, through the training of a suitable neural network model, enable the identification of skin and non-skin pixels [
29].
According to Gonzalez and Woods [
30], the fundamental problem of segmentation is the division of an image into regions that satisfy certain conditions set to achieve the end goal. In skin segmentation, the problem becomes more complex because the goal is not exactly the detection of an object. It involves examination of image features, such as color regions of a given color space in color images and the intensity levels in non-color images. In general, the process of segmenting skin and non-skin pixels can be represented by
where the conclusion
G is a comparison of the boundary
and the result of the function
f, and this represents the function applied by the approach taken in solving the problem.
3.2. Challenges
Considering the classification of pixels in an image as skin or non-skin as a typical image segmentation problem, the adopted approach must have the ability to overcome some difficulties (low illumination, camera characteristics, complex background, subject movement, etc.) [
5]. Depending on the expected result for the final application of the skin identification process, some challenges may substantially affect this result. The following are the main ones:
- (1)
Lighting conditions: The illumination applied at the moment of image capture directly impacts the result, especially in approaches that use color space analysis to determine if the pixel belongs (or not) to a certain region classified as skin. The lighting variation in the environment where the image was captured creates one of the most difficult issues for this operation. Computer vision must have the ability to adjust to this variation, recognizing the same object in different lighting conditions in the same way that human vision can.
- (2)
Ethnicity: Ethnic skin tone features represent a difficulty during the classification process because the enlargement of the skin region causes an increase in the occurrence of false positives. Some approaches adopt as a measure to overcome this difficulty the elaboration of a skin pixel model based on the previous detection of the face of the image object, and from the resulting map, they classify the image pixels as skin or non-skin. Although this method can work around some cases of skin tone variation, it has no efficient application when there is no face in the image and when there are several people in the same image. Another factor that may make this technique unfeasible is the increased computational consumption as well as the increased processing time.
- (3)
Background: The accuracy of skin detection, in the process of segmentation, is severely affected when the background is complex and contains textures and colors similar to the skin region in the color space. The increase in false positives makes skin detection in certain images impractical. Some authors consider this situation a probabilistic problem and propose the definition of a skin probability map for each pixel of an image. Although the result can significantly reduce the occurrence of false positives, there is an increase in the need for computational power, which can make the approach unfeasible for certain applications.
- (4)
Characteristics of the scanned image: Applications in which the scanned images have variable characteristics impact the result obtained. Images that have multiple objects, small objects, and objects from different perspectives represent increased difficulty in the segmentation process. The process needs to have the ability to detect skin without being able to use features such as skin texture. In computer vision, the ability to detect an object is usually related to a set of preknown characteristics of the object, such as size, position, and quantity, thus generating a certain restriction for the result. It happens that in some final application demands, the condition of the objects in the image is not known, and the computer vision system needs to have the same capacity as human vision to obtain the result, waiting independently of the characteristics of the objects in the examined image.
3.3. Color Space
Skin detection can be considered a binary pixel classifier with two classes: skin and non-skin [
31]. The classification achievement depends on an appropriate feature set for capturing the essential elements of these pixels [
8]. Each application needs the appropriate color model (RGB, HSV, or YCbCr, among others) [
10] for when we consider the accuracy, the computational cost required for transformation, and the separability of the pixels with skin colors as the principal decision point [
8]. Colorimetry, computer graphics, and video signal transmission standards have given rise to many color spaces with different properties. We have observed that skin colors differ more in luminance intensity than in chrominance due to lighting variation. Therefore, it is common to apply a linear or nonlinear transform in the RGB color space. This process changes the input’s original space into another color space with independent components, eliminates in the classification process the luminance component analysis, and preserves the chrominance components [
6]. A color space is made up of two components—chrominance and luminance [
6]—and can be subdivided into four groups [
31,
32]:
- (1)
Basic—RGB, CIE-XYZ: The RGB color model consists of three color channels, where
R is red,
G is green, and
B is blue. Each channel has a value ranging from 0 to 255. This color space was originally developed for old cathode-ray tube (CRT) [
32] monitors. Due to the model mixing the luminance and chromatic components, it is not always the most suitable color model for classifying pixels as skin and non-skin, as the variation in illumination greatly affects the accuracy of the result obtained from the classifier. The Commission Internationale de l’Eclairage (CIE) system describes color as a luminance component Y and two additional components
X and
Z. According to Kakumanu et al. [
6] the CIE-XYZ values were constructed from psychophysical experiments and correspond to the color matching characteristics of the human visual system.
- (2)
Orthogonal—YCbCr, YDbDr, YPbPr, YUV, YIQ: The common goal among these color models is to represent the components as independently as possible, reducing redundancy between their channels, unlike in basic models [
4,
32]. The luminance and chrominance components are explicitly separated, favoring their use in skin detection applications and, in this case, discarding the luminance component [
6].
- (3)
Perceptive—HSI, HSV, HSL, TSL: The RGB color space does not directly describe the perceptual features of color, such as hue (
H), saturation (
S), and intensity (
I), and many nonlinear transformations are proposed to map RGB to perceptual features. The HSV space defines the property of a color that varies in the passage from red to green as the hue, the property of a color that varies in the passage from red to pink as the saturation, and the property that varies in the passage from black to white as intensity, brightness, luminance, or value [
6]. HSV can be a very good choice for skin detection methods because the transformation from RGB to HSV is invariant to high intensities in white lights, ambient light, and surface orientations relative to the light source.
- (4)
Uniforms—CIE-Lab, CIE-Luv: According to [
6], perceptual uniformity describes how two colors differ in appearance to a human observer. However, perceptual uniformity in these color spaces is obtained at the expense of heavy computational transformations. In these color spaces, the calculations for the luminance (
L) and chromaticity (ab or uv) are obtained by a nonlinear method of mapping the coordinates
.
3.4. Methods
Our approach is based in four tasks:
- (1)
Explicit boundaries: In this method, a small region in a given color space is selected, where the pixels that belong to that region are considered skin. This approach is one of the simplest and most widely used, although it has many limitations regarding the challenges already discussed in this article. Known in the literature as
thresholding, it seeks to determine a threshold for considering pixels belonging to the skin group. Many approaches improve their results by applying conversion techniques from the RGB color space to other color spaces where one can work with the chrominance and luminance values in separate ways. As an example, Basilio [
2] converted the RGB color space to
YCbCr and considered a
thresholding of
. Benzaoui [
15] used in his approach explicit boundaries directly over the RGB color model:
- (2)
Histogram based: A skin and non-skin color model is obtained by training with a training dataset, where skin and non-skin pixels are identified. After obtaining the global histogram based on this dataset, the color space is divided into two classes of pixels (skin and non-skin). This approach is widely used [
1,
21] as it shows better results for varied condition images and needs low computational power for its execution. Buza et al. [
24] used in their work a histogram-based approach as the basis for their hybrid implementation, which ultimately classified skin and non-skin pixels using a
k-means clustering algorithm.
- (3)
Neural networks: The Neural networks play an important role in research related to skin segmentation, especially the
(multilayer perceptron) (MLP) model. Given a dataset of skin and non-skin samples and the determination of the learning parameters, the net adjusts the synaptic weights according to the expected training result. After obtaining the network training, it is possible to classify skin and non-skin pixels in an analyzed image. For example, to overcome the problems related to ethnic differences and capture conditions such as the lighting and object organization, the author of [
29] used in his work a neural network architecture composed of convolutional layers followed by a deep learning layer for classification of skin and non-skin pixels [
5].
The final task of the proposed solution is the segmentation of the skin and non-skin pixels present in the examined image for each part of the body previously segmented. In this step, the objective is to know the proportion of visible skin concerning the total number of pixels corresponding to the examined part, thus being able to present an accurate indicator of the skin content in the image.
Because it was necessary to determine a color space for this process, the shape that best represented the color hues, especially for a human observer, was chosen by us. Therefore, the H and S components, related to the hue from the HSV model, were mixed with the a and b components of the hue coordinates from the CIE-Lab model. In this way, we modified the skin tone components to reduce the impact of false positives in images with complex backgrounds containing skin-like elements. At the same time, we removed the components responsible for the pixel brightness intensity.
We applied the U-Net neural network model to determine the skin pixel map. In this case, we elaborated on a heterogeneous dataset—based on the mixture of several datasets collected during the literature review—for network training. It was necessary to assemble this dataset because each one available in the literature—related to skin segmentation—was directed to its respective application. For example, an image dataset for a face recognition application contains only face examples, other applications only contain abdomen examples, and so on. Then, by collecting these datasets, we compiled for the network training a new, totally heterogeneous dataset.
3.5. Dataset
The dataset of this project was collected, organized, and normalized, containing the union of seven datasets collected during the literature review. The images and their respective ground truths were set to the same image format (JPEG) and were randomly distributed so that the dataset had diversity of images and so similar images were not clustered together.
The collection was based on the following datasets:
After the compilation, a new dataset was created, containing 8209 varied images with their respective ground truths. In this way, the dataset became diversified, as shown in
Figure 7, containing images of close-ups, body parts with exposed skin areas, and other images with complex backgrounds considering several lighting conditions.
3.6. Convolutional Neural Network
U-Net is a neural network model proposed for semantic segmentation and consists of a contraction process followed by an expansion process [
38], as demonstrated in
Figure 8. It is a widely used architecture for image segmentation problems, especially in biomedical applications. In this proposal, it was trained based on the aforementioned
dataset, with the objective of segmenting skin and non-skin pixels. This convolutional neural network model is widely used and recommended in the literature, as many authors rely on this model to implement their approaches. For example Liu et al. [
39], in the task of segmenting overlapping chromosomes from non-overlapping ones, used the UNet model in their paper, which gave excellent results.
It was presented in 2015 and won the cell tracking challenge of the International Symposium on Biomedical Imaging (ISBI) in the same year in several categories. It was considered, on that occasion, a revolution in deep learning. It uses the concept of fully convolutional networks for this approach. The intention of U-Net is to capture both the context and location characteristics, and this process is successfully completed by the type of architecture built. The main idea of the implementation is to use successive contracting layers, which are immediately followed by the sampling survey operators to achieve higher-resolution results for the input images. The goal of the U-Net architecture is to recognize the features and localization of the examined context, and for this, the main idea is to use successive contraction layers, followed by a decoder composed of expansion layers.
For training the network, we considered 30 epochs with steps for all samples in each epoch. We defined the parameters by evaluating the tests and noticed that after 30 epochs, the result remained stable. To evaluate the net training process, the
k-fold cross-validation technique was applied, which estimates the net training considering
k distinct segments of the dataset. As exemplified in
Figure 9, this technique seeks to evaluate the neural net model on distinct datasets and finally assign the average of the evaluations as the final official performance of the net.
Hence, using this technique, we divided the dataset into 10 folds (or segments), where each fold contained a random set of samples for training and another for testing without repetition of the sets. In addition, for the training process, we defined that the net parameters should be adjusted according to the validation step for each epoch, considering 15% of the samples in the training set. Thus, when the 30 epochs were completed, the samples reserved in the set for testing—totally unknown for net training—were submitted for evaluation of the trained model to obtain the final result for each fold.
Figure 10 shows the training process, where the stability of the training appears in the 10 different dataset arrangements submitted to the training process.
Table 3 presents the results of accuracy for the training phase, where the train column corresponds to the training result’s framed values and is the result of the validation step for adjusting the parameters of each completed epoch. The test column presents the results obtained from the set of images from the dataset reserved for applying the model evaluation after the training process. Considering the 10 arrangements produced by the
k-fold technique, the model results were stable and obtained an average accuracy of a little over 85%, representing a good result for the forensic classification aspect of this research. This paper avoided comparing the results with other works in the literature because a fair comparison between deep learning-based works for skin detection is difficult due to the unavailability of a common benchmark [
5].
4. Results
The set of tasks, explained in the previous sections of the proposed solution, form a system. The image submitted to the system is initially processed by the task of recognition and delimitation of the regions of interest (i.e., the people present in the image) and finally, if they exist, segments them from the image. Then, each segmented region of interest is submitted to the recognition and segmentation of the visible body parts. Each recognized body part is then processed by the next task, which produces the segmentation of the skin and non-skin pixels present in the analyzed body region. Finally, the image is then reconstructed, identifying the mask of the pixels recognized as skin in the image, the number of instances of objects classified as persons, and the ratio of skin pixels to non-skin pixels—the proportion of visible skin—for each person recognized in the image. In
Figure 11, the macro diagram of the operation of the proposed solution is presented.
To evaluate the results, some images and their respective
ground truths were submitted to processing as described. In
Figure 12, the result obtained after processing is presented. In the “Predicted Mask” column, the neural network prediction of each human body part is displayed, which is presented as being reassembled according to the input image. In
Table 4, the data extracted from each processing step are presented to then allow calculating more precisely the estimate of skin presence in the examined image. Analytically, in
Table 5, the individualized result per part of each image in the experiment is presented. It is possible to verify the results obtained for each examined part, knowing the ability of skin recognition in specific applications according to the part of the body as the objective.
In
Figure 13, another example is presented where more than one person was present in the input image. As can be seen, there were three people in the figure, and they were segmented to be examined individually by the process presented in the proposed solution. The results shown in
Figure 14 present the segmented way of examining the process. The first and second lines show the data for the first segmented person, where the first line shows the segmentation of the body part (face, arm, torso, and leg) and the second line shows the prediction of the neural network corresponding to the segmentation of the skin and non-skin pixels for each part.
Table 6 describes the information based on the composition of the pixels of the image, dividing this information by person and part of the person. For example, it is possible to verify that person c in
Figure 15 had in all parts of the body exposed a high index of skin presence, except for the leg region. With the known information, applying the proposed solution may allow an image examination application to consider it potentially relevant for further analysis, considering the high index of skin pixels in specific regions of the human body. Although the focus of this work is not the privacy management of data or user information transmitted and received, we will use the definitions and criteria developed in accordance with [
40,
41,
42,
43,
44].
5. Conclusions
Considering use in applications that need accurate information about the proportion of visible skin in an image, the proposed solution is efficient. We presented in the experiments in the previous sections the ability of a solution to recognize the proportion of skin according to the human body part. Our method makes it possible to restrict the recognition of images with visible skin by considering analyzing the relevant human body part, as occurs in cases of forensic image examination. In addition, we also presented a new dataset of images for segmentation, which was generated from the collection of research in the literature review. In future work, we will consider the performance analysis of experiments and evaluation of other specific neural network architectures for infrared segmentation and, in the same way, for the segmentation phase of skin and non-skin pixels. An approach may present difficulties in a certain image conditions, but this does not mean that this difficulty will result in harm to the final application. In general, we enable control of the capture process, allowing the simplest approach to present the best result, considering other factors such as time, complexity, and processing power. Another relevant factor to be considered is the level of precision needed, since not all applications need that. For each demand of the final application, there is a suitable approach.
In our approach, we can present a method capable of processing images that do not belong to a controlled or known environment but environments where the lighting conditions, capture quality, and orientation of objects in the image do not interfere with the result. The main application of our model is for forensic purposes, where images need to be classified according to the proportion of skin present. Thus, our model was able to determine the amount of skin present in an image, considering the real proportion of each identified person and not in relation to the total number of pixels in the image, thus being able to more efficiently apply a classification of images of forensic interest. Another aspect is that the proportion of skin can be evaluated for each part of the human body individually, which may contribute to future applications that target only certain parts of the human body. Although this work has a focus on overcoming the challenges of skin and non-skin segmentation in unfamiliar environments, some challenges or limitations persist. Grayscale images do not give the expected results, since the skin segmentation process is based on a color model, thus requiring color images as input. Another limitation is images with very small portions of objects containing skin, although this does not affect the result. As these objects are not of forensic interest, it is necessary to highlight this limit. These limitations can be considered as opportunities for future work, since they address important aspects of the entire process.