1. Introduction
Image aesthetic quality assessment (IAQA) is an important visual task, which represents an important criterion for visual content curation and lays the foundation in many multimedia applications such as image retrieval [
1,
2], photo enhancement [
3], and image cropping and photo album creation [
4,
5,
6]. The goal of IAQA is to design algorithms that automatically predict image aesthetic quality. This is a challenging task due to its fuzzy definition and its highly subjective nature. The aesthetic score of images relies on several undetermined factors, such as composition, color distribution, and technical quality. Many approaches for the aesthetic assessment of images with generic content are present in the literature [
6,
7,
8]. However, psychology research [
9] shows that certain kinds of content are more attractive than others. Professional photographers adopt different photographic techniques and have various aesthetic criteria in mind when taking different types of photos; therefore, it is reasonable to design features specialized in modeling aesthetic quality for different kinds of photos (e.g., [
10,
11,
12]).
In this paper, we focus on the aesthetic quality assessment of images containing human faces. The reasons are twofold: (i) a large percentage of images on social media sites and media content repositories contains faces and self-portraits, or “selfies” [
13,
14]; (ii) the performance of generic content aesthetic assessment methods [
7] drops considerably when dealing with these types of images. The automatic estimation of the overall aesthetics of images containing faces is fundamental for a wide range of applications, for example to discriminate professional and amateur portraits on sharing platforms [
15], to choose the most aesthetically pleasing picture for sharing on social media [
16], to guide the capturing process on smart cameras [
17], or to handle the automatic creation of photo albums [
1]. The prediction of the overall aesthetics of an image containing faces is the result of the combination of several features encoding relevant information about the global image aesthetics adapted to facial pictures, as well as information related to facial expressions and high-level attributes (e.g., smile, age, gender, hair style). It should be clear that although facial beauty and face aesthetics are two related concepts, the first reflects the attractiveness of the subject’s face, while the second represents the attractiveness of the photo containing the subject’s face (see, for example,
Figure 1).
Previously proposed methods for the aesthetic quality assessment of images containing faces can be grouped into those that treat the problem as a categorization into images with low or high aesthetic quality [
18,
19,
20] and those that instead estimate a continuous score of aesthetic quality [
1,
17,
19].
Males et al. [
18] exploited a support vector machine for aesthetic quality categorization trained on the combination of global (e.g., contrast and hue distribution of the whole image) and local features (e.g., sharpness and blown-out highlights only of facial region). Their experiments were carried out on a set of photos collected from Flickr and manually labeled by five people as being aesthetically appealing or not. In [
20], a compositional based augmentation scheme was used to train a deep convolutional neural network (DCNN) on a portrait subset of the AVA dataset for binary aesthetic classification. Li et al. [
21] evaluated the performance of several categories of features related to aesthetics such as pose, face locations, and photo composition on their own dataset of photos with faces. Lienhard et al. [
19,
22] proposed a new database, called Human Faces Score (HFS), and developed a method based on the selection of low-level features extracted from several regions for both aesthetic quality categorization of portrait images (i.e., low or high) and continuous aesthetic score prediction. Recently, many works have proposed intelligent capture methods for taking good selfies based on hand-crafted features and face pose analysis [
17,
23].
In this paper, we propose a method for the aesthetic assessment of images containing faces. It involves the use of three convolutional neural networks (CNNs) to encode information regarding perceptual quality, global image aesthetics, and facial attributes. A mixed-coded genetic algorithm (GA) is trained to combine these features to explicitly predict the aesthetics of images containing face. The mixed-GA is built to simultaneously address: (i) the selection of relevant features and (ii) the optimization of the weights characterizing the linear model, which maps features to an aesthetic prediction. As far as we know, this is the only approach that, for estimating the aesthetic quality of images containing faces, takes into account the properties of the entire image, as well as aspects specific to the face such as demographic attributes (gender, age, and ethnicity), mood (facial expressions), and visual attributes (e.g., hair style, clothing, face shape).
The idea underlying this method was presented in [
24]. In this paper, we revise this idea, and in particular, we perform a deeper investigation concerning the fitness functions to be used for the optimization of the genetic algorithm. We also exploit a richer set of evaluation metrics to more comprehensively assess the aesthetics models. Moreover, a new set of experiments assessing the generalization ability of the best method is carried out.
The rest of the article is organized as follows:
Section 2 details the proposed method; in
Section 3, we present the experimental protocol and the considered metrics;
Section 4 reports the results and the analysis of the performance achieved; and conclusions and comments are made in
Section 5.
3. Experiments
In this section, the evaluation protocol, the considered databases, and the experimental setup are detailed.
3.1. Evaluation Protocol
For the experiments, the same evaluation procedure adopted in [
19] was followed. More in detail, for each experiment, ten-fold cross-validation was performed by randomly dividing the dataset into ten disjoint subsets and repeating the experiment ten times, each time selecting a different subset of tests and the remaining nine for training. The division into ten disjoint sets was repeated 10 times to avoid sampling bias.
Classification performance was evaluated in terms of the Good Classification Rate (GCR) and F1 score. The GCR measures the ratio between the number of images correctly classified and the number of test images and is defined as
. The cross-category error (CCE) can be computed as follows:
where
N is the number of samples,
is the ground-truth class, and
is the predicted class for the
i-th image.
if
x is true,
otherwise. The F1 score corresponds to:
where
is the number of true positives,
stands for the number of false positives, and
is the number of false negatives, respectively.
Regression performance was evaluated in terms of Pearson’s Linear Correlation Coefficient (PLCC) and Spearman’s Rank-Order Correlation Coefficient (SROCC). The PLCC measures the linear correlation between the actual and the predicted scores, and it is defined as follows:
where
N is the number of samples,
and
are the sample points indexed with
i, and finally,
and
are the means of each sample distribution. Instead, the SROCC estimates the monotonic relationship between the actual and the predicted scores, and it is calculated as follows:
N is the number of samples, and is the difference between the two ranks of each sample. The average of the considered metrics across the 10 rounds is reported.
3.2. Portrait Image Databases
In this section, the publicly available databases for the aesthetic assessment of images with faces are described. The databases consist of images containing people or groups of people gathered from online photo databases or photo sharing websites (e.g., Flickr, DPChallenge). Given that these photos were collected in real scenarios, they present a wide range of subjects, facial appearances, illumination, and imaging conditions.
The CUHKPQ [
15] is a manually annotated database for image aesthetics’ categorization (respectively high and low). It consists of 17,673 images organized into seven different categories. In this work, only images belonging to the “human” category are considered. There are 3148 photos of different sizes. The size of the faces instead varies between
pixels and
pixels. Some example images are shown in
Figure 8a.
Figure 8b shows that most of the sample images were annotated as being of low aesthetic quality.
The Human Faces Score (HFS) [
22] database contains 250 photos of faces in the same pose with the same width of 240 pixels and a variable height. Specifically, seven images of 20 different people and 110 additional portrait images were collected. The face images of one subject are given in
Figure 9a. The annotation of each image was obtained by having 25 human observers rate the image on a scale with values between 1 and 6 (the highest aesthetic quality), then calculating the Mean Opinion Score (MOS). In
Figure 9b, the histogram of the MOSs for the database is shown.
The Face Aesthetics Visual Analysis (FAVA) database is a subset of the large-scale AVA dataset [
28] containing various images with faces. The latter are portrayed in near-frontal positions. The smallest face in the database has a size of
pixels, while the largest has a size of
pixels. Each picture is associated with a value between 1 and 10 (the highest quality) corresponding to the average of around 210 collected individual scores (
Figure 10b displays the histogram of the MOSs). Samples are shown in
Figure 10a.
The Flickr database was gathered from Flickr for general aesthetic assessment [
1]. It consists of 500 images associated with a ground-truth score between 0 and 10, where 10 means high quality. Photos have the longest side corresponding to 1600 pixels and show a single face or a group of faces. The size of the smallest face in the database is
pixels, while the largest face almost completely covers the surface of the image with a size of
pixels. According to [
19], only the biggest detected face is considered in each picture.
Figure 11a shows samples from the database, while the distribution of the scores is reported in
Figure 11b.
3.3. Experimental Setup
Binary aesthetic classification and aesthetic score regression were performed for each dataset presented previously.
For binary classification, the goal was to discriminate images into low-quality and high-quality aesthetics. To get the ground-truth for the databases that provide the MOSs (all except CUHKPQ, which already provides the low-/high-quality aesthetic labels), we followed the same protocol as in [
19]. In this protocol, the datasets were first sorted by the Mean Opinion Score (MOS) values, then separated into two sets having the same number of samples to contain the images with the lowest and highest aesthetic scores, respectively.
In all the experiments, the GA was trained with a population of 100 individuals initialized by using the parameters (weights and bias) and their perturbed versions of a linear support vector machine (SVM) previously trained for aesthetic prediction. The learning parameters were empirically setup differently for classification and regression. More precisely, for classification, the number of generations was 200, the probability of crossover 80%, and the elitism (the percentage of individuals in the current generation who will survive for the next generation) 7%. For regression, the number of generation was 250, the crossover probability 85%, and finally the elitism 10%.