**Chocolate Quality Assessment Based on Chemical Fingerprinting Using Near Infra-red and Machine Learning Modeling**

**Thejani M. Gunaratne 1 , Claudia Gonzalez Viejo 1 , Nadeesha M. Gunaratne 1 , Damir D. Torrico 1,2 , Frank R. Dunshea <sup>1</sup> and Sigfredo Fuentes 1, \***


Received: 1 September 2019; Accepted: 18 September 2019; Published: 20 September 2019

**Abstract:** Chocolates are the most common confectionery and most popular dessert and snack across the globe. The quality of chocolate plays a major role in sensory evaluation. In this study, a rapid and non-destructive method was developed to predict the quality of chocolate based on physicochemical data, and sensory properties, using the five basic tastes. Data for physicochemical analysis (pH, Brix, viscosity, and color), and sensory properties (basic taste intensities) of chocolate were recorded. These data and results obtained from near-infrared spectroscopy were used to develop two machine learning models to predict the physicochemical parameters (Model 1) and sensory descriptors (Model 2) of chocolate. The results show that the models developed had high accuracy, with *R* = 0.99 for Model 1 and *R* = 0.93 for Model 2. The thus-developed models can be used as an alternative to consumer panels to determine the sensory properties of chocolate more accurately with lower cost using the chemical parameters.

**Keywords:** sensory; physicochemical measurements; artificial neural networks; near infra-red spectroscopy

#### **1. Introduction**

Chocolate is a semisolid suspension of fine particles made from sugar, milk powder, milk fat and cocoa in a continuous fat phase. The fruit of *Theobroma cacao* provides the cocoa solids and cocoa butter used for chocolate production. It is a confectionery product that evokes emotional stimuli upon consumption and activates the pleasure centers in the brain [1,2]. The main processing steps of chocolate manufacture are mixing, refining, conching, tempering, molding, and packaging. Particle size reduction takes place in the refining stage to obtain the optimum size of particles, which is important for the texture of chocolate [2]. Viscosity, mouthfeel, and consistency of chocolate are very important properties of chocolate, which are affected by rheological and textural characteristics. The processing conditions and compositions have an influence on the rheological properties of food [3]. To achieve the preferred rheological factors in chocolate with an acceptable texture, several production steps such as mixing, refining, conching, and tempering, are important. These processing parameters affect the viscosity of chocolate [4]. In order to obtain high-quality products, viscosity is considered an important physical parameter in producing chocolate and cocoa products, since it influences the textural properties of chocolate [4]. Moreover, acidity, sweetness, bitterness, color intensity, hardness, and smoothness are the most important parameters which affect the sensory perception of chocolate [5].

Bitterness can be caused by different chemical compounds, such as quinine hydrochloride and 6-n-propyl-2-thiouracil (PROP), and one of its functions is said to be the identification and avoidance of poison [6]. Sodium-ion (Na+) can be used as a reference to measure saltiness, and it impacts the ion channel of the taste receptor. Acids are responsible for causing sourness, and the hydrogen ion (H+) activates the taste receptor. Compounds which are responsible for sweetness vary extremely from simple carbohydrates to amino acids and artificial sweeteners [6]. "Umami" is a Japanese term that means "delicious" and is naturally found in food like meats, tomatoes, and mushrooms. It was identified through studies of monosodium glutamate (MSG) and is considered as a flavor enhancer [7].

Research and quality inspection of food products, such as sensory evaluation as well as the determination of physicochemical data is very time-consuming and labor-intensive and may require analytical techniques. Near infra-red (NIR) spectroscopy (NIRS) conforms to 750–2500 nm wavelength, which employs photon energy (hn) within the range of 2.65 × 10−<sup>19</sup> to 7.96 × 10−<sup>20</sup> J that is a promising technique which may overcome some of the drawbacks of traditional methods [8]. There are several advantages of using NIRS, including being a fast, non-destructive, non-invasive, and universally accepted technique [9]. Several studies have been conducted with NIRS and physicochemical data to quantify various analytes in foods [10]. Recent studies on food products, such as assessment of beer quality using computer vision algorithms, NIRS and machine learning algorithms [11], prediction of pH and total soluble solids in banana using NIRS [12], prediction of canned black bean texture using visible/NIRS [13], and discriminant analysis of pine nuts by NIRS [14] have been conducted. Applications of NIRS to cocoa and chocolate manufacture include quality control of cocoa beans [8], determination of biochemical quality parameters in cocoa [15], rapid determination of sucrose content in chocolate mass [16], and assessment of raw cocoa beans to predict the sensory properties of chocolate [17].

In this study, NIRS was used to gain chemical fingerprinting of chocolate produced using the five basic tastes. Predictive models based on artificial neural networks (ANN) were developed using Matlab® R2018b (Mathworks Inc., Natick, MA, USA). Specific absorbance values of NIR spectra were used as inputs, while physicochemical data (pH, Brix, viscosity, and color) and sensory properties (basic taste intensity) of chocolate were used as targets. The objective of this study was to develop accurate models to predict the quality of chocolate based on chemical, physical, and sensory properties using NIRS and machine learning algorithms.

#### **2. Materials and Methods**

#### *2.1. Chocolate Samples*

For this study, five different types of chocolate with basic tastes (bitter, salty, sour, sweet, and umami) were used. For the bitter sample, commercially available dark chocolate (70% cocoa) was used. Three concentrations each for the other four basic tastes were produced in the Sensory laboratory at The University of Melbourne, Australia. By tasting at a focus group discussion consisting of sensory professionals at The University of Melbourne, the final concentration of each taste was determined. They commented on the identification and the intensity of each flavor and the concentration which all the participants agreed, was used for final sample preparation. Therefore, salt (4 g), citric acid (0.5 g), sucrose (6 g) and MSG (3.5 g) were added to 100 g of melted compound milk cooking chocolate chips in order to produce salty, sour, sweet and umami samples respectively. These chocolate samples were used for sensory analysis, NIRS, and physicochemical analysis.

#### *2.2. Participants for Sensory Sessions*

Panelists (*N* = 45) were recruited via email invitations from The University of Melbourne, Australia, who volunteered to participate in the sensory assessment of chocolate samples with basic

tastes. The panelists had to sign a consent form before participating in the sensory session. The panelists received incentives (chocolate and confectionery products) as an appreciation for participating in the study. The experimental procedure was approved by the Ethics committee of the Faculty of Veterinary and Agricultural Sciences at The University of Melbourne, Australia (Ethics ID 1545786.2).

#### *2.3. Sensory Evaluation*

Sensory sessions were conducted in individual booths in the sensory laboratory at The University of Melbourne. Each booth consisted of an integrated camera system controlled by a Bio-sensory application (App) designed for Android tablets (Google; Open Handset Alliance, Mountain View, CA, USA) developed by the sensory group from the School of Agriculture and Food, Faculty of Veterinary and Agricultural Sciences, The University of Melbourne [18]. The intensity of bitterness, saltiness, sourness, sweetness, and umami taste (0—low/7.5—medium/15—high) were assessed using a 15-cm non-structured continuous scale. The temperature of the serving/preparation room was 20 ◦C, while the temperature of the booths was controlled between 24 and 25 ◦C. Panelists were served with chocolate samples (two pieces from each taste; each square measuring 1 cm × 1 cm and weighing 7.3 g) on a tray in random order with specific 3-digit random numbers for evaluation. Panelists were asked to cleanse their palate using crackers and water in between all samples.

#### *2.4. pH, Total Soluble Solids (TSS) and Viscosity Measurements*

For pH, five readings were taken from five pieces of chocolate (25 readings in total per sample). Initially, 10 g of chocolate was ground and mixed with 100 mL of distilled water and was allowed to settle for 20 min [19]. The pH of the supernatant liquid was measured (25 readings per each taste) using a calibrated bench-top meter (Sper Scientific Ltd., Scottsdale, AZ, USA) at room temperature (23–25 ◦C). The same method was modified to measure total soluble solids (Brix), and the supernatant liquid obtained from mixing chocolate with distilled water was used to measure the Brix value (25 readings per each taste) using a digital refractometer HANNA HI 96801 (Hanna Instruments Inc., Woonsocket, RI, USA) at room temperature (23–25 ◦C). In order to determine viscosity, chocolate was melted by microwaving 100 g for 1 min at 800 W [20]. A Brookfield DV1 viscometer with RV7 spindle at 50 RPM (AMETEK Brookfield, Middleboro, MA, USA) was used to measure the viscosity (6 readings per sample) of the melted chocolate at 28–30 ◦C. About 12–15 min were taken to obtain the measurements.

#### *2.5. Near infra-red Spectroscopy and Color Measurements*

A handheld microPHAZIR™ RX Analyzer (Thermo Fisher Scientific, Waltham, MA, USA) was used to measure the NIR spectra of chocolate samples. The instrument was calibrated using a white background before obtaining the measurements as well as after 10–15 samples. This device can be used to measure the absorbance of wavelengths between 1600 and 2396 nm.

In this study, 24 pieces from each type of chocolate were used to measure the spectra. Furthermore, three readings from the top surface and three from the bottom surface of the chocolate piece were obtained, providing 144 readings per sample. The obtained absorbance readings from chocolates manufactured with the five basic tastes were averaged separately in order to plot against wavelength. After considering all the pre-treatments, the second derivative values of absorbance were used for the analysis because they showed the best representation. The Unscrambler X ver. 10.3 (CAMO Software, Oslo, Norway) software was used to plot absorbance values with all wavelengths for the five chocolate samples.

The color parameters using the CIELab color scale were recorded using a StellarNet Inc. EPP2000 (EPP2000-UVN-SR) coupled with an SL1 Filter (StellarNet, Inc., Tampa, FL, USA). In the CIELab color scale, *L* represents lightness and ranges from 0 to 100, *a* depicts the red and green in the positive and negative values, respectively, and *b* represents yellow in the positive values and blue in the negative [21].

#### *2.6. Statistical Analysis and Machine Learning Modeling*

An analysis of variance (ANOVA) was conducted to assess significant differences (α = 0.05) of the physicochemical data between chocolates with different tastes using Minitab 2017 (Minitab Inc., State College, PA, USA). The mean values and standard deviation of the sensory, chemical, and physical parameters were also calculated and tabulated. *α*

A correlation matrix was also developed to identify the correlations between the sensory (basic taste intensities) and physicochemical properties (pH, Brix, color, and viscosity) of the chocolate samples using a customized code written in Matlab® R2018b (Mathworks Inc. Natick, MA, USA).

As shown in Figure 1, two machine learning models were developed by testing 17 different ANN training algorithms using an automated customized code written in Matlab® R2018b. The 17 algorithms were summed as two backpropagation using Jacobian derivatives, 11 backpropagation using gradient derivatives, and four supervised weight or biased training. An ANN fitting model (Model 1) with three hidden neurons was developed using the normalized NIR readings (−1–1) of chocolate as inputs and normalized values (−1–1) of physicochemical data (pH, Brix, viscosity, and color in CIELab) as targets. The whole NIR spectra (1596–2396 nm) was used to develop the model. The outputs of the Model 1 were used as inputs for the Model 2, while the sensory data (intensities of bitterness, saltiness, sourness, sweetness and umami taste) were used as targets. Both models were developed using a random data division, 70% (*n* = 84) of the samples were used for training, 15% (*n* = 18) for validation with a mean squared error (MSE) performance algorithm and 15% (*n* = 18) for testing using a default derivative function. Data division was randomized similar to Model 1, and 10 hidden neurons were used. From the 17 algorithms (data not shown), the best models corresponded to the Levenberg Marquardt algorithm in both Model 1 and Model 2. The coefficient of determination (R), slope (s), MSE and statistical significance (criteria; *p* < 0.05) were obtained. The p-value was calculated using CoStat ver. 6.45 (CoHort Software, Monterey, CA, USA) software. The performance of the models was evaluated based on the MSE. The most accurate model was selected from the above based on the R and MSE values for each stage (training, validation, testing and overall) and the slope of the overall model. Finally, a normality test (Jarque-Bera Test and DAgostino & Pearson Test) was conducted for the values of the error histogram using a customized code written in Matlab®R2018b to identify whether the errors were normally distributed. − −

**Figure 1.** Diagrams with a two-layer feedforward network and tan-sigmoid function in the hidden layer, and a linear transfer function in the output layer for (a) Model 1 constructed with 100 inputs from near-infrared readings, three neurons and six targets related to physicochemical data of chocolate, and (b) Model 2 developed using six inputs obtained from the output of Model 1, ten hidden neurons and five targets related to basic taste intensities of chocolate. For the hidden and output layers, *w* = weights and *b* = biases.

#### **3. Results**

#### *3.1. Sensory and Chemical Analysis*

The mean values and standard deviation of the results obtained from the sensory session (intensities of basic tastes), and physicochemical analysis are shown in Table 1. Significant differences (*p* < 0.001) were found in all the basic taste intensities (bitterness, saltiness, sourness, sweetness, and umami taste) for all five samples. According to the results of ANOVA conducted for the chemical parameters, there was a significant difference (*p* < 0.001) between the different chocolate samples in pH, Brix, and viscosity.

**Table 1.** Mean and standard deviation values of the sensory, chemical, and color data of the chocolate samples used for this study.


a–d Means with different letters for each parameter indicate significant differences (*p* < 0.05) by Tukey's studentized Range (HSD) test. ± standard deviation of mean values is stated. Bitterness, saltiness, sourness, sweetness, and umami taste were obtained from a 15-point continuous scale.

#### *3.2. Near Infra-Red Spectroscopy*

Figure 2 shows the graphs drawn using the mean values from the replicates of each type of chocolate. Different colors are used to signify each type of chocolate: dark blue for bitter, red for salty, green for sour, light blue for sweet and brown for umami. As seen in this figure, the peak values for all the chocolate samples are within the range of approximately 1700–2350 nm, in which compounds such as water, carbohydrates, proteins, sucrose, lactose, and fat can be found [22].

#### *3.3. Multivariate Data Analysis*

The correlation matrix showing the values of the significant correlations (*p* < 0.05) between the chemical and sensory parameters of chocolate is shown in Figure 3. It can be observed that Brix was negatively correlated to bitterness (*R* = −0.95) and positively correlated to the *L* value (*R* = 0.95) while pH was negatively correlated to sourness (*R* = −0.91). There was also a negative correlation between bitterness and the *b* value (*R* = −0.91).

Table 2 shows a summary of the statistical data obtained from both Model 1 and 2. Model 1 had a higher correlation coefficient (*R* = 0.99) for the overall stage (Figure 4a). It also had the same performance values for validation and testing stages (MSE = 0.01), which are an indication of no overfitting. The slope values of all the stages were closer to unity (*s* ~ 1). The overall correlation coefficient (*R*) of Model 2 was 0.93 (Figure 4b). Furthermore, it also had similar performance values for the validation and testing stages (MSE = 0.05) and values closer to the unity (*s* ~ 1) for the slopes of all the stages. Furthermore, data were normally distributed (*p* = 1) in the error histograms of Models 1 and 2 according to the Jarque-Bera Test and the DAgostino & Pearson Test.

**Figure 2.** Curves for chocolate samples showing the absorbance (Au) values (y-axis) for specific wavelength (nm) values (x-axis) in the near infra-red spectra for each chocolate sample with basic tastes. A total of 144 absorbance readings per sample were taken, and the curves were drawn by using the mean values. − − −

**Figure 3.** Results from the correlation matrix. Those with the values in the boxes represent the significant correlations (*p* < 0.05). The color bar represents the correlation coefficient (r) with the blue side being positive correlations and the yellow side the negative correlations. The x-axis and y-axis represent the descriptors.


**Table 2.** Statistical data showing the stage, number of samples, correlation coefficient (*R*), performance based on mean squared error (MSE), and slope for Model 1 and 2.

**Figure 4.** Results of artificial neural networks (ANN); (**a**) Model 1 using physicochemical data as targets and readings of the whole near-infrared wavelength range (1596–2396 nm) as inputs. (**b**) Model 2 using outputs of Model 1 as targets and sensory responses (basic taste intensities) as inputs. The observed values are shown on the x-axis and the estimated values on the y-axis.

#### **4. Discussion**

The sour chocolate sample showed the lowest pH value according to the mean values obtained for all samples. This is due to the acidic pH obtained by adding citric acid to the chocolate. Furthermore, bitter chocolate had the lowest Brix value, which is an indicator of the total soluble solids (sugars) of the food. These results are also shown in the correlation matrix (Figure 3), which is discussed later in this paper. Moreover, the bitter chocolate indicated the lowest result for the *L* value of the color measurements, which shows that it was the darkest sample when compared to others because 0 indicates black and 100 shows white in the *L* value of the CieLab scale [23].

The NIR curves for all five chocolate samples exhibiting bitter, salty, sour, sweet, and umami taste profiles showed similar trends. The main ingredients of chocolate are cocoa liquor, cocoa butter, sugar, milk powder, and milk fat. These compounds contain peptide and carbohydrate bonds and hence, contribute to the peaks of spectral [24]. According to Figure 2, peaks were observed at 1728 nm and 1761 nm, which match the C-H bond of carbohydrates which is similar to the profile of cocoa butter [24]. The peak around 1940 nm is the water content in the chocolate which may be the water contained in

the ingredients used for chocolate production [22]. The NIR curves also showed peaks around 2100 nm and 2310 nm, which show the presence of protein, mainly from the milk powder and cocoa butter in the chocolate. The peak around 2080 nm due to the absorbance of O–H bond in sucrose indicates the presence of sugar in chocolate. Moreover, the peaks around 1759 nm, 2310 nm, and 2343 nm indicate the presence of fats in the chocolate [24]. In food science, qualitative methods play an important role in NIRS analysis based on physicochemical data [25,26]. Hence, these qualitative methods were used in this study to determine the available compounds in the chocolate samples.

Brix is an indicator of the total soluble solids (sugars) in a food product, where the sweetness is high when the value increases [27]. This was consistent with the correlation matrix presented in this study, which showed a negative correlation between Brix and sensory bitterness. The pH is an indicator of acidity in a sample [27]; therefore, sourness decreases with pH as basicity increases, and this was also in accordance with this study, as seen in the correlation matrix as pH had a negative correlation to sourness. Furthermore, bitterness showed a negative correlation to the *b* value, which indicates the reduction of yellow color. This complied with the findings showing that brown color is associated with chocolate products [28]. Hence, the instrumental measurements were correlated to the sensory characteristics. Recently, NIRS applications have been extended by the development of many calibration models along with physicochemical data. In this study, two models with high accuracy were developed using NIRS results, physicochemical data, and sensory intensities to predict the quality of chocolate. These can be used for qualitative or quantitative analysis of food products, and the main methods used for quantitative analysis are principal component regression, step multiple linear regression, partial least squares and ANN [29–34]. In the present study, ANN machine learning models were developed out of the above-mentioned methods.

Chocolate has also been a target for research using NIRS for several purposes. Nutritional parameters have been measured in chocolate using near-infrared diffuse reflectance spectroscopy and neural networks [24]. Similarly to the present study, that study also used ANN models as an alternative to time-consuming chemical methods of analysis. According to the results of the present study, the regression models developed using physicochemical data (pH, Brix, viscosity, color in L\*a\*b) showed high accuracies with a high correlation coefficient (*R* <sup>2</sup> = 0.99), meaning that better predictions on chemical factors can be obtained using the developed models.

Both models in this study showed similar MSE values for validation and testing stages, which means that there was no overfitting [35,36]. Due to the high accuracy of Model 2, which can be used to predict the sensory properties of chocolate using chemical and physical parameters (color), it may be used as a fast-screening method to determine the basic taste intensities of chocolate. Using Model 1, we can predict the physicochemical data (pH, Brix, viscosity, and color) of chocolate samples. Furthermore, the sensory properties (intensity of bitterness, saltiness, sourness, sweetness, and umami taste) of chocolate can be predicted using Model 2. Furthermore, this does not require the measurements of NIR but instead, requires the chemical and physical parameters which will be a lower cost compared to near infra-red spectroscopic measurements, considered as a high-cost technique by some researchers [37,38]. Moreover, if an investment can be made on a NIR device, it may be used to obtain all the chemical fingerprinting, physicochemical and the sensory data, which may reduce the time and cost of analysis.

Understanding of food products over regression models for physicochemical data and sensory properties of food can be done using predictive modeling. Research also has been done to assess beer chemometry using novel techniques such as non-linear methods by developing predictive models using PLS and ANN [39]. It showed better predictive models when developed using ANN when compared to PLS. This was also found in the study from Cen and He [29], where they stated that ANN gives better results for some non-linear data than linear approaches. Furthermore, PLS can only generate models for training and validation stages, while ANN can generate for all four stages, which may also be used to find any signs of over or underfitting [11].

NIRS has several benefits when used in food analysis. The spectral measurement takes only 15–90 seconds; hence, NIRS is considered a rapid technique [40]. Furthermore, several samples may be analyzed using one spectral measurement, which is very beneficial in terms of analyzing various indexes. The physical state of the sample also does not matter for NIRS and can be directly tested in the sample chamber. NIRS is also a non-destructive method that is free of chemicals, which is another advantage of this technique [29]. There are few disadvantages in NIRS, such as showing insensitivity to lower concentrations (below 0.1%), the technique being an empirically based quantitative tool and necessity of careful observation for accurate results [41].

#### **5. Conclusions**

In this study, the use of NIRS for the prediction of chocolate quality based on chemical, physical, and sensory parameters was reinforced. NIRS and machine learning algorithms can be used to develop accurate prediction models to assess the quality of chocolate. The developed models can be potentially used as a rapid method to obtain physicochemical data and screen the intensity of basic tastes in chocolate based on sensory properties using a minimal amount of laboratory instruments and labor. Further studies may be conducted to improve the quality of models using other physicochemical measurements and sensory properties, which were not considered in this study.

**Author Contributions:** Conceptualization, T.M.G, C.G.V, D.D.T and S.F; Data curation, C.G.V, D.D.T and S.F; Formal analysis, T.M.G and C.G.V; Funding acquisition, F.R.D; Investigation, T.M.G; Methodology, T.M.G, C.G.V, N.M.G, D.D.T and S.F; Project administration, F.R.D and S.F; Supervision, D.D.T, F.R.D and S.F; Validation, C.G.V, D.D.T and S.F; Writing – original draft, T.M.G; Writing – review & editing, C.G.V, N.M.G, D.D.T, F.R.D and S.F.

**Funding:** This research was partially funded by the Australian Government through the Australian Research Council (Grant number IH120100053) 'Unlocking the Food Value Chain: Australian industry transformation for ASEAN markets.'

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

**Fernando Mateo 1, \* , Andrea Tarazona <sup>2</sup> and Eva María Mateo 3**


**Abstract:** Unifloral honeys are highly demanded by honey consumers, especially in Europe. To ensure that a honey belongs to a very appreciated botanical class, the classical methodology is palynological analysis to identify and count pollen grains. Highly trained personnel are needed to perform this task, which complicates the characterization of honey botanical origins. Organoleptic assessment of honey by expert personnel helps to confirm such classification. In this study, the ability of different machine learning (ML) algorithms to correctly classify seven types of Spanish honeys of single botanical origins (rosemary, citrus, lavender, sunflower, eucalyptus, heather and forest honeydew) was investigated comparatively. The botanical origin of the samples was ascertained by pollen analysis complemented with organoleptic assessment. Physicochemical parameters such as electrical conductivity, pH, water content, carbohydrates and color of unifloral honeys were used to build the dataset. The following ML algorithms were tested: penalized discriminant analysis (PDA), shrinkage discriminant analysis (SDA), high-dimensional discriminant analysis (HDDA), nearest shrunken centroids (PAM), partial least squares (PLS), C5.0 tree, extremely randomized trees (ET), weighted k-nearest neighbors (KKNN), artificial neural networks (ANN), random forest (RF), support vector machine (SVM) with linear and radial kernels and extreme gradient boosting trees (XGBoost). The ML models were optimized by repeated 10-fold cross-validation primarily on the basis of log loss or accuracy metrics, and their performance was compared on a test set in order to select the best predicting model. Built models using PDA produced the best results in terms of overall accuracy on the test set. ANN, ET, RF and XGBoost models also provided good results, while SVM proved to be the worst.

**Keywords:** machine learning; unifloral honeys; botanical origin; physicochemical parameters; classification

#### **1. Introduction**

Honey is a natural food appreciated worldwide with high nutritional value that provides many health benefits [1,2]. Honey is defined by the European Union (EU) as "the natural sweet substance produced by *Apis mellifera* bees from the nectar of plants or from secretions of living parts of plants or excretions of plant-sucking insects or the living parts of plants, which the bees collect, transform by combining with specific substances of their own, deposit, dehydrate, store, and leave in honey combs to ripen and mature" [3]. The EU regulations concerning honey are included mainly in the 2001/110/EC Council Directive [4], further amended by the 2014/63/EU Directive [5]. The composition criteria for honey placed on the market or used in any product intended for human consumption are stated in Annex II of [4]. The Codex standard for honey adopted by the Codex Alimentarius Commission in 1981 was revised in 1987 and 2001, has served as a basis for national legislations in some countries and has voluntary application [6]. Honey is

**Citation:** Mateo, F.; Tarazona, A.; Mateo, E.M. Comparative Study of Several Machine Learning Algorithms for Classification of Unifloral Honeys. *Foods* **2021**, *10*, 1543. https://doi.org/10.3390/ foods10071543

Academic Editor: Sigfredo Fuentes

Received: 31 May 2021 Accepted: 29 June 2021 Published: 3 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

a very rich food product that contains water, sugars (mainly fructose and glucose, but also di- and trisaccharides), hydroxymethyl furfural and other compounds at low levels such as minerals, amino acids, proteins (including enzymes), aromatic acids, esters, aroma components and flavonoids [1,2]. Naturally, bees forage the flowers they can access. Hence, the honey produced mostly has a blend of flavors and is commonly sold in the market simply as honey or mixed-flower honey. However, when the nectar is taken predominantly from a single type of flower, the honey produced has characteristic organoleptic properties, adding to its commercial value. Many consumers appreciate these particular sensorial properties very much, which increase these honeys' price with respect to other types of honey. Moreover, honeydew honeys are especially appreciated by consumers in Central Europe (Germany, Switzerland and Austria). Denominations of botanical origin are extensively used on the honey market as they offer consumers the choice among a variety of different typical products, paying prices depending on local consumer preferences [7]. The existing international norms and regulations do not specify the characteristics of unifloral honeys, although limits for moisture content, sugar content or electrical conductivity are different for honeys originated from some botanical origins [2]. Authentication of food products is of great concern in the context of food safety and quality. In recent years, interest in honey authenticity in relation to botanical or geographical origin and adulteration has increased. Due to the huge variety of different floral sources normally attainable by bees for foraging and to the great diversity within plant species, which is influenced by the climatic and growing conditions, the parameters used for characterizing unifloral honeys do not exhibit typical values but are defined in rather large, often overlapping ranges [7]. The differences observed in honey composition depend on a variety of factors, such as the region, season, nectar source, beekeeping practices and harvest period [8].

Classically, the determination of the botanical origin of honeys has been performed by melissopalynological methods. The fundamentals of this methodology were established many years ago [9,10], but it has been used for years. Usually, honey is considered mainly from one plant if the pollen frequency of that plant is >45%. Pollen grains from anemophilous plants and plants with nectarless flowers are excluded in the calculation of the percentages. Moreover, pollen grains from some species are under- or over-represented in relation to the nectar their flowers yield. For unifloral honeys with under-represented pollen, the minimum percentage of the taxon that gives the honey its name ranges 10–30%; for those with overrepresented pollen, the minimum percentage can be 80–90%. This technique is useless in the case of honey filtration. Notwithstanding, interpretation of pollen analysis data may be difficult in some cases, and the counting and identification of pollen grains depend greatly on the skill and performance of the analyst [11]. Sensory properties (color, aroma, flavor) can help to ascribe a honey sample to a given botanical origin, but, due to subjectivity, well-trained personnel are needed. However, the sensory properties of a honey can vary with time and thermal treatment while maintaining the floral origin. Organoleptic properties have been considered, together with pollen analysis, key to performing the classification of unifloral honeys. Methods based on physicochemical properties of honey have been developed for the accurate classification of these honeys with the help of suitable statistical treatments [11]. Generally, no single parameter has proved useful to characterize the botanical origin of honey, except the methyl anthranilate content, which is characteristic of citrus honey [12,13]. Assayed parameters have been honey color (measured using CIE-1931 xyL or CIE-1976 LAB chromatic coordinates) [14], the carbohydrate profile [15–17], volatile organic compounds [16,18–22] or the amino acid profile [18,23].

Even when differences among honeys from distinct botanical sources are found using only a profile of a single class of compounds (sugars, amino acids, volatiles, etc.) or characteristics, a thorough characterization of the botanical origin of honeys is not achieved. Thus, sets of different parameters, either physicochemical or sensorial, or both, sometimes with the pollen spectrum and usually involving statistical (chemometric) techniques, such as cluster analysis, principal component analysis (PCA) and linear discriminant analysis (LDA), have been considered. Parameters tested together with this aim have been water

content, pH, acidity, electrical conductivity, some carbohydrates, color, volatile compounds, amino acids, phenolic compounds, mineral elements, etc. [11,22–26]. Even when the chemical composition of honey is associated with its botanical and geographical origin, some processes, such as heating, storing or the extraction techniques, can alter the initial volatile composition [22], which affects the volatile fingerprint of unifloral honeys and hence organoleptic properties. Other classification approaches lie in the use of nondestructive techniques applied to honey samples. In this way, attenuated total reflectance Fourier-transform infrared spectroscopy (ATR-FTIR) of unifloral honeys is a technique that, after treatment by PCA and further treatment of the principal components by means of a machine learning (ML) algorithm such as support vector machine (SVM), proved useful for the characterization of honey origin [27]. The potential application of other spectroscopy techniques such as visible–near-infrared (VIS–NIR) hyperspectral imaging for the detection of honey flower origin using ML techniques has been reported [28]. PCA was used for dimensionality reduction before ML treatment using three ML algorithms, namely, radial basis function (RBF) network, SVM and random forest (RF), to predict honey floral origin. Furthermore, FT-Raman spectroscopy has shown to be a simple, rapid and nondestructive technique that, in combination with proper PCA or LDA models, could be successfully adopted to identify the botanical origins of some honey types [29–31]. The same technique resulted in being useful to detect adulterations of pure beeswax with paraffin or microcrystalline waxes [32]. Nuclear magnetic resonance (NMR) was used for the estimation of the botanical origin of honeys. Due to the complex nature of NMR data, multivariate analysis has been applied to extract the useful information [33]. The application of an electronic nose (E-nose) to parametrize the odor compounds in the form of numeric resistance and further treatment by k-nearest neighbor (k-NN) has been reported [34]. A commercial electronic tongue including seven potentiometric sensors has been applied for the classification of honeys. Botanical classification was performed by PCA, canonical correlation analysis (CCA) and artificial neural network (ANN) modeling on samples of acacia, chestnut and honeydew honeys [35]. The ML algorithms applied to the authentication of the botanical origin of honeys are ANN [35,36], classification and regression trees (CART) [37], k-NN, SVM or RF [27,28,34]. However, the usage of ML techniques is not popular in honey research, and mixed approaches including classical statistics together with ML have been applied in some studies, as indicated in a recent review [38].

The aim of the present study was to carry out a comparative analysis of the application of some ML algorithms to find the most useful to accurately classify rosemary, citrus, lavender, sunflower, heather, eucalyptus and forest unifloral honeys harvested in Spain on the basis of some physicochemical properties (pH, moisture, electrical conductivity, sugars) and color. The classifier algorithms used for this goal were penalized discriminant analysis (PDA), high-dimensional discriminant analysis (HDDA), shrinkage discriminant analysis (SDA), nearest shrunken centroids (PAM), partial least squares (PLS) or decision trees (5.0 tree), extremely randomized trees or Extra Trees (ET), k-NN, SVM, RF and extreme gradient boosted tree (XGBoost).

#### **2. Materials and Methods**

#### *2.1. Honey Samples*

The analyzed Spanish honey samples were obtained from beekeepers and traders before processing. They belonged to seven unifloral origins, most of which are very appreciated by consumers worldwide or in countries of Central Europe. They were rosemary (*Rosmarinus officinalis* L.), orange blossom (*Citrus* spp.), lavender (*Lavandula latifolia*, *L. angustifolia*, *L. vera*), sunflower (*Helianthus annuus* L.), heather/bell heather (Ericaceae, mainly *Erica* spp. and *Calluna vulgaris*), eucalyptus (*Eucalyptus globulus* and *E. camaldulensis*) and forest honeys. Honeys were harvested in different regions of Spain, excluding the Balearic and Canary Islands. The approximated coordinates of the areas related to honey harvest are longitude 3◦19′ E–7◦ W and latitude 36◦ N–43◦ N. Orange blossom honey was harvested mainly in eastern Spain (Valencian Community) and some

provinces of Andalucia. Rosemary and lavender honeys were harvested mainly in central and southeastern Spain (Castilla-la Mancha, Castilla-Leon) during spring (rosemary) and summer (lavender). Heather and bell heather (in the following, heather) honey was harvested in many Spanish regions during March or September. Forest honey was mainly honeydew honey from holm oak (*Quercus ilex* L., *Q. rotundifolia* Lam., *Q. bellota* Desf.) grown in western Spain (Extremadura and Salamanca), and it was harvested during August/September or later. Eucalyptus honey was harvested mainly in provinces located in western Spain (Huelva, Extremadura) and northwestern Spain during September–October. Sunflower honey was collected during summer in southern/central Spain (Andalucía and Castilla-La Mancha). Samples were collected in different years from 2010 to 2014.

Samples were screened by microscopic and sensory analysis (color, aroma, taste) assessment as soon as they arrived at the laboratory. When analysis had to be delayed for more than four weeks, they were stored at −20 ◦C; otherwise, they were stored at 4–6 ◦C in the dark.

The samples were assessed microscopically for pollen and honeydew elements (HDE), which are mainly unicellular algae, fungal spores and hyphae. HDE/pollen (from nectariferous plants) ratios higher than three are required for honeydew honeys according to Louveaux et al. [10]. However, an HDE/pollen ratio of 1.5 ± 1.2 (0.3–4) was reported in 167 honeydew honeys from different places in Europe [13]. Studies performed in Spain have found rather low values for such index in oak honeydew honey [39]. Melissopalynology seems not to be useful for classification of Spanish oak honeydew honeys [40]. Other required parameters associated with honeydew honeys are electrical conductivity values >800 µs/cm [4,6] and pH values >4.3 [13], besides acceptable sensory assessment (dark amber color, characteristic taste, lack of crystallization tendency). It is known that in rosemary, lavender and citrus honeys, pollen from the flowers of these plants is not dominant. After screening, some samples were rejected as unifloral. The number of honey samples collected before the initial screening and the number of samples eventually selected for all physicochemical analyses and statistical treatments are indicated in the following relation, where the selected samples are between parenthesis: 27(13) from rosemary, 31(13) from heather, 35(16) from orange blossom, 33(16) from forest, 19(14) from lavender, 23(14) from eucalyptus and 33(14) from sunflower.

#### *2.2. Microscopical Analysis*

Microscopical analysis of honey sediment was achieved according to the methods of melissopalinology [10] and the Spanish official methods of analysis for honey [41]. Slides were prepared without acetolysis. Briefly, graduated conical centrifuge tubes containing 10× *g* of homogenized honey solved in 20 mL of dilute sulfuric acid were centrifuged for 10 min at 2500 rpm. The supernatant was discarded, and the sediment was washed twice with 10 mL of distilled water and centrifuged. After discarding the supernatant, the sediment was homogenized, and an aliquot was placed on a glass slide, sprouted over an area of 4 cm<sup>2</sup> , dried at 40 ◦C and mounted with stained glycerin-gelatin. Pollen grains were identified by light microscopy with the aid of non-acetolyzed pollen collection and microphotographs from specialized studies. Usually, 350–500 grains were counted, and they were classified in the following frequency classes: dominant pollen (>45% of the pollen grains counted); secondary pollen (16–45%); important minor pollen (3–15%); minor pollen (1–3%); and present (<1%). For forest honey, HDE were counted apart from pollen grains.

#### *2.3. Electrical Conductivity*

Electrical conductivity was measured at 20.0 ± 0.1 ◦C in a 20% (*w*/*v*) solution of honey (dry matter basis) in deionized water with electrical conductivity of <1 µs/cm [41] using a Crison model 525 conductimeter (Crison Instruments, Barcelona, Spain). The cell was previously calibrated at 20.0 ◦C with a 0.01 M KCl solution. Measurements were carried out in quintuplicate.

#### *2.4. Water Content*

Water content was determined at 20.0 ◦C by refractometry according to [41]; this method matches the AOAC 969.38B method [42] cited in [6]. A Bellingham and Stanley standard model Abbe-type refractometer previously calibrated and connected to a thermostatic bath was used. The Chataway tables revised by Wedmore were used to convert refraction indices to percentage of water [41]. Measurements were carried out in triplicate.

#### *2.5. pH Measurement*

Measurements of pH were performed at 20.0 ± 0.1 ◦C in a 10% (*w*/*v*) solution of honey in freshly boiled distilled water using a Crison micropH 2000 pH-meter (Crison Instruments, Barcelona, Spain). The pH-meter was calibrated with buffers of pH 6.50 and 3.00 just before measurements, which were conducted in triplicate.

#### *2.6. Color*

Honeys were liquefied if needed by heating at 50–60 ◦C, in the case of crystallized samples, and then left to cool at room temperature. Color of liquid honeys was determined by measurement of transmittances at 30 selected wavelengths, on a Shimadzu UV-vis 240 spectrophotometer fitted (Shimadzu Co., Tokyo, Japan). The x, y and L chromatic coordinates from the CIE-1931 (xyL) color system [43] were calculated from the tristimulus values [14]. Transmittance measurements were conducted in triplicate.

#### *2.7. Sugars*

Sugars were determined by gas chromatographic (GC) separation of the trimethylsilyl (TMS) derivatives (TMS oximes and TMS ethers in the case of non-reducing sugars) in an OV-17 packed column on a Perkin-Elmer Sigma 3 gas chromatograph equipped with a flame ionization (FID) detector (Perkin-Elmer Co., Norwalk, CT, USA) [15,41]. The sugars determined were fructose, glucose, sucrose, kojibiose, isomaltose and maltose. The last disaccharide includes not only maltose but also nigerose and turanose due to peak overlapping. The fructose/glucose and glucose/water ratios were also calculated. The trisaccharides raffinose, erlose and melezitose were estimated, but due to uncertainty in determination or lack of detection, they were not used in classification algorithms. Analyses were conducted in triplicate.

#### *2.8. Classification Using Statistical Multivariate and Machine Learning Algorithms*

Once the dataset with the values for all the inputs was obtained, several algorithms were applied to compare their performance to achieve the classification of the samples as accurately as possible. First, in an exploratory analysis, a correlation matrix was obtained. Then, PCA, a well-known unsupervised statistical method, was used to examine the data; it identifies orthogonal directions of maximum variance in the dataset, in decreasing order, and projects the data in a lower-dimensionality space formed of a subset of the components with the highest variance. The orthogonal directions are principal components, which are linear combinations of the original input variables. It is a method for feature reduction and transforms the original independent variables into new axes. PCA helps to identify patterns in data and express them in such a manner to indicate their similarities and differences [27,38]. The k-means algorithm was also applied to the dataset to find clusters among the samples.

The first supervised classification approach was to consider most of the dataset (70%) randomly selected for a training task using 10-fold cross-validation to validate the models and, after finishing training and optimizing the parameters, to utilize the remaining dataset (30%) to test the ability of the best model to accurately classify the samples into their a priori labeled parent classes. Statistical multivariate algorithms applied to this goal were the following: penalized discriminant analysis (PDA) [44], which is a penalized version of Fisher linear discriminant analysis (LDA); the K-NN algorithm, which assumes the similarity between the new sample and available samples within a K distance and

places a new sample into the class that is most similar among the available classes. It is non-parametric, i.e., it does not make any assumption on the underlying data. In fact, the used algorithm was a weighted version of K-NN included in the KKNN method of "caret" [45,46]. Other tested classifiers have been high-dimensional discriminant analysis (HDDA) [47], which is a model-based discriminant analysis method that assumes that each class of the dataset resides in a proper Gaussian subspace that is much smaller than the original one, and the function calculates the parameters of each subspace to predict the class of new observation of this kind; nearest shrunken centroids (NSC) [48,49], also known as prediction for microarrays (PAM); C5.0 tree, which is a popular implementation of decision trees; partial least squares (PLS), a multivariate linear regression method that can deal with a large number of predictors, a small sample size and high collinearity among predictors and acts by forming linear combinations of the predictors in a supervised manner; extremely randomized trees (ET) [50]; and shrinkage discriminant analysis (SDA) [51], an algorithm that determines a ranking of predictors by computing CAT scores (correlation-adjusted t-scores) between the group centroids and the pooled mean. Variables were preprocessed and scaled before treatment. Several metrics were used for tuning the algorithm parameters during training/validation. They were logistic loss (log loss), also known as cross-entropy loss, which is a classification loss function that tends to a minimum (the lower limit is zero but there is not a high limit) as model performance increases, meaning that the objective is to minimize the expected loss or risk; accuracy, which increases (the upper limit is 1) as the performance of the classifier increases, that is, when labeled samples are included into their a priori known classes; area under the curve (AUC), which computes the area under the receiver operator characteristic (ROC) curve and also increases (value range 0-1) with classifier performance; Cohen's kappa; sensitivity; specificity; and precision, among others. However, log loss was primarily used to measure the performance to build the best model across training/cross-validation, except when this function is not implemented.

Other ML algorithms included in the comparison were ANN (neuralnet library) [52], SVM with linear kernels (SVM library), which is a well-known classifier [53,54], RF (randomForest library) [55], XGBoost (XGBTree library) [56] and extremely randomized trees (ET) [50]. The software used was R and the "classification and regression training" (caret) package [46]. The confusion matrices express the number of samples accurately classified into their parent class or otherwise. To conduct a fair comparison of the different algorithms, the dataset partitions into training and test sets were identical.

#### **3. Results**

#### *3.1. Honey Dataset*

For the selected honey samples after microscopy analysis and sensory assessment, a dataset was built using the mean values of each determination. The dataset is summarized in Figures S1 and S2. After microscopic analysis, pollen count was related to nectariferous plants. The box plots of percentages of pollen from the taxa that give the names to the studied unifloral honeys are shown in Figure S2g. Selected samples were as follows: Rosemary honeys that had 20–77% pollen from *R. officinalis* were considered acceptable as it is known as an under-represented pollen. Orange blossom or citrus honeys had a percentage of *Citrus* spp. pollen in the range 10–46%, except in one sample (80%). Citrus honeys are considered unifloral if the pollen of *Citrus* spp. is >10% because it is considered as under-represented. Lavender honeys sowed a percentage of *Lavandula latifolia* or *L. spica* in the range 15–68%. Pollen from *L. stoechas* was usually absent. Additionally, in this honey class, the pollen is considered under-represented. Sunflower honeys had pollen of *H. annuus* in the range 31-82%. Eucalyptus honeys contained 82-98% pollen of *Eucalyptus* spp. High counts in this case are usual because *Eucalyptus* pollen is over-represented. Heather honeys encompassed pollen from *Erica* spp. in the range 48–80% (Figure S2g). For forest honey, which is mainly honeydew honey, pollen counts of *Quercus* spp., although always present, are of no interest, as previously commented, because their flowers are nonnectariferous, but they were always examined microscopically for HDE and the presence

of pollen from other taxa. HDE presence was scarce. Concerning organoleptic properties, rosemary and orange honeys displayed a light amber color and had a characteristic aroma and taste. Lavender and eucalyptus honeys were light amber but darker than orange or rosemary honeys and had a characteristic aroma and taste. Sunflower honeys had a yellow characteristic and a bright golden-amber color, with a yellow hue and slight tart aroma, and crystalized easily, producing fine crystals. Heather honeys were amber/dark amber with a reddish hue and had a characteristic intense aroma and sour taste and a tendency to crystallize. Forest honeys were also dark amber/dark, had an intense flavor, were slightly bitter and sour and remained liquid even in cool conditions for months.

Figures S1 and S2 show the large variability of the data. Some rosemary and citrus honeys had a high moisture percentage, and the lower water levels were found in eucalyptus and forest honeys, while the remaining honey types exhibited intermediate water contents (Figure S1a). All lavender honeys were well below the 15% limit for sucrose established in the EU Council directive [4]. Forest and heather honeys showed the highest values of electrical conductivity, followed by eucalyptus and lavender/sunflower honeys, while rosemary and citrus displayed the lowest values for this parameter (Figure S1). The pH was also higher in forest and heather honeys than in the remaining honeys. The highest contents of fructose and glucose and the highest glucose/water ratio were found in sunflower honeys, which also had the lowest fructose/glucose ratio; forest honeys showed the lowest levels of both fructose and glucose. On the contrary, maltose, isomaltose and kojibiose contents and the fructose/glucose ratio reached the highest values in forest honeys (Figures S1 and S2). Concerning the color parameters, the largest x values were observed in heather honeys followed by forest honeys, and the minimum x values were observed in rosemary and citrus honeys. However, heather honeys had the lowest mean value for the y and L chromatic coordinates. The largest mean y value was exhibited by sunflowers honeys, and the largest mean L value was observed in citrus honeys (Figure S2).

#### *3.2. Statistical and ML Algorithms*

A multivariate statistical study of the dataset was carried out initially. Variables were centered and scaled before statistical treatments. Correlations, PCA and data clustering were performed. A diagram including correlations between all the variables is shown in Figure 1. A score plot of the two principal components can be observed in Figure 2. PC1 and PC2 account for 45.27% and 20.46% of the variance, respectively (overall 66.73%). Heather and forest honeys spread along the positive side of PC1. Sunflower honeys extend on the negative side of PC1, but on the positive side of PC2. All citrus and most rosemary honeys are on the negative side of both PC1 and PC2. Eucalyptus honey samples spread on the positive side of PC2, and most of them are on the negative side of PC1, while most lavender honeys fall on the negative side of PC1, but they spread on both the positive and negative sides of PC2.

Another unsupervised way to explore the dataset, k-means clustering [57,58], was run to partition the data into a number of clusters using the library "factoextra" in R. All the input variables were taken into account. Two clusters of sizes 70 and 30 were obtained on the basis of the maximum average silhouette width (Figure 3). However, this number of clusters is an estimate and does not mean that only two clusters may exist. The mean values for the variables in each of these two clusters (corresponding to the centroids) are listed in Table S1. Cluster 2 is smaller in size than cluster 1 and is higher than cluster 1 in mean values of electrical conductivity, pH, disaccharides (except sucrose), fructose/glucose ratio and the x chromatic coordinate. When comparing Figures 2 and 3b, cluster 2 seems to encompass forest and heather honeys and cluster 1 the remaining honeys. Forcing the k-means clustering to display seven groups on a two-dimensional plot leads to highly overlapped clusters (Figure S3). The relative importance of the variables was tested using an RF model as a reference (Figure S4). The most important variables are electrical conductivity, the chromatic coordinates, water content, fructose and glucose. The less important variables are glucose/water and fructose/glucose ratios. The Boruta

package [59] was applied, and it considered that no variables had to be removed regardless of their relative importance.

**Figure 1.** Correlation chart among the 14 predictor variables for the whole dataset. The number in each square is rounded to one figure. The color scale at the right indicates color meaning. Red color means positive correlation; blue color means negative correlation.

**Figure 2.** Principal component score plot based on the 14 variables of 100 honey samples according to the botanical origins.

Using the approach of supervised modeling, different classifier algorithms were applied to the dataset, which was divided into a training set (70%) and a test set (30%). Ten-fold cross-validation was applied during training with four repetitions. Various metrics (log loss, accuracy, AUC, kappa, sensitivity, specificity, precision, etc.) can be used during training to tune the key parameters of the algorithms in order to find the best ones. The absolute values of metrics vary when training is repeated. Among them, the log loss metric was usually chosen to select the optimal model using the smallest value. For KKNN or KNN (as weights were not relevant), the final value of the tuning parameters used for the optimized model was kmax (maximum number of neighbors) = 5 (Table 1).

**Figure 3.** Optimal number of clusters by k-means using the average silhouette width (**a**) and clustering of honey samples by k-means algorithm in two clusters where the two largest symbols are the centroids of each cluster (**b**).


**Table 1.** Model optimization using the classifier algorithms. Log loss values are means of 10-fold cross-validation.

KKNN: weighted k-nearest neighbors; PDA: penalized discriminant analysis; HDDA: high-dimensional discriminant analysis; SDA: shrinkage discriminant analysis; PAM: nearest shrunken centroids; PLS: partial least squares; ET: extremely randomized trees.

For the PDA algorithm, the optimal lambda value was 0.1. For HDDA, the best model had a threshold of 0.300, but this algorithm is not robust and other repetitions led to a different configuration; for SDA, the lowest log loss was obtained with lambda = 0.05, and for PAM, the best model had a threshold = 0.70615. This value can change slightly if the whole treatment is repeated. With PLS, log loss was also used to select the optimal model using the smallest value, and the final value selected for the model was as follows: number of components (ncomp) = 3 (Table 1).

The box plots for the three main metric parameters log loss, accuracy and kappa for eight classifiers can be observed in Figure 4.

**Figure 4.** Box plots of log loss, accuracy and kappa values for various machine learning (ML) algorithms after training with 10-fold cross-validation to obtain the best model using the training dataset. Black circles symbolize mean values. PLS: partial least squares; C5TREE: C5.0 tree; PAM: nearest shrunken centroids; KNN: weighted k-nearest neighbors; ET: extremely randomized trees; SDA: shrinkage discriminant analysis; PDA: penalized discriminant analysis; HDDA: high-dimensional discriminant analysis.

ANN (single-layer perceptron) was applied to the training set with 10-fold crossvalidation. The training process evaluated from 1 to 20 hidden units (neurons) and weight decays from 0.1 to 0.5. After optimization of tuning parameters to maximize the validation accuracy, the best model had 17 hidden units and weight decay = 0.1 (Figure 5). As it can be observed, the variability of accuracy with more than five hidden units is low, ranging from 0.85 to 0.91. This means that repetitions of the treatments can produce different topologies with very similar accuracy.

**Figure 5.** Change in the artificial neural network (ANN) accuracy during training with 10-fold cross-validation with the number of hidden units (nodes) and weight decay.

The accuracy of the SVM with linear kernels (SVML) algorithm during training with 10 fold cross-validation was maximized, with a value of the cost function of C = 2.5 (Figure 6). The largest value of the accuracy (0.61) was relatively low. In an attempt to improve SVM, we tested SVM with radial basis function kernels (SVMR). The final values used for the SVM<sup>R</sup> model were sigma = 0.1 and C = 0.5 (Figure 6). The accuracy was 0.562, meaning it was not improved. However, the cost value of these algorithms was quite variable on repeated treatments, maintaining the same partition ratio.

Figure 7 shows the variation in RF accuracy throughout training with 10-fold crossvalidation. The final values for the RF model were as follows: number of variables randomly sampled as candidates at each split (mtry) = 8; the number of trees (ntry) parameter was 500; the maximum accuracy was 0.9246.

(a) (b)

**Figure 6.** Change in accuracy of (**a**) Support vector machine with linear kernel (SVML) and (**b**) Support vector machine with radial kernel (SVMR) during training with 10-fold cross-validation (CV) with the cost function.

**Figure 7.** Change in the accuracy of random forest (RF) models with the number of randomly selected predictors (mtry).

The XGBoost tree algorithm has many parameters to tune, although, usually, some of them are held constant. Figure 8 shows the variation in some tuning parameters during the training process. The largest accuracy was obtained with "subsample" = 0.5, "shrinkage (eta)" = 0.1 and "max tree depth" = 4. Other final values for the model were "nrounds" = 200, "gamma" = 0, "colsample\_bytree" = 0.8 and "min\_child\_weight" = 1.

**Figure 8.** Change in the accuracy of XGBoost algorithm during training with 10-fold cross-validation with the parameters "max tree depth", "shrinkage (eta)" and "subsample". Tuning parameters "nrounds", "gamma", "colsample\_bytree" and "min\_child\_weight" had constant values of 200, 0, 0.8 and 1, respectively.

The confusion matrices produced by all the ML models on the test set (30 samples) are listed in Tables 2–4. These matrices show the true botanical origin (Reference) in the columns and the predicted classification (Prediction by the models) in the rows. The ideal situation is to have all the samples located on the diagonal cells of the matrix, which would mean that the accuracy is 100%. The overall accuracies obtained with the PDA, SDA, ET, PLS and 5.0 tree algorithms were 90.00%, 86.67%, 86.67%, 73.33% and 76.67%, respectively (Table 2). The overall accuracies obtained with the KKNN, PAM, HDDA, ANN and RF algorithms were 83.33%, 83.33%, 83.33%, 86.67% and 80.00%, respectively (Table 3). The overall accuracies obtained with SVM with linear kernels (SVML), SVM with radial kernels (SVMR) and XGBoost were 66.33%, 60.00% and 90.00%, respectively (Table 4).

**Table 2.** Confusion matrices of various classifier algorithms (PDA, SDA, ET, PLS, C5.0 tree) on the test set. The number of honey samples in this set was as follows: citrus (5), eucalyptus (4), forest (5), heather (4), lavender (4), rosemary (4) and sunflower (4).



**Table 3.** Confusion matrices of various classifier algorithms (KKNN, PAM, HDDA, ANN and RF) on the test set. The number of honey samples was the same as that indicated in Table 2.

> An accuracy of 100% was not obtained with any model. The largest accuracy was provided by the models obtained with PDA and XGBoost (90%) followed by SDA, ET and ANN. The lowest accuracy was provided by SVM, especially SVMR, which failed to correctly classify all heather, lavender and rosemary honeys. All models correctly classified sunflower honeys, and most of them (11) correctly classified all forest honeys. Ten models correctly classified the four heather honeys. Seven models correctly classified all citrus honeys. Only XGBoost classified the four rosemary honeys into their parent groups; no model was able to correctly classify all the lavender honeys. Some lavender honeys were classified as sunflower by XGBoost, SVM, ANN, PAM, RF, HDDA, KKNN, C5.0 tree, PLS and PDA. Other lavender honeys were classified as eucalyptus honeys.

> To test the robustness of the overall accuracies, classifications were repeated three more times (the samples included in the training and test sets changed randomly) while maintaining the same splitting ratio (70/30). The box plots of the metrics (log loss, accuracy and kappa) of some optimized models are shown in Figure S5. The results of overall mean accuracies of all the models on the test sets after four repetitions of the whole

process (training/10-fold cross-validation), including the ones shown above (Tables 2–4), are summarized in Table 5.

**Table 4.** Confusion matrices of various classifier algorithms (SVML, SVN<sup>R</sup> and XGBoost) on the test set. The number of honey samples was the same as that indicated in Table 2.


**Table 5.** Overall accuracy of ML models for classification of honey samples in the test sets.


As deduced from the results in Table 5, the PDA algorithm had the largest mean overall accuracy on the test set (86.67%), followed by ANN (85.84%), ET, RF and XGBoost (84.17%). The worst performance was rendered by SVM<sup>L</sup> and SVM<sup>R</sup> (≤60%). The most stable algorithm was C5.0 tree.

In the case that all samples are used for training with 10-fold cross-validation without separation of a test set, the results are much better with all the models. The training was performed similar to the case of splitting, using the same parameters (log loss, accuracy, kappa) for obtaining the best models (Figure S6). This approach is sometimes found in the literature concerning honey classification, but overfitting is usually a problem. The overall

accuracies in this case, according to the confusion matrices (Table S2), were 100% for ET, RF and XGBoost, 97% for PDA and ANN, 95% for C5.0 tree, 92% for SDA, 91% for PAM, 90% for KKNN, 87% for HDDA, 69% for PLS, 66% for SVM<sup>R</sup> and 56% for SVML.

#### **4. Discussion**

A variety of factors can influence the variability of the data observed in the dataset. For example, early harvest to increase the amount of honey especially of citrus or rosemary may lead to unripe, very clear products that do not meet legal requirements, although they can be unifloral. Beekeepers or traders can store these crops or blend early with late crops to obtain acceptable products. Early harvest can also affect the sugar content because unripe citrus honey may have more than 20% sucrose. In this case, they cannot be placed directly on the market, although they may be blended with more ripened honeys of the same class to comply with regulations. Late harvest may affect these variables as a large amount of pollen and more variability in the pollen spectrum are expected to occur because bees will go on gathering all available flowers or honeydew and pollen. Therefore, it may be very difficult to obtain unifloral honeys. The more time honey remains inside the combs, the riper the honey is expected to be, with a very low amount of sucrose; the contents of fructose and glucose may vary at low levels or increase, except in the case of honeydew honey, and a safe level of moisture will be reached. Thus, a good balance between early and late harvest should be taken into account by beekeepers to obtain unifloral honeys with the best quality.

In the present study, different ML algorithms using R and the caret package were applied to the same dataset of honeys belonging to seven classes from a single botanical origin collected in Spain. The initial dataset was a matrix of 100 honey samples (rosemary 13, citrus 16, lavender 14, eucalyptus 14, heather 13, sunflower 14 and forest 16) and 14 physiochemical features (water content, electrical conductivity, pH, sugars and colorimetric coordinates). It was partitioned into a training set and a test set.

The first approach to analyze the dataset considered an unsupervised approach. The data were analyzed by PCA and k-means, and two broad clusters with 70 and 30 samples were shown using k-means clustering. The two clusters are too broad to meet the actual honey classes. All variables were used because all were considered important by Boruta. Then, supervised ML approaches were tested. ML classifier algorithms applied were KKNN, PDA, PLS, PAM, HDDA, SDA, C5.0tree, ET, ANN, SVML, SVMR, RF and XGBoost.

The metric used for optimizing most models was log loss. Other metric parameters such as accuracy or kappa were also calculated and used instead of log loss for ANN, RF or XGBoost. After training using the same randomly selected dataset (70 samples) and finding the optimal configuration using 10-fold cross-validation, the performance of all models to accurately classify the test samples into their parent classes was compared. All treatments were repeated four times under the same conditions although the samples were randomly distributed in both sets. The performance was not constant among repetitions, and the mean accuracy was considered. The best results were provided by the PDA classifier, which classified the unifloral honeys in the test set within their parent types with 86.67% overall accuracy on average. Good results were also obtained with ANN, ET, HDDA, RF and XGBoost, while SVML and SVMR proved to be the worst. The honeys that have the best chance to be correctly classified are sunflower, forest, heather, eucalyptus and citrus. The correct classification of rosemary honey was hard to carry out, but the most difficult to be appropriately classified were lavender honeys. A low number of samples in the test set can be a problem in making good predictions, especially with samples that have a low percentage of pollen of the putative taxa or have pollen from other nectariferous plants. This happens with lavender, rosemary and citrus honeys, with under-represented pollen [13], and is a problem because they are very appreciated by consumers [60]. It has been reported that pollen analysis can be of limited usefulness for labeling lavender honeys, and analysis of volatiles should be considered as a complementary technique in the case that samples show the characteristic organoleptic properties [21,61].

Application of supervised ML algorithms to the classification of unifloral honeys by botanical origin is an issue of interest. This classification is based on the labeling of the studied samples into classes based on melissopalynology and organoleptic properties. However, microscopic analysis is time-consuming, requires highly specialized personnel and is unable to detect seasonal variation in pollen amounts or fraudulent pollen.

Former attempts at classification were conducted using multivariate statistical discriminant techniques applied to physicochemical features [11], with rather good results. Anjos et al. [36] investigated different ANN configurations to classify the botanical origin of 49 honey samples. Measurements of moisture, electrical conductivity, water activity, ash content, pH, free acidity, colorimetric coordinates and total phenol content were used as input variables. It was concluded that the botanical origin of honey can be reliably and quickly known from the colorimetric information and the electrical conductivity of the honey, which agrees with our results. Another report [24] showed the results obtained with a similar set of variables, although including a large phenolic profile, to classify acacia, tilia (linden), sunflower, honeydew and polyfloral honeys of Romanian origin (50 samples) labeled by pollen analysis into their parent classes by using LDA and ANN as classifiers. LDA correctly classified 92.0% of the samples. An ANN with two hidden layers classified 94.8% of the honey samples into their botanical origin. However, all samples from each class were used to reach these accuracy rates. In the present paper, a test set was used to calculate the percentage of correctly assigned origins, and we obtained higher accuracy rates using all the samples. Popek et al. [37] were able to correctly classify nearly all their samples according to their botanical origin using CART. They obtained good results using all 72 samples (9 samples × 8 classes) under treatment (rape, acacia, heather, linden, buckwheat, honeydew, nectar-honeydew and multifloral honeys).

Authentication of honey origin using ML algorithms and nondestructive analytical techniques has been reported. In this way, ATR-FTIR spectra of 130 Serbian samples belonging to acacia, linden and sunflower honeys were treated by SVM, and the predictability rate was high [27], although the classes only totaled three, and the method of carrying out sample labeling was omitted. In our treatment, SVM was not useful. Ciulu et al. [62] reported on the usage of ATR-FTIR spectra and processing by RF to this aim. Eighty samples belonging to four different floral origins were considered: strawberry tree, asphodel, thistle and eucalyptus. Training an RF on the IR spectra allowed achieving an average accuracy of 87% in a cross-validation setting. This is approximately the same accuracy rate obtained in our study using different variables. FT-Raman spectra combined with PCA or LDA have also proved to be useful to classify monofloral honeys with a high degree of accuracy [29–31].

Using NIR (850–2500 nm), classification of 119 Italian honey samples encompassing acacia, linden and chestnut unifloral honeys and multifloral honeys was attempted by PLS and SVM with linear kernels using cross-validation of the NIR spectra [63]. Pollen analysis was not used for labeling. SVM provided better classification scores than PLS contrary to what happens in our case. An additional approach was to apply Boruta for feature selection, but the accuracy was not improved. Splitting of the dataset into a training/CV set and an independent test set was not carried out, meaning that confusion matrices included all samples, which obviously improves the success rate. Linden honeys failed to be correctly classified, which might be due to the low number of samples of that class. NIR was also the source of input variables to classify five types of Chinese unifloral honeys by application of Mahalanobis distance discriminant analysis (MD-DA) and a backpropagation artificial neural network (BP-ANN) [64]. By the MD-DA model, overall correct classification rates were 87.4% and 85.3% for the calibration and validation samples, respectively, while the ANN model resulted in having total correct classification rates of 90.9% and 89.3% for the calibration and validation sets, respectively. Pollen analysis was not employed for origin assignation to honeys. Minaei et al. [28] used VIS–NIR hyperspectral images of 52 samples of five classes of unifloral honeys and, after a reduction in dimensionality, applied RBF networks (a type of ANN with several distinctive features), RF and SVM for classification. The test set had 20 samples, and the remaining 32 samples were used for training. The first ML rendered 92% accuracy, while SVM and RF returned accuracies of 84 and 89%, respectively. A problem related to this technique is the variability in color with time.

Other types of input variables within the group of nondestructive methodologies are based on sensors able to mimic organoleptic perceptions such as the electronic nose (E-nose) [34,65] and the electronic tongue [35]. The E-nose generates signals corresponding to volatile and semivolatile compounds from honeys that, after being processed by ML algorithms, have the ability to carry out correct classifications. Benedetti et al. [65] studied 70 samples ascribed to three unifloral origins, which were certificated by pollen analysis. First, a PCA of samples indicated the main components, and then an ANN was generated that, after optimization by cross-validation, was able to accurately classify all samples of the test set. An electronic tongue was reported to correctly classify acacia, chestnut and honeydew honeys after application of an ANN to signals from the device [35].

Thus, the application of ML algorithms to classification of unifloral honeys has been increasing in recent years, and it is expected that it will go on increasing in the future. However, a systematic comparison of the main ML algorithms to reach this goal as it is presented in this study has not been reported to date. The PDA algorithm was the best, but others such as ANN, SDA, RF, ET or HDDA can also be useful to perform accurate classifications based on the variables from the dataset. SVM worked badly with all repetitions on the datasets. Failure in obtaining larger accuracy rates is due to some honey classes such as lavender, rosemary or citrus with under-represented pollen grains. Good marker parameters should be found and used to improve the classification of these honeys that have not been included in most studies using ML algorithms for prediction. To our knowledge, this is the first time that most of the compared algorithms in the present study (for example, PDA, HDDA, SDA, C5.0, ET, XGBoost) have been used for the goal of classification of unifloral honeys. It is expected that the comparison of the performance of the ML algorithms applied here may be useful not only for research on the topic of honey classification by origin but also for research on other kinds of foods.

#### **5. Conclusions**

A comparison of 13 ML algorithms on a dataset of one hundred honeys harvested in Spain and belonging to seven unifloral classes was performed using 14 physicochemical parameters. The ML algorithms were built by splitting the dataset into a training set (70%) and a test set (30%) and optimizing the configuration by 10-fold cross-validation using several parameters, but mainly log loss. The optimized models were tested on the test set to record the overall and partial accuracies in the right classification of samples into their parent classes. The whole process was repeated three times, and the results were averaged. The best accuracies were provided by the PDA algorithm, (86.67%), followed by ANN (85.56%), SDA and RF (83.33%). The worst results were rendered by SVM with radial and linear kernels (53–60%). Most algorithms correctly classified forest, sunflower and heather honeys. Orange blossom and eucalyptus honey samples were partly misclassified by some models; rosemary honeys were partly misclassified by all models, except XGBoost, while lavender honeys were the most difficult to be included into their parent groups. Most the algorithms studied here have not been applied previously to the issue of honey classification, and they can likely be useful for such a task in future research such as the inclusion of more unifloral honey types and a multifloral honey class. Moreover, other parameters (among them those obtained by FT-IR, FT-Raman or NIR spectroscopy nondestructive techniques) can be included in the datasets and tested to improve the accuracy of the classification task as much as possible.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/ 10.3390/foods10071543/s1, Figure S1: Box plots of some variables in the honey dataset: (a) water content (%); (b) electrical conductivity (µs/cm); (c) pH; (d) fructose content (%); (e) glucose content (%); (f) sucrose content (%); (g) maltose content (%); (h) isomaltose content (%), Figure S2: Box plots of some variables in the honey dataset: (a) kojibiose content (%); (b) fructose/glucose ratio; (c) glu-

cose/water ratio; (d) chromatic coordinate x; (e) chromatic coordinate y; (f) chromatic coordinate L; (g) percentage of pollen from the characteristic taxa giving the name to each unifloral honey (this variable is not included in the dataset), Figure S3: Clustering of the honey dataset in 7 clusters by k-means. The largest symbols correspond to the centroids of each cluster, Figure S4: Plot of the importance of the variables from the honey dataset, as estimated by an RF model, Figure S5: Box plots of metric values of several ML algorithms by optimization throughout training with 10-fold cross-validation of the honey dataset with splitting into training and test sets obtained in three additional repetitions: (a), (b) and (c) log loss of repetitions 2, 3 and 4; (d), (e) and (f) accuracy and kappa of repetitions 2, 3 and 4. Black circles symbolize mean values, Figure S6: Box plots of (a) log loss, (b) accuracy and kappa values of several ML algorithms throughout the training with 10-fold cross-validation of the honey dataset without partitioning into training and test sets. Black circles symbolize mean values, Table S1: Mean values for the predictor variables in the two clusters obtained by k-means, Table S2: Confusion matrices obtained with all honeys in the dataset without splitting into training and test sets using 10-fold cross-validation.

**Author Contributions:** Conceptualization, F.M. and E.M.M.; methodology, F.M., E.M.M. and A.T.; software, F.M.; validation, F.M.; formal analysis, F.M.; investigation, F.M., E.M.M. and A.T.; resources, F.M.; data curation, F.M.; writing—original draft preparation, F.M.; writing—review and editing, F.M. and E.M.M.; visualization, F.M.; supervision, F.M. and E.M.M.; project administration, E.M.M.; funding acquisition, E.M.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received co-funding from EUROPEAN REGIONAL DEVELOPMENT FUND (ERDF) and MINISTERIO DE ECONOMÍA Y COMPETITIVIDAD (Spanish Government) through project RTI2018-097593-B-C22.

**Data Availability Statement:** The data supporting the results can be found in Supplementary Materials or by petition to the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Chreston Miller 1, \* , Leah Hamilton <sup>2</sup> and Jacob Lahne 2**


**Abstract:** This paper is concerned with extracting relevant terms from a text corpus on whisk(e)y. "Relevant" terms are usually contextually defined in their domain of use. Arguably, every domain has a specialized vocabulary used for describing things. For example, the field of Sensory Science, a subfield of Food Science, investigates human responses to food products and differentiates "descriptive" terms for flavors from "ordinary", non-descriptive language. Within the field, descriptors are generated through Descriptive Analysis, a method wherein a human panel of experts tastes multiple food products and defines descriptors. This process is both time-consuming and expensive. However, one could leverage existing data to identify and build a flavor language automatically. For example, there are thousands of professional and semi-professional reviews of whisk(e)y published on the internet, providing abundant descriptors interspersed with non-descriptive language. The aim, then, is to be able to automatically identify descriptive terms in unstructured reviews for later use in product flavor characterization. We created two systems to perform this task. The first is an interactive visual tool that can be used to tag examples of descriptive terms from thousands of whisky reviews. This creates a training dataset that we use to perform transfer learning using GloVe word embeddings and a Long Short-Term Memory deep learning model architecture. The result is a model that can accurately identify descriptors within a corpus of whisky review texts with a train/test accuracy of 99% and precision, recall, and F1-scores of 0.99. We tested for overfitting by comparing the training and validation loss for divergence. Our results show that the language structure for descriptive terms can be programmatically learned.

**Keywords:** natural language processing; deep learning; sensory science; flavor lexicon; long short-term memory

#### **1. Introduction**

#### *1.1. Flavor Language*

Flavor is a major factor motivating eating behavior and food choice, but due to the approximately 350 different receptors for aroma-active compounds and the low detection thresholds for many such compounds [1], it is not currently possible to predict a flavor experience from chemical data alone. On the other hand, English and most other languages do not have a systematic and unambiguous flavor vocabulary, so studying flavor by surveying humans is still challenging. Sensory scientists need a way of aligning the different sensory lexicons used by different tasters and stakeholders.

The earliest solution in Sensory Science was the Descriptive Analysis (DA) panel, a body of related methods that use the experiences and vocabularies of a small panel of participants to create a single aligned sensory lexicon for some category of products. In order to ensure alignment between panelists (namely preventing needless synonyms and disagreement about definitions), every word is defined in reference to a physical standard. The largest time investment (often taking weeks or months) in DA is the hands-

**Citation:** Miller, C.; Hamilton, L.; Lahne, J. Sensory Descriptor Analysis of Whisky Lexicons through the Use of Deep Learning. *Foods* **2021**, *10*, 1633. https://doi.org/10.3390/ foods10071633

Academic Editor: Sigfredo Fuentes

Received: 8 June 2021 Accepted: 7 July 2021 Published: 14 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

on training to identify appropriate descriptors and references, then creating experts in the newly-defined language, all of which occurs before the product analysis [2].

This standardized vocabulary is called a sensory or descriptive lexicon and comprises words or phrases called "descriptors": terms that can be used to describe the flavor, aroma, mouthfeel, taste, appearance, or other sensory attributes of the product set [3]. For highlystudied categories or in cases where flavor communication between groups is important, the lexicon itself may be a desired outcome [3,4]. Lexicons can provide a reference list of possible terms for analysis of products by trained or untrained panelists; lexicons can be used to communicate sensory properties of products to consumers for marketing and product differentiation, and lexicons can be used to define product categories by connoisseurs and enthusiasts, as prototypically documented in the wine world [3–5].

Smaller research operations and newer or less-studied categories, however, often prefer methods of flavor measurement that do not need a carefully crafted flavor lexicon. These "rapid" methods collect similarity measurements, allow consumers to use their own, untrained vocabulary to describe product flavors or both [6]. When flavor is described with colloquial language, there are likely to be individual differences and other problems like those encountered during DA training, but rapid methods deal with these problems after data collection rather than before.

The process of identifying meaningful descriptors from free-text product descriptions and combining synonymous terms is known as comment analysis (or free text analysis), and its adoption within Sensory Science has made it possible to utilize existing sources of descriptive text as sensory data. Comment analysis of existing text data has previously been used to produce or modify lexicons for rum [7], wine [8], and whisky [9] and to identify terms that drive liking or price in wine [10,11]. Like all descriptive lexicons, these will never be truly universal—a new product can have a new taste, or a new consumer can have a new perspective on the products—but lexicons are universalizable: they are "intersubjectivity engines" that allow structured communication about the subjective, perceptual qualities of foods, beverages, and other consumer products [5,12,13].

Because it requires human attention, the scale of DA is necessarily limited: the largest DA studies usually include no more than about 100 products. In comparison, there are thousands or even tens of thousands of whiskies currently on the market [14], hundreds of thousands of wines, and similar variety in other specialty food markets, like coffee, chocolate, beer, tea, and so on. All of these products are sold partly or entirely based on the value of their sensory attributes, and these attributes are described in various publications such as reviews and commentaries. While manual comment analysis requires fewer work hours than DA, it is still impractical for large datasets like the RateBeer or Yelp review corpora [15,16]. Computational methods are needed to fully leverage the power of existing sources of food-descriptive data.

#### *1.2. Natural Language Processing and Machine Learning*

The field of Natural Language Processing (NLP) has quickly matured in the last several years with the boom of deep learning techniques. A number of techniques have been developed in NLP for the identification of relevant terms in freeform text, but accomplishing this task in any given application is usually challenging due to the importance of domain-specific language and unique words not well-represented in language references like thesauruses.

Previous computational linguistic analysis of flavor in food descriptions has relied on having text with a high density of flavor descriptors [10,11], which is not always the case [8]. There are no published tools designed specifically for the identification of flavor descriptors. In this paper, we will prototype and test such a tool using a Long Short-Term Memory (LSTM) deep neural network [17] trained on manually-annotated whisky reviews. The architecture used is based off of a keyword extraction tool developed for identifying skills in resumes [18].

We believe that the flexibility of the Long Short-Term Memory (LSTM) deep neural network architecture in capturing context in natural language will allow the identification of unique language lexicons from whisky reviews. Free comments (reviews) are the most natural way for humans to describe their food experiences, but they are very hard to systematically analyze, especially in volume. The use of LSTMs presents an alternative to hand-coding and increases the volume of data we can meaningfully deal with. The ability to take advantage of previously written reviews to build lexicons with minimal human intervention has much value. Our investigation is a use-case scenario for LSTMs, but not the only one. This kind of architecture (and related ones) can probably solve many problems in Sensory Science (including sentiment and synonymy/similarity) as Sensory Science has seen very limited use of NLP.

#### *1.3. Objectives*

Our overarching aim is to be able to use preexisting reviews to create a flavor language by identifying and extracting the unique descriptors. Since the world of food and beverage descriptions is obviously a large and heterogeneous domain, we use whisk(e)y as a case study for this domain, as it is an important economic product [14] and one without an authoritative flavor language that comprehensively covers the many increasingly-relevant product styles [19]. This approach should, in principle, be applicable to any product with a large corpus of free-text description available, such as the flavor of other foods, textile feel, and perfume aroma.

To determine whether it is possible to separate flavor-descriptive terms from those with no sensory meaning in written descriptions of food experiences, such as those found on product review blogs, we created a data pipeline that uses NLP techniques to take freeform text whisky reviews and extract descriptors, which can be used to create a lexicon for each whisky: a list of characteristic descriptors or a descriptive representation. The results can be used to identify relationships between descriptors and allow us to begin understanding the flavor language of whiskies.

Therefore, our contributions are threefold:


#### **2. Materials and Methods**

The training of a neural network requires a training set of positive and negative examples—in this case, words that are descriptive of flavors as opposed to other words in product reviews that do not describe flavor. To minimize the burden of manual annotation, we developed an interactive visual tagging tool. Full-text reviews were preprocessed to provide a list of potentially descriptive word forms for human annotation as descriptive or non-descriptive, and then individual instances of these descriptive and non-descriptive words in context were used to train and test the proposed LSTM architecture.

#### *2.1. Data Collection*

The dataset we used to train our model contains a total of 8036 full-text English whisky reviews scraped from four websites: WhiskyAdvocate (WA; 4288 reviews), WhiskyCast (WC; 2309 reviews), The Whiskey Jug (WJ; 1095 reviews), and Breaking Bourbon (BB; 344 reviews). WA and WC reviews are from websites affiliated with a magazine and podcast, respectively, and written by professional reviewers scoring whiskies from around the world and providing tasting notes. BB and WJ are smaller "semi-professional" review websites more focused on American whiskey. BB and WA have multiple named reviewers writing the tasting notes for their websites, while WC and WJ each have a single named reviewer.

WC, WJ, and BB were scraped using beautifulsoup4 in Python v3.7, while WA was scraped using the rvest package v0.3.2 in the R Statistical Environment v3.5.3. A combination of GET-specific scrapers were used to collect the review-containing URLs from the various sites, and then the content was collected with site-specific scrapers. When possible, page formatting was used to collect metadata about the products being reviewed such as country of production or the proportion of alcohol by volume (ABV), as well as the review itself, but the metadata were not consistent across sites.

#### *2.2. Data Preparation*

After collection, each review was converted from full text (excluding the title and metadata elements such as the date of publication) into a list of potentially-descriptive word base forms called "lemmas" that occurred in each review using the workflow described in [19]. Briefly, the reviews were first tokenized, or converted into an ordered list of individual words and punctuation. Each token was tagged with a part-of-speech (POS) label such as "adjective" or "punctuation", and all tokens other than nouns, adjectives, and a small whitelist of verbs were removed. The remaining tokens were lemmatized, or converted into their base form (e.g., "drying" to "dry").

This was done using Spacy v2.1.8 in Python v3.7. The pretrained model en\_core\_web\_sm v2.1.0 was used as the basis for calculating predictions and R package cleanNLP v3.0 was used in R v3.5.3 to convert the data to a tabular CSV format.

These lemmas were used as the list of potential descriptors for manual annotation with our interactive visual tagging tool (described in the next section). The frequency of occurrence (as an adjective or noun) for each lemma was used to prioritize more common lemmas for annotation.

#### *2.3. Interactive Tagging Tool*

To create examples of descriptive and non-descriptive words in context for this study, human annotators used a browser-based interactive tagger tool based on a word cloud visualization, seen in Figure 1. The tool was built for this purpose in Javascript and HTML5 using jQuery v3.4.1. A CSV file of token frequencies is uploaded from the user's local storage, the most common terms are rendered into a word cloud using jQCloud v2.0.3, and the user assigns words to the descriptor (1) or non-descriptor (0) classes using one of three interaction modes. The central wordcloud display (Figure 1B) repopulates with progressively less common words as the user assigns words to classes, and the resulting corpus of labeled words is exported along with unlabeled words as a CSV file to the user's local storage. Up to 50 words can be displayed in the central panel at a time, based on the rendering algorithm described in [20]. Fancybox v3.5.7 is used to display tooltips.

With a low learning curve, the user is able to sift through the text in a timely manner and create a human-annotated list of positive and negative examples. The user is then able to save the corpus of labeled words to a comma separated value (CSV) file.

#### *2.4. Gold Standard Annotations*

We asked four annotators (A, B, C, and D) from Food Science to use the interactive tagger tool to create an annotated training set. The annotators were chosen based on their expertise in Sensory Science, a sub-field of Food Science. Annotators A and B were involved with annotating all the datasets, while C was a tiebreaker for datasets WA and WC and D was a tiebreaker for BB and WJ. A lemma was deemed a descriptor if it was tagged as such by two out of the three annotators; otherwise, the lemma was tagged as not being a descriptor. As such, the number of annotators was chosen so a best two out of three consensus could be achieved. This is important as it provides a more accurate set of labeled annotations and is a common practice in both corpus annotation in NLP [21] and in the analysis of freeform comments in sensory science survey research [22]. A total number of 1794 lemmas (499 descriptive, 1295 non-descriptive) were tagged and used to create a training and test set. There were a total of 2638 unique descriptive/non-descriptive tokens tagged based on these lemmas (e.g., the lemma "fruit" could appear in the text as "fruity" or "fruits", i.e., a lemma could result in multiple tagged tokens). All individual occurrences of the tokens in context were used for training.

#### **Interactive Tagging Tool**

**Figure 1.** Interface for the Interactive Tagging Tool: (**A**) Non-descriptors (negative examples) are kept in the deletion history. (**B**) Then the most frequent words (up to 50) are shown in a word cloud format. (**C**) Confirmed descriptors (ones that are selected by the human operator) are stored in a confirmed terms list.

#### *2.5. Word Embeddings*

A word embedding is a representation of a word as a high-dimensional vector. The closer a pair of word vectors are in the high-dimensional space, the more the words are conceptually "similar" or "related". An input sequence for each word (i.e., potential descriptor) was created from the context and potential descriptor (unigram). Each potential descriptor and context word is assigned a 300 dimensional GloVe [23] word embedding. GloVe embeddings with 1.9 million tokens were used. A key note is that terms generally used in a domain specific language, such as those of whisky tasting notes, are not commonly used by the lay person, so this is a key consideration for domain-specific keyword extraction.

As illustrated in Figure 2, three words before and after were used as context, n = 3. If the context was less than three words; e.g., if the word was the first word of a sentence, then a PAD, a filler value, was used to signal no available context. The PAD value is assigned the zero vector. It is these input sequences that were used to train the model described in the next section.

#### *2.6. Descriptor Extractor*

We chose a uni-directional Long Short-Term Memory (LSTM) deep neural network architecture since it works well with the context of language. An LSTM is a Recurrent Neural Network (RNN) designed for modeling sequence data. LSTMs have a memory segment that can "remember" up to a certain degree of events in time. Hence, it works well with remembering context in language and the relationships between words. The context that a descriptor is found is essential to identifying what is or is not a descriptor. How a word is used can be a deciding factor.

**Figure 2.** Conceptual example of an input sequence for training the model. The before and after context of a candidate word is extracted from a text review. This is then converted into a numerical representation using GloVe word embeddings for each context word and the candidate word with a PAD being the zero vector.

The architecture used was inspired by [18] and can be seen in Figure 3. There are two inputs, the context of a potential descriptor and the potential descriptor itself. Each is fed into a uni-directional LSTM of 256 units followed by a dense layer of size 128 units. These two dense layers are concatenated and fed forward to a series of decreasing dense layers ending with a binary softmax output layer that decides whether the input word is a descriptor or not. Defining the size of the context is flexible while we currently fix the descriptor (word) to a unigram as we observed most descriptors are single words. However, we chose a context of three words before and after a descriptor, n = 3.

#### **Descriptor Extractor architecture**

**Figure 3.** Deep Learning Architecture which is composed of two sets of input (context and a word) that feed into an uni-directional LSTM each. The rest of the architecture concatenates the LSTM layers and continues to merge dense layers with a softmax as the output.

#### **Input seqeunce construct**

We decided to use a traditional LSTM as a starting point for our keyword (descriptor) extractor. We wanted to see how well this model structure could perform before turning to more sophisticated model architectures in the future such as transformers [24]. As we will discuss in our results, the model architecture performed well.

The Descriptor Extractor was written in Python v3.6.9 with the deep learning architecture built using Keras v2.3.1 and Tensorflow v1.14.0. Comet.ml [25] was used to track different aspects of the model training, allowing us to provide detailed information presented in some of the figures in the Results section.

#### **3. Results**

Our experiments focused on testing our LSTM architecture as it was tailored to the problem space. For a comparison baseline, we chose parts-of-speech (POS) tagging since it closely reflects the characteristics of descriptors, which are generally adjectives and nouns. The POS approach is currently state-of-the-art for Sensory Science and therefore reflects a valid comparison [10,19].

Our first experiment combined the WA and WC datasets into one dataset for training and validation. We used an Adam optimizer with a learning rate of 0.0001, and our loss function was binary cross entropy with a batch size of 32. We had the BB and WJ datasets annotated in the same fashion as WA and WC so as to have a labeled test set. The results on the test set (accuracy/precision/recall/F1-score) can be seen in Table 1. The scores were lower compared to those of the training set. This made us rethink why this could be happening. We realized that an important difference between WA/WC and BB/WJ was that WA/WC were professionally written reviews, whereas BB/WJ were hobbyist reviews. There are likely different writing styles between the two kinds of reviews driving this difference in performance.

**Table 1.** Results from the test set (BB/WJ combined) when using WA/WC combined as the training and validation data.


We then combined all the datasets (WA, WC, BB, WJ) into one dataset and performed a train/test split of 80/20. We approached our methodology of splitting up the train and test sets differently. After combining WA, WC, BB, and WJ, we tokenized the reviews into words (tokens) and performed a train/test split on the tokens themselves instead of on a review basis. We also recorded the specific review it occurred in, the sentence within the review, and the position in the sentence. Therefore, instead of just performing a search for all locations of vanilla, for example, in all reviews for each training and test sets, we used the specifically tagged location for each instance of vanilla. From there, we were able to extract the context (n = 3 words before and after each token). Hence, we isolated where each instance of vanilla was for the respective training and test set. Combining the reviews to create a new train and test set allowed the model to be exposed to more variations in writing styles and hence become a more robust classifier.

Given the labeled descriptors/non-descriptors (2638 unique), we identified around 250K instances of the labeled words. As mentioned, we used a randomly chosen 80/20 split for training and testing. The total number of words used for training and validation were around 200K for training and around 50K for testing. For training, we removed punctuation but kept stop words as they are part of the context. Twenty percent of the training data were used for validation. Each training/test split contained a class ratio of 56% non-descriptors and 44% descriptors; hence, there was no class imbalance.

It was unclear as to how many epochs to train for. An epoch is the number of passes through the entire training dataset. We noticed that the accuracy converged to near 100% quickly (within the first two epochs). To prevent overfitting, we ran the training for as many epochs as necessary until the loss did not improve and then plotted the training loss versus the validation loss to see how the training was behaving. We trained the model in increments of 5% use of the data up to using 100%. This resulted in 20 training sessions. This was done to observe how the model behaved given different numbers of training data. In practice, if a model performs poorly, the inclusion of more training data may improve results. We investigated the plots for 5%, 50%, and 100% (Figures 4–6, respectively). We observed that the loss values consistently crossed roughly around three epochs an then diverged (overfitting). This is marked as "Epoch 2" in the figures as Epoch 1 is really Epoch 0 in the figures. Hence, we chose to train for three epochs.

**Training loss vs. validation loss for 5% of data**

**Figure 4.** Training Loss versus validation loss using 5% of the training data where the *x*-axis is the epoch and the *y*-axis is the loss.

**Figure 5.** Training Loss versus validation loss for using 50% of the training data where the *x*-axis is the epoch and the *y*-axis is the loss.

After reviewing the loss and accuracy for each incremental iteration, we decided to report on using 100% of the training data. The gain from using all the data was small, e.g., loss difference of 0.00231 loss for 95% of the data versus 00.00238 for 100%. The difference in accuracy was equally minimal. Since the training with 100% of the data did not take long (around 5 min for three epochs using a desktop CPU), the small increase was still worth the extra training time. Figures 7 and 8 illustrate the training loss and validation loss, respectively, in which each loss (*y*-axis) is plotted in comparison to the percent of the data used in training (*x*-axis).

#### **Training loss vs. validation loss for 100% of data**

**Figure 6.** Training Loss versus validation loss for using 100% of the training data where the *x*-axis is the epoch and the *y*-axis is the loss.

#### **Loss per data percentage**

**Figure 7.** The loss for each percent of the training data used. *x*-axis is the specified percentage used and *y*-axis is the loss value.

**Figure 8.** The validation loss for each percent of the training data used. *x*-axis is the specified percentage used and *y*-axis is the loss values.

We then use a parallel coordinate system (Figure 9) to illustrate how the use of more samples (higher percentage of data used) increases the batch accuracy, which corresponds to a lower loss and an overall higher accuracy. Note that the axis ranges for batch accuracy, loss, and accuracy are quite small.

The results from training can be seen in Table 2. Here the loss and accuracy are reported for the cases where available. Our model's training accuracy hovered at 99% with POS at 51.3%. The precision, recall, and F1-scores for the test set are presented in Table 3. One thing to note is that the recall is very high for POS. POS classifying is essentially saying "all nouns and adjectives are descriptors". In that case, there will be very few false negatives, because almost all descriptors ARE nouns and adjectives. Since recall = true positives/(true positives + true negatives), recall will be very high.

**Figure 9.** From left to right: The relationships between the number of samples used for training, the batch accuracy, the training loss, and the final accuracy.


**Table 2.** Training results from combining the datasets and splitting train/test along tokens.

**Table 3.** Test results from combining the datasets and splitting train/test along tokens.


An illustration of a case where the LSTM model struggles can be seen in Figure 10. The orange underlined words were identified by both a human annotator and the LSTM model. The blue ones were not identified by the LSTM model but were by the human annotator. The primary differences tend to be that the LSTM model only identifies the more descriptive word in bi-gram descriptor phrases (e.g., "banana chips") and will classify uncommon words, especially proper nouns, as descriptive, albeit with a low probability

(e.g., "Redbreast", prediction of 71%). The challenge with bi-grams is a focus of future work discussed later, and the low probabilities can be addressed by using a filter threshold.

Figure 10 also demonstrates the difficulty in creating a tagged gold standard corpus, as words like "copper" and "heavy" that are not usually flavor words were annotated by the LSTM model as non-descriptors. In certain contexts, as in the idiosyncratic text of this review, these words are arguably capable of describing flavor. In the majority of reviews, however, "copper" instead describes the color (not flavor) of the spirit. The difficulty that these kinds of rarely descriptive words present for rapid annotation is also a focus of future work for the tagging scheme described in this paper.

**Figure 10.** Prediction results for a random review. The orange underlined words were identified by the LSTM model. The blue ones were not identified.

We were also interested in viewing the relationships between words chosen as descriptors or non-descriptors. One approach to do this is to visualize the GloVe word embeddings using a t-SNE plot, an approach to visualize high-dimensional data in a two-dimensional space [26]. What results is a scatter plot visualization where distance between each word represents "similarity" based on word embeddings. The closer they are, the more conceptually similar.

We first plot the t-SNE for the annotated words within the training set. This can be seen in Figure 11 with a descriptor being a brown "X" and non-descriptor a blue dot. The words that were labeled in the training dataset create distinct clusters. This demonstrates that the human annotations created a well-defined cluster space, and hence, supports that the provided annotations were of good quality and the embeddings have enough understanding of flavor language to have captured it in the embedding space. The same can be said for the clusters for the test dataset (Figure 12). One may notice that there are some descriptor/non-descriptors "speckled" across the opposing cluster, i.e., words labeled as a descriptor are found within the non-descriptor cluster. This demonstrates that just because words are similar in an embedding space, their contextual meaning can vary. Zooming in to an example (Figure 13) of this, we see various terms that describe different aspects of bodies of water or climates. These words hold some form of similarities but are not deemed "equal" in a descriptive sensory sense.

In Figure 14, we see the embedding space for words that were predicted to be a descriptor (brown "X") or not a descriptor (blue dot). As can be seen, neatly defined clusters also emerge along with some "speckles". This provides support that the trained model is able to segment the space into descriptors and non-descriptors and hence carve out sensory terms. Another observation is that in all the t-SNE embedding plots, the nondescriptors outnumber the descriptors as was noted to be the case in the Related Works section by [8]. The actual descriptors are the minority, which makes sense as we observed that in the reviews, most words are non-sensory.

#### **Training data word relationships**

**Figure 11.** t-SNE representation of the training data where a blue dot represents a word labeled as a non-descriptor, and a brown ''X" represents those that are descriptors.

**Test data word relationships**

**Figure 12.** t-SNE representation of the test data where a blue dot represents a word labeled as a non-descriptor and a brown "X" represents those that are descriptors.

#### **Closer look at word relationships**

**Figure 13.** A zoom-in of a t-SNE plot to exemplify the separation of words into sensory and nonsensory despite their similarity of word embeddings. This shows that some words can hold a form of similarity but are not deemed "equal" in a descriptive sensory sense.

#### **Predicted data word relationships**

**Figure 14.** t-SNE representation of the words that were not annotated. A blue dot represents a word predicted as a non-descriptor, and a brown "X" represents those that are predicted as descriptors.

#### **4. Discussion**

We were able to build a deep learning model architecture based on LSTMs to provide descriptor identification within free-form text whisky reviews. Our results were very promising with training, validation, and test accuracies around 99%. The precision, recall, and F1-Scores were equally high. This is substantially higher than the current state of the art for Sensory Science. We were concerned about overfitting with such high scores, so we

tracked the training and validation loss over many epochs, a common approach to detect overfitting. The tracking showed overfitting after three epochs, so we stopped our training at three epochs.

We were successfully able to automatically separate flavor-descriptive terms from those with no sensory meaning in written descriptions of food experiences (reviews). Our LSTM architecture was able to capture the language constructs that dictate what is and is not a descriptor. We view this as one of our key contributions.

Another key contribution is the pipeline for data preparation, annotation, training, and testing not seen before in sensory science. This opens the door for researchers in Sensory Science and Food Science in general.

We also introduced a novel interactive word tagging tool for creating a set of humanlabeled descriptor/non-descriptor words. With multiple annotators using the tool, the set of human-labeled words provided an excellent training set. This supports the possibility of performing the same annotation with other datasets in other domains in order to facilitate the creation of a labeled set of words, hence training a model for those domains.

Visualizing the results using t-SNE revealed some interesting results. First, the embedding space of human labeled words by the interactive word tagger was segmented fairly cleanly into two clusters: those words that are descriptors and those that are not. This supports that the annotations are of high quality. Similarly, the embedding space for predicted words from the test set (non-annotated words) reveals two fairly clean clusters for descriptors and non-descriptors. This provides support that the trained model is able to learn the language structure of sensory terms.

These contributions result in some interesting and novel implications. We trained a model that can identify descriptors in texts, leading to the ability to create a lexicon for a whisky. Lexicons allow a comparison between whiskies: a \$50 bottle of whisky can have a similar lexicon to that of a \$300 bottle. This allows the consumer to "experience" the \$300 whisky by trying the \$50 whisky. Lexicons of whiskies can also map distinct descriptors to that of different metadata of the whiskies, such as age, region of origin, ingredients, and price. This can provide a foundation to perform predictive analyses, such as Random Forests. By fitting a model to predict continuous variables (e.g., bottle price, quality score) or classify products (e.g., region of origin) based on the presence or absence of flavor terms in the bodies of reviews, we can identify which flavors or flavor terms drive the price and consumer liking of whiskies or differentiate between product categories.

#### **5. Limitations and Advantages**

While our approach has had much success, there are a few limitations. The interactive tagger currently does not present the context of a word when the word is shown to the user. The context can influence whether some words are descriptive or not (e.g., "maple" in "a sweet maple aftertaste" vs. "this new offering from Maple Leaf Spirits"). Furthermore, unigrams are used to train the model with a context window of n = 3. Although this context window is based on the suggestion from [18], experimentation with other windows could be beneficial for our domain. The use of unigrams is also a limitation as some descriptors are not unigrams but phrases of two or more words. Studying beyond unigrams would be an important direction for future work.

The LSTM developed in this work is most immediately applicable to situations where researchers have some structured data about products (e.g., price, hedonic liking scores, chemical data, ingredient concentrations) and freeform product descriptions but no structured descriptive data. The presence or absence of the descriptors in each product, or the number of participants who used them, can be used very similarly to check-all-that-apply data in sensory analysis [10,27,28]. Freeform textual data is easier to collect than trained descriptive panel measurements, meaning that the use of this tool could reduce the barrier to entry for studies looking to relate production variables to resulting sensory properties or sensory properties to product liking and consumer behavior. The performance of the LSTM on descriptions of other foods (e.g., other spirits, wine, coffee, specialty meats and

cheeses, casserole recipes) should be assessed before using the predicted descriptors for this kind of further analysis, but the interactive annotation tool developed in this work should reduce the amount of work necessary to tag small test sets and, if necessary, new training sets for these other domains.

#### **6. Conclusions**

In conclusion, we were successful in automatically detecting words as flavor-descriptive terms and separating them from non-sensory terms. Our developed deep learning architecture proved successful and opens the doors for further research into descriptive analysis.

For future work, we would like to perform testing of generalization between food domains, e.g., apply the model to cocktail and coffee descriptions. Currently, we use only unigram descriptors/non-descriptors. We would like to expand to using bi-grams and tri-grams as some sensory descriptors are phrases and not single terms (e.g., "wet dog", "red fruits"). Finally, we foresee the possibility of creating a word embedding dataset for the Food Sciences, i.e., an analogue to GloVe embeddings trained on sensoryspecific descriptions.

**Author Contributions:** Conceptualization, C.M., L.H., and J.L.; methodology, C.M. and L.H.; software, C.M.; validation, C.M. and L.H.; formal analysis, C.M.; investigation, C.M.; resources, C.M., L.H., and J.L.; data curation, L.H.; writing—original draft preparation, C.M.; writing—review and editing, C.M., L.H., and J.L.; visualization, C.M. and L.H.; supervision, C.M.; project administration, C.M. and J.L.; funding acquisition, J.L. and C.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Institute for Creativity, Arts, and Technology (ICAT) Major SEAD grant at Virginia Tech. The grant has no funding number; however, it does have a project page: https://icat.vt.edu/projects/2019-2020/major/seeing-flavors.html (accessed on 21 May 2021). The APC was funded by the Virginia Tech Libraries Open Access Subvention Fund and by author Lahne's overhead funds.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy concerns and copyright.

**Acknowledgments:** We would like to thank Jonathan Bradely of the Virginia Tech University Libraries for providing the prototype of the interactive tagging system. We would also like to thank our external annotators Brenna Littleson and Amanda Stewart.

**Conflicts of Interest:** The authors declare no conflict of interest. The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


### *Article* **Insights into Drivers of Liking for Avocado Pulp (***Persea americana***): Integration of Descriptive Variables and Predictive Modeling**

**Luis Martín Marín-Obispo 1,† , Raúl Villarreal-Lara 1,2,† , Dariana Graciela Rodríguez-Sánchez 1 , Armando Del Follo-Martínez 3 , María de la Cruz Espíndola Barquera 4 , Jesús Salvador Jaramillo-De la Garza 1 , Rocío I. Díaz de la Garza <sup>1</sup> and Carmen Hernández-Brenes 1, \***


**Abstract:** Trends in new food products focus on low-carbohydrate ingredients rich in healthy fats, proteins, and micronutrients; thus, avocado has gained worldwide attention. This study aimed to use predictive modeling to identify the potential sensory drivers of liking for avocado pulp by evaluating acceptability scores and sensory descriptive profiles of two commercial and five non-commercial cultivars. Macronutrient composition, instrumental texture, and color were also characterized. Trained panelists performed a descriptive profile of nineteen sensory attributes. Affective data from frequent avocado adult consumers (*n* = 116) were collected for predictive modeling of an external preference map (*R* <sup>2</sup> = 0.98), which provided insight into sensory descriptors that drove preference for particular avocado pulps. The descriptive map explained 67.6% of the variance in sensory profiles. Most accepted pulps were from Hass and Colin V-33; the latter had sweet and green flavor notes. Descriptive flavor attributes related to liking were global impact, oily, and creamy. Sensory drivers of texture liking included creamy/oily, lipid residue, firmness, and cohesiveness. Instrumental stickiness was disliked and inversely correlated to dry-matter and lipids (*r* = −0.87 and −0.79, respectively). Color differences (∆E*ab* \*) also contributed to dislike. Sensory-guided selection of avocado fruits and ingredients can develop products with high acceptability in breeding and industrialization strategies.

**Keywords:** avocado; cultivars; preference mapping; sensory evaluation; sensory descriptive analysis; consumer science

#### **1. Introduction**

Avocado fruits are now of high economic value, and thus, the food industry is showing a remarkable interest to enhance the production and processing of this crop [1,2]. According to the Statistical Division of the Food and Agriculture Organization of the United Nations (FAOSTAT), Mexico is the major producer and exporter of avocado worldwide. In 2018, Mexico's avocado production was 2,184,663 t, with a harvest area of 206,389 ha, representing a 2.17-billion-dollar market [3]. Nutritional characterization of avocado fruit identified many functional compounds, which include unsaturated fatty acids, vitamin E, tocopherols, ascorbic acid, B vitamins, carotenoids, potassium, phenols, antioxidants, phytosterols,

**Citation:** Marín-Obispo, L.M.; Villarreal-Lara, R.; Rodríguez-Sánchez, D.G.; Del Follo-Martínez, A.; Espíndola Barquera, M.d.l.C.; Jaramillo-De la Garza, J.S.; Díaz de la Garza, R.I.; Hernández-Brenes, C. Insights into Drivers of Liking for Avocado Pulp (*Persea americana*): Integration of Descriptive Variables and Predictive Modeling. *Foods* **2021**, *10*, 99. https:// doi.org/10.3390/foods10010099

Received: 2 December 2020 Accepted: 29 December 2020 Published: 6 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

acetogenins, and its derivatives containing a furan ring (called avocatins or avofurans), terpenoid glycosides, flavonoids, and coumarins [4–12]. The fruit's flesh is pale green to bright yellow in color; is smooth, buttery in consistency, and has an exquisite flavor and aroma [13]. Although the fruit is low in carbohydrates, it is high in lipids, proteins, and minerals [12,14]. Previous consumer studies identified relevant sensory attributes that characterized avocado pulp for texture (firmness, creaminess, buttery, smoothness, and watery) and flavor (grassy, bland, nutty, and buttery) [15,16]. However, avocado cultivars are reported to differ in their chemical profiles, which can influence their sensory characteristics and consequently, their acceptability. Although consumer liking of some avocado cultivars is reported [17,18], a trained panel-guided identification of the sensory attributes that impact a particular avocado cultivar's preference over another was not studied. The present work aimed to use predictive modeling to identify potential sensory drivers of liking for avocado pulp by evaluating the acceptability scores and sensory descriptive profiles of two commercial and five non-commercial cultivars. The study also characterized macronutrient composition, instrumental texture, and color as variables, to understand the liking and identify rapid assessment tools related to consumer acceptability.

#### **2. Materials and Methods**

#### *2.1. Plant Material*

Avocado (*P. americana*) cultivars, shown in Figure 1, were hand-picked (6 October 2017) from the Fundación Salvador Sanchez Colin—(CICTAMEX) experimental field, located in Coatepec Harinas, Estado de Mexico, Mexico (18◦ 55′ N, 99◦ 45′ W, 2240 m above sea level). Collected samples (10–12 kg of each cultivar) included five cultivars from the CICTAMEX collection and specimens from two commercially marketed cultivars (Hass and Fuerte). Fruits from all evaluated cultivars were harvested from the same orchard, from different locations within a single tree, selecting those at full physiological maturity (but unripe). Samples were air shipped in closed containers with activated charcoal. Upon arrival to the Centro de Biotecnologia-FEMSA (Tecnologico de Monterrey, Monterrey, NL, Mexico), all fruits were kept at 20 ◦C (85–90% relative humidity) for seven days, to complete the ripening process, until reaching an optimal state for consumption. Information related to horticultural race and general phenotype description of the cultivars used in the present study is summarized in Table 1. ′ ′

**Figure 1.** Avocado (*Persea americana*) fruits including commercial and non-commercial cultivars.


**Table 1.** Morphological traits and genotype relationships of avocado cultivar samples at commercial ripeness.

<sup>1</sup> Race: M, (Mexican, *Persea americana var. drymifolia*), G (Guatemalan, *P. americana var. guatemalensis*).<sup>2</sup> Genotype assignation, parentage, and origin reported by López-López, Barrientos-Priego, & Ben Ya'acov [21]; Rodríguez-López, Hernández-Brenes, & Díaz De La Garza [22]; Rendón-Anaya et al. [2]. <sup>3</sup> Values represent median (interquartile range) or mean ± SE for non-parametric or parametric data, respectively (*n* = 3–5). Different letters within the same column indicate significant differences, according to Kruskal-Wallis or LSD post-hoc test, for non-parametric or parametric data, respectively (*p* < 0.05); <sup>4</sup> g of pulp/100 g of total fruit's weight.

#### *2.2. Physicochemical Analyses*

The required number of avocado fruits (of each cultivar) were randomly selected to complete a sample composite of about 1.5 kg of pulp, based on their average weight and pulp content (Table 1). Avocado pulps were then manually pureed, and vacuum packaged in transparent nylon-polyethylene bags (Uline, Apodaca, NL, Mexico), containing approximately 250, 150, and 40 g of sample for their use in proximate composition, instrumental texture, and color determinations, respectively. The packaging material had a thickness of 5 µm, a standard barrier oxygen transmission rate of 63 cm3/m<sup>2</sup> for 24 h at 23 ◦C, and 0% relative humidity, and a moisture vapor transmission rate of 4.8 g/m<sup>2</sup> , 24 h at 37 ◦C at 90% relative humidity. Samples were stored at 4 ◦C, and physicochemical analyses were conducted within the following 24 h.

#### 2.2.1. Proximate Macronutrient Composition

Moisture, protein, lipid, ash, sugar, and crude fiber content in avocado pulps were determined in triplicate, following the standard methods from the Association of Official Analytical Chemists International [23]. Total carbohydrate content was calculated by difference and dietary fiber was also determined in triplicate, using the AOAC methods 997.08 and 999.03.

#### 2.2.2. Instrumental Texture Analyses

Instrumental texture determination of avocado puree samples was conducted using a TA-XTplus (Stable Micro Systems, Godalming, UK) texture analyzer. Measurements were performed using the TTC Spreadability Rig (HDP⁄ SR) fixture, consisting of a set of male and female acrylic cones with 90◦ angles. Avocado puree packages were conditioned at

25 ◦C for 20 min before analysis. Samples were then filled into the female (lower) cone with a spatula, pressed lightly to eliminate air pockets (visible through the cone), and the surface was flattened. The female cone was fixed on the base holder of the texture analyzer. Protocol for cheese spread (Texture Exponent software version 6.1.11.0—Stable Micro Systems, Godalming, UK) was used in the determinations, with a 5 kg load cell (test speed—3.0 mm/s and post-test speed—10.0 mm/s). Distance traveled (23 mm) by the male cone was recorded, from its start point at 25 mm over the bottom of the female cone and until it was introduced into the sample, stopping when the final gap between the two cones was precisely 2 mm. Textural data were recorded as force in grams (g) versus time (s), and the software calculated the following instrumental parameters as output variables—firmness (g), work of shear (g s), stickiness (g), and work of adhesion (g s). Determinations were performed at controlled room temperature (25 ◦C) with five replicates per sample.

#### 2.2.3. Instrumental Color Determinations

Instrumental color of avocado pulps from each cultivar were determined with a tristimulus Minolta CR-400 colorimeter (Konica Minolta Sensing Inc., Osaka, Japan), using a D75 illuminant at an observation angle of 10◦ . A standard white tile was used as a calibration reference. Readings for *L*\* (lightness), *a*\* (red-green axis), and *b*\* (yellow-blue axis) CIE*Lab* coordinates were recorded in five replicates (*n =* 5) for each cultivar. Variations of *L*\*, *a*\*, *b*\* (∆*L*\*, ∆*a*\*, ∆*b*\*), and total color difference (∆E*ab*\*) were calculated for each cultivar using Hass commercial cultivar as a reference control. The following equation was used:

$$
\Delta E\_{ab}^\* = \sqrt{\left(L\_1^\* - L\_0^\*\right)^2 + \left(a\_1^\* - a\_0^\*\right)^2 + \left(b\_1^\* - b\_0^\*\right)^2} \tag{1}
$$

where *L*0\*, *a*0\*, and *b*0\* included the reference values for control (Hass cultivar) and *L*1\*, *a*1\*, and *b*1\* indicated values for cultivars. Values of ∆E*ab*\* > 3.5 units were considered as indicators that instrumental color differences were possibly perceived by an average observer [24].

#### *2.3. Sensory Analyses*

#### 2.3.1. Sample Preparation

Fruits were weighed, washed, and soaked for 5 min in chlorinated water (200 ppm) for sanitization, and dried at room temperature for one hour. About 10 min before each sensory testing session (descriptive and affective tests), sample preparation was initiated; pulp was manually separated from peel and seed. Avocado fruits were randomly selected to obtain ~1 kg of pulp from each cultivar, based on their average weight and pulp content (Table 1). Pulps from each cultivar were hand-scooped and placed into plastic bags. Headspace was removed, and then the pulp was manually pureed within the bags, for 5 min, until color and texture were visually homogenous. For sensory evaluations, avocado puree samples (20 g) were placed in disposable soufflé cups (30 mL), identified with random three-digit numbers, and presented in random order to trained and untrained judges.

#### 2.3.2. Sensory Descriptive Profiling

Ten trained panelists from SensoLab Solutions SC, a sensory and consumer science laboratory center, with over 500 h of descriptive experience in a wide variety of foods, conducted descriptive sensory profiling of nineteen sensory attributes using a 15-cm free scale. These attributes were obtained previously during two consensus sessions, where the panel as a group enlisted the most relevant attributes that characterized the studied samples. Five additional one-hour sessions were carried out to train the expert panel in the attributes that were obtained during the consensus. The ballot was designed using Fizz Forms (Biosystems, Couternon, France). All trials were conducted in individual sensory booths with white lighting and data were collected with FIZZ® Acquisition software version 2.50 (Biosystemes, Couternon, France). References and samples were rated using a 15-cm universal Spectrum™ line scale with 0 cm representing "none" and 15 cm representing

"strong" [25]. Samples were presented with a random three-digit code, in random, monadic sequential order, and evaluated in triplicates, on four different days. Rinsing water and crackers were provided. No information about the test or samples was given to the panelists before or during the evaluations. Attribute definitions and references used in the evaluations are shown in Table 2.

**Table 2.** Sensory attributes, definitions, and references used in the descriptive analyses of avocado pulp.


<sup>1</sup> References were prepared approximately 24 h before a testing session, refrigerated overnight, and removed from the refrigerator 1 h before a testing session. Intensity based on a 15-point numerical scale, where 0 represents absence and 15 represents extremely strong. Avocado creole (unknown landrace) common in Northern Mexico was obtained from a local supermarket (Monterrey, NL, Mexico) and was used for calibration purposes, and reference intensities were established by the panel consensus.

#### 2.3.3. Consumer Evaluations

Affective data were collected from *n* = 116 frequent avocado adult consumers (frequency > twice a week; 28% males; 72% females) within 18–51 years old (mean age 33.62 ± 12.73 years). Participants were previously recruited and were instructed to avoid eating or drinking anything but water at least two hours before the sensory evaluation. Consent forms provided participants with information on avocado samples. They were also asked for their willingness to participate in the study, as part of a graduate research project from the Department of Bioengineering, School of Engineering and Sciences of Tecnologico de Monterrey, Campus Monterrey, Mexico (Ethics ID: CSERDBT-0001). Participants evaluated appearance, texture, flavor, and overall liking, using a nine-level hedonic scale. The sessions were conducted at SensoLab Solutions SC, located at the Technology Transfer and Innovation Center of Tecnologico de Monterrey. Generally, 20–30 consumers participated in each evaluation session, and the study was conducted for three days. Participants received an economic incentive at the end of their participation.

#### *2.4. Statistical Analysis*

For physicochemical and instrumental determinations, normality of the data was evaluated by the Shapiro-Wilk test. The mean ± SE or median (interquartile range) was reported for parametric and non-parametric data, respectively. Determinations on the avocado composited samples included, proximate composition (*n* = 2), instrumental texture (*n* = 5), and instrumental color (*n* = 5), while the fruit's weight, length, and percent pulp contents were determined from individual specimens of each cultivar (*n* = 3–5). Analysis of variance (ANOVA) and mean separations were conducted with Fisher's least significant difference (LSD) post-hoc tests for parametric variables, and Kruskal-Wallis with Nemenyi post-hoc test for non-parametric variables. Significant differences were assessed at a *p* < 0.05. Pearson product-moment correlations for each pair of variables were also calculated to assess the relationships between sensory and physicochemical data. Statistical analyses were performed using the JMP software version 15.0 (SAS Institute, Cary, NC, USA).

For the descriptive data, ANOVA was performed using as a complete randomized design using product as a fixed effect and panelist as random effect, using Fizz Calculations software version 2.50 (Biosystems, Couternon, France). A post-hoc means comparison using Fischer's Protected LSD at a 95% confidence level was performed to determine significant differences [26,27]. For liking analysis, one-way ANOVA was performed and Fischer's Protected LSD as a post-hoc test at *p* < 0.05 level of significance using Fizz Calculations software version 2.50 (Biosystems, Couternon, France) [28].

External preference mapping methodology was used to relate the preferences shown by the consumers to descriptive sensory characteristics of the different avocado pulps. The first step consisted in mapping the pulps on the basis of their sensory descriptive characteristics. Principal Component Analysis (PCA) was used to construct a sensory descriptive biplot with all studied descriptive attributes (individuals run by principle means) by correlations (standardized). Components were retained if they explained at least 15% of the variance. The second step was the construction of the predictive models of external preference maps using consumer hedonic data, which were performed using the Fizz Calculations software version 2.50 (Biosystems, Couternon, France) and by the XLSTAT software (XLSTAT, 2020, Addinsoft, Germany). Consumer overall liking scores were regressed onto the product scores on the principal components of the sensory space included in the PCA biplot (obtained with the trained panel data), using a quadratic model. Quadratic surface model was selected, since it corresponded to the complete model, which allowed to take into account interactions between characteristics.

#### **3. Results**

#### *3.1. Fruit Morphological Traits and Proximate Macronutrient Composition*

Morphological traits of the sampled cultivars, as previously reported by CIC-TAMEX [19,20,29], were confirmed in the present work; their characteristics were documented in Table 1 and can be visualized in Figure 1. All cultivars used in the study, including Hass and Fuerte, were hybrids or selections of Mexican and Guatemalan races. The median fruit weight of non-commercial cultivar Jimenez II (195 g) was the closest in value to the weights of commercial cultivars Hass and Fuerte (198 and 237.2 g, respectively), followed by Fundacion II (190.1 g); while fruits from Colin V-33, Labor, and Ariete had average weights greater than 300 g.

Fruit length values for cultivars Colin V-33 and Jimenez II (11.8 and 10.7 cm, respectively) were not significantly different than those of commercial cultivars Hass and Fuerte (9.8 and 10.5 cm, respectively). While Ariete and Labor cultivars presented the fruits with the highest average lengths (13.6 and 15 cm, respectively). Fundacion II fruits presented the lowest length values (7.4 cm). Values of percent pulp yield are considered relevant parameters for commercial applications, since they represent the edible portion of the fruit. Pulp yields (Table 1) followed a similar trend than fruit lengths, thus the Colin V-33 and Jimenez II yield values (73.1 and 79.5%, respectively) were non-significantly different from those of commercial cultivars. While the longest cultivars Ariete and Labor had the highest

median pulp yields (>76%), and Fundacion II (the shortest in length) had the lowest pulp yield (64.1%).

Proximate macronutrient composition, shown in Table 3, indicated that for five of the pulps, lipids were the primary macronutrient (>13%), closely followed by carbohydrates (>8%) and then proteins (1.3–2.2%). However, the carbohydrate contents for Jimenez II and Hass pulps were slightly higher or equal than the lipid contents, respectively. The moisture contents ranged between 54 and 73%, and the fiber and ash contents were less than 3%. Fuerte and Hass pulps contained significantly higher lipid contents (21.3 ± 0.2% and 17.7 ± 0.2%, respectively) and lower moisture levels (54.7 ± 0.2% and 57.7 ± 0.2%, respectively) than the other cultivars. Our results indicated that total lipid concentrations were inversely (*r* = −0.88, *p* = 0.009) related to moisture contents. Additionally, carbohydrate and sugar levels were both inversely and strongly correlated (*r* = −0.95, *p* < 0.0012) to moisture concentrations.

#### *3.2. Descriptive Sensory Analyses of Avocado Pulps*

Significant differences (LSD, *p* < 0.05) were observed for ten of the nineteen evaluated descriptors (Table 4). The sensory characteristics that differentiated the pulps the most were the attributes related to lipids' flavor impact and texture. While Hass, Jimenez II, Fuerte, and Colin V-33 were characterized for having the highest global flavor impact, Ariete had the lowest (LSD, *p* < 0.05). Creamy flavor was perceived in the highest impact for Colin V-33, Fuerte, Hass, and Jimenez II (LSD, *p* < 0.05). In texture, Hass had the strongest creamy/oily texture perception (LSD, *p* < 0.05). Additionally, Hass, Fuerte, and Jimenez II were also characterized for having the highest texture perception in firmness, cohesiveness, and lipidic residual, while Ariete, Fundacion II, and Labor had the lowest perception (LSD, *p* < 0.05). Additionally, the fiber strands attribute was different between samples (LSD, *p* < 0.05), being significantly higher in the Fundacion II cultivar.

A sensory PCA biplot was generated with all sensory descriptive attributes (Figure 2), where the first two dimensions explained 67.6% of variability in descriptive profiles of the tested avocado pulps. Sensory attributes were related to flavor, texture, and chemical factor sensations. Sensory attributes that loaded on Component 1 included flavor attributes such as lipid complex, creamy, global impact, oily, and earthy; it also included the texture descriptors lipidic residual, firmness, cohesiveness, creamy/oily, and spoon print. Component 2 involved flavor descriptors such as green/grassy, sweet, fresh, and sour.

#### *3.3. Liking of Avocado Pulps by Consumers*

The commercial preference of consumers towards the Hass cultivar was evidenced since it presented high liking scores that ranged between 7.1 and 7.2 hedonic points, as shown in Table 5. The most surprising results were obtained for the non-commercial cultivar Colin V-33 (6.9–7.2 hedonic points), since it ranked in the top liking group for all parameters and showed non-significant differences for appearance, texture, flavor, and overall liking when compared to Hass. The least overall liked cultivar was the noncommercial Fundacion II (5.6 points), and for the liking of appearance, commercial cultivar Fuerte (5.9 points) was also significantly lower (LSD, *p* < 0.05). Overall acceptability data for the rest of pulps ranged in the hedonic scale between 6.2 and 6.5 points; their values were slightly lower (but statically significant LSD, *p* < 0.05) than the most liked cultivars (Hass and Colin V-33).

 **Figure 2.** Principal component analysis (PCA) biplot of components F1 and F2, explaining 68% of the variance in the sensory descriptive profiles of seven avocado cultivars. Avocado samples are shown in blue (•), while vectors for sensory descriptive attributes are shown in red (), and the descriptor names are shown in black. Sensory attributes were also classified with an abbreviation that indicated if they were related to flavor (F), sensory texture (T), or a sensory chemical sensation factor (C).


**Table 3.** Proximate macronutrient concentrations (g/100 g fresh weight (FW)) of seven avocado cultivars, including commercial and non-commercial samples.

\* Values represent mean±SE (*n*= 2). Different letters within the same row indicate that the means are significantly different, according to the LSD test (*p*< 0.05).


**Table 4.**Intensity ratings of nineteen descriptive attributes for seven commercial and non-commercial avocado cultivars.

1 Letters in parenthesis indicate attribute type designated as flavor (F), texture (T), and chemical factor sensation (C). \* Values represent mean ± SE (10 trained panelists by triplicate, *n* = 30). Different letters within the same row indicate that the means are significantly different, according to LSD test (*p*< 0.05).



\* Values represent mean±SE (*n*= 116). Different letters within the same row indicate that means are significantly different, according to the LSD test (*p*< 0.05).

74

#### *3.4. Preference Mapping of Consumer Acceptability and Descriptive Sensory Attributes*

As previously mentioned, significant differences were observed in the liking of consumers for the seven studied cultivars. Overall acceptability scores for the seven pulps ranged from 5.6 to 7.2 in a nine-point hedonic scale, and generated three distinctive groups (LSD, *p* < 0.05). The external preference map shown in Figure 3 was obtained by modeling the overall acceptability scores over the descriptive map, where the best fit was obtained using a quadratic model (*R* <sup>2</sup> = 0.98). Furthermore, the overall liking scores were highly correlated to appearance, texture, and flavor variables (*r* = 0.87, 0.96, and 0.96, respectively, *p* < 0.05).

**Figure 3.** Tridimensional (**A**) and bidimensional (**B**) external preference map obtained by quadratic modeling of the overall liking of frequent avocado consumers (*n* = 116) for seven avocado cultivars, and placement of the affective data within sensory descriptive space. Principal component biplot of sensory descriptive data (components F1 and F2) used in construction of external preference map is shown in Figure 2. \* Scores in the bidimensional map (B), represent the predictive hedonic values.

External preference mapping of affective data (Figure 3) was aligned with the descriptive map (Figure 2) to gain further understanding on the sensory attributes that drove the preference of the evaluated avocado pulps. According to the results, the sensory attributes that appeared responsible for the driving of like (areas shown in circles in Figure 3A,B) included the flavor descriptors of global impact, oily, and the texture attributes of creamy/oily and firmness. The alignment of descriptive and affective data in the preference map (Figure 3 and Figure S1) also provided valuable insight into which pulps were most desirable for percentages of consumers. Hass, Colin V-33, and Jimenez II samples fell in a region of the map that was characterized by high-acceptability (90–100% of satisfied consumers), followed by a region of slightly lower but still high-acceptability rates (60–80% of satisfied consumers), in which cultivar Fuerte was located. None of the avocado pulps were located in the mid-acceptability region (20–60% satisfaction) but Fundacion II, Labor, and Ariete were located in the least-liked region in which consumer satisfaction percentages ranged from 0–20%.

#### *3.5. Proximate Macronutrient, Instrumental Texture, Instrumental Color, and Sensory Relationships*

The relationships between sensory affective and descriptive data, shown in Figure 3 and described in the previous section, were key to gain insight on the potential sensory drivers of liking. A second PCA was also constructed to visualize how the differences in proximate macronutrient composition and instrumental texture were related with the sensory descriptive attributes of the avocado pulps (Figure 4). As previously discussed, cultivars that were located in high-acceptability regions (Hass, Colin V-33, and Jimenez II) were associated with sensory attributes related to flavor, particularly flavors associated with lipid notes. Total carbohydrates and sugars were variables that appeared to be relevant to the differentiation of Hass, Colin V-33, and Jimenez II from other pulps, as they loaded in the same PCA quadrant (Figure 4).

Table 3 shows the lipids to carbohydrate ratios for the pulps; this parameter indicated that when the lipid content gets higher in relation to their carbohydrate content (as for the Ariete and Labor pulps), the balance in flavor sensory attributes in the PCA quadrant seemed to move away from the desirable intensities (Figure 3). However, as shown in Figure 4, the relationship between macronutrient composition and desirable sensory descriptive profiles was not simple. Cultivar Colin V-33, which was among the most liked by consumers, had similar lipid to carbohydrate ratios than Fuerte cultivar and the least liked Fundacion II cultivar (Table 3), but Colin V-33′ s sensory sweetness scores were significantly higher (Table 4).

PCA biplot shown in Figure 4 also aided in the visualization of chemical components (Table 3), sensory attributes (Table 4), and instrumental texture parameters (Table 6) that differentiated avocado pulps. Most sensory texture attributes related to lipidic sensations in the mouth, assessed by trained panelists, loaded in the quadrant with the most desirable pulps (Hass, Jimenez II, and Colin V-33). Relevant sensory texture attributes included lipidic residual, firmness, creamy/oily, and spoon print. Some instrumental texture parameters were noted to be correlated with some sensory texture descriptive attributes such as cohesiveness, which inversely correlated with instrumental stickiness (*r* = −0.75). Additionally, instrumental stickiness showed a significant (*p* = 0.01) and direct correlation with moisture contents (*r* = 0.88). As shown in Figure 4, both the stickiness and moisture vectors were characteristics associated with the least liked cultivar Labor.

−

 **Figure 4.** Principal component analysis (PCA) biplot of components F1 and F2, explaining 69% of the variance in the sensory descriptive profiles, proximate macronutrient composition, and instrumental texture analysis of the seven avocado cultivars. Avocado samples are shown in blue (•), while vectors for sensory descriptive attributes are shown in red (), and the descriptor names are shown in black. Variables were also classified with an abbreviation indicating if they were related to sensory flavor (F), sensory texture (T), sensory chemical sensation factor (C), proximate macronutrient composition (P), or instrumental texture analysis (I).



\* Values represent median (interquartile range) and mean ± SE for nonparametric and parametric data, respectively (*n* = 5). Different letters within the same column indicate significant difference, according to the Kruskal-Wallis or LSD post-hoc test, respectively (*p* < 0.05).

> The liking of commercial Fuerte cultivar was difficult to understand since it ranked in the second-best group for overall liking (Table 5 and Figure S1); however, its chemical and texture characteristics were different to those of Hass, Jimenez II, and Colin V-33.

In the PCA biplot space (Figure 4), Fuerte cultivar was associated with the vectors for instrumental firmness, total lipids, and work of shear; the latter was inversely related to moisture (*r* = −0.77).

Data on instrumental colorimetric parameters of avocado pulps were expressed as ∆*L*\*, ∆*a*\*, ∆*b*\* in relation to Hass cultivar (as reference control). Instrumental color differences among avocado cultivars are shown in Figure 5 and Table S2. Color variation values (∆E*ab*\*), also shown in Figure 5, were also calculated as quantitative parameters that integrated the ∆*L*\*, ∆*a*\*, ∆*b*\* values. Results indicated that the least liked pulps, Labor and Ariete, presented higher ∆*L*\* values (∆*L*\* = +8.15 and ∆*L*\* = +7.20, respectively) indicating higher lightness values than Hass cultivar. Colin V-33, Fuerte, Fundacion II, and Jimenez II only showed minor variations in ∆*L*\*, denoting similarity to Hass cultivar. In contrast, ∆*a*\* and ∆*b*\* values, in reference to the Hass cultivar, were similar for Fuerte and the non-commercial cultivars, suggesting that the green and yellow chromaticity was similar among all pulps. Color differences (∆E*ab*\*) were the instrumental parameters that differentiated samples the most from Hass and the values ranged from 1.20 to 8.31 (Figure 5).

#### **4. Discussion**

#### *4.1. Fruit Morphological Traits and Proximate Macronutrient Composition*

Cultivars developed by the CICTAMEX foundation (Ariete, Colin V-33, Fundacion II, Jimenez II, Labor) were characterized morphologically, chemically, and compared to commercial cultivars (Hass and Fuerte) (Tables 1 and 3). In agreement with our findings, Cajuste-Bontemps et al. [29] and Alemán-Reyes et al. [19] reported that fruits from Colín V-33 cultivar presented similar pulp yields than those form Hass and Fuerte cultivars (Table 1). However, Colin V-33 had significantly higher fruit weights; similarly, prior authors reported high fruit weights (>319 g) for the same cultivar and described it as an unfavorable commercial characteristic since the calibers of greater demand oscillate between 200 and 300 g [19,29].

Results from proximate macronutrient analyses (Table 3) indicated that all sampled cultivars (Guatemalan X Mexican hybrids) were above the California Avocado Industry standard for minimum dry matter percentages of 20.8% set for Hass [30]. According to Yahia & Woolf [31], the 20.8% dry matter standard approximates a minimum oil content of 8%. Total lipid concentrations shown in Table 3 were found to be inversely correlated (*r* = −0.88) to moisture contents. Similarly, previous research reported that during avocado fruit development, the moisture levels declined, detailed as parallel increases in dry matter and lipid contents [32]. Other researchers that focused on the chemical characterization of avocado pulps observed that high moisture and low moisture in dry fruits contained lower lipid levels, but also showed lower levels of other macronutrients such as carbohydrates, sugars, and proteins [1,30].

#### *4.2. Descriptive Sensory Analyses of Avocado Pulps*

In the present study, sensory descriptive analyses showed that the attributes that differentiated the pulps the most were related to the lipids' impact on flavor and texture descriptors (Figure 2). In agreement, a positive correlation between oil content and palatability (flavor), as a unique sensory attribute, was previously reported for different commercial cultivars as determined by "super-critical tasters" that were very familiar with the avocado fruit and expected more of it than would the average consumer [33]. Other published sensory studies also performed the scaling of various sensory attributes in avocado samples with different chemical compositions, although not with trained panels. In a study conducted by Obenland et al. [15], avocado sensory attributes were defined by a consumer panel (*n* = 15–20), which also conducted affective testing of the same twelve avocado samples; data were used for the selection of eight main sensory attributes that were associated with the samples. All avocado samples included in their study were from the Hass cultivar, grown in different locations, and harvested on different years. The list of potential avocado descriptors was based on a previous study also conducted with the cultivar Hass [31]. Although foundational work for the selection of the eight main attributes

present in avocado, with consumer evaluations, generated valuable knowledge [15]; the attributes were determined using only Hass cultivar, which might have possibly limited the sensory description of other attributes not present or pronounced in that particular cultivar. The main avocado sensory attributes identified in their work included four texture attributes (firm, creamy, buttery, smooth, and watery) and four flavor attributes (grassy, bland, nutty, and buttery). Moreover, using consumer panels, other authors confirmed the presence of similar attributes in studies with Hass avocado fruit, and in other cultivars reported as Hass hybrids [18,34]. In the aforementioned work, sensory attribute scaled were limited to a creamy texture (watery to creamy), rich (bland to rich), and grassy flavor (grassy to not grassy) on 15-cm line scales. It was not clear if the 'richness' definition was evaluated as an overall attribute or if it was defined for flavor or texture, but it was clearly correlated to the creamy texture attribute (*r* = 0.86) [15].

∆ ∆ ∆ ∆ **Figure 5.** Instrumental colorimetric differences for commercial and non-commercial avocado cultivars in reference to the widely accepted Hass cultivar. Instrumental data for the Hass cultivar were used as reference control to calculate the deviations of each cultivar for the colorimetric parameters, which included ∆*L*\*, ∆*a*\*, ∆*b*\*, and total color difference (∆*Eab*\*).

Literature on sensory studies conducted with avocado cultivars other than Hass and its hybrids was found to be very scarce [16,17]; and as previously mentioned, works conducted with trained descriptive panels on the evaluation of different avocado cultivars were not found. Among the few works that included various cultivars, Shaw et al. [17] conducted sensory hedonics on twenty-one avocado samples; many of them belonged to the West-Indian avocado race characterized by having lower oil contents but described as being well adapted to subtropical regions. West-Indian hybrids are therefore commercially grown in Florida, USA [35]. Consumers that evaluated the pulps documented flavor sensory descriptors, which included nutty, sweet, bitter, and mild. Formal descriptive sensory analyses with trained panels were published for the Hass cultivar samples [8,36]. However, the aims of both studies were different from the identification of drivers of liking

−

and focused on documenting the effects of emerging technologies and storage on sensory profiles. Salgado-Cervantes et al. [36] using an experienced sensory panel identified twelve descriptors that were classified into visual appearance (homogeneity, shiny, and color), aroma (avocado, boiled vegetable, and nutty), flavor (bitter, fatty, and astringent), and texture (unctuous, grainy texture, and fibrous). The attributes were reported to be relevant sensory descriptors present in the control and flash vacuum-expansion processed avocado samples.

#### *4.3. Preference Mapping of Consumer Acceptability and Descriptive Sensory Attributes*

Most avocado fruits grown commercially in the world are from the Hass cultivar, since consumers are positively drawn to its taste and texture [18]. Our consumer acceptability results confirmed that Hass presented high liking scores (Table 5). Unexpectable high liking scores were also obtained for non-commercial cultivar Colin V-33, which showed non-significantly different liking scores when compared to Hass. Our results were in agreement with observations reported by López-López [37]. In their work, the authors conducted a small consumer acceptability study (*n* = 14) using three of the same cultivars evaluated herein (Hass, Fuerte, and Colin V-33). Their results indicated that hedonic scores for flavor, color, odor, and external appearance for cultivar Colin V-33 were not significantly different from those of Hass. Therefore, the observations from both independent studies confirmed that Colin V-33, a non-commercial cultivar, was highly liked by consumers.

In the present study, the external preference map was modeled using overall liking scores from consumers, followed by its placement over a previously constructed descriptive map (Figure 3). Overall liking scores were highly correlated to the appearance, texture, and flavor liking scores (*r* = 0.87, 0.96, and 0.96, respectively, *p* < 0.05). It is possible that consumers being untrained assessors were not able to accurately differentiate appearance, flavor, and texture but they clearly indicated that the three attributes were relevant for consumer satisfaction of avocado fruit. Pereira et al. [16] also observed for avocado samples that it is common in consumer research to observe correlations among scores for overall liking with those for the liking of specific attributes; possibly because of the halo effect, since consumers are more focused on the general affective response and when they like or dislike a sample, they tend to give similar scores for all its attributes. Obenland et al. [15] conducted a follow-up of the changes in sensory attributes and hedonics during maturation of Hass cultivar from different locations and harvesting years. The aforementioned study showed that liking declined when the texture descriptor for creaminess declined and the flavor descriptor for grassiness increased, indicating that both flavor and texture attributes contributed to liking. Our results confirmed and complemented these observations, since flavor and texture sensory attributes were found to be the drivers of liking; but more specifically flavor descriptors such as lipidic notes, sweetness, and some fresh/green notes, together with texture descriptors for firmness, creaminess, and lipidic residue. The present work also characterized bitter and astringent as sensory attributes present in the avocado pulps, in agreement with descriptive work on avocados conducted by Salgado-Cervantes et al. [36]. Statistical comparisons among pulps did not show marked differences for bitterness or astringency (Table 4), but in the preference map (Figures 2 and 3) both attributes were associated with pulps that were penalized in liking (Labor and Fuerte). Sensory flavor is complex, and for the present work it was limited to the studied attributes, therefore it is possible that those particular pulps transmitted sensations that require further descriptive work.

#### *4.4. Proximate Macronutrient, Instrumental Texture, Instrumental Color, and Sensory Relationships*

As previously discussed, total lipids are widely reported in the literature to be a desirable quality in avocado fruit [30]. In this work, all studied cultivars were Guatemalan X Mexican hybrids, and their proximate composition indicated that all were above the California Avocado Industry standard for minimum dry matter percentage (20.8% set for Hass) [30]. However, in the present work we were able to observe that a balance between

carbohydrates, sugars, and lipids appears to be relevant to avocado sensory flavor profile (Figure 4). An observation that was also supported by direct slight correlations between affective flavor liking with carbohydrate and sugar contents of the pulps (*r* = 0.65 and 0.61, respectively).

In their work with Hass cultivar, Obenland et al. [15] concluded that carbohydrates, because of their low concentrations, might not influence acceptability. However, their observations could be limited by the use of that single cultivar. A prior study conducted with various cultivars, including different avocado races, focused on carbohydrates [17], and showed that the West-Indian cultivars contained higher levels of seven carbon (C7) sugars, which are rare in nature but are present in avocado fruit. The C7 sugars D-mannoheptulose and perseitol were the main sugars present in some cultivars of the West-Indian race background. Furthermore, West-Indian race fruits contained higher concentrations of the C7 sugars than those for glucose and fructose, and the consumer panel associated them with the sweet sensory attribute [17]. In the present work, the concentrations of individual C7 sugars were not measured; therefore, we were not able to confirm prior author conclusions that when the Mexican race was present in the genetic background, the C7 sugar levels tended to be low [17]. In this study, results from total sugars concentrations were significantly higher for some of the pulps with higher liking scores (Hass and Jimenez II), however, Colin V-33 had lower sugar levels but a higher sweetness sensory scores. Perhaps further work on the characterization of individual sugar profiles, including C7 sugars, can provide further insight into the sensory observations. Flavor metabolites were also reported to play relevant roles in the generation of desirables profiles, and they can be generated from both lipids and carbohydrates, particularly sugars. Lipid degradation products such as acetaldehyde, methyl acetate, 2,4 heptadienal were associated to have high preference values [15]; nonetheless, avocado sugar metabolites are least known. Prior studies described the disappearance of C7 sugars during ripening [1,38] and suggested a potential role in the generation of flavor metabolites.

In addition to flavor, sensory texture and appearance attributes need to be considered among the potential drivers of liking of avocado pulp. Thus, in this study, sensory texture attributes clearly differentiated avocado pulps and were related to the descriptors of lipidic oral sensations. Similarly, other studies confirmed the relationship between texture attributes such as firm, creamy, smooth, and high hedonic scores [15,34]. Herein, instrumental texture measurements alone were poorly correlated to consumer liking, possibly because liking is a complex variable and is difficult to relate to individual instrumental parameters. Nevertheless, instrumental data were useful as an additional objective assessment of the characteristics of the avocado pulps evaluated. Sensory firmness was directly related to instrumental weight of shear (*r* = 0.64), and inversely to instrumental stickiness (*r* = −0.69). Similarly, sensory cohesiveness (rated higher in the most liked pulps) was also inversely related to instrumental stickiness (*r* = −0.76). These correlations served to reassure that the train panel assessments were in accordance with the texture lexicon definitions, since other authors reported similar sensory and instrumental texture relationships for semi-solid matrixes [39].

Interestingly, correlations between sensory texture attributes evaluated by a trained panel, and proximate compositions were even stronger than those for instrumental texture measurements. For instance, sensory cohesiveness showed a significant (*p* = 0.0002) and strong inverse correlation with moisture contents (*r* = −0.97). Additionally, sensory cohesiveness was strongly correlated with total carbohydrates (*r* = 0.96) and sugars (*r* = 0.95), and mildly correlated with lipids (*r* = 0.86), confirming the relevant relationship of both macronutrients to texture, in addition to flavor. Data from both sensory assessments (liking and descriptive) indicated that high moisture levels, thus lower dry matter, lipids, carbohydrates, and sugars moved the texture away from the desired sensations. Contrary to sensory cohesiveness, instrumental stickiness loaded in the same PCA quadrant of the less desirable traits (Figure 4) and was also found to be directly correlated to moisture (*r* = 0.88) and inversely to lipid content (*r* = −0.79). Therefore, results indicated that stickiness was

considered as an interesting instrumental texture parameter, since it was also described by prior authors as an undesirable trait for semi-solid matrixes, such as fat spreads [40].

Considering lineage information shown in Table 1, it was observed that non-commercial cultivars that were located in the high-acceptability regions (Colin V-33 and Jimenez II) had common lineages with at least one commercial cultivar (Hass or Fuerte); therefore, suggesting that the progenitors were already selected for the desirable traits, such as flavor and texture. Selection was possibly performed considering high dry matter and oil contents since both chemical traits are known to drive acceptability [33]. Energy concentrations (kcal/100 g fresh weight (FW)), which served as a combined measurement of the contribution of lipids, carbohydrates, and proteins to the overall composition of the pulps are also included in Table 3, and the results clearly indicated that the most liked cultivars contained the highest values (201.3–267.2 kcal/100 g FW). However, results obtained for the Fuerte cultivar indicated that other factors could also influence liking. The Fuerte cultivar ranked in the second-best group for overall liking (Table 5 and Figure S1); although its dry matter, lipid, and caloric contents were the highest of all (Table 3). Sensory texture characteristics of Fuerte were not very different from those of Hass, Jimenez II, and Colin V-33, although the instrumental texture parameters indicated significantly higher values for its work of shear, work of adhesion, and stickiness (Table 6). It is also possible that its acceptability was slightly penalized because of its visual aspects, since its liking of appearance score by consumers were significantly lower (5.9 in a nine-levels hedonic scale, Table 5). Using Hass as a reference, the ∆*a*\* and ∆*b*\* values were similar for Fuerte and the non-commercial cultivars, but the ∆E*ab*\* values showed some differences among the pulps (Figure 5). Labor (∆E*ab*\* = 8.31) and Ariete (∆E*ab*\* = 7.25) pulps showed the highest color differences, while Fuerte (∆E*ab*\* = 2.7) color variation was not as high. However, Fuerte was slightly different from Hass compared to the more liked Colin-V33 (∆E*ab*\* = 1.2) and Jimenez II (∆E*ab*\* = 2.1). Perhaps that slight color difference was sufficient to penalize the liking of Fuerte for appearance. However, Ghidouche et al. [24] observed that ∆E*ab*\*values greater than 3.5 units were required to perceive a color difference by an average observer. Labor and Ariete pulps presented ∆E*ab*\*values greater than 3.5 from Hass (8.3 and 7.25, respectively), which might have partly influenced consumers' slight overall dislike.

#### **5. Conclusions**

For the first time, the development of highly detailed descriptive profiles of different commercial and non-commercial avocado cultivars generated new knowledge on key sensory attributes that drove the liking for avocado pulp conveyed by consumers. Our results confirmed observations obtained from prior consumer evaluations, in which flavor and texture sensory attributes were concluded to be key for liking. Furthermore, the present study's sensory-driven strategy generated an external preference map that facilitated the identification of sensory descriptors, which influenced the overall liking. In general, consumers tend to prefer avocados with a strong global impact, a creamy and oily flavor attributes, and other relevant sensory texture attributes that grouped in Hass's region (one of the most preferred pulps). A non-commercial cultivar, Colin V-33, presented sweet and green notes that also appear to drive preference. Therefore, the results indicated that the drivers of liking for avocado pulp include specific lipid flavor notes, sweetness, green notes, and textures of creaminess/oiliness, lipid residue, firmness, and cohesiveness. The earthy, bitter notes, absence of fibers, and a balanced green color also complemented specific cultivars' preferences. The role of avocado sugars in flavor remains to be further explored since the fruit contains unique carbohydrates. The present work also generated new knowledge and ideas on the possible drivers of disliking, such as stickiness, differences in color, and possibly other unexplored flavors and chemical sensations that remain to be characterized. However, results from the preference map generated valuable information that can be used by avocado breeders and processors as sensory-guided insight to develop and select cultivars with high acceptability for their commercialization strategies.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/2304-815 8/10/1/99/s1, Figure S1. Bidimensional external preference map obtained by quadratic modeling of the overall liking of frequent avocado consumers (*n* = 116) for seven avocado cultivars, and placement of affective data within the sensory descriptive space. Table S1. Instrumental colorimetric values of seven commercial and non-commercial avocado cultivar. Table S2. Differences in instrumental colorimetric parameters of commercial and non-commercial avocado cultivars in reference to the values obtained for cultivar Hass.

**Author Contributions:** Conceptualization, R.V.-L., D.G.R.-S., M.d.l.C.E.B., R.I.D.d.l.G., and C.H.-B.; Methodology, L.M.M.-O., R.V.-L., D.G.R.-S., M.d.l.C.E.B., A.D.F.-M., J.S.J.-D.l.G., R.I.D.d.l.G., and C.H.-B.; Conceptualization, C.H.-B., A.D.F.-M., R.V.-L., R.I.D.d.l.G., and D.G.R.-S. Software, L.M.M.- O., R.V.-L., and C.H.-B.; Validation, R.V.-L., D.G.R.-S., and C.H.-B.; Formal analysis, L.M.M.-O., R.V.-L., and J.S.J.-D.l.G.; Investigation, L.M.M.-O., R.V.-L., and D.G.R.-S.; Resources, M.d.l.C.E.B.; Data curation, L.M.M.-O., R.V.-L., J.S.J.-D.l.G., R.I.D.d.l.G., and C.H.-B.; Writing—original draft preparation, L.M.M.-O. and C.H.-B.; Writing—review and editing, L.M.M.-O., R.V.-L., D.G.R.-S., M.d.l.C.E.B., A.D.F.-M., R.I.D.d.l.G., and C.H.-B.; Visualization, L.M.M.-O., R.V.-L., D.G.R.-S., and C.H.-B.; Supervision, D.G.R.-S.; Project administration, D.G.R.-S. and C.H.-B.; Funding acquisition, R.V.-L. and C.H.-B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Tecnologico de Monterrey Translational-Omics (GIEE) Research Group and by the Mexican National Council for Research and Technology (CONACyT) doctoral scholarships for Luis Martín Marín-Obispo (No. 445941), Raul Villarreal-Lara (No. 359813), and Salvador Jaramillo-De la Garza (No. 743316). Descriptive tests were kindly funded by SensoLab Solutions, SC through the SensoLab Pro Bono Research Initiative.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Ethics Committee for Sensory Evaluation Studies of the Department of Bioengineering, School of Engineering and Sciences of Tecnologico de Monterrey, Campus Monterrey, Mexico (Ethics ID: CSERDBT-0001, approved on 02/12/2018).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Data is contained within the article or supplementary material.

**Acknowledgments:** We are grateful to Fundacion Salvador Sanchez Colin-CICTAMEX for providing the avocado cultivars used in this study. A special acknowledgment to CIT2, the Technology Park and Technology Transfer Center for technology-based startups and spinoffs from Tecnologico de Monterrey, for access to their facilities.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


*Article*
