**Effects of Varying the Color, Aroma, Bitter, and Sweet Levels of a Grapefruit-Like Model Beverage on the Sensory Properties and Liking of the Consumer**

#### **Andries G. S. Gous 1, Valérie L. Almli 2, Vinet Coetzee <sup>3</sup> and Henrietta L. de Kock 1,\***


Received: 14 December 2018; Accepted: 11 February 2019; Published: 22 February 2019

**Abstract:** Color, aroma, sweet, and bitter tastes contribute to the sensory perception of grapefruit juice. Consumers differ about liking grapefruit. A reason is the bitter taste that characterize the fruit. The objective was to determine the effect of varying the color (red or yellow), aroma (two levels), bitterness (three levels), and sweetness (three levels) of a grapefruit-like model beverage, on consumers' liking and perception of its sensory properties. The sensory profiles of thirty-six grapefruit-like beverages, created on the basis of a factorial design, has been described. Consumers rated their liking of color, aroma, and flavor of the twelve most diverse beverages. Bitter and sweet levels of the beverages had a significant effect on the flavor and aftertaste attributes. Aroma concentration had a significant effect on the majority of the sensory attributes. Color had a significant effect on perception of some of the aroma attributes, as well as the grapefruit's flavor intensity. Consumers liked the red beverages more than the yellow ones, and those with low aroma over the high aroma intensity. Consumers preferred the low bitter/high sweet beverages. Pungent and grapefruit aroma were found to be negative drivers for liking of the aroma. Sweet and citrus flavors were found to be positive drivers and sour and bitter flavors were found to be negative drivers of flavor-preferences (or liking) of the tested beverages.

**Keywords:** grapefruit; sensory; consumer; bitter; naringin; sweet; aroma; color; hedonic

#### **1. Introduction**

The sensory properties of grapefruit (*Citrus X paradisi*) are distinctive characterizing components and play a key role in reasons why consumers choose or not choose to consume the fruit and its products, e.g., juice. Grapefruit is a rich source of vitamin C, and health-promoting citrus flavonoids and limonoids and has beneficial antioxidant and anti-inflammatory properties [1,2]. Its appearance, aroma, flavor, and mouthfeel properties contribute to the sensory perception of the fruit.

Consumers differ widely in opinions on liking or disliking grapefruit and part of this individual preference is attributed to liking or disliking of the bitter taste that characterizes the fruit. Excessive bitterness of the juice was considered to be an important problem in commercial grapefruit juice production [3]. Naringin and limonin are mainly responsible for the bitter taste commonly associated with grapefruit [4]. The consumption of fresh grapefruit, and grapefruit products has been declining [5] and plant breeders are working on ways to select for desirable sensory traits. A better understanding of the impact of the different sensory modalities contributing to the sensory perception of grapefruit products (e.g., juice) might assist product developers to optimize formulations and

improve uptake of the products among consumers, thereby, maintaining or enhancing profitability for the role-players, along the grapefruit value chain.

Flavor perception is complex, due to the simultaneous stimulation of a number of senses. It is the result of processes that respond to sensory signals, from the activation of multiple sensory modalities, including smell (retronasal olfaction), mouthfeel (somatosensation), as well as taste (gustation), and to some extent also sight. When different senses are stimulated, concurrently, and perceptually interact with each other, the perceived flavor is the result of the cross-modal sensory interaction [6]. Cross-modal interactions can change the intensity and perceived character of individual tastes and aromas, and even the overall flavor [7].

The present study aimed broadly, to determine the relations between the stimulus components of a model beverage (formulated to be similar to grapefruit juice) and their effects on the perceived sensory properties and hedonic responses. We factorially combined, in the same acidified neutral base, each of three possible levels of bitter naringin (low, medium, and high) with each of three levels of sweet sucrose (low, medium, and high), two levels of grapefruit aroma (low and high) and two color variants (red and yellow). We hypothesized that perceived bitterness of the model grapefruit-like beverage will drive consumers dislike for the beverage but that bitterness perception will be a function of cross-modal color–taste, aroma–taste, and sweet–bitter taste interactions.

The color of the natural juice extracted from the grapefruit depends on the variety used and ranges from greenish-yellow to pale yellow, pink, and light red [8]. We hypothesized that a rose red grapefruit-like beverage would be perceived as sweeter than a pale yellow option. Previously, it was reported [9] that a red color decreased the perception of bitter taste intensity of a caffeine + water solution, with the yellow and green color having had no effect. The color of food and drinks impacts subsequent perception of taste, flavor, and overall sensory perception. It has been reported in several studies that the color of a solution greatly impacts the ability to identify its flavor and also affects the liking responses [7].

We hypothesized that a beverage with high, compared to a low grapefruit aroma, would suppress the bitter taste perception and enhance the taste of sweetness [10]. A new study [10] reported that lemon extract, sucrose, and citric acid, when presented separately and also together, affected the perception of sweet, sour, and citrus flavors. The aroma of the products can influence the perception of basic tastes and vice versa [11–15].

It is well-known that sucrose and other sweet tasting compounds can suppress bitterness. This is practically applied when bitter tasting coffee or tea is sweetened with sugar. Here the expectation was that sweetness would suppress bitterness but an enhancement effect on volatile aroma and flavor compounds was also expected. When sucrose was added to the fruit juices, not only were the perceived level of bitterness and sourness reduced and the sweet taste intensity increased, but the sweet aroma intensity rating also changed [16].

#### **2. Materials and Methods**

#### *2.1. Preparation of the Grapefruit-Like Beverages*

Thirty-six grapefruit-like beverages (Table 1) were manufactured, following a factorial design with deflavored, clarified, deionized, and acidified apple juice, as base, with an addition of naringin (three bitter levels), sucrose (three sweet levels), a grapefruit aroma compound mixture (two intensity levels) consisting of caryophyllene, citral, nootkatone, aldehyde C8 (octanal), aldehyde C9 (nonanal, aldehyde C10 (decanal), and two colorants (red or yellow). The addition of naringin was intended to reflect a low-level, in-between, and a high-level, based on the typical content in the grapefruit juice (218–340 mg/kg) [17]. The low level of sweetness was based on the industry minimum requirement for export purposes, with incrementally higher levels added to reflect medium and high sweetness. These aroma compound mixture and levels used were selected in consultation with a flavorant supplier. The typical grapefruit juice color was copied using artificial colorants. The red color was a 0.001% solution blend of 30% sunset yellow

and 70% ponceau red. The yellow color consisted of 0.0125% quinoline yellow. Standard preparation and mixing procedures were used for all added stimuli to ensure uniformity. The grapefruit-like beverages were filled in 250 mL plastic bottles, with lids, for easy handling and uniformity, and were kept frozen at −18 ◦C, until use. The beverages were defrosted overnight at an ambient temperature and kept at 14 ◦C, until served. A summary of the physico-chemical characterization of the 36 grapefruit-like beverages is presented in the Supplementary Material (Table S1).


**Table 1.** Factorial design for the 36 grapefruit-like beverages.

<sup>1</sup> Code: 1st letter = bitter level (High, Medium, or Low); 2nd letter = sweet level (High, Medium, or Low); 3rd letter = aroma level (High or Low); 4th letter = color (Red or Yellow). Samples in bold italics were used for consumer evaluation. <sup>2</sup> Aroma blend = Caryophyllene, citral, nootkatone, aldehyde C8 (octanal), aldehyde C9 (nonanal), aldehyde C10 (decanal). <sup>3</sup> Red color = 0.001% solution (30% Sunset yellow and 70% Ponceau red); Yellow color = 0.0125% Quinoline yellow.

#### *2.2. Descriptive Sensory Analysis*

The sensory profiles of the beverages were described by a sixteen-member trained sensory panel with one to two years of descriptive sensory analysis experience. The specific training for attribute and methodology development for the evaluation of the beverages consisted of two sessions of 2 h each, using the generic descriptive analysis method [18]. A total of 21 attributes were generated to characterize the aroma, flavor, and aftertaste of the grapefruit-like beverages (Table 2). Beverage samples (±30 mL) were served at ±14 ◦C, in 125 ml polystyrene cups with plastic lids, and marked with randomly selected three-digit numbers. Samples were evaluated in duplicates, 12 beverages per 2 h session per day and a total of six sessions. The presentation order of samples per day for the different panelists followed a Williams Latin square design. Reference standards were available during training and evaluation sessions.


**Table 2.** Definitions of attributes used for describing the aroma, flavor, and aftertaste of the grapefruit-like beverages.

Panel performance was monitored to test reproducibility and consistency of the panel ratings using PanelCheck 1.3.2 (www.panelcheck.com; Nofima, Ås, Norway).

The attributes were evaluated on a structured horizontal line scale (10 cm) with descriptors at the scale ends ranging from 'not intense' (at the left end of the scale, 0 cm) to 'very intense' (at the right end of the scale, 10 cm). Data was captured using Compusense® five release 4.6 software (Compusense Inc., Guelph, ON, Canada).

#### *2.3. Consumer Evaluation*

Ninety six young South African female consumers aged 18–24 years were recruited by trained fieldworkers. Each consumer completed an online screening survey and were invited to participate if in a self-reported good state of health, and if not limited by any food intolerance(s) and/or allergies. Participants were briefed and gave written consent before evaluating the beverages. Participants were requested not to eat, drink (except for water) or smoke for at least 1 h prior to the session.

The consumers (*n* = 90) evaluated liking of the color, aroma and flavor of the 12 most diverse beverages (selected on the basis of composition) (Table 1) using the Simplified Labeled Affective Magnitude (SLAM) scale [19], a 10 cm line scale labelled with descriptors 'greatest imaginable dislike' (at 0 cm), and 'greatest imaginable like' (at 10 cm). Sample preparation and presentation was the same as for the trained panel. The 12 samples were evaluated in one session and the order of presentation to different consumers followed a Williams design.

Data was captured using Compusense® five release 4.6 software (Compusense Inc., Guelph, ON, Canada).

Ethical approval for this study was obtained from the Faculty of Natural and Agricultural Sciences Ethics Committee at the University of Pretoria (EC 130827-088).

#### *2.4. Statistical Analysis*

An analysis of variance (ANOVA) model fitted using PROC GLM in SAS v9.4 (SAS Institute Inc., Cary, NC, USA) was used to determine the main effects of the panelists, the bitter level, the sweet level, the aroma level, and the color type, together with the respective two-way interactions on the sensory attributes of the beverages. Tukey's HSD test (*p* = 0.05) was used to compare beverages that differed in an attribute. Principal component analysis (PCA) using XLSTAT 2014 (Addinsoft, Paris, France) was applied to the correlation matrix of the sensory panel mean ratings, for all attributes of all grapefruit-like beverages.

Consumer liking of the color, aroma, and flavor of the 12 most diverse beverages was analyzed by a three-way ANOVA model, including the effects of color, aroma level, and tastants (bitter and sweet levels in three combination). Means were compared using Fisher's least significant difference test at *p* < 0.05. Data were analyzed using GenStat® (VSN International Ltd., Hertfordshire, UK). Consumer liking ratings (y) for color, aroma, and flavor of beverages were modeled as a function of the descriptive sensory attributes (x), using three separate partial least squares (PLS) regression models. Preliminary models were run with all sensory attributes and their squared terms. Variable importance (VIP), which measures how important a variable is in terms of modeling the liking attributes, was used to select a smaller number of linear and squared terms for the final model. The VIP values summarize the overall contribution of each X-variable to the PLS model, summed over all components, and weighted according to the Y variation, accounted for by each component. Only those linear terms with a VIP greater than 0.8, as well as the five squared terms with the highest contribution, were retained. The PLS models were used to determine the positive and negative drivers of color, aroma, and flavor liking, and also to predict consumer liking of the 24 samples that were profiled by the descriptive sensory panel, but not evaluated by the consumers. The SIMCA-P package (Umetrics, Umea, Sweden) was used for the PLS modeling.

#### **3. Results**

#### *3.1. Descriptive Sensory Profiles of the Grapefruit-Like Beverages*

Table 3 presents a summary of the main effects (color, aroma, bitter, and sweet) and two-way interaction ANOVA effects (provided in Supplementary Tables S2, S3 and S4) on sensory attributes of grapefruit-like beverages, as evaluated by the trained sensory panel. Means for each of the samples represent the average of duplicate ratings by 16 panelists. Color of the grapefruit-like beverages had a significant effect on perception of some aroma and flavor properties. The overall aroma and grapefruit, deteriorated/rotten, muddy/mouldy, fruity and sweet aroma, and grapefruit flavor of the red colored beverages were perceived as significantly (*p* < 0.05) more intense than the yellow colored beverages.

The level of aroma added had a significant effect on the majority of the sensory attributes, namely, overall aroma intensity and citrus, grapefruit, chemical, muddy/moldy, fruity, green/grassy, peely/peel oil, soapy, pungent, woody/spicy, and sweet aroma, with the lowest intensities perceived in the beverages with the low aroma level added. Aroma level had a significant effect on the bitter, astringent, and citrus flavor, and the bitter aftertaste perception, with the highest bitter and astringent flavor and bitter aftertaste being perceived in the beverages with a low aroma level and the highest citrus flavor being perceived in the beverages with a high aroma level.

Varying the naringin content (bitter level) of the beverages did not have any significant effect on any of the aroma attributes. It did, however, have a significant effect on the intensities of overall flavor and the astringent flavor, with the highest values observed for beverages with medium and high naringin concentrations. The naringin level had a significant effect on the intensities of sweet, sour, bitter, and grapefruit flavor, and the bitter aftertaste perception. The highest sweetness, but lowest sourness and grapefruit flavors were perceived in the beverages with low and medium naringin concentrations. Intensity of bitter flavor and bitter aftertaste followed the level of bitter compound addition.

Sweetness level contributed by sucrose had a significant effect on the perception of the many sensory properties of the grapefruit-like beverages. Significantly higher soapy aroma was perceived in the beverages with low and medium levels of sucrose, compared to a high sucrose addition. Sucrose level in the beverages had a significant effect on sour, sweet, bitter, astringent, and grapefruit flavor, and the bitter aftertaste intensities. Sour, bitter, astringent, and bitter aftertaste intensities decreased as the sweet level increased, while sweetness increased. A less intense grapefruit flavor was perceived in the high sweet level beverages, compared to the low and medium sweet levels.


1


 values

Very few two-way interactions were significant. The detailed tables for the significant interaction effects are presented in the Supplementary Material (Tables S2–S4). The bitter level x aroma level interaction effect (Table S2) was significant for the perception of the intensity of chemical aroma and overall flavor intensity, the bitter flavor, and the bitter aftertaste. A trend was observed in that the chemical aroma was more intensely perceived in the beverages with high aroma, although only significantly so in the low and high bitter samples and not in the medium bitter samples. The overall flavor intensity was significantly but slightly lower in the high aroma/medium bitter, compared to the low aroma/medium bitter sample. Aroma level did not affect the overall flavor perception at the low or high bitter levels. Bitter flavor and bitter aftertaste were notably less intense in the high aroma samples, compared to the low aroma samples, but only significantly so for the medium and high bitter level beverages.

The bitter level x color type interaction effect (Table S2) was significant for the bitter aftertaste intensity. However, bitter aftertaste was essentially driven more by the bitter level than the color type. The bitter level x sweet level interaction effect (Table S3) was significant only for the pungent aroma. A significantly lower pungent aroma was noted between the medium sweet and low sweet beverages, at the medium bitter level.

The aroma level x color interaction effect (Table S3) was significant only for bitter flavor intensity. At a low aroma level, no difference in bitter flavor intensity was found between the two colors. However, at the high aroma level, the yellow beverage was perceived as being significantly bitterer.

The sweet level x aroma level interaction effect (Table S4) was not significant for any of the sensory aroma attributes. A sweet level x color interaction effect (Table S4) was significant for the astringent and citrus flavor perception. While no significant differences were found between the red and yellow beverages at the medium sweet level, the red beverage was perceived as significantly more astringent at low sweet and high sweet levels. A similar effect was found for the citrus flavor, although the red beverages were found to have a more intense citrus flavor, only at the low sweet level.

The multivariate differentiation of the beverages is presented in Figure 1 as a PCA map over a two-dimensional space. The first and second principal components (F1 and F2) explained 37% and 35%, respectively, of the variance across the samples. F1 clearly separated beverages based on intensity of overall aroma, peely/peel oil aroma, citrus aroma, sweet aroma, and pungent aroma. Beverages that were more intense in terms of the mentioned attributes are located on the right of the plot. Note that all of these beverages have an H as third letter, therefore, they have a high aroma level. The beverages with lower intensities are located on the left of the plot and notably has L as the third letter, therefore, with a low level aroma. F2 separated the beverages based on 'taste' perception, i.e., naringin (bitter)-sucrose (sweet) levels. Beverages with high and medium bitter levels and low sweet are positioned at the top, and beverages with low bitter level and medium and high sweet levels, are at the bottom. Beverages at the top, namely HLLY and HLLR with a high level of naringin and MLHY with a medium level, were characterized by more intense astringency, sour, and bitter tastes, and with grapefruit and overall flavor intensities. Beverages (e.g., LHHR) with a low naringin level (at the bottom), were characterized by a more intense sweet taste. The attributes citrus flavor, chemical aroma, and muddy/moldy aroma in the middle of the plot, did not discriminate beverages on the first two PCs.

**Figure 1.** Principal Component Analysis (PCA) of the sensory profiles of the 36 grapefruit-like beverages. The vectors indicate the loadings for sensory attributes while the position of the sample codes indicate the score values. The four-letter codes indicate levels of naringin (1st letter: L = Low, M = Medium, or H = High), sucrose (2nd letter: L = Low, M = Medium, or H = High), aroma (3rd letter: L = Low, or H = High) and color (4th letter: R = red or Y = yellow). Sensory attributes 1AT = Aftertaste, 2Fl = Flavor, 3Ar = Aroma. Beverages in green font were selected for the consumer tests.

#### *3.2. Consumer Evaluation of the Grapefruit-Like Beverages*

The effects of color, aroma level, and bitter/sweet levels of the grapefruit-like beverages on mean liking ratings for the color, aroma, and flavor, as evaluated by the consumers, are presented in Table 4. Two-way interaction effects were not significant.



values across the design variable levels. NS = not significant, \* *p* ≤ 0.05, \*\* *p* ≤ 0.01, \*\*\* *p* ≤ 0.001. 2 Red = 0.001% solution (30% Sunset yellow and 70% Ponceau red); Yellow = 0.0125%Quinoline yellow.. 3 Aroma blend {caryophyllene, citral, nootkatone, aldehyde C8 (octanal), aldehyde C9 (nonanal), and aldehyde C10 (decanal)}.

The standardized PLS regression coefficients for attributes as part of the prediction models are presented in Table 5. PLS regression (PLSR) models were used to predict liking of the color, aroma, and flavor of the 36 beverages, including the beverages that were not evaluated by consumers (Table 6). Expected errors of prediction for the models were low, lying between ±1.288 for the aroma model to ±2.458 for the color model, and ±2.678 for the flavor model, with a 95% confidence interval, indicating reliable prediction estimations of the liking variables.

**Table 5.** Standardized partial least squares (PLS) regression coefficients for factors to summarize the relationship between predictors (X, consumer liking variables) and Y, sensory response variables. Only selected important variables (main effects and squared effects, noted as '2') from the refined models are shown.


**Table 6.** Partial least square regression (PLSR) model predicted liking ratings for color, aroma, and flavor of the grapefruit-like beverages.



**Table 6.** *Cont.*

<sup>1</sup> Refer to Table 1 for number. <sup>2</sup> Code: 1st letter = bitter level (High, Medium, or Low); 2nd letter = sweet level (High, Medium, or Low); 3rd letter = aroma level (High or Low); 4th letter = color (Red or Yellow). Samples in bold italic were used for consumer evaluation. <sup>3</sup> Values are means (± standard deviation); Observed means in a column with different letters are significantly different (*p* < 0.05).

Liking of the color of the red grapefruit-like beverages were rated, on average, slightly higher than the yellow ones (*p* < 0.05) (Table 4). Whether the beverage was colored yellow or red, it did not affect the liking of the aroma or the flavor. Predicted mean liking of the color for the highest and lowest liked of the 36 beverages differed, however, only by a maximum of 12.2 scale units (Table 6). Notably the research found no significant sensory attribute drivers for liking of the color of the grapefruit-like beverages (Table 5).

Liking of the aroma of beverages with a low added-aroma level, was higher (*p* < 0.05) than for those with a high added-aroma level (Table 4). Aroma level did not have an effect on the liking of the color of the beverage. Aroma level also did not affect the liking of the flavor of the beverage. The predicted mean liking of the aroma for the highest and the lowest liked beverages, differed by 16.5 scale units (Table 6). Main effects and squared effects are indicated as '2'. Positive attribute drivers for liking of the aroma of the grapefruit-like beverages were the square term of fruity aroma (noted fruity aroma2), citrus flavor, and sweet flavor, while negative drivers were sweet aroma 2, sweet flavor2, and pungent aroma (Table 5).

As expected, the level of the gustatory flavorants, the naringin, and the sucrose, did not affect the liking of the color of the beverages (Table 4). Surprisingly the non-volatile taste level did have a significant effect (*p* < 0.05) on the liking of the aroma of the beverages. The aroma of the most bitter/least sweet beverages was liked significantly less than the other two taste combination levels. Not surprisingly, liking of the flavor of the beverages decreased significantly (*p* < 001) as the bitter level increased and the sweet level decreased. Predicted mean ratings for liking of the flavor, the highest liked and the lowest liked of the 36 beverages, differed by 27.5 scale units. Positive drivers for liking of the flavor of the grapefruit-like beverages were sweet taste, squared term for chemical aroma (noted as 'chemical aroma2), and citrus flavor intensities, while the negative drivers were intensity of soapy aroma, bitter aftertaste, and sour taste (Table 6).

#### **4. Discussion**

The research studied the effect of varying the bitterness, sweetness, color, and aroma intensity of grapefruit-like beverages on the cross-modal perception of sensory properties and its effects on consumer liking. A model grapefruit-like beverage standard formulation was created and a sensory lexicon with a total of 21 attributes and definitions were generated to characterize the aroma, the flavor, and the aftertaste of the grapefruit-like model beverage with variations in color, aroma, and gustatory flavorant levels.

Color hue of the grapefruit-like beverage affected the perception and description of the aroma and flavor sensory properties, as evaluated by the trained human panelists. Color of the beverages, and, in particular, the sample with the rose-red hue had a significant enhancing effect on perception of overall aroma intensity and grapefruit, deteriorated/rotten, muddy/moldy, fruity, and sweet aroma intensities. It also corresponded to the consumer liking—the red beverages were liked more than the yellow ones. The cross modal effect of the beverage color on aroma and flavor of the beverages, however, did not lead to significant differences in the liking of aroma or a liking of the flavor of the red and yellow beverages. The difference in methodology followed and the cognitive tasks employed by the two groups of panels might be the reason. When the group of consumers evaluated the liking of the color of the beverages, solely based on appearance, a slight but significant preference for the red-colored beverages was noted. This preference was solely driven by visual cues, since the consumers did not yet smell or taste the beverages. After smelling and tasting the beverages, it is likely that the opinion and preference might have changed, based on the cross-modal, color-aroma/flavor sensory interaction, as demonstrated by the results for the trained panel in this study. Considering that the consumers first evaluated the liking of the color, then the aroma (retronasally), and lastly the flavor (after consumption) of each sample, sequentially, it cannot be excluded that some form of learning, anticipation, and association might have occurred over the evaluation of the sequence of twelve samples, of which 50% were red and 50% were yellow.

A study [20] reported that the red color decreased the perception of the bitter taste sensitivity of a water solution. Coloring a clear bitter solution red, decreased the perception of the bitter taste, while the addition of yellow and green coloring had no such effect [9]. Other researchers [21] suggested that color-induced olfactory enhancement observed when odorous solutions are smelled orthonasally, might be the result of a conditioned olfactory percept caused by the color. Conditioned expectations predict that certain colors would be strongly associated with particular flavors, e.g., red with cherry, orange with orange, and green with lime [22]; yellow with lemon, blue with spearmint, and red with strawberry, raspberry, and cherry [23]. In South Africa, the location for the study, both yellow and red/pink grapefruit are marketed. The Star Ruby variety with a red color is the most planted (84%) grapefruit variety in South Africa, followed by the white variety Marsh (16%) (the juice of this type of grapes is pale yellow) [24]. In another study [25] it was found that the relationship of green and yellow colors in the lemon and lime-flavored sucrose solutions was altered; such color changes were found to have an impact on the perceived sweetness ratings. In another study, results showed that color–odor solution pairings were rated as having more intense odors with color cues than without, regardless of the color–odor pairing appropriateness [21]. This cross-modal effect presumably results from the color-cue setting up an expectation concerning the likely identity and intensity of a food or drink's taste or flavor [20]. No significant sensory attribute drivers for liking of the color of the grapefruit-like beverages was identified, since the trained sensory panel did not evaluate the appearance attributes.

Aroma level added to the model beverage had a significant enhancing effect, on the majority of the aroma and flavor sensory attributes. The enhancement of overall aroma and characteristic aroma qualities, including citrus flavor, as a function of the level of aroma added, was expected and confirmed. When consumers evaluated liking of the aroma of the beverages, solely based on orthonasal inspection, surprisingly the beverages with low aroma were slightly preferred over those with high aroma. It is possible that the higher aroma level was more distinctive and clearly reminiscent of grapefruit and possibly evoked a stronger cue for those disliking grapefruit. An interesting and unexpected finding was the apparent suppression of bitter and astringent gustatory sensations, due to a higher load of olfactory stimuli (high aroma level). Previous studies have found that aroma–taste interactions can result in complicated changes in the perceived flavor. The addition of an aroma can, e.g., elevate the bitter-detection threshold [26,27]. The perceived intensity of tastes in solutions was increased by volatile compounds, especially when there was a logical association between them, such as between

sweetness and fruitiness [28]. Apple and strawberry aromas evoked both sweetness and sourness. A study found that tasteless aromas, namely green tea and coffee, predominantly evoked bitterness, while the vanilla aroma predominantly evoked sweetness [29]. The grapefruit aroma consisted of a blend of caryophyllene, citral, nootkatone, and various aldehydes; octanal, nonanal, and decanal. No study could be found that specifically indicated that any of these compounds evoked bitterness. Nootkatone at the above threshold concentrations was reported as tasting bitter [30]. Consumption of a beverage results in the simultaneous perception of aroma and taste, coupled with tactile sensations, all of which contribute to an overall impression of flavor. Compounds that stimulate taste perception (e.g., naringin contributing a bitter taste) can increase the apparent intensity of aromas. In this study, the grapefruit flavor was enhanced by the naringin addition. The aroma compound (containing a citral component) of the grapefruit-like beverages had an enhancing effect on the citrus aroma intensity. An additive effect of the sweet components with citral or limonene volatiles having a 'citrus'-like aroma was reported by [31] but was not observed in this study. The suppression of bitterness in the high aroma beverages, however, did not affect the liking of the flavor, since there was no difference found in the liking of the flavor of beverages with low or high aroma levels. Positive drivers for liking of the aroma of the grapefruit-like beverages were fruity aroma2, citrus flavor, and sweet flavor, while negative drivers were sweet aroma2, sweet flavor2, and pungent aroma.

The low bitter/high sweet beverages were preferred over the high bitter/low sweet samples. A study [32] reported that with an increase in the ratio of ◦Brix/acidity of reconstituted grapefruit juice, the consumer perception of sweetness increased and bitterness and aroma intensity decreased. Some bitterness in processed grapefruit products is acceptable for consumers, but excessive bitterness is one of the major consumer objections to such products [28,31]; this was confirmed in this study. The variation in sensitivity of the individual consumers to bitter compounds in grapefruit beverages could be explored further to identify whether subgroups might have different preferences. As expected, the contribution of varying concentrations of naringin affecting the bitterness of the grapefruit-like beverages did not have a significant effect on any of the aroma attributes. Similarly, [32] reported that consumers did not find any difference in aroma with increased levels of naringin in processed grapefruit juice. However, the concentration of bitterness of the grapefruit-like beverages had a significant effect on the flavor attributes (astringent, sweet, sour, bitter and grapefruit flavor, and the bitter aftertaste). A study [32] has also reported that an increase of limonin (also a bitter compound) in processed grapefruit juice, increased the perceived bitterness and tartness, while decreasing the sweetness.

In a previous study, an increase in the ◦Brix with sucrose, enhanced the taste of sweetness, and had a decreasing effect on the sour, bitter, astringent, and grapefruit flavors, and the bitter aftertaste. When sucrose was added to fruit juices, not only were the perceived levels of bitterness and sourness reduced (as was also found in this research) but the sweet aroma intensity rating also changed [16] (although this was not found here). Sucrose was also reported to mask the bitter taste of sinigrin, goitrin, and quinine [33]. In the complex beverage model, increasing sucrose did not have the often reported enhanced effect on the perceived fruity aroma. Increasing the sugar concentration of blueberry and cranberry fruit juices, increased their fruitiness (evaluated by sipping), even though no difference in aroma was perceived by sniffing alone [16]. Sucrose in the mouth significantly enhanced the "citrus" ratings, compared to when citral was inhaled alone [12]. Similarly, increases in the intensity of different 'fruity' aromas were perceived in a multichannel flavor delivery system [34], model dairy desserts [35], and custard desserts [36], when increasing the sweetness with sucrose. Sweet level also affected the soapy aroma of the grapefruit-like beverages. The reason for the effect on soapy aroma is unclear. It is possible that the aroma blend contributed a slight soapy aroma.

The effect of aroma level and color on the perceived sensory attributes, as observed in this study, are evidence of cross-modal sensory interactions. It was anticipated that the intensity and character of the aroma level of a grapefruit juice would increase the perception of the citrus flavor, a positive driver of grapefruit flavor liking and reduce the negative attributes, the bitter and astringent flavor, as well as

the bitter aftertaste. Positive drivers for liking of the flavor of grapefruit-like beverages were the sweet taste, the chemical aroma, and the citrus flavor intensities, while negative drivers were intensity of soapy aroma, bitter aftertaste, and sour taste.

#### **5. Conclusions**

This study indicated that aroma, bitterness, and sweetness levels, and also product color (hue) influences the perception of grapefruit-like beverages, as well as their hedonic value. A grapefruit-like beverage model was created and a lexicon to describe the sensory properties of the cross-modal interaction of stimulus components of the model beverage was developed. From the descriptive sensory profiles, prediction models for liking of the color, aroma, and flavor of grapefruit-like beverages were developed. In the next phase, the models should be applied to a wide range of grapefruit juice samples to determine validity and reliability in real juices. The models can then be optimized for application in grapefruit quality control and product development programs.

**Supplementary Materials:** The following are available online at www.mdpi.com/xxx/s1. Table S1: Physico-chemical characterization (means ± standard deviation) of the 36 grapefruit-like beverages. Table S2: Summary of sensory attribute mean values1 [± standard error of means (SEM)] and significance of bitter x aroma and bitter x color two-way ANOVA interactions of the model grapefruit-like beverages as evaluated by a trained sensory panel (*n* = 16). Table S3: Summary of sensory attribute mean values1 [± standard error of means (SEM)] and significance of bitter x sweet and aroma x color two-way ANOVA interactions of the model grapefruit-like beverages as evaluated by a trained sensory panel (*n* = 16). Table S4: Summary of sensory attribute mean values1 [± standard error of means (SEM)] and significance of sweet x aroma and sweet x color two-way ANOVA interactions of the model grapefruit-like beverages as evaluated by a trained sensory panel (*n* = 16).

**Author Contributions:** Conceptualization, A.G.S.G.; Formal analysis, A.G.S.G. and V.L.A.; Funding acquisition, A.G.S.G. and H.L.d.K.; Investigation, A.G.S.G.; Methodology, A.G.S.G.; Resources, A.G.S.G.; Supervision, V.C. and H.L.d.K.; Writing—original draft, A.G.S.G., V.L.A., V.C., and H.L.d.K.; Writing—review & editing, H.L.d.K.

**Funding:** This work is based on research supported, in part, by the National Research Foundation of South Africa, Grant Number: 76905. V.L.A. is thankful for funding from the Norwegian Agriculture and Food Industry Research Funds, FoodSMaCK strategic program.

**Acknowledgments:** The technical and research support of Karien Kotze, Melanie Richards-Dennil, Marise Kinnear, Leandri de Kock is acknowledged.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Any opinion, finding, and conclusion or recommendation expressed in this material is that of the author(s) and the NRF does not accept any liability in this regard.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **The Influence of Water Composition on Flavor and Nutrient Extraction in Green and Black Tea**

#### **Melanie Franks, Peter Lawrence, Alireza Abbaspourrad and Robin Dando \***

Department of Food Science, Cornell University, Ithaca, NY 14850, USA; mf755@cornell.edu (M.F.); petegcms@gmail.com (P.L.); alireza@cornell.edu (A.A.)

**\*** Correspondence: robin.dando@cornell.edu

Received: 19 November 2018; Accepted: 18 December 2018; Published: 3 January 2019

**Abstract:** Tea is made from the processed leaves of the *Camellia sinensis* plant, which is a tropical and subtropical evergreen plant native to Asia. Behind water, tea is the most consumed beverage in the world. Factors that affect tea brewing include brewing temperature, vessel, and time, water-to-leaf ratio, and, in some reports, the composition of the water used. In this project, we tested if the water used to brew tea was sufficient to influence perceived flavor to the everyday tea drinker. Black and green tea were brewed with bottled, tap, and deionized water, with brewing temperature, vessel, time, and the water-to-leaf ratio matched. The samples were analyzed with a human consumer sensory panel, as well as instrumentally for color, turbidity, and Epigallocatechin Gallate (EGCG) content. Results showed that the type of water used to brew tea drastically affected sensory properties of green tea (and mildly also for black tea), which was likely driven by a much greater degree of extraction of bitter catechins in teas brewed with more purified bottled or deionized water. For the everyday tea drinker who drinks green tea for health, the capability to double the EGCG content in tea by simply brewing with bottled or deionized water represents a clear advantage. Conversely, those drinking tea for flavor may benefit from instead brewing tea with tap water.

**Keywords:** taste; sensory evaluation; tea; EGCG; hedonics

#### **1. Introduction**

#### *1.1. Tea and Tea Processing*

Tea is a beverage steeped in culture and history. Valued for its taste and caffeine content as well as its numerous health properties [1], tea has been consumed for centuries [2]. Behind water, tea is the most consumed beverage in the world [3]. The botanical name for the plant producing tea is *Camellia sinensis* (L.) Kuntze. There are many other plants used for extraction such as rooibos and chamomile, however these are not strictly teas. Instead, they are classified under the category of tisanes or herbal infusions. The main difference between various styles of tea is the level of oxidation of the leaf during processing. Green and white teas are unoxidized, oolongs vary in the levels of oxidation, and black tea leaves are fully oxidized. A cup of tea is made from processed fresh tea leaves. Biochemical changes that occur during processing help reduce the bitter taste of fresh tea leaves. Processing the tea leaves lowers water content to aid in shelf stability, deactivates enzymes, and adds sweetness and a myriad of colors to the cup. Physically the leaf transforms from a sturdy crisp leaf to limp and pliable during withering. Chemically, caffeine content increases, hydrolysis of hydrophobic carbohydrates begins, non-gallated catechins and aroma compounds form, and the levels of chlorophyll and various enzymes increase [4]. For black teas, after withering, the leaves are purposefully crushed to speed oxidation. This step is what gives black tea its defining quality, whereby enzymatic oxidation converts catechins into theaflavins and thearubigins. Polyphenols give black tea its reddish-brown coloration [5].

#### *1.2. Tea Flavanols*

The main polyphenols found in tea are flavonoids. Flavonoids are a group of bioactive compounds synthesized during plant metabolism. Flavonoids are found in fruits and vegetables, prominently in spinach, apples, and blueberries, as well as in beverages like tea and wine. Previous health-related research on tea has largely focused on the flavonoid group. Flavonoids contain two six-carbon rings linked by a three-carbon unit, which is also known as a chalcone structure [6]. Catechins (also referred to as flavanols) are bioactive compounds that are a subclass of flavonoids, and, in tea, are the main secondary metabolites. The main catechins in tea are: catechin, epicatechin, epicatechin gallate, epigallocatechin, epigallocatechin-3-gallate, and gallocatechin. Catechin content in tea differs by tea type or style. Catechins in green tea are relatively stable since they do not go through oxidation during processing, and are what gives green tea its characteristic bitterness and astringency. In black tea, the catechins are largely oxidized to theaflavins and thearubigins [6], which reduces catechin content by around 85% compared to green tea [7], leaving the tea darker and less bitter.

#### *1.3. Tea and Water*

After tea leaves are harvested and processed, the final product is ready to consume. However, unlike many other beverages, the final processing step is left to the consumer. A high-quality tea that has gone through many labor-intensive steps can be ruined in an instant by improper brewing. Factors that alter the taste of the brewed cup are brewing temperature, time, vessel, the water-to-leaf ratio, and the water composition [8,9]. This study focuses on the water used to brew tea, specifically how water quality influences the sensory and chemical qualities of black and green tea. Taste is a key factor in consumer acceptance of water [10], however water is often not a top priory when making tea, despite its critical role as the vehicle for the infusion. References to the importance of water content in brewing tea can be found as early as 758AD, in The Classic of Tea by Lu Yu [11]. Lu Yu was an orphan during the Tang Dynasty, raised by an abbot in the Dragon Cloud Monastery. He authored an efficient 7000-character book detailing how to harvest, process, and brew tea, including what types of water are suitable for tea, as well as the proper tools and utensils. Lu Yu felt that tea made from mountains streams was ideal, river water was sufficient, and well water was inferior [3]. In a more recent book from Kuroda & Hara [12], tap water is recommended as the most suitable water for making tea, although specific recommendations are that water should be clear of odors and deficient in magnesium and calcium.

Previous work suggests that tap water can influence the amount of tea flavanols extracted in green tea compared to brewing green tea with purified water [13]. Tap water has a differing (inconsistent between regions, and over time) mineral balance. "Hard" water is high in minerals such as calcium and magnesium. Tea infusions are particularly affected by calcium, with previous studies showing that levels of theaflavins and caffeine extracted decrease with high levels of calcium [14]. Magnesium and calcium can also promote two undesirable outcomes of tea brewing: tea cream and scum formation. Tea cream is the precipitate matter that forms as the tea cools and is caused by the reaction between caffeine and tea flavanols, while tea scum is a surface film that forms on the tea infusion surface, which is composed of calcium, hydrogen carbonates, and other organic material. This film occurs due to calcium carbonate triggering oxidation of organic compounds [9]. It has also been demonstrated that catechin extraction can be increased in white tea by brewing with purified water [15].

#### *1.4. Tea Flavor*

Between 25% to 35% of the fresh tea leaf is composed of phenolic compounds with 80% of these being flavanols [16]. Both phenolic compounds and alkaloids such as caffeine contribute to the bitter taste in tea, though the catechins are thought to be the main contributors to bitterness [17]. Glucose, fructose, sucrose, and arabinose in tea account for its sweet taste. Free amino acids make up about 1% to 3% of the dry leaf, and, in green tea, may yield an umami characteristic [16]. Astringency, albeit

not a taste, is a common oral sensation in tea, thought to arise from its catechin content [18]. Despite tea being consumed for several thousand years, there are few consumer sensory studies of tea flavor, with researchers more often favoring evaluation by trained or expert panels. The goal of this project was to test if the water source used to brew tea (tap, bottled, or deionized) influenced flavor or liking from the everyday tea drinker, using both black and green tea. Tea samples were analyzed with a human consumer sensory panel as well as with a number of instrumental methods.

#### **2. Materials and Methods**

#### *2.1. Mineral Analysis of Water Samples*

Ithaca city tap water, Poland SpringTM bottled water (Nestle Waters, Paris, France), and deionized water used for the study were tested by the Community Science Institute, Inc (Ithaca, NY, USA), assaying calcium, iron, magnesium, sodium, and copper content. Methods followed those recommended by the Environmental Protection Agency (EPA). Briefly, Iron, Magnesium, and Sodium were measured spectrochemically (EPA protocol 200.2, Rv. 2.8) and with inductively coupled plasma-atomic emission spectrometry (EPA 200.7, Rv 4.4), while copper was measured using Inductively Coupled Plasma Mass Spectrometry (EPA 200.8/EPA200.8, Rv 5.4). Calcium and residual chlorine were measured colorimetrically, using an EDTA titration for calcium (SM 3500-Ca B), and a Lamotte test kit for chlorine (LaMotte DPD-1R, LaMotte Co., Maryland, USA).

#### *2.2. Preparation of Tea Infusions*

Two high-quality loose leaf teas known as Zhejiang green and Mao Feng black teas were purchased from In Pursuit of Tea (New York, NY, USA). Both teas are from the Zhejiang Province in China, which is a highly regarded tea region, with both produced on the same farm. Green teas were brewed in tap (GT) water, bottled (GB) water, and deionized (GD) water, with black similarly denoted as black tea in tap (BT), bottled (BB), and deionized (BD) water. For the green tea samples, 2.5 g of tea was weighed out into pre-warmed Gaiwan tea brewing vessels (Figure S1), with 125 mL of water at 80 ◦C added to the vessel. The green tea infusion was brewed for three minutes and then strained through a fine mesh strainer. Black tea samples were brewed at 100 ◦C for 5 min (more typical for black tea preparation), and strained. Samples were then either cooled to room temperature for instrumental analysis or served fresh in pre-heated cups for sensory analysis (see 2.6 below).

#### *2.3. Colorimetry*

Analysis of tea color was performed with a Hunter Lab UltraScan VIS colorimeter (Reston, VA, USA). L (light vs dark), a (red vs green), and b (yellow vs blue) values were recorded for each sample with each of the samples measured in triplicate.

#### *2.4. Turbidity*

The turbidity of each sample was measured in triplicate with use of a HACH 2100P portable Turbidity meter (Loveland, CO, USA), with measurements recorded in Nephelometric Turbidity Units (NTU). The samples were held at a 90◦ angle to the incident beam using single detection. Turbidity standards used were 0.1 NTU, 20 NTU, and 100 NTU.

#### *2.5. Analysis of EGCG*

Epigallocatechin Gallate (EGCG) in the tea infusions was measured using high performance liquid chromatography (HPLC), following the methods of Wang and Helliwell [13]. Samples were run using an Agilent 1100 HPLC system (Santa Clara, CA, USA) with a DAD detector. Separations were carried out using a Waters Cortecs (Milford, MA, USA) C18 (4.6 mm × 100 mm) column using an isocratic solvent system consisting of 90% 0.01% phosphoric acid in Millipore water (*v*/*v*) and 10% methanol with a flow rate of 0.6 mL/min. The column was held at a constant temperature of 30 ◦C. The DAD detector was set to 210 nm. Sample injection volume was 10 μL. The total run time was 20 min. All samples were filtered just before being loaded onto the HPLC using a 0.22 μm Polyvinylidene Fluoride (PVDF) filter from Celltreat (Pepperell, MA, USA). Quantification was performed by the use of an external standard curve using purified EGCG purchased from Sigma Aldrich (St Louis, MO, USA). Identification of EGCG in tea samples was performed using retention time of the pure standards (10.26 min).

#### *2.6. Sensory Evaluation*

All human study procedures were approved by the Cornell University Institutional Review Board for Human Participants, with all methods performed in accordance with relevant guidelines and regulations. A total of 103 panelists were recruited from the local community, pre-screened for their tea drinking behavior, and all gave informed consent. All the participants in the study drank tea three to five times a week or more, and were both green and black tea drinkers. The panelist either habitually consumed tea with no milk or sugar added to it or stated no dislike of tea in this manner. Participants knew that the study involved tea but were unaware of the true objective of the research. The session took approximately 45 min, with panelists compensated for their time. The panelists answered questions about samples in individual booths, using Red Jade sensory evaluation software (Curion, Deerfield, IL, USA). The samples were delivered monadically, in a counterbalanced full-block design, but panelists either received 3 green tea samples or 3 black tea samples first. Each tea sample was evaluated for overall liking, appearance liking, and flavor liking with 9-point scales, and then used the generalized Labeled Magnitude Scale (gLMS) to test sweetness, bitterness, sourness, astringency, vegetal quality (for green tea only), and earthiness (for black tea only). All panelists were briefly trained on how to use the gLMS before beginning the tasting [19]. The color of the tea was also evaluated by panelists with a color matching sheet (Figure S2) from which they chose the closest match for each tea sample. Teas were freshly brewed every 30 min. A total of 10 g of tea was brewed with 500 mL of water, at 80 ◦C for green tea, and 100 ◦C for black tea. All infusions were kept warm in pre-heated, insulated carafes until the panelist was ready for the sample. Samples were served in pre-heated (80 ◦C) white ceramic Gung Fu cha teacups (see Figure 1 below) labeled with random 3-digit codes. After each sample, panelists were instructed to cleanse their palette with water and non-salted crackers to avoid fatigue as well as deter any lingering bitterness or astringency. At the end of the questionnaire, panelists were asked a series of demographic questions and for information on their tea drinking habits.

#### *2.7. Statistical Analysis*

Data were analyzed with repeated measure analyses of variance (ANOVA) and post-hoc Tukey's tests using Graphpad Prism 5.0 (Graphpad Software, La Jolla, CA, USA). Separate ANOVAs were used for green and black tea samples since such large differences in taste and chemical properties have been shown previously. Statistical significance was inferred at *p* < 0.05. Multivariate analysis was performed using XLSTAT (Addinsoft, Paris, France) whereby two separate Principal Components Analyses were run on sensory and instrumental data as well as these two datasets combined in a Multiple Factor Analysis.

**Figure 1.** (**A**) Image of black and green tea samples brewed in tap, bottled, or deionized water. For both green and black tea, infusions appear darker and cloudier from tap wate compared to the teas brewed in DI or bottled water. (**B**)Turbidity measurements (NTU) for each tea infusion showing average of three replicates with SEM. (**C–E**) Colorimeter readings from tea infusions, L, a and b values displayed with individual readings as dots, lines denoting average, and SEM. Samples denoted as green tea brewed in tap (GT), bottled (GB), and deionized (GD) water, black tea brewed in tap (BT), bottled (BB), and deionized (BD) water. Green tea samples represented in green, black tea in dark red.

#### **3. Results and Discussion**

#### *3.1. Water Analysis*

Deionized, tap, and bottled water samples were tested for calcium, magnesium, copper, iron, residual chlorine, and sodium (Table 1). The amount of calcium, magnesium, and sodium in tap water was far greater than that in bottled or deionized water.


**Table 1.** Mineral analysis of the different water types in mg/L.

#### *3.2. Turbidity and Color*

Figure 1A shows the appearance of tea samples when brewed with three different water types. Teas brewed in tap water appear more cloudy and darker in color than teas brewed in bottled water or deionized (DI) water for both green and black teas. Turbidity measurements (Figure 1B) in green (*p* < 0.001) showed GT was more turbid than both GB (95% CI = 133.3 to 156.7) and GD (95% CI = 135.7

to 159.1), with no difference between GB and GD. In black tea, the turbidity of BT was also higher (*p* < 0.001) than both BB (95% CI = 57.66 to 103.9) and BD (95% CI = 58.81 to 105.1), with no difference between BB and BD. Adding high concentrations of calcium or magnesium in water can cause cloudiness and tea scum in tea infusion as well as possibly influencing tea's sensory properties [9,18] since both calcium and magnesium were higher in tap water used in this project. This was likely the cause of the observed turbidity increase.

Both green (*p* = 0.016) and black (*p* = 0.023) tea infusions significantly differ in lightness. Green tea brewed in tap water exhibited lower L values compared to the same tea brewed in bottled (95% CI = −9.992 to −1.288) or DI (95% CI = −8.952 to −0.2476) water, with BT similarly lower than BB (95% CI = −15.14 to −0.7051) or BD (95% CI = −15.13 to −0.6918). The a values for green (*p* < 0.001) but not black (*p* = 0.425) tea significantly differed between samples, with all pairs differing between green teas (95% CI for GT vs GB = 2.042 to 2.458; GT vs GD = 1.269 to 1.685; GB vs GD = −0.9814 to −0.5652). The b values for both green (*p* < 0.001) and black (*p* = 0.001) teas significantly varied between treatments, with tap water against the different sample. GT was higher compared to GB (95% CI = 5.661 to 12.80) and GD (95% CI = 8.401 to 15.540), with BT higher than BB (95% CI = −14.94 to −4.711) or BD (95% CI = −15.33 to −5.105).

#### *3.3. EGCG Content*

The amount of EGCG in black tea is customarily lower than that found in green tea, since the majority of the catechins in black tea are converted to theaflavins and thearubigins [5]. The small amount of EGCG in the black tea infusions did not vary with water type (*p* = 0.250, Figure 2C,D). Conversely, with green tea (natively much higher in EGCG), there was a significant difference between green tea infusions (*p* < 0.001) and with green tea brewed in bottled water (95% CI = −6350 to −3984) and in deionized water (95% CI = −5890 to −3524) having around double the amount of EGCG compared to green tea brewed in tap water (Figure 2A,B), despite being brewed from the same leaves, at the same strength, time and temperature, in identical vessels. Green teas brewed from bottled or deionized water achieved around the same level of EGCG extraction (95% CI = −723.0 to 1643). Such dramatically inferior EGCG extraction in tap water is important to green tea consumers, many of whom are consuming green tea due to a perceived consequence of health promotion [20]. EGCG is the most abundant catechin in green tea [21] as well as one of the most bitter tasting [22]. That green tea acceptance has been linked to bitter taste genes [23], and that bitterness in tea is largely a product of EGCG content [24], implies that extraction of bitter catechins in bottled or deionized water may lead to more healthy and yet less palatable tea infusions.

#### *3.4. Sensory Testing of Tea Samples*

There was no significant difference between panelists' overall (*p* = 0.646), or flavor (*p* = 0.553) liking of black tea samples (Figure 3A,C). Panelists did find significant differences in appearance liking between the samples (Figure 3B, *p* = 0.0345), which is likely a reflection of the color differences between the black tea infusions evident in Figure 1A. However, this trend was not strong enough to reflect differences between sample pairs in post-hoc Tukey's tests. Panelists also evaluated various flavor attributes of the black tea infusions. No differences were evident with water type between black tea infusion for astringency, bitterness, sourness, or sweetness (Figure 3D–F,H, all *p* > 0.05). However, panelists did find a difference in earthy flavor (Figure 3G, *p* = 0.025), specifically between that brewed in bottled water compared to black tap water (95% CI = −7.339 to −0.5252). While the panel perceived black tea brewed in tap water to be earthier, it had little effect on liking, which suggests that water may not be a critical factor in determining liking in black tea.

**Figure 2.** (**A**) Chromatogram illustrative of HPLC spectrum from green tea. EGCG peak at arrow. Y axis in milli-Absorbance Units. (**B**) Total EGCG content for green tea in ppm, brewed in tap (GT), bottled (GB), and deionized (GD water. Display shows mean of three readings plus SEM. (**C**) Chromatogram illustrative of HPLC spectrum from black tea. EGCG peak at arrow. (**D**) Total EGCG content for black tea in ppm. Samples denoted as black tea brewed in tap (BT), bottled (BB), and deionized (BD) water.

**Figure 3.** Consumer perception of black tea brewed in tap (BT), bottled (BB), and deionized (BD) water. (**A**) Overall liking of samples, from dislike extremely (1) to like extremely (9). (**B**) Appearance liking of samples, from dislike extremely (1) to like extremely (9). (**C**) Flavor liking of samples, from dislike extremely (1) to like extremely (9). (**D**) Perceived sweetness of samples, rated on gLMS, scale descriptors no sensation (0.0), barely detectable (1.4), weak (6.0), moderate (17.0), strong (34.7), very strong (52.5), and strongest imaginable sensation of any kind (100.0). (**E**) Bitterness, scale as in D. (**F**) Sourness, scale as in D. (**G**) Earthy flavor, scale as in D. (**H**) Astringency, scale as in D. Bars display mean rating of panel (*n* = 103) plus SEM. \* indicates *p* < 0.05.

For green tea samples, the effects of water were clearer. Panelists rated their overall liking (Figure 4A, *p* < 0.001) of green tea samples as differing across water treatments, with the tap clearly higher than bottled water (95% CI = −1.138 to −0.2993), with tap vs. deionized water approaching significance (95% CI = −0.04054 to 0.7978). Interestingly, this reduction in liking seemed to be driven by the panel's liking of the sample's flavor (Figure 4C, *p* = 0.001), and not its appearance (Figure 4B, *p* = 0.099). In investigating changes to the green tea's flavor properties, panelist found no significant difference in astringency, sourness, or vegetal flavor (Figure 4F–H, all *p* > 0.05). However, the panel judged the green tea samples brewed with tap water to be far less bitter (Figure 4E, *p* < 0.001) than both the sample brewed with bottled (95% CI = 0.6244 to 6.502) or with deionized water (95% CI = −9.162 to −3.285). Since only around half the amount of EGCG was extracted in green tea brewed from tap water compared to the other samples, and EGCG is experienced as highly bitter, this would result in less bitter tea infusion when brewing with tap water. Since bitterness is closely linked to liking tea regardless of ethnicity or tea drinking habits [25,26], this likely drove the increase in liking of green tea brewed in tap water. The GT sample was also experienced as sweeter by the panel (Figure 4D, *p* = 0.012), which was likely due to mixture suppression [27,28] of sweetness in samples with more bitter catechins. EGCG has been noted to extract more efficiently from green tea with purer water [29] and with higher conductivity (thus higher impurity) water producing poorer catechin extraction [30]. Rossetti and colleagues [31] measured the detection threshold of EGCG (perceived to be bitter and astringent) to be 183 mg/L (at 37 ◦C). Despite the fact that bitterness may be somewhat depressed by temperature [32], the bitterness of green tea in our study would be clear in the samples' flavor profile. Thus, doubling the EGCG content of tea in bottled or deionized water (compared to tap) was likely the driving factor behind reduced liking of these samples in consumer testing. Since black tea has fewer catechins than green tea due to the oxidation process in manufacturing, the type of water used seems less important to the everyday tea drinker.

As well as instrumental measurement of color changes in tea samples, and assessment of appearance liking of samples, we were also interested in whether variation in color between samples was visible to the human eye. Panelists used a color matching chart for both black and green tea samples (see Figure S2), divided into eight color segments for green teas, and eight more for black teas. The panelists could clearly discern differences between samples of both black (*p* < 0.001, chi-square 39.91), and green tea samples (*p* < 0.001, chi-square 43.87), although this did not influence their liking of the samples overall, nor their liking of the appearance of the samples, which suggests flavor is more critical in determining liking of tea infusions than their appearance. It is clear, however, that consumer perception of beverages can be altered by their color and appearance [33], and, thus, some of the effects observed may have been due to the cross modal influence of the different colored tea samples.

Some work exists concerning the influence of various brewing conditions on the sensory properties of tea. Liu et al. [34] found optimal conditions for acceptance, at least in a small expert panel, were brewing for 5.7 min at 82 ◦C, with tea of around 1100 μm in particle size, in a 70 mL/g ratio of water to tea. From instant green tea preparations, increasing the calcium concentration in the brewing solution was found to weaken bitter taste in the mixture purportedly provided by EGCG [18], which is in good agreement with our observations. However, influences on the sweetness of infusions (attributed in part to theanine) were not seen in our work, possibly due to the around 4 mg/100 mg sucrose found in the group's instant tea preparations. A study of hot and cold-brewed tea infusions of varying strength by Lin et al. [2] proposed a linkage between higher EGCG and EGC (epigallocatechin) levels and lower sensory appeal, which was attributed to lower bitterness and astringency in these samples. In a small group of trained panelists, sensory differences were reported in green tea brewed with various water types [30], with mineral water found to produce tea with lower EGCG levels than tap water, purified water, or mountain spring water, as well as perceived bitterness mapping onto EGCG levels. However, samples from this report were liked more with higher bitterness (and EGCG) unlike our own results. A similar result was reported by Zhang et al. [15], whereby EGCG levels from green tea extractions varied with water quality. Sensory reports of taste quality were higher for the

high EGCG samples, though no report was made of panel size or makeup. Such differences are likely attributed to the difference in palate of a small group of experts from China versus a large panel of tea consumers in the US. Alternatively, those regularly consuming diets high in salty [35], sweet [36], or umami [37] stimuli have shown some reduced ability to perceive these stimuli possibly due to receptor regulation in taste [38]. Thus, it is possible that regular consumers of very bitter tea experience how they taste in a fundamentally different manner.

**Figure 4.** Consumer perception of green tea brewed in tap (GT), bottled (GB), and deionized (GD) water. (**A**) Overall liking of samples, from dislike extremely (1) to like extremely (9). (**B**) Appearance liking of samples, from dislike extremely (1) to like extremely (9). (**C**) Flavor liking of samples, from dislike extremely (1) to like extremely (9). (**D**) Perceived sweetness of samples, rated on gLMS, scale descriptors no sensation (0.0), barely detectable (1.4), weak (6.0), moderate (17.0), strong (34.7), very strong (52.5), and strongest imaginable sensation of any kind (100.0). (**E**) Bitterness, scale as in D. (**F**) Sourness, scale as in D. (**G**) Earthy flavor, scale as in D. (**H**) Astringency, scale as in D. Bars display mean rating of panel (*n* = 103) plus SEM. \* indicates *p* < 0.05. \*\* indicates *p* < 0.01. \*\*\* indicates *p* < 0.001.

Following sensory testing, panelists participated in a survey of their attitudes toward tea. When asked their primary motivation for drinking black tea, only 7% of panelists responded due to healthful properties and, instead, favoring taste or flavor (84%), with a small number of respondents citing other reasons. However, when asked their primary motivation for drinking green tea, 26% cited its health benefits, with 67% for taste or flavor, and again a small number citing other reasons. This suggests the ability to almost double the EGCG content of green tea would be of great interest to many green tea consumers.

#### *3.5. Multivariate Analysis*

Further analysis of the data with Principal Components Analysis (PCA) and Multiple Factor Analysis (MFA) was performed. Scree plots revealed that data could be plotted well on two axes both in the case of sensory and instrumental data, with 88.5% and 98% of the variance accounted for by the first two factors in the analysis, respectively. The sample of green tea brewed with tap water was located close to both dimensions of overall and flavor liking, which, in turn, were negatively correlated with bitterness (Figure 5A). In plots of instrumental results, samples pairs GD and GB as well as BD and BB plotted almost exactly on top of one another (Figure 5B). In the case of both black and green tea, the tap-brewed sample was the clear outlier. Samples GD and GB plotted closely to the axes represent phenolics, EGCG, and colorimetric L-value. MFA plots combining both sensory and instrumental data showed similar patterns (Figure 5C), with sample GT lying in the directions of overall and flavor liking, and anti-parallel to that of bitterness.

**Figure 5.** Multivariate analysis of tea samples. (**A**) Principal components analysis of sensory data. Samples shown in black, original axes in red, variance from new factors in parentheses. (**B**) Principal components analysis of instrumental data. Samples shown in black, original axes in blue, variance from new factors in parentheses. (**C**) Multiple factor analysis of sensory and instrumental data. Samples shown in black, sensory axes in red, instrumental axes in blue, variance from new factors in parentheses. Samples denoted as green tea brewed in tap (GT), bottled (GB), and deionized (GD) water, black tea brewed in tap (BT), bottled (BB), and deionized (BD) water.

#### **4. Conclusions**

Tea is the most consumed beverage besides water in the world. This project sought to get a better understanding of whether the type of water used to brew tea is of importance to the everyday tea drinker. Through the instrumental analysis of green and black tea brewed in tap, bottled, and deionized water, we demonstrated a difference in color, turbidity, and the amount of EGCG extracted from tea leaves depending on the water type. The high mineral content of the tap water used in this study led to inferior extraction of catechins in green tea, and thus, produced an infusion that was less bitter, and also perceived as sweeter than the same tea brewed in bottled or deionized water, with an accompanying higher degree of liking for green tea when brewed in this manner. For tea drinkers consuming green tea for either flavor or its health benefits, our results highlight that the type of water used to brew tea is clearly important, and suggests that those seeking greater health benefits should use a more purified water source to brew green tea, while those more concerned with flavor may prefer to use water from the tap.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2072-6643/11/1/80/ s1, Figure S1: Traditional Gaiwan brewing vessel, Figure S2: 8-option color matching diagram provided to consumer panel.

**Author Contributions:** Conceptualization, M.F. and R.D.; formal analysis, M.F., P.L., A.A., and R.D.; investigation, M.F. and P.L.; writing—original draft preparation, M.F. and R.D.; writing—review and editing, M.F., P.L., A.A., and R.D.; project administration, R.D.

**Funding:** This research received no external funding

**Acknowledgments:** Thanks to Alina Stelick and members of the Dando lab for support in consumer testing, and In Pursuit of Tea for initial consultation.

**Conflicts of Interest:** The authors declare no competing interests.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Estimation of Olfactory Sensitivity Using a Bayesian Adaptive Method**

#### **Richard Höchenberger 1,2 and Kathrin Ohla 1,2,\***


Received: 24 April 2019; Accepted: 28 May 2019; Published: 5 June 2019

**Abstract:** The ability to smell is crucial for most species as it enables the detection of environmental threats like smoke, fosters social interactions, and contributes to the sensory evaluation of food and eating behavior. The high prevalence of smell disturbances throughout the life span calls for a continuous effort to improve tools for quick and reliable assessment of olfactory function. Odor-dispensing pens, called Sniffin' Sticks, are an established method to deliver olfactory stimuli during diagnostic evaluation. We tested the suitability of a Bayesian adaptive algorithm (QUEST) to estimate olfactory sensitivity using Sniffin' Sticks by comparing QUEST sensitivity thresholds with those obtained using a procedure based on an established standard staircase protocol. Thresholds were measured twice with both procedures in two sessions (Test and Retest). Overall, both procedures exhibited considerable overlap, with QUEST displaying slightly higher test-retest correlations, less variability between measurements, and reduced testing duration. Notably, participants were more frequently presented with the highest concentration during QUEST, which may foster adaptation and habituation effects. We conclude that further research is required to better understand and optimize the procedure for assessment of olfactory performance.

**Keywords:** smell sensitivity; olfaction; threshold; staircase; QUEST

#### **1. Introduction**

The appreciation of food involves all senses: sight, smell, taste, touch, and also hearing. While the sight of a cup of coffee may indicate its availability, it is typically its smell that makes it appealing and that triggers an appetite for most people. During consumption, the smell or aroma is perceived again retronasally and supported by its pleasant temperature and a bitter taste. These largely parallel sensations occur automatically and only raise awareness when one or more senses are disturbed. That said, the sense of smell has been shown to influence food choice and eating behavior [1], and its impairment has even been associated with a higher risk for diet-related diseases like diabetes [2]. Even more, olfactory stimuli can invoke emotional states, are linked to memory storage and retrieval, and as such also serve as important cues to rapid detection of potentially dangerous situations and threats (see e.g., [3,4]. Given that the estimated prevalence of smell impairment is 3.5% in the United States [5], continuous efforts are made toward an efficient and precise assessment of olfactory function.

The Sniffin' Sticks test suite (Burghart, Wedel, Germany; [6]), is an established tool in the assessment of olfactory function. It consists of three tests involving sets of impregnated felt-tip pens: odor detection threshold (T), odor discrimination (D), and odor identification (I). Each test produces numbers in the range from 1 to 16 (T) or from 0 to 16 (D and I) as a performance measure. Overall olfactory function is assessed by summing all three test results, resulting in the *TDI score.* Comparison of individual TDI scores to the comprehensive set of available normative data (e.g., [7,8]) facilitates the interpretation of test scores and allows to reliably diagnose olfactory impairment. Notably, threshold, discrimination, and identification measure different facets of olfactory function [9]. The threshold, however, has been found to explain a larger portion of variability in TDI scores than the two other measures [10]. Moreover, the discrimination and identification tests follow relatively simple test protocols in which all stimuli are presented only once and in a predefined order. The threshold, in comparison, is of a more complex nature, and the method, therefore provides the largest potential for possible improvements. It follows a so-called adaptive method, specifically, a "transformed" one-up/two-down staircase procedure [11]. The procedure first assesses a starting concentration and then moves on to the "actual" threshold estimation, during which fixed step widths are used: for each incorrect answer, the stimulus concentration is increased by one step; and for two consecutive correct answers, the stimulus concentration is decreased by one step [6].

Since the one-up/two-down staircase was first conceived, several new approaches to threshold estimation, including Bayesian methods, have been published. Bayesian methods estimate parameters of the psychometric function (e.g., threshold or slope) using Bayesian inference: based on prior assumptions about the true parameter value, the stimulus concentration to be presented next is selected such that the expected information gain (about the parameter) is maximized. The first published Bayesian adaptive psychometric method is the QUEST procedure [12], which is still popular today. QUEST has two distinct properties that set it apart from the staircase described above. Firstly, it always considers the entire response history and is not solely based on the past one or two trials to select the optimal stimulus concentration to be presented next. Secondly, QUEST is not tied to a fixed step width, allowing it to traverse through a large range of concentrations more quickly.

In a clinical setting, at the otorhinolaryngologist's (ear-nose-throat, ENT) practice or at the bedside in the hospital, shorter testing times are always beneficial, as they reduce strain on patients and free up time for other parts of diagnostics and treatment. But also when working with healthy participants, e.g., in a psychophysical lab or in large cohort studies, reduced testing time spares resources and allows for a larger number of measurements in a given time.

QUEST has been shown to converge reliably and quickly in gustatory threshold estimations [13,14]. Inspired by these results we set out to design and test a QUEST-based procedure for olfactory threshold estimation and to compare its performance with that of the established staircase method.

#### **2. Materials and Methods**

#### *2.1. Participants*

36 participants (32 women; median age: 29.5 years, age range: 19–61 years) completed the study. The influence of gender on olfactory performance has been investigated in previous studies. The results typically showed no (e.g., [15], several hundred participants; [7], >3000 participants, no main effect) or only rather small gender differences with negligible diagnostic and real-world relevance (e.g., [8], >9000 participants). We therefore did not enforce a gender balance in our sample. Due to a technical error, the identification test data was not recorded for one participant (female, 26 years old). All participants were non-smokers and reported being healthy and not having suffered from an infectious rhinitis for at least two weeks before testing. The study conformed to the revised Declaration of Helsinki and was approved by the ethical board of the German Society of Psychology (DGPs).

#### *2.2. Stimuli*

Stimuli were so-called Sniffin' Sticks (Burghart, Wedel, Germany; [6]), felt-tip pens filled with an odorant. The Sniffin' Sticks test battery consists of three subtests: an odor threshold test, an odor detection test, and an odor identification test. The threshold test comprises 48 pens. There were 16 pens filled with different concentrations of 2-phenylethanol (rose-like smell) ranging from 4% to approx. 1.22 × <sup>10</sup><sup>−</sup>4% (a geometric sequence with the common ratio of 2, so the first pen contained a 4% dilution, the second 2%, the third 1%, and so on), dissolved in 4% propylene glycol, an odorless solvent. Note that in this test, the 1st pen contained the highest, the 16th pen the lowest odorant concentration. The remaining 32 pens contained 4% propylene glycol and served as blanks. The pens were arranged in triplets such that each triplet contained one pen with odorant and two blanks. The detection test comprised 48 pens that were filled with 16 different odorants at supra-threshold concentrations. The pens were arranged in triplets such that two pens contained the same and one pen a different odorant. The identification test comprised 16 pens filled with different odorants at supra-threshold concentrations.

#### *2.3. Procedure*

#### 2.3.1. Experimental Sessions

Participants were invited for two experimental sessions – the Test and Retest session for the odor threshold. To ensure similar testing conditions across sessions, participants were instructed to refrain from eating and drinking anything but water 30 min before visiting the laboratory. Further, both sessions were scheduled at approximately the same time of day, and took place with a median inter-session interval of 3.0 days (SD = 2.6, range: 0.9–8.9 days); only four participants had an inter-session interval of more than 7.0 days. In each session, olfactory detection thresholds were determined using two distinct algorithms, staircase and QUEST, described below. The order of algorithms was balanced across participants and kept constant for Test and Retest within each participant. Additionally, odor discrimination and odor identification ability were measured at the end of one session following the standard Sniffin' Sticks protocol (Burghart, Wedel, Germany).

#### 2.3.2. Stimulus Presentation

Testing took place in a well-ventilated testing room and was performed by the same experimenter, who refrained from using any fragrant products (e.g., soap, lotion, perfume, etc.) and wore odorless cotton gloves when presenting the stimuli. At the beginning of each test session, participants were blindfolded. To present a stimulus, the experimenter removed the cap from the pen, held the tip of the pen in front of the participant's nose, approx. 2 cm from the nostrils, and asked the participant to take a sniff. For the threshold test, participants were blindfolded and informed that the odorant may be presented in very low concentrations, and that only one of the three pens presented in each trial contained the odorant, while the others contained the solvent exclusively. The task was to "indicate which of the three pens smells different from the others", and participants had to provide a response even when unsure. Participants were familiarized with the odorant by presenting pen no. 1 (highest concentration) before testing commenced.

A similar procedure was used for the discrimination test: participants were blindfolded and presented with a triplet of pens containing clearly perceivable odorants. Each triplet consisted of two pens with the same and one pen with a different odorant. Again, participants were to indicate the pen that smelled different from the others. During threshold and discrimination testing, stimulus triplets were presented during each trial, which lasted approx. 30 s and included the presentation of three pens (approx. 3 s each) and a pause of 20 s. These tests yield a probability of <sup>1</sup> ⁄3 of guessing correctly.

For the identification task, the blindfold was removed and participants smelled one pen at a time. They were to identify the odor by pointing to the matching word on a response sheet with four written response options. The interval between pens was approx. 30 s. The probability of guessing correctly in this task was <sup>1</sup> ⁄4.

#### 2.3.3. Staircase

Following the standard protocol as detailed in the test manual; see also [16]), the order of presentation within the triplets varied from trial to trial. In the first trial, the odor pen was presented first, in the second trial, it was presented between two blanks, and in the third, after two blanks. After the third trial, this sequence was repeated.

We first determined the starting concentration. Beginning with the presentation of triplet no. 16 or 15 (balanced across participants), participants had to indicate which of the pens smelled different. Concentration was increased in steps of two (e.g., from pen 16 to 14) for each incorrect response. Once participants provided a correct response, the same triplet was presented again. If the response was incorrect, the concentration was increased again by two steps as before. However, if the triplet was correctly identified a second time, that dilution step served as the starting concentration.

Contrary to the standard protocol, where testing would then continue without interruption, our participants were granted a short break of approx. 1 min before the actual threshold estimation started with the presentation of the triplet containing the starting concentration. The threshold was determined in a one-up/two-down staircase procedure: odor concentration was increased by one step after each incorrect response (one-up), and decreased by one step after two consecutive correct responses at the same concentration (two-down). This kind of staircase targets a threshold of 70.71% correct responses ([11]; but cf. [17], who found small deviations from this value). That is, if presented repeatedly with a stimulus at threshold intensity, participants would be able to correctly identify it in about 71 out of 100 cases. The probability of providing *two consecutive* correct responses purely by guessing is <sup>1</sup> ⁄3 × <sup>1</sup> ⁄3 = <sup>1</sup> ⁄9. The procedure finished after seven reversal points were reached. The final threshold estimate was the mean of the last four reversal concentrations. This procedure is referred to simply as staircase throughout the this manuscript.

#### 2.3.4. QUEST

QUEST requires to set parameters that describe the assumed psychometric function linking stimulus intensity and expected response behavior. We assumed a sigmoid psychometric function of the Weibull family, as proposed by [12] (albeit in a slightly different parametrization) and used for gustatory testing [13], with a slope *β* = 3.5, a lower asymptote *γ* = 1/3 (chance of a correct response just by guessing), and a parameter *λ* = 0.01 to account for lapses (response errors due to momentary fluctuation of attention):

$$\Psi(\mathbf{x}) = \lambda \gamma + (1 - \lambda)[1 - (1 - \gamma)\exp(-10^{\beta(\mathbf{x} + T)})],$$

Here, the presented concentration is denoted as *x*, and the assumed threshold as *T*. This yielded a function extending from 0.33 to 0.99 in units of "proportion of correct responses". The granularity of the concentration grid was set to 0.01. All parameters of this function were constant, except for the threshold, which was the parameter of interest that was going to be estimated in the course of the procedure. The prior estimate of the threshold was a normal distribution with a standard deviation of 20, which was centered on the concentration of pen no. 7, which was used as the starting concentration. The algorithm was set to target the threshold at 80% correct responses, which is slightly higher than the threshold target in the staircase procedure, but had proven to produce good results both in pilot testing as well as in gustatory threshold estimation [13,14]. Unlike in the staircase procedure, where the order of pen presentation varied systematically from triplet to triplet, triplets were presented in random order during the QUEST procedure.

Notably, QUEST updates its knowledge on the expected threshold after each response and proposes the concentration to present in the next trial such that it maximizes the expected information gain about the "true" threshold. As the set of concentrations was discrete and limited to 16, QUEST might propose concentrations other than those contained in the test set. In this case, the software selects the triplet with the concentration closest to the one proposed. In contrast to the staircase, where the concentration was always decreased or increased by a single step after the starting concentration had been determined, the step width was not fixed in QUEST. For example, QUEST might step up three concentrations in one trial, step down two in the next, and present the exact same concentration again in the following trial. Whenever the same concentration had been presented on two consecutive trials, the concentration for the next trial was decreased if both responses were correct, and increased if both responses were incorrect. QUEST might suggest to present concentrations outside of the range of available dilution steps. Therefore we set up the algorithm such that, whenever the presentation of a pen below 1 or above 16 was suggested, we would instead present pen no. 1 and 16, respectively. QUEST would be informed about the actually presented pen, and incorporate this information into the threshold estimate. Note, however, that final threshold estimates outside the concentration range could still occur occasionally, and needed to be dealt with accordingly; see the data cleaning paragraph in the next section for details.

The procedure ended after 20 trials. The final threshold estimate is the mean of the posterior probability density function of the threshold parameter. We will refer to this procedure as "QUEST".

#### 2.3.5. Analysis

#### Odor Discrimination and Identification

The discrimination and identification tests comprised 16 trials each. For each test, the number of correct responses was summed up, resulting in a test score which can range from 0 to 16. Together with the staircase threshold, which yielded values from 1 to 16, the sum of all three test results formed a cumulative score: the TDI score.

#### Data Cleaning

When a participant reached one of the most extreme concentrations (i.e., pens no. 1 or 16) and provided a response that would, theoretically, require us to present a concentration outside the stimulus of set, the staircase procedure cannot be safely assumed to yield a reliable threshold estimate anymore. For example, if a participant fails to identify the highest concentration (pen no. 1), the staircase procedure would accordingly demand to present a hypothetical pen no. 0, which obviously does not exist. Since our sole termination criterion was "seven reversals", we would repeatedly present pen no. 1 until a correct identification allowed the procedure to move up to pen no. 2 again. The resulting threshold estimate, then, would systematically overestimate this participant's sensitivity. Therefore we set the threshold values of staircase runs where participants could not identify pen no. 1 at least once to *T* = 1 after the run was completed, following [7] (but cf. [16], who suggest to set the value to *T* = 0 instead). This was the case in five out of the 72 staircase threshold measurements (two during test, three during retest; five participants affected). Conversely, when a participant were to correctly identify the lowest concentration (pen no. 16), the staircase procedure would require the presentation of a hypothetical pen no. 17, in which case we would have assigned a threshold value of *T* = 16; however, this situation did not occur in the present study after the starting concentration had been determined.

For QUEST, pen no. 1 was not correctly identified at least once in 12 of the 72 measurements, concerning 11 participants; no participant reached and correctly identified pen no. 16. QUEST yielded final threshold estimates *T* < 1 in 11 measurements (8 during Test, 3 during Retest; 10 participants affected). Similarly to the data cleaning procedure for the staircase, we assigned threshold *T* = 1 in these cases. Notably, this again concerned 3 of the 5 participants for whom we had assigned *T* = 1 in a staircase experiment.

#### Test–Restest Reliability

To establish test–retest reliability, we first compared the means of Test and Retest thresholds for each procedure. Q–Q plots and Shapiro–Wilk tests revealed that thresholds were not normally distributed for the QUEST test session (*W* = 0.90, *p* < 0.01); we, therefore, compared the means using non-parametric Wilcoxon signed-rank tests. We then correlated Test and Retest threshold estimates via Spearman's rank correlation (Spearman's rho, denoted as *ρ*) to estimate the degree of monotonic relationship between measurements. Ordinary least squares (OLS) models were used to fit regression lines to provide a better understanding of the nature of the relationship between the threshold estimates

(i.e., whether test thresholds could predict retest thresholds). Q–Q plots and Shapiro–Wilk tests showed that the regression residuals were normally distributed (all *p* > 0.05) and thus satisfied an important requirement for OLS regression.

Although correlation and regression analyses are widely used to assess test–retest reliability and to compare methods, it has been argued that these measures may in fact be inappropriate (see e.g., [18–20]). Instead, analyses that focus on the *differences* between, not agreement of, measurements should be preferred. A possible approach is to calculate the mean difference ¯*d* and standard deviation of the differences between two measurements to derive *limits of agreement,* ¯*<sup>d</sup>* <sup>±</sup> 1.96 <sup>×</sup> SD [18]. These limits correspond to the 95% confidence interval. This means that in 95 out of 100 comparisons, the difference between two measurements can be expected to fall into this range. Narrower limits of agreement indicate a better agreement between two measurements. The related repeatability coefficient (RC) was simply 1.96 × SD, and its interpretation was very similar to the limits of agreement: only 5% of absolute measurement differences will exceed this value, and a smaller RC indicates better agreement. (It should be noted that an alternative method for calculating the repeatability coefficient has been suggested, based on the within-participant standard deviation, *sw* [20]. The results we obtained from these calculations were similar to those based on the standard deviation of the measurement differences. Because the latter are directly visualized in the Bland–Altman plot by the limits of agreement, i.e., mean difference ± 1.96 × SD, we opted to only report these values.)

If the differences between two measurements are plotted over the mean of the measurements, and ¯*d* and the limits of agreement are added as horizontal lines, the resulting plot is called a Bland–Altman plot (sometimes also referred to as Tukey mean difference plot). It can be used to quickly visually inspect how well measurements can be reproduced, specifically which systematic bias ( ¯*<sup>d</sup>* <sup>=</sup> 0) and which variability or "spread" of measurement differences to expect. Accordingly, we assessed the RC, limits of agreement, and produced Bland–Altman plots for both methods, staircase and QUEST, to gain more insight into the repeatability (or lack thereof) of measurements for each method. The use of these analyses requires the measurement differences to be normally distributed, which we confirmed using Q–Q plots, and Shapiro–Wilk tests failed to reject the null hypothesis of normal distributions (all *p* > 0.05). Confidence intervals for the limits of agreement were calculated using the "exact paired" method [21].

Lastly, to test whether the duration of the inter-session interval might be a confounding factor in the threshold estimates, we also calculated the Spearman correlation between inter-session intervals and differences between Test and Retest thresholds.

#### Comparison between Procedures

To compare the threshold estimates across procedures, we averaged Test and Retest thresholds for each participant within a procedure, and, similarly to the analysis of reliability, compared the means with a Wilcoxon signed-rank test, followed by the calculation of Spearman's *ρ* and the fit of a regression line using an OLS model. The regression residuals were normally distributed, according to a Q–Q plot and a Shapiro-Wilk test (*W* = 0.96, *p* = 0.26), satisfying the normality assumption of errors on which OLS regression crtitically relies.

Additionally, we estimated the 95% limits of agreement from the differences between the within-participant session means for the two procedures, and generated Bland-Altman plots. The measurement differences were normally distributed, according to a Q-Q plot and a Shapiro-Wilk test (*W* = 0.96, *p* = 0.30). Like in the investigation of test-retest reliability, we assessed confidence intervals of the limits of agreement via the "exact paired" method [21].

Because the limits of agreement derived from session means might actually be too narrow, as within-participant variability is removed by averaging measurements across sessions [20], we calculated adjusted limits of agreement from the variance of the between-subject differences, *σ*<sup>2</sup> *d* , which in turn can be calculated as *σ*<sup>2</sup> *<sup>d</sup>* = *<sup>s</sup>*<sup>2</sup> ¯*<sup>d</sup>* <sup>+</sup> 0.5 *<sup>s</sup>*<sup>2</sup> *xw* + 0.5 *s*<sup>2</sup> *yw*. Here, *s*<sup>2</sup> ¯*<sup>d</sup>* is the variance of the differences between the session means; and *s*<sup>2</sup> *xw* and *s*<sup>2</sup> *yw* are the within-participant variances of methods *x* and

*y*, respectively (staircase and QUEST in our case). The limits of agreement can then be calculated as ¯*<sup>d</sup>* <sup>±</sup> 1.96 <sup>×</sup> *<sup>σ</sup>d*, with ¯*<sup>d</sup>* being the mean difference between the session means of both procedures. Again, the interpretation of these limits is straightforward: 95% of the differences between staircase and QUEST measurements can be expected to fall into this interval, and narrower limits indicate a better agreement across the measurement results produced by both procedures. Finally, we derived 95% confidence intervals for these limits ([20], Section 5.1, Equation (5.10)).

#### Software

The experiments were run via PsychoPy 1.85.4 [22,23] running on Python 2.7.14 (https://www. python.org) installed via the Miniconda distribution (https://conda.io/miniconda.html) on Windows 7 (Microsoft Corp., Redmond, WA, USA). All analyses were carried out with Python 3.7.1, running on macOS 10.14.2 (Apple Inc., Cupertino, CA, USA). We used the following Python packages: correlation coefficients, Bland-Altman and Q-Q plots were derived via pingouin 0.2.2 [24]; confidence intervals for the Bland–Altman plots were calculated with pyCompare 1.2.3 (https://github.com/jaketmp/ pyCompare); Shapiro–Wilk statistics were calculated with SciPy 1.2.1 [25,26]; linear regression models were estimated using statsmodels 0.9.0 [27]; and box plots and correlation plots were created with seaborn 0.9.0 (https://seaborn.pydata.org) and matplotlib 3.0.2 [28].

#### **3. Results**

#### *3.1. Odor Discrimination and Identification*

The average test score was 13.3 (SD = 1.5, range: 11–16; *N* = 35) for odor discrimination, and 13.0 (SD = 1.6, range: 11–16; *N* = 36) for odor identification. When summed with the staircase threshold estimates from the Test and Retest sessions, we observed TDI scores of 33.34 (SD = 3.8; range: 26.5–43) and 33.64 (SD = 3.8; range: 26.75–41.75), respectively. Individual as well as cumulative scores indicate a below-average ability to smell (roughly around the 25th percentile) in our sample compared to recent normative data from over 9000 subjects [8].

#### *3.2. Starting Concentrations*

The average starting concentration was pen no. 9.9 (SD = 4.2, range: 1–16) for the Test and 9.6 (SD = 4.1, range: 1–16) for the Retest session of the staircase. The average difference in starting concentrations between sessions was 4.9 (SD = 4.0, range: 0–15). In comparison, we used a slightly higher, fixed starting concentration of pen no. 7 for QUEST.

#### *3.3. Test Duration*

The average number of trials needed to complete the staircase measurements was 23.6 (SD = 4.8, range: 13–41), which translates to approx. 11.5 min and is 2 minutes longer than for QUEST, which per our parameters always lasted 9.5 minutes (20 trials). Test duration varied slightly between staircase sessions and was 24.4 trials (SD = 4.2, range: 16–34) for the test and 22.9 trials (SD = 5.4, range: 13–41) for the retest session. Please note that the number of trials and the testing duration for the staircase are based on the time required to reach seven reversal points after the starting concentration had been determined, thereby deviating from the "standard" procedure, which treats the starting concentration as the first reversal.

#### *3.4. Test-Retest Reliability*

The mean Test thresholds did not differ from the mean Retest thresholds for the staircase (*M*Test = 6.9, SDTest = 3.1; *M*Retest = 7.2, SDRetest = 3.2; *W* = 268.0, *p* = 0.19). For QUEST, on the other hand, mean test and retest thresholds differed significantly, with slightly higher sensitivity (higher *T* unit) in the Retest (*M*Test = 5.2, SDTest = 3.8; *M*Retest = 6.2, SDRetest = 3.4; *W* = 201.5, *p* < 0.01; see Figure 1).

**Figure 1.** Threshold estimates for the staircase and QUEST procedures during Test and Retest sessions. Each dot represents one participant. Horizontal lines show the median values, and whisker lengths represent 1.5 × inter-quartile range.

The test and retest thresholds correlated significantly for both procedures, with QUEST demonstrating a stronger relationship between measurements than the staircase (staircase: *ρ*<sup>34</sup> = 0.49, *p* < 0.01; QUEST: *ρ*<sup>34</sup> = 0.66, *p* < 0.001; Figure 2A).

As already pointed out, correlation gives an indication of the strength of the monotonic relationship between values, but only provides limited information on their agreement. We therefore calculated the repeatability coefficient RC and created Bland–Altman plots to generate a better understanding of the measurement differences. The prediction of the RC is that two measurements (test and retest) will differ by the value of RC or less for 95% of participants. We found that RC was about 16% smaller for QUEST than for the staircase (RCStaircase = 6.44, RCQUEST = 5.43), suggesting a slightly better agreement between Test and Retest measurements for the QUEST procedure. Accordingly, the Bland–Altman plot (Figure 2B) showed narrower limits of agreement for QUEST (staircase: −6.79 [−8.89, −5.63] and 6.09 [4.93, 8.18]; QUEST: −6.42 [−8.18, −5.44] and 4.44 [3.46, 6.29]; 95% CIs in brackets). The mean of the differences between measurements was relatively small and deviated less than 1 *T* unit from zero—the "ideal" difference—for both methods (*M*Δ*T*,Staircase = −0.35 [−1.43, 0.72]; *M*Δ*T*,QUEST = −0.99 [−1.89, −0.08]). This systematic negative shift indicates that participants, on average, reached higher *T* units in the second session than in the first. The differences between Test and Retest measurements for three (staircase) and two participants (QUEST), respectively, fell outside their respective limits of agreement, which corresponds to the expected proportion of 5% of outliers (3/36 = 8.3%; 2/36 = 5.6%), demonstrating the appropriateness of the estimated limits. Considering the confidence intervals of the limits of agreement, an equal number of measurement differences (four) fell outside the predicted range for both procedures.

To test whether the time between Test and Retest sessions might be linked to the observed differences between Test and Retest threshold estimates, we computed correlations between those measures. We found no relationship for either method (staircase: *ρ*<sup>34</sup> = −0.12, *p* = 0.50; QUEST: *ρ*<sup>34</sup> = 0.03, *p* = 0.85).

**Figure 2.** (**A**) Correlation between Test and Retest threshold estimates for the staircase and QUEST procedures. (**B**) Bland–Altman plots showing mean differences between Test and Retest, and limits of agreement corresponding to 95% confidence intervals (CIs) as mean ± 1.96 × SD. The shaded areas represent the 95% CIs of the mean and the limits of agreement. Each dot represents one participant.

#### *3.5. Comparison between Procedures*

Although the threshold estimates, averaged across sessions, for the staircase were significantly higher than those for QUEST (*M*staircase = 7.0, SDstaircase = 2.7; *M*QUEST = 5.7, SDQUEST = 3.3; *W* = 101.0, *p* < 0.001; Figure 3A), we found a strong correlation between the procedures (*ρ*<sup>34</sup> = 0.80, *p* < 0.001; Figure 3B). The regression slope was close to 1, providing an indication of agreement across procedures. The Bland-Altman plot based on the session means (Figure 3C) shows a systematic difference between both procedures; specifically, QUEST thresholds were, on average, 1.38 [0.78, 1.97] *T* units smaller than the staircase estimates (95% CIs in brackets). The limits of agreement reached from −2.20 [−3.37, −1.56] to 4.95 [4.31, 6.12], meaning the difference between the two procedures will fall into this range for 95% of measurements. Only for 1 participant the observed differences between staircase and QUEST fell outside the limits of agreement (1/36 = 2.8%; when considering the CIs of the limits, 3 participants fell outside the expected range (3/36 = 8.3%)

The corrected limits of agreement, taking into account individual measurements (as opposed to session means only), were −4.20 [−23.6, 15.3] and 6.96 [−12.5, 26.4], which is substantially larger than the uncorrected limits. The large confidence intervals that expand even beyond the concentration range reflect the relatively large within-participant variability across sessions in both threshold procedures.

**Figure 3.** (**A**) Mean threshold estimates, averaged across Test and Retest sessions for the staircase and QUEST procedures. Horizontal lines show the median values, and whisker lengths represent 1.5 × inter-quartile range. (**B**) Correlation between mean staircase and QUEST threshold estimates. (**C**) Bland–Altman plot showing mean differences between session means in both procedures, and limits of agreement corresponding to 95% confidence intervals (CIs) as mean ± 1.96 × SD. The shaded areas represent the 95% CIs of the mean and the limits of agreement. Each dot represents one participant.

#### **4. Discussion**

In the presented study we used a QUEST-based algorithm to estimate olfactory detection thresholds for 2-phenylethanol with the aim to provide a reliable test result as it had recently been demonstrated for taste thresholds [13] with reduced testing time. The results were compared to a slightly modified version of the widely-used testing protocol based on a one-up/two-down staircase procedure [6,7,9,15,16].

Test–retest reliability was assessed using multiple approaches. Comparison of Test and Retest thresholds revealed a small yet significant mean difference for QUEST: threshold estimates during retest were higher than in the test, indicating an increase in participants' sensitivity. A similar effect was reported in a previous study [6]. However, with a mean difference of approx. 1 *T* unit or pen number, the practical relevance of this effect is debatable, even more so when considering the large variability of measurement results within individual participants.

Following common practice of establishing test-retest reliability of olfactory thresholds (see e.g., [6,9,29]), we calculated correlations between Test and retest sessions. The correlation coefficient for QUEST (*ρ* = 0.66) indicated solid, but not exceptionally great test–retest reliability. Reliability of the staircase procedure was only moderate (*ρ* = 0.49) and lower than reported in previous studies for *n*-butanol (*r* = 0.61; [6]) and 2-phenylethanol (*r* = 0.92; [9]) thresholds.

To acknowledge previous criticism of correlation analysis – which focuses on the agreement, but not on the differences between measurements [18–20] – we calculated repeatability coefficients and generated Bland–Altman plots for the analysis of session differences. Repeatability was higher for QUEST than for the staircase; however, measurement results of both procedures varied considerably across sessions for many participants. This inter-session variability is further substantiated by the differences in starting concentrations assessed for the staircase, which varied up 15 pen numbers in the most extreme case. The effect was not universal: some participants performed better in the Test than in the Retest session, whereas for others performance dropped across sessions, and remained almost unchanged in others. Since both sessions had been scheduled within a relatively short time period and all measurements have been performed by the same experimenter, measurement variability can be mostly attributed to variability within participants themselves.

The comparison of the staircase and QUEST procedures via the session means of each participant showed that the staircase yielded slightly higher pen numbers (i.e., lower thresholds) than QUEST. This was expected as the procedures were assumed to converge at approx. 71% and 80% correct responses, respectively. We found a strong correlation between the session means of the procedures (*ρ* = 0.80), and regression analysis showed an almost perfect linear relationship, which some would interpret as a good agreement between QUEST and staircase results. The 95% limits of agreement, taking into account the within-participant variability, showed a large expected deviation between both procedures (range: QUEST thresholds almost 7 *T* units smaller or more than 4 *T* units greater than staircase results), with the corresponding CIs of those boundaries even exceeding the concentration range. This result is indicative of the large variability we found within participants in both procedure. The limits of agreement based on the within-participant session means were much narrower, as variability is greatly reduced through averaging.

A potential source of variability might be guessing. In fact, the probability of responding correctly merely by guessing is <sup>1</sup> ⁄3.

In a series of simulations, it could be shown that with an increasing number of trials the frequency of correct guesses might get unacceptably high, potentially leading increased variability in the threshold estimates [30]. The author determined that, for a staircase procedure like the one in our study, the expected proportion of such false-positive responses exceeds 5% with the 23rd trial. For our staircase experiments, the average number of trials was 23.6; and the procedure finished after 23 or more trials for 24 of the 36 participants in the Test, and for 20 participants in the Retest session. Therefore, the large variability between Test and Retest threshold estimates in the staircase could, at least partially, be ascribed to correct guesses "contaminating" the procedure. However, QUEST—which always finished after 20 trials—only had slightly better test-retest reliability according the the repeatability coefficient, suggesting that the largest portion of test-retest variability in our investigations was probably not caused by (too) long trial sequences and related false-positive responses alone.

Surprisingly, a number of participants were unable to correctly identify pen no. 1 at least on one occasion, and this effect was more pronounced during QUEST compared to the staircase. It seems plausible that the variable step size used by QUEST made it possible to approach even the extreme concentration ranges quickly, whereas the staircase requires a longer sequence of incorrect responses to reach pen no. 1.

Despite careful selection of healthy participants who reported no smell impairment, olfactory performance was lower than recently reported in a sample comprising over 9000 participants [8]. This coincidental finding highlights the need for a comprehensive smell screening before enrollment. To what extend olfactory function contributed to the present results and limits their generalizability remains to be explored.

All QUEST runs completed after 20 trials for all participants. The procedure could be further optimized by introducing a dynamic stopping rule. For example, [13] set the algorithm to terminate once the threshold estimate had reached a certain degree of confidence. Such a rule can reduce testing time, as the run may finish in fewer than 20 trials, and should be considered in future studies. Although the reduction or omission of a minimum trial number bears potential to reduce the testing time further, it needs to be shown first that the algorithm performs well under these conditions and, most importantly, large-scale studies need to show whether such a reduced or faster protocol is appropriate to assess odor sensitivity in participants with odor abilities at the extremes (particularly insensitive/sensitive).

Inspection of the data showed that some staircase runs had not fully converged although seven reversal points were reached. In these cases, participants exhibited a somewhat "fluctuating" response behavior (or threshold) that caused the procedure to move in the direction of higher concentrations throughout the experiment (see Figure A1 in the appendix and supplementary data for an example). QUEST proved to behave more consistently, at least in some cases, by either converging to a threshold or by reaching pen no. 1, which would then sometimes not be identified correctly. These interesting differences between procedures require further investigation to fully understand their cause and influence on threshold estimates and, ultimately, diagnostics.

#### **5. Conclusions**

The present study compared the reliability of olfactory threshold estimates using two different algorithms: a one-up/two-down staircase and a QUEST-based procedure. The measurement results of both procedures showed considerable overlap. QUEST thresholds were more stable across sessions than the staircase, as indicated by a smaller variability of test-retest differences and a higher correlation between session estimates. QUEST offered a slightly reduced testing time, which may be further minimized through a variable stopping criterion. Yet, QUEST also tended to present the highest concentration, pen no. 1, more quickly than the staircase, which may induce more rapid adaptation and habituation during the procedure and, eventually, produce biased results. Further research is needed to better understand possible advantages and drawbacks of the QUEST procedure compared to the staircase testing protocol.

#### **6. Data and Software Availability**

The data analyzed in this paper along with graphical representations of each individual threshold run are available from https://doi.org/10.5281/zenodo.2548620. The authors provide a hosted service for running the presented experiments online at https://sensory-testing.org; the sources of this online implementation can be retrieved from https://github.com/hoechenberger/webtaste.

**Author Contributions:** conceptualization, R.H. and K.O.; programming, analysis, and visualization, R.H.; interpretation and writing, R.H. and K.O.; supervision and project administration, K.O.

**Funding:** The implementation of the online interface was supported by Wikimedia Deutschland, Stifterverband, and Volkswagen Foundation through an Open Science Fellowship granted to R.H.

**Acknowledgments:** The authors would like to thank Andrea Katschak for data collection.

**Conflicts of Interest:** The authors declare no conflict of interest. The funding agents had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A**

Example threshold runs of the same participant: while the QUEST runs did converge, the staircase runs obviously did not fully converge although seven reversal points were reached. Intriguingly, the staircase provided more consistent results (more similar thresholds across runs) than QUEST. We speculate that this participant exhibited a fluctuating response behavior during the staircase procedure.

**Figure A1.** Comparison of threshold estimation runs of the same participant during test and retest sessions for QUEST (**A**) and the staircase (**B**).

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Nutrients* Editorial Office E-mail: nutrients@mdpi.com www.mdpi.com/journal/nutrients

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18