*2.4. Statistical Analysis*

For processing sensory data from the assessors of each panel (by each respective panel leader), the IOC excel spreadsheet was applied according to official methodology ([6,8] and later modifications). Sensory data from each panel were processed (by the UNIBO panel leader), and after application of the proposed decision tree, the coefficient of variation (CV), was calculated [26] (dataset). A limit for the CV based on its frequency distribution was also proposed to check the level of variability.

#### *Foods* **2020**, *9*, 355

The CV frequency distribution was also expressed as cumulative probability by the *t*-test (Student's test distribution). Foods 2019, 8, x FOR PEER REVIEW 5 of 13

For control of the performance of the panel, estimation of both precision and trueness of panels was performed according to IOC guidelines [25]. The estimation of the precision of panels was made during the procedure of replicate analysis by the calculation of both normalized error (En) and repeatability number (rN), whereas control of panel trueness was obtained by z-score estimation. frequency distribution was also expressed as cumulative probability by the t-test (Student's test distribution). For control of the performance of the panel, estimation of both precision and trueness of panels was performed according to IOC guidelines [25]. The estimation of the precision of panels was made during the procedure of replicate analysis by the calculation of both normalized error (En) and

#### **3. Results and Discussion** repeatability number (rN), whereas control of panel trueness was obtained by z-score estimation.

#### *3.1. The Decision Tree* 3. Results and Discussion

For the "quantitative panel test", it was very important to classify samples to reach agreement (among the six panels involved) on sensory characteristics (in terms of intensity of positive/negative attributes), thus providing useful information for instrumental analysis. 3.1. The Decision Tree For the "quantitative panel test", it was very important to classify samples to reach agreement (among the six panels involved) on sensory characteristics (in terms of intensity of positive/negative

For this specific objective, the classification of samples based on the evaluation data provided by the six panels was elaborated by applying a decision tree (Figure 1), a new tool for categorization of VOOs. attributes), thus providing useful information for instrumental analysis. For this specific objective, the classification of samples based on the evaluation data provided by the six panels was elaborated by applying a decision tree (Figure 1), a new tool for categorization of VOOs.

Figure 1. Decision tree adopted for statistical processing of sensory results provided by the six panels. mpd = main perceived defect. \* Mean value calculated on the median values obtained by OLEUM panels for mpd and the fruity attribute. **Figure 1.** Decision tree adopted for statistical processing of sensory results provided by the six panels. mpd = main perceived defect. \* Mean value calculated on the median values obtained by OLEUM panels for mpd and the fruity attribute.

The adopted decision tree is based on the agreement (more than 50% of panels) on the category and on the median of the intensity of the main perceived defect (mpd) and/or of the fruity attribute. If one of these two agreements was not reached, a first or second type of misalignment occurred, and the sample was not classified. Following the flow of the decision tree, the UNIBO panel leader first checked whether the sensory data provided by at least four out of six panels defined the sample as belonging to the same quality grade; if yes, agreement on the mpd was also checked, while in the negative case, formative reassessment was required. If the desired agreement was met for both criteria, it was possible to proceed with calculation of The adopted decision tree is based on the agreement (more than 50% of panels) on the category and on the median of the intensity of the main perceived defect (mpd) and/or of the fruity attribute. If one of these two agreements was not reached, a first or second type of misalignment occurred, and the sample was not classified. Following the flow of the decision tree, the UNIBO panel leader first checked whether the sensory data provided by at least four out of six panels defined the sample as belonging to the same quality grade; if yes, agreement on the mpd was also checked, while in the negative case, formative reassessment was required.

the mean of the medians (provided by each panel) for classifying the samples. The coefficient of variation (CV%) was applied and considered satisfactory if 35% (adequate level of variability). The adoption of 35% as upper limit of CV was selected by observing the frequency distribution of all CV% values registered for the mpd and fruity attribute for the set of samples analyzed. The frequency distribution was also expressed as cumulative probability (p = 0.74) applying the t-test (Figure 2). If the desired agreement was met for both criteria, it was possible to proceed with calculation of the mean of the medians (provided by each panel) for classifying the samples. The coefficient of variation (CV%) was applied and considered satisfactory if ≤35% (adequate level of variability). The adoption of 35% as upper limit of CV was selected by observing the frequency distribution of all CV% values registered for the mpd and fruity attribute for the set of samples analyzed. The frequency distribution was also expressed as cumulative probability (*p* = 0.74) applying the *t*-test (Figure 2).

Foods 2019, 8, x FOR PEER REVIEW 6 of 13

Figure 2. Control of the level of variability of values obtained by application of the decision tree based on the frequency distribution of CV%. CV% = variability of the median values with respect to the mean value. The frequency distribution was also expressed as cumulative probability by t-test (Student's test distribution). **Figure 2.** Control of the level of variability of values obtained by application of the decision tree based on the frequency distribution of CV%. CV% = variability of the median values with respect to the mean value. The frequency distribution was also expressed as cumulative probability by *t*-test (Student's test distribution).

Cerretani and co-workers investigated the relationship between sensory and chemical composition of VOOs to assess correlations between sensory attributes and minor components [27]; in this study, sensory attributes were assessed by four panels (two Italian and two Spanish) employing a total of 59 tasters, and the median values for each VOO evaluated by panels were used as the final input for statistical analysis. In our work, the mean of the medians provided by each panel was considered. The median represents the midpoint of an ordered set of odd numbers or the mean of two midpoints of an ordered set of even numbers. It is, therefore, a robust tool since it is not influenced by outliers; considering that it was already applied by each panel individually, the mean of the medians was also considered more appropriate for comparison of results of panels and for Cerretani and co-workers investigated the relationship between sensory and chemical composition of VOOs to assess correlations between sensory attributes and minor components [27]; in this study, sensory attributes were assessed by four panels (two Italian and two Spanish) employing a total of 59 tasters, and the median values for each VOO evaluated by panels were used as the final input for statistical analysis. In our work, the mean of the medians provided by each panel was considered. The median represents the midpoint of an ordered set of odd numbers or the mean of two midpoints of an ordered set of even numbers. It is, therefore, a robust tool since it is not influenced by outliers; considering that it was already applied by each panel individually, the mean of the medians was also considered more appropriate for comparison of results of panels and for monitoring performance.

monitoring performance. The decision tree was applied to the entire set of 334 oils and, in case of misalignments, samples were reassessed in a sensory session (formative reassessment) where each panel was provided with the available IOC reference materials and certified oils evaluated by at least three accredited panels (sent by the UNIBO panel) to improve the identification of any defects and assessment of their intensity. The reassessments were done in a blind way (no information related to the type of misalignments were provided to panel leaders), again applying the organoleptic assessment method, The decision tree was applied to the entire set of 334 oils and, in case of misalignments, samples were reassessed in a sensory session (formative reassessment) where each panel was provided with the available IOC reference materials and certified oils evaluated by at least three accredited panels (sent by the UNIBO panel) to improve the identification of any defects and assessment of their intensity. The reassessments were done in a blind way (no information related to the type of misalignments were provided to panel leaders), again applying the organoleptic assessment method, but without open discussion of the attributes between assessors.

but without open discussion of the attributes between assessors. During the first year of the project, 176 of 180 oils were classified, and only four misalignments occurred (Table S1a-d); in summary, 152 of 180 samples were immediately classified, and 28 samples were reassessed since first- and/or second-type misalignments occurred (14 samples for each type of misalignment). At the end of formative reassessment, 176 samples were classified (54 EV, 76 V, and 48 L), but classification was not possible for four samples (UN\_10, UP\_14, EU\_29, and UN\_32) since agreement among four of six panels was not reached. Specifically, disagreement on the category (V/L) was obtained for UN\_10 and UP\_14, but for both, fusty-muddy sediment and rancid were perceived by at least four of six panels, indicating these samples as representative of borderline samples; on the other hand, for samples EU\_29 and UN\_32, an agreement on the category (V) was reached, but not on the identity of mpd due to the presence of more than one defect (fusty-muddy sediment, musty, winey, frostbitten olives, rancid were indicated for EU\_29; fusty-muddy sediment, frostbitten olives, During the first year of the project, 176 of 180 oils were classified, and only four misalignments occurred (Table S1a–d); in summary, 152 of 180 samples were immediately classified, and 28 samples were reassessed since first- and/or second-type misalignments occurred (14 samples for each type of misalignment). At the end of formative reassessment, 176 samples were classified (54 EV, 76 V, and 48 L), but classification was not possible for four samples (UN\_10, UP\_14, EU\_29, and UN\_32) since agreement among four of six panels was not reached. Specifically, disagreement on the category (V/L) was obtained for UN\_10 and UP\_14, but for both, fusty-muddy sediment and rancid were perceived by at least four of six panels, indicating these samples as representative of borderline samples; on the other hand, for samples EU\_29 and UN\_32, an agreement on the category (V) was reached, but not on the identity of mpd due to the presence of more than one defect (fusty-muddy sediment, musty, winey, frostbitten olives, rancid were indicated for EU\_29; fusty-muddy sediment, frostbitten olives, rancid were indicated for UN\_32), but none were perceived by at least 50% of the panels.

rancid were indicated for UN\_32), but none were perceived by at least 50% of the panels. The sensory evaluation of oils from the second sampling (2017/2018 oil campaign), as well as the application of the decision tree, allowed the classification of this set (154 oils) as follows: 69 classified as EV, 51 classified as V, 33 classified as L; one sample was not classified due to an anomalous lemon smell (ZRS\_1) and was therefore excluded from the set (Table S2 a–d). For 17/154 oils, misalignments of first or second type were achieved (15 and 2, respectively) but, after formative reassessment, all The sensory evaluation of oils from the second sampling (2017/2018 oil campaign), as well as the application of the decision tree, allowed the classification of this set (154 oils) as follows: 69 classified as EV, 51 classified as V, 33 classified as L; one sample was not classified due to an anomalous lemon smell (ZRS\_1) and was therefore excluded from the set (Table S2a–d). For 17/154 oils, misalignments of first or second type were achieved (15 and 2, respectively) but, after formative reassessment, all samples were classified by OLEUM panels.

samples were classified by OLEUM panels. A recent comparative study [28] on a panel test made by nine IOC recognized panels (five from A recent comparative study [28] on a panel test made by nine IOC recognized panels (five from Italy, two from Spain, one from Greece, and one from Slovenia) and chemical analysis of commercial

olive oils (16 samples) reported that the sensory methodology works well in case of extremely good olive oils, but not for common commercial ones, and therefore it should be applied only for Protected Designation of Origin (PDO) and other peculiar EVs. Results from the present work, carried out on a large set of commercial VOOs, are in disagreement with those of Circi et al. [28]. The panel test is an official method that has been used to assess improvement in the quality of VOOs since 1991 up to now and provides information on sensory characteristics (intensities of fruity, bitter, and pungent; presence of more than one defect) that are difficult to obtain using a single instrumental approach. The strict application of IOC guidelines for training and quality control of panels and some improvements in the training of a sensory panel, such as the availability of new reference materials that are stable and reproducible, is crucial to increase the reliability of a method to apply a group of assessors as an analytic tool. olive oils, but not for common commercial ones, and therefore it should be applied only for Protected Designation of Origin (PDO) and other peculiar EVs. Results from the present work, carried out on a large set of commercial VOOs, are in disagreement with those of Circi et al. [28]. The panel test is an official method that has been used to assess improvement in the quality of VOOs since 1991 up to now and provides information on sensory characteristics (intensities of fruity, bitter, and pungent; presence of more than one defect) that are difficult to obtain using a single instrumental approach. The strict application of IOC guidelines for training and quality control of panels and some improvements in the training of a sensory panel, such as the availability of new reference materials that are stable and reproducible, is crucial to increase the reliability of a method to apply a group of assessors as an analytic tool.

Foods 2019, 8, x FOR PEER REVIEW 7 of 13

#### *3.2. The Panel's Performance* 3.2 The Panel's Performance

The UNIBO panel, responsible for statistical elaboration of the sensory results, in agreement with the guidelines of IOC document T.28 revised in 2018 [25], summarized the z-score (satisfactory, questionable or unsatisfactory results) for each subgroup of samples from each year and sent it to panel leaders to help them in monitoring the performance of their own panel and to adopt any corrective actions. The UNIBO panel, responsible for statistical elaboration of the sensory results, in agreement with the guidelines of IOC document T.28 revised in 2018 [25], summarized the z-score (satisfactory, questionable or unsatisfactory results) for each subgroup of samples from each year and sent it to panel leaders to help them in monitoring the performance of their own panel and to adopt any corrective actions.

The same method adopted by the IOC during its proficiency test (IOC z-score) was applied; it was calculated using: (i) the median (Me) of the predominant defect (the intensity of predominant defect was considered regardless the type of defect that could be different between the six panels) and/or the fruity attribute detected by each panel; (ii) the great median (assigned value, GM) calculated as median of the medians for the predominant defect or for the fruity attribute (detected by all panels as consensus value); (iii) the standard deviation (ơ obj) of the scores calculated from IOC historical data (±0.7). A slightly modified version of this method (OLEUM z-score) was also adopted; the only difference from the previous one was, in case of V and L categories, the use of the median (Me) of the defect identified as predominant by consensus of the panels (even if it was not the predominant defect for each panel). The same method adopted by the IOC during its proficiency test (IOC z-score) was applied; it was calculated using: (i) the median (Me) of the predominant defect (the intensity of predominant defect was considered regardless the type of defect that could be different between the six panels) and/or the fruity attribute detected by each panel; (ii) the great median (assigned value, GM) calculated as median of the medians for the predominant defect or for the fruity attribute (detected by all panels as consensus value); (iii) the standard deviation (ơ obj) of the scores calculated from IOC historical data (±0.7). A slightly modified version of this method (OLEUM z-score) was also adopted; the only difference from the previous one was, in case of V and L categories, the use of the median (Me) of the defect identified as predominant by consensus of the panels (even if it was not the predominant defect for each panel).

Therefore, the intensity, and also the type of the mpd, was considered in the OLEUM version of the z-score to obtain a reliable dataset for comparison with instrumental data (e.g., in OLEUM for developing screening methods based on the analysis of volatile compounds). The detailed formulas of both the methods used to calculate the z-score are shown in Figure 3. Therefore, the intensity, and also the type of the mpd, was considered in the OLEUM version of the z-score to obtain a reliable dataset for comparison with instrumental data (e.g., in OLEUM for developing screening methods based on the analysis of volatile compounds). The detailed formulas of both the methods used to calculate the z-score are shown in Figure 3.

Figure 3. Formulas of the two methods used to calculate the z-score (IOC and OLEUM). **Figure 3.** Formulas of the two methods used to calculate the z-score (IOC and OLEUM).

Results of the z-score estimation were illustrated by quality control charts, as part of internal quality control. Some examples of panel performance evaluation are reported in Figures 4 and 5; the vertical axis represents the z-score and the horizontal one identifies the sample codes. Results of the z-score estimation were illustrated by quality control charts, as part of internal quality control. Some examples of panel performance evaluation are reported in Figures 4 and 5; the vertical axis represents the z-score and the horizontal one identifies the sample codes.

Foods 2019, 8, x FOR PEER REVIEW 8 of 13

Figure 4. Example of z-score graph for estimation of panel performance, calculated on 60 samples from the subgroup of the first sampling year (180 samples). Criteria of acceptance: |z|≤ 2, performance was satisfactory; 2<|z|≤ 3, performance was questionable; |z|> 3, performance was considered unsatisfactory. The z-scores were calculated for median of the main perceived defect (for V and L category) and for the median of fruity attribute (for V and EV category). **Figure 4.** Example of z-score graph for estimation of panel performance, calculated on 60 samples from the subgroup of the first sampling year (180 samples). Criteria of acceptance: |z| ≤ 2, performance was satisfactory; 2 < |z| ≤ 3, performance was questionable; |z| > 3, performance was considered unsatisfactory. The z-scores were calculated for median of the main perceived defect (for V and L category) and for the median of fruity attribute (for V and EV category).

The z-score has positive or negative values and was calculated for both fruity (for EV and V category) and negative sensory attributes (for V and L category); the central value is zero, the warning limits for the index are ±2, and the action limits are ±3. The interpretation is the same for both the methods applied (IOC and OLEUM): if |z| ≤ 2, performance was satisfactory; if 2<|z|≤ 3, performance was questionable; finally, if |z|> 3, performance was considered unsatisfactory. Each panel leader, observing this chart, had to define any corrective or/and preventive actions taken if a result is outside of the limits or if several consecutive results are obtained at the same side (positive or negative) of the central value (bias) [25]. The results obtained verified that the approach using the z-score represents a very useful tool to evaluate the trueness of the panel over time. The z-score has positive or negative values and was calculated for both fruity (for EV and V category) and negative sensory attributes (for V and L category); the central value is zero, the warning limits for the index are ±2, and the action limits are ±3. The interpretation is the same for both the methods applied (IOC and OLEUM): if |z| ≤ 2, performance was satisfactory; if 2 < |z| ≤ 3, performance was questionable; finally, if |z| > 3, performance was considered unsatisfactory. Each panel leader, observing this chart, had to define any corrective or/and preventive actions taken if a result is outside of the limits or if several consecutive results are obtained at the same side (positive or negative) of the central value (bias) [25]. The results obtained verified that the approach using the z-score represents a very useful tool to evaluate the trueness of the panel over time.

An example of panel performance reported in Figure 4 showed that, in the case of OLEUM zscore for the mpd (V and L), the panel obtained 25 of 48 satisfactory results, 12 questionable, and 11 unsatisfactory, whereas in the case of IOC z-score, 23 of 48 satisfactory, 14 questionable and 11 unsatisfactory results were obtained. In the case of IOC z-score for fruity attribute (V and EV), the panel obtained 29 of 42 satisfactory results, 7 questionable, and 6 unsatisfactory. These results highlight a trend of the panel to more frequently use higher values of the scale for the intensity of mpd or fruity attribute than the GM value (median of the medians of six panels); moreover, in some cases, the presence of a z-score lower than -2 indicated the lack of intensity recognition of the mpd or An example of panel performance reported in Figure 4 showed that, in the case of OLEUM z-score for the mpd (V and L), the panel obtained 25 of 48 satisfactory results, 12 questionable, and 11 unsatisfactory, whereas in the case of IOC z-score, 23 of 48 satisfactory, 14 questionable and 11 unsatisfactory results were obtained. In the case of IOC z-score for fruity attribute (V and EV), the panel obtained 29 of 42 satisfactory results, 7 questionable, and 6 unsatisfactory. These results highlight a trend of the panel to more frequently use higher values of the scale for the intensity of mpd or fruity attribute than the GM value (median of the medians of six panels); moreover, in some cases, the presence of a z-score lower than -2 indicated the lack of intensity recognition of the mpd or of

the fruity attribute. The second example (Figure 5) showed that for the mpd (V and L), the panel obtained 18 of 19 satisfactory results and 1 questionable result for OLEUM z-score, while obtaining 17 of 19 satisfactory results and 2 questionable results for IOC z-score. Foods 2019, 8, x FOR PEER REVIEW 9 of 13 obtained 18 of 19 satisfactory results and 1 questionable result for OLEUM z-score, while obtaining 17 of 19 satisfactory results and 2 questionable results for IOC z-score.

Figure 5. Example of z-score graph for estimation of panel performance, calculated on 38 samples from the third subgroup of the second sampling year (154 samples). Criteria of acceptance: |z|≤ 2, performance was satisfactory; 2<|z|≤ 3, performance was questionable; |z|>3, performance was considered unsatisfactory. The z-scores were calculated for median of the main perceived defect (for V and L category) and median of fruity attribute (for V and EV category). **Figure 5.** Example of z-score graph for estimation of panel performance, calculated on 38 samples from the third subgroup of the second sampling year (154 samples). Criteria of acceptance: |z| ≤ 2, performance was satisfactory; 2 < |z| ≤ 3, performance was questionable; |z| > 3, performance was considered unsatisfactory. The z-scores were calculated for median of the main perceived defect (for V and L category) and median of fruity attribute (for V and EV category).

In the case of IOC z-score for the fruity attribute (V and EV), the panel obtained 25 of 26 satisfactory results and 1 questionable result. Overall, the panel showed good performance, although the verification of samples in which the z-score is questionable, using both the panel results and those provided by all panels (by the application of the decisional tree), was suggested in the feedback sent to the panel leader. The estimation of z-score was consistent in evaluating the performance of sensory laboratories over time. Its application in this study showed a progressive, greater convergence of results passing from the first to the second sampling and allowed identification of the critical aspects In the case of IOC z-score for the fruity attribute (V and EV), the panel obtained 25 of 26 satisfactory results and 1 questionable result. Overall, the panel showed good performance, although the verification of samples in which the z-score is questionable, using both the panel results and those provided by all panels (by the application of the decisional tree), was suggested in the feedback sent to the panel leader. The estimation of z-score was consistent in evaluating the performance of sensory laboratories over time. Its application in this study showed a progressive, greater convergence of results passing from the first to the second sampling and allowed identification of the critical aspects of the performance of each panel and definition of suitable actions for improvement.

of the performance of each panel and definition of suitable actions for improvement. In addition to the z-score estimation, during the second year of sampling, the control of the panel's precision was also performed by using replicate analysis. The repeatability of panels was controlled by comparing the medians obtained on three samples in duplicate and determining whether the results are homogenous and, therefore, statistically acceptable. In addition to the z-score estimation, during the second year of sampling, the control of the panel's precision was also performed by using replicate analysis. The repeatability of panels was controlled by comparing the medians obtained on three samples in duplicate and determining whether the results are homogenous and, therefore, statistically acceptable.

Specifically, three pairs of identical samples were sent to the panels with different codes (blind

Specifically, three pairs of identical samples were sent to the panels with different codes (blind conditions) (UN\_44 = UN\_55, UN\_59 = UN\_60, and UN\_66 = UN\_69) and the level of agreement between intensity values expressed for the same sample during independent evaluations was estimated by calculating the repeatability number (rN) and normalized error (En), whose acceptability limits are ≤2 and ≤1, respectively [25] (Table 1).

**Table 1.** Values of repeatability number (rN), normalized error (En) of each panel for the predominant defect (d) or fruity attribute (f) and suggested limits for these parameters, calculated on the three pairs of samples (UN\_44/UN\_55, UN\_59/UN\_60, UN\_66/UN\_69) evaluated in the replicate analysis (blind conditions).


In general, the panels showed good repeatability. In the case of the first pair of samples (UN\_44 = UN\_55, category V), in fact, the values of both parameters (En and rN) were below the suggested limit for good performance; for the second pair of samples (UN\_59 = UN\_60, category V), the least satisfactory performances were achieved: three panels showed values above these limits (2, 3 and 6), highlighting the need for additional training to improve performance. Finally, for the third replicated sample (UN\_66 = UN\_69, category EV), only one panel registered values above the limits due to different intensity of the fruity attribute in the two sessions and therefore was not considered repeatable. These indices are based on the evaluation of the correct intensity of the mpd or fruity attribute (and therefore the product quality grade) by each panel and do not take into account the type of defect; results from the application of the decisional tree were consistent for the correct classification of samples, but not for the mpd (UN\_44 fusty-muddy sediment, UN\_55 rancid, UN\_59 brine, UN\_60 winey). The inconsistency in the nature of mpd was probably due to more than one defect present in the sample and with similar intensities; in addition, brine and winey usually go together.
