3.2.6. Panelist·Session Interaction

If significant, it means that one or more panelists do not similarly grade for all of the products from one session to another. There were several significant panelist·session interactions. Among the descriptors that contributed to discrimination, mushroom, oak barrel, cheesy smell, sourness, chewiness, bitterness, and saltiness had significant interactions (Table 1). The contribution of panelists to this interaction might also be evaluated by their respective coefficients, estimated as above-commented. Figure 4 shows examples.

Among the panelists that most contributed to the differences in scores between sessions according to descriptors, were: A13 for skin red, flesh red, and flesh green. Regarding other descriptors, A12 actively contributed to vinegar or A5 to natural fruity/floral, alcohol, and earthy soil (data not shown). However, most of the panelists had homogeneous contributions in most of the descriptors (skin green, skin sheen, flesh yellow, or briny, Figure 4). Moreover, no panelist showed a systematic trend for all descriptors, except a few of them, like A12 for skin sheen and flesh red or A7 for mushroom (Figure 4). Subsequently, the interaction was mainly due to the contribution of a reduced number of panelists (frequently only one) with limited influence on the panel repeatability.

The panelist·session interaction might also be presented as a plot of the mean per session over the mean on the whole sessions, according to panelists (Figure 5). Ideally, they should follow a line, regardless of sessions. In general, the panelists followed a similar trend over sessions (Figure 5 for some descriptors) with only punctual exceptions, like panelist A6 for rancid. Other cases were related to panelists A4, A12, and A8 for bitterness due to the abnormally low scores given by them (data not shown).

**Figure 3.** Panel performance. Sample·session interaction. Mean per session of panelists, according to samples, over the sample means of the whole sessions for significant descriptors: (**A**) saltiness, and (**B**) metallic.

**Figure 4.** Panel performance. Panelist·session interaction. Contribution (coefficients) of panelists to the interaction for selected descriptors (skin red, skin green, skin sheen, flesh red, flesh yellow, flesh green, briny, and mushroom).

**Figure 5.** Panel performance. Panelist-session interaction. Means per session according to panelists over means of the whole sessions for selected descriptors (ripeness, buttery, metallic, and rancid).

Finally, the plot of the different coefficients over sessions is the most common evaluation of the panelist·session interaction (Figure 6, for flesh red as an example). In this case, the problems that could be observed are, again, of different ranking in successive sessions or different amplitude of scale over sessions. In Figure 6, panelist A13 assigned an excessive high score in the first session, while in the second session the score was low. Additionally, the amplitude of the scale for this descriptor was wider-spread in the first session than in the second. In saltiness, the situation was different, A12 had a very low contribution (coefficient) but the scale amplitude was similar among sessions; in firmness and fibrousness, panelist A13 was the only who had an excessive high score and, subsequently, a high contribution to the interaction, while, on the contrary, had low contribution on saltiness. Therefore, the analyses in detail of this interaction allowed for detecting some weakness of panel performance and lack of coherence in some panelist. Then, personalized training would be advisable.

**Figure 6.** Panel performance. Panelist-session interaction. Detail of the coefficients through the three sessions for the flesh red descriptor.

#### *3.3. Panelist Performance*

When a panelist can discriminate among samples and is well repeatable and reproducible (that is, score the same product consistently and agrees with the rest of the panel), it is considered to be reliable according to Rossi [18]. There are several techniques for evaluating these panelist's performance parameters. Tomic et al. [20] develop a series of graphs for easy visualisation of the sensory profiling data for performance. Kermit and Lengard Almli [19] mentioned consonance analysis with PCA, full ANOVA model and notation, assessor sensitivity, assessor reproducibility, or agreement test as appropriate to evaluate the assessor and panel performance. Lanza and Amoruso [17] mention the repeatability index (RIt) and deviation index (DIt) to evaluate how assessors perform against themselves over time and their performance with respect to the whole panel, respectively. In this work, the diverse tools that were proposed by Husson et al. [31] for studying the panelist work will be particularly followed.

#### 3.3.1. Discrimination Power of Each Panelist

The individual efficiency of panelists was evaluated with the model: score = sample + session. The *p*-values (Table 2) that are associated with the F-test of the sample effect on each panelist are, then, the appropriate parameter to measure this discrimination power. Their values, with rows and columns being sorted by the median estimated over them (Table 2), showed that most of the panelists were able to discriminate the black ripe table olive samples based on several of the descriptors that were developed by Lee et al. [9] and used later by López-López et al. [10]. Their efficiencies, in decreasing order, were: A14, A4, A2, A3, A6, A5, A8, A1, A12, A13, and A7, while only A11, A10, and A9 had not any discriminant power (Table 2). Skin green was the only descriptor that received an overall significant median; however, mouth coating, flesh red, briny, flesh green, or skin red were among the attributes most differently perceived in the samples (Table 2). On the contrary, soapy smell/medicinal, fishy smell, cheesy smell, alcohol, or metallic were among the most similarly perceived; however, this does not necessarily mean that the panelists were not able to differentiate samples, but that they were present in very low intensity or even completely absent (Table 2). There is controversy in the possible *p*-value that could be used as a cut off-level to consider one panelist acceptable. Stone et al. [32] proposed *p* ≥ 0.5, but the problem was that there were so many *p*-values below 0.5 when evaluating tea that almost any laboratory would retain them. Powers [33] pointed out that the real question

was establishing the number of attributes with significant performance being necessary for a judge to be an acceptable assessor. However, no agreement on this aspect was achieved. In this work, in general, the panelists were not systematically excellent in all descriptors, but most of them were good at some descriptors (significant *p*-value), and their overall performance was reasonable; however, the behaviour of panelists A11, A10, and A9 should be, according to these results, candidates for possible further training or even removal from the panel if their performance will not sufficiently improve. Kermit and Lengard Almli [19] also identified an assessor with further need for training in attributes pea flavor, sweetness, fruity, and off flavor.

#### 3.3.2. Panelist Repeatability

The panelists' repeatability is the ability to consistently score the same product for a given attribute [18] and was evaluated by the standard deviation (SD) of the measurements of a descriptor from each panelist on each sample. It was considered that, when the residual of the ANOVA model for each panelist and descriptor (Table 3) was ≤ 1.96 (*p* ≤ 0.95), the panelist scored the samples in a narrow range through the successive sessions and only panelists with residuals that were above this limit scored differently between sessions. In this work, there were no panelists who systematically graded the descriptors differently from one session to another (SD ≥ 1.96, in bold); however, several of them showed residuals above the limits for one to various descriptors, but not at a large distance. Therefore, in general, the panelists showed acceptable repeatability.

#### 3.3.3. Panelist Reproducibility

The panelist agreement with the panel, as associated to reproducibility [18], was assessed by the correlation between the panelists' scores and the adjusted means of the panel (estimated by the ANOVA model) according to descriptors.

The procedure is similar to that used by Nyambaka et al. [30] to study the sensory changes in dehydrated cowpea leaves. The data are presented in a table, in which both panelists (in the column) and descriptors (in rows) are sorted from the highest to the lowest marginal median (Table 4). The panelists' agreement with the panel (significant correlation, in black) were, in descending order of their medians, A6, A8, A14, A5, A1, A7, A13, A10, A9, A3, A2, A4, A12, and A11, while the negative correlation (in black and italic) was distributed more or less evenly, indicating opposed agreement with the panel (divergent behaviour). The inconsistence of some panelists when evaluating cowpea leaves was attributed to particular preferences of assessors [30] and could also be possible in table olives for some attributes, like firmness or fibrousness.



 test.

**Table**


**3.** Panelist repeatability as assessed by the ANOVA residuals according to descriptors.

**Table**



**Table**

Overall, the descriptors that had the best agreement between panelists and panel, sorted by the median, were (in decreasing order of relationship) skin green, skin sheen, flesh red, firmness, flesh green, fibrousness, flesh yellow, and moisture release (Table 4). They were also among the descriptors with the most discriminant power. On the contrary, those with more discrepancies among the panelists were residual, artificial fruit/floral, metallic, rancid, sourness, or soapy smell/medical (Table 4), all of them with no discriminant influence.

These results show that the overall behaviour of the panelists was reasonable, although there was still margin for some improvement in their performance, particularly regarding those panelists with strongly opposed correlation to the mean of the panel. Alternatively, they could be candidates for further rejection.

Lanza and Amoruso [17] used line plot according to the attribute and deviation index (DIt) to evaluate the agreement between panelists and whole panel. Their results are in line with those described above, since they also found some panelists who clearly deviated from the consensus. According to these authors, this type of results helps the panel leader to identify repeatability problems of specific assessors as compared to the whole panel and correct the deviation by the corresponding training.

#### *3.4. Multivariate Study of Panelists and Panel*

#### 3.4.1. Clustering

A first multivariate approach of the similarity among panelists was achieved by hierarchical clustering analysis based on the scores given to the sample descriptors by each of them. The study was performed in XLSTAT, while using Wards' aggregation criterion [28]. Three groups of panelists were formed when comparing the panelists' behaviour (Figure 7A). The greatest dissimilarity was found between the group that was formed by A4 and A6 with respect to the other panelists. The dissimilarity within the groups of other panelists was sensibly lower, leading to three groups. Two of them were composed of four and seven panelists, while the third only included panelist A8, who had a peculiar behaviour. Therefore, in this case, the cluster analysis, which considers the overall panelist performance, showed that the panelists followed a somewhat similar trend when evaluating the black ripe olive samples, but not reveal their peculiarities. In line with this result, the hierarchical classification is more usually applied for the classification of products or studying the association among descriptors. Francois et al. [28] used this technique for assessing the astringency of different beers while Pense-Lheritier et al. [29] applied it to link the sensory changes induced by the addition of drugs to different beverages. Alasalvar et al. [6] found similarity among the flavor of natural and roasted Turkish hazelnut cultivars. Clustering was also used to segregate different consumers segments according to their overall liking scores [34].

#### 3.4.2. Panelist Reproducibility

The multivariate study of the agreement among panelists and the whole panel [18], while using bootstrapping, was made in SensoMiner, by considering the results of a virtual panel that was obtained by taking successive samples (500 simulations) from the real data and applying Principal Component Analysis. Only two eigenvalues ≥1 were found and they accounted for ~42 and 26% of the variance, respectively. The analysis was made while using the function panelipse·session. The resampling technique has been described in detail elsewhere [31].

The closeness of the whole panel and panelists' answers was evaluated by projecting them onto the first two PCs. A PCA on the consensus allows for visualizing the strength of the consensus and the global discrimination of the products; besides, treatments identification shows the observed differences between the products [35]. In this work, the distance from each panelist to the situation of the corresponding sample assessed the agreement between the whole panel (squares symbols and different colours for the samples) and the panelists' acronyms (associated to samples by circle symbols using the same colours) (Figure 7B).

**Figure 7.** Panelist performance as assessed by multivariate analysis. (**A**) Clustering of panelists according to their performance. (**B**) Projection of panelists' loads (individual description) and samples' scores onto the first two Principal Components.

PC1 was highly efficient for segregating samples from Manzanilla (on the left) and Hojiblanca (on the right) and it could be associated to cultivar, while PC2 was able to distinguishing samples as a function of growing area and storage. In general, the projections of panelists for each sample were situated around that of the whole panel (sample associated to the same colour); although, there were some of them far for their respective samples. The discrepant panelists were (as identified by the corresponding acronyms) the same already mentioned in previous sections, mainly: A12, A8 for HL2; A8 for HA2; A13, A12, A8 and A6 for HA1; A12, A7, A9, A6, A3, and A2 for MAL2; A12, A7, A6, and A2 for ML2; A13, A11, A9, A8, A7, A5, and A1 for MAL1; and, A12, A8, A7, A6, and A2 for ML1. The panelist who scored the samples differently more times was A12, followed by A8, A7, and A6. Lower discrepancies were observed for A2, A9, A13, A3, and A5. However, they represent just a few cases of divergences, while most of the panelists' scores are jointly distributed around their corresponding

samples. Additionally, panelists had greater ability (closeness to the sample average) to evaluate long stored Hojiblanca samples (HL2 and HA2) than any other sample. In conclusion, this plot has identified the panelists who will require particular training, but the performance of the others will also benefit from training. Our results are in agreement to those that were presented by Tomic et al. [21], who also found underperformance panelists and emphasized the need for a detailed study of their behavior while using the established statistical methods for the evaluation. Lanza and Amoruso [17] studied the performance of panelist against the whole panel using Eggsshell plots, concluding that there were also a few panelists that ranked some of the descriptors quite differently from the consensus, while there was a good agreement in others, like hardness.

#### 3.4.3. Panel Repeatability

Study by Variables Projection on the Correlation Circle According to Sessions

The analysis was carried out using the virtual panel described above [31]. A first approach of the panel repeatability was observed by projecting the descriptors (only those more relevant, contribution >0.20) onto the first two PC according to sessions. Close situations of descriptors in the correlation circle for the different sessions indicate good repeatability. The panel was particularly repeatable among sessions for some descriptors, like skin green, astringency, flesh green, moisture release, fibrousness, flesh red, skin sheen, or flesh yellow. However, others had sensible distances from one session to another, like fishy smell/ocean, saltiness, or chewiness (Figure 8A). The interpretation of the relationships among variables is not straightforward due to these oscillations on the variables' projections. Nevertheless, it is possible to establish overall associations, mainly in those variables with high repeatability among sessions. For example, firmness, fibrousness, or chewiness are opposed to moisture release, ripeness, or flesh green. Additionally, those black ripe olives with high astringency could also present flesh yellow or skin green notes, but low vinegar or ripeness scores.

Galán Soldevilla et al. [14] associated bitter, sour, and wood with *Green*, *Cured*, and *Traditional Aloreña de Málaga* table olives, respectively. In black ripe olives, discrimination among the samples from different origins was mainly based on the 2nd and 3rd PCs, which were the components linked to aroma and flavour characteristics; however, the more linear behaviour of panelists was related to a textural dimension that was strongly connected to PC1 [9]. Kinesthetic sensations were also critical for the segregation between defected and un-defected samples by PCA [12].

#### Study by Sample Projections According to Sessions

The analysis was also carried out using the virtual panel described above. In this case, the median scores of the virtual panel perception of the samples (the same of the real panel) were projected onto the plane of the two first PCs according to sessions. Subsequently, 95% of the closest points of the generated cloud of points were used to draw their confidence ellipses (*p*-value = 0.05), which were built according to the procedure that was described by Husson et al. [31] (Figure 8B). The repeatability of the panel to the session can be assessed by the displacement of the sample centres. In general, the separation between the sample centres due to session was limited, indicating a good panel agreement between sessions, which is also corroborated by the overlapping of their confidence ellipses. Incidentally, the plot also indicates that the long stored fruits showed lower dispersion by sessions than the just processed fruits (one-month storage).

**Figure 8.** Panel repeatability as assessed by multivariate analysis, using bootstrapping. (**A**) Projection of the descriptors 'loads on the correlation circle onto the first two Principal Components, and (**B**) Projection of the samples' scores and confidence ellipses according to sessions onto the first two Principal Components.

#### **4. Conclusions**

Usually, the study of the panel performance is a previous, but superficial, task during the sensory evaluation of products. However, a detailed investigation of the panel and panelist performance is a convenient tool to uncover the details of their evaluation. In this work, such study allowed for the assessment of the panel performance as a whole, as well as detecting the panelist with the lowest discriminant power, those that have interpreted the scale in a different way than the panel and, therefore, require further training or even discovery that the stored black ripe olive products are more similarly perceived by the panelists over sessions. Besides, the study identified the descriptors of

#### *Foods* **2019**, *8*, 562

hard evaluation (skin green, vinegar, bitterness, or natural fruity/floral). Therefore, panelists would require particular training on them or, in case of not reaching the appropriate level of discrimination, be replaced by some other/s with higher sensitivity. In summary, the work has confirmed that such studies are an essential tool for the appropriate panel control and training, which should be a permanent concern of the panel leader.

**Author Contributions:** Conceptualization: A.L.-L. and A.G.-F.; Methodology: A.L.-L. and A.H.S.-G.; Software: A.G.-F.; Validation: A.L.-L. and A.G.-F.; Formal analysis: A.G.-F. and A.L.-L.; Investigation: A.C.-D., A.H.S.-G., A.M., A.L.-L.; Resources: A.L.-L., A.H.G.-S. and A.M.; Data curation: A.L.-L. and A.G.-F.; Writing—original draft preparation: A.L.-L. and A.G.-F.; Writing—review and editing: A.L.-L. and A.G.-F.; Visualization: A.L.-L. and A.G.-F.; Supervision: A.L.-L.; Project administration: A.M. and A.L.-L.; Funding acquisition: A.M. and A.L.-L.

**Funding:** This research was funded in part by the Ministry of Economy and Competitiveness from the Spanish government through Project AGL2014-54048-R, partially financed by the European Regional Development Fund (ERDF).

**Acknowledgments:** We thank Elena Nogales Hernández for her technical assistance.

**Conflicts of Interest:** The authors declare no conflict of interest.
