*3.2. Evaluation Comparison between Experts and the Developed Algorithm*

The subjective ratings from three experts were averaged and regarded as the gold standard to validate the developed DTW-based algorithm. Figure 6 shows the scatter plot of the algorithm's final performance scores and averaged experts' ratings for 21 subjects. The Pearson correlation coefficient (r) was 0.86 (*t* = 7.45, *p* < 0.001), which indicates a strong positive linear relationship between scores from those two evaluation methods. Additional analysis on performance scores between two evaluation methods showed that, under many circumstances, the algorithm would overestimate the performance when compared with the experts' rating. Therefore, we calibrated performance scores of the algorithm using the fitted equation from linear regression (see Figure 6) so that the algorithm would generate similar evaluation scores as the domain experts for practical applications. Figure 7 further demonstrates the calibrated performance scores from the algorithm were comparable to experts' ratings. The score difference between two evaluation methods had a mean of 9.5 and a standard deviation of 7.0 (Maximum: 21.6, Minimum: 0.1).

**Figure 6.** Linear relationship between algorithm scores and experts' ratings.

**Figure 7.** Comparison between algorithm evaluation (after calibration) and experts' evaluation.
