*5.2. Scalability*

To further test the scalability and generalization capacity of the proposed method, we designed a second experiment that aimed to test how predictions improve as more subjects are incorporated into the training. As an example, Figure 5 shows the results obtained in the three repositories for different approaches, when using a Naive Bayes classifier to predict valence. The methods compared were the *z*-score normalization, the proposed data transformation and the subject-based scaling proposed in [33,65], in which features were scaled to the range [0, 1]. The latter method is labeled as max-min in the figure.

In this plot, the classification accuracy reported for a number of training subjects *p* is the average of as many trials as subjects there are in the dataset. In each trial, a different subject was considered, and all his/her samples were included in the test set. The training set was composed of all samples from *p* subjects other than the test subject, chosen at random but maintained across the different algorithms to allow for a fair comparison.

Both the proposed normalization and the subject-based max-min scaling used in [33,65] showed better results when more subjects were used for learning. On the contrary, the *z*-score normalization did not seem to benefit from learning when the number of training subjects increased. The higher performance of the proposed data transformation can easily be observed in all databases. This shows up as a positive trend that implies a reliability increase as more users are incorporated into the model, and further supports the validity of inter-subject models when they are combined with a suitable transformation function that takes individual traits into consideration.

1XPEHURIVXEMHFWVLQWUDLQLQJVHW(**c**)

**Figure 5.** Classification accuracy as the number of training users is increased: (**a**) DEAP; (**b**) MAHNOB-HCI; and (**c**) DREAMER.
