4.2. Profile Visualisations
One of the novel aspects of this dataset is the inclusion of profile information that was gathered about the raters. In this section, we visualize the demographics of the 277 participants. In
Figure 3, we can see an overview of the proportion of participants that fall into each profile category. The profile information was binned as seen in the figure. This binning will be useful in the later sections on emotion prediction models.
Section 5.1 further describes how the profile information is treated and utilised for emotion prediction.
The first 3 feature bars of
Figure 3 show common demographic profile features, e.g., age. As depicted in
Figure 4, most of the participants are young adults in their twenties to thirties. The age feature was divided into 4 bins. Participants below 25 years of age are considered youth, between 26 and 35 years of age are young adults, between 36 and 50 years of age are adults and above 51 are elders. After binning, 52.7% of participants are youth, 24.2% are young adults, 13.4% are adults and 9.7% are elders. The second bar in
Figure 3 shows that there are 57.0% male and 43.0% female participants. The third bar depicts the country of residence of participants. The majority of the participants are from the USA (52.7%) and India (42.6%) as the MTurk task was released to these two countries. The remaining 4.7% includes participants from Great Britain (1.4%), Italy (0.7%), South Africa (0.4%), Russia (0.4%), Indonesia (0.4%), Armenia (0.4%), American Samoa (0.4%), Romania (0.4%) and Brazil (0.4%).
The fourth feature bar, labelled ‘enculturation’ in
Figure 3, depicts a slightly more unique feature which represents the musical enculturation of participants—which country’s music do participants identify with most. As one might expect, the division looks very similar to the bar above it, implying that country of residence and country of musical enculturation are related. The percentages of each country are as follows: USA (52.7%), India (40.8%), Great Britain (1.8%), Italy (0.7%), Japan (0.4%), Ecuador (0.4%), Mexico (0.4%), South Africa (0.4%), Russia (0.4%), Armenia (0.4%), Colombia (0.4%), American Samoa (0.4%), New Zealand (0.4%), United Arab Emirates (0.4%) and Brazil (0.4%).
The fifth and sixth feature bars of
Figure 3 pertain to the listening preferences of participants. With regard to the preferred language of lyrics, since participants are mostly from the USA and India, it is unsurprising that for the fifth feature ‘language’, songs with English (72.2%) and Tamil (18.1%) lyrics are the favourite of most participants. The remaining 9.7% include the languages Malayalam (3.2%), Hindi (2.5%), Italian (0.7%), Telugu (0.7%), Armenian (0.7%), Japanese (0.7%), Korean (0.4%), German (0.4%) and Bengali (0.4%). As for the preferred genre of participants shown in the sixth bar named ‘genre’, Rock (31.8%) had the highest percentage, followed by Classical (14.1%) and Pop (13.7%) music. Many other genres were grouped as ’other’ in the bar chart as they were small in comparison, they include Rhythm and Blues (8.3%), Indie Rock (6.9%), Country (6.9%), Jazz (5.4%), Electronic dance music (2.2%), Metal (2.2%), Electro (1.4%), Techno (1.1%) and Dubstep (0.7%).
The seventh to ninth feature bars in
Figure 3 represent the musical experience of participants. The seventh feature, labelled ‘instrument’, represents the proportion of participants that are actively playing at least one instrument. 45.8% of them indicated that they were actively playing an instrument. The eighth bar, named ‘training’, depicts the proportion of participants that have received formal musical training. A total of 57.4% of participants indicated that they received formal musical training, while 42.6% indicated that they never received formal music training. Since 42.6% of participants have not received formal musical training yet 45.8% are actively playing an instrument, we can surmise that at least 3.2% of participants are self-taught. The ninth bar, labelled ‘duration’, reflects the number of years participants have received formal training. The 42.6% of participants who had not received formal training are included in the ninth feature bar as the participants who have received 0 years of training. A total of 5.8% participants underwent 1 year of training, 15.9% had 2 years of training, 13.0% had 3 years, 5.8% had 4 years, and 7.9% had 5 years of training. Overall, 48.4% of participants received between 1 and 5 years of formal musical training. 0.9% of participants indicated having 6 or more years of training, the largest value being 31 years of training.
In the tenth feature bar of
Figure 3, the proportion of MTurk master to non-master participants is represented. A total of 46.2% of participants are master participants while 53.8% of participants are non-master participants. It is noteworthy that 128 master participants were retained from the original 172 master participants after our preprocessing, while only 149 non-master participants were retained from the original 280 non-master participants. The retention percentage of 74.4% for master participants as compared to only 53.2% for non-master participants implies that master participants are indeed more reliable compared to non-master participants.
We should note that the profile binning or grouping for non-boolean type profiles was arbitrarily determined in this work. For example, as the age of participants was largely skewed towards the young adult age, the two younger groups are of smaller age ranges while the two older groups are of larger age ranges. In future research, one could experiment with different configurations, or further testing may be performed to determine more representative age bins that show a difference in perceived emotion from the music. The same can be said for the preferred genre profile type. In this study, the participants mainly preferred rock and classical songs. Some of the favored genres were not represented by many participants, and were grouped under ’Other’. Perhaps with better representation, more significant differences between genres would be revealed.
4.3. Statistical Differences in Affect Ratings between Profile Groups
We analysed the collected data in order to determine whether there are significant differences in terms of valence and arousal annotations from participants of various demographic groups. As statistical testing requires independent samples, the dynamic affect labels were averaged to a single value per participant per song. Additionally, because the valence and arousal ratings were not normally distributed, a non-parametric test was used. The non-parametric Kruskal Wallis test [
67] was used for each of the 10 profile features to identify whether statistically significant differences exist between the emotion ratings of the different profile groups.
Table 3 shows the results of the Kruskal Wallis tests.
p-values lower than the threshold value of 0.05 are marked in bold so as to highlight that there is a significant difference in emotion ratings between profile groups of that profile feature.
For the profile and affect type pairs that have
p-values below 0.05, we carried out Dunn’s test [
68] as a post hoc test, with Bonferroni correction [
69]. The resulting
p-values of the Dunn’s test indicate which profile features are statistically different. For each bold value in
Table 3, we report the profile features that are significantly different, along with their
p-values below. Our findings are in line with Schedl et al. [
40], who found differences in music perception only for some user groups.
A statistical difference was found between valence ratings provided by young and adult raters () as well as youth and elder raters (). The data suggests that the youth group tends to give higher valence ratings as compared to the two other groups. Valence ratings from the young-adult age group seem to lie in between the other groups, suggesting that the perceived valence of music may decrease with age.
Both valence and arousal ratings from raters from a different country of residence showed a significant difference. For valence, however, the post hoc test p-values were larger than after Bonferroni correction. In particular, between the USA and India participants, the p-value was , which is close to the threshold for significance. As for arousal, a significant difference was found between USA and India (). Participants residing in India had a larger proportion of ratings that were near the origin , for both affect types, as compared to participants residing in the USA. In general, the ratings from participants residing in the USA were more evenly spread out as well, while ratings from participants residing in India were skewed towards the positive end of both affect types.
With regard to raters with a different country of music enculturation, a significant difference between the USA and India groups was found for both valence () and arousal (), and between the USA and other countries, only for valence (). It is worth noting that there is only one participant representing the ‘other’ group, hence we did not take this value into consideration for the analysis. Furthermore, as there is a larger overlap between the participant groups for country of residence and country of music enculturation, similar observations of the data can be made.
For listeners with a different preferred genre of music, we see a statistically significant difference in terms of valence ratings. Interestingly, the differences are between the classical genre and each of the other groups. Namely, between classical and rock (), classical and pop (), and classical and other (). As compared to the other three genres, participants who prefer classical music mostly rated valence closer to the origin. Other groups tended to give higher positive valence ratings.
Participants who actively play an instrument compared to participants who do not, have a larger proportion of ratings near the origin . With regard to valence (), participants who do not actively play an instrument had the most ratings near valence. As for arousal (), other than the larger proportion of ratings near the origin by participants who do not actively play an instrument, both groups are generally skewed towards more ratings in the positive arousal quadrant rather than the negative quadrant.
The distributions for participants who have received formal training and those who have not (the eighth profile information-training), closely resemble those of participants who actively play an instrument and those who do not (the seventh profile information-instrument). This is observed despite the fact that there are 43 participants who have received formal musical training but are not actively playing an instrument, and another 11 participants who are actively playing an instrument but have not received formal training. A statistically significant difference is found between these two groups, for both valence () and arousal (). This makes sense, as most participants who play an instrument learned to do so through formal musical training.
The group of participants who have not received formal training coincides with the group of participants who have received 0 years of musical training. With regard to arousal ratings, the group of participants with 1 to 5 years of training is significantly different to the other two groups: 0 years () and more than 5 years of training (). As for valence, a significant difference was found between the group of participants with 1 to 5 years of training and the group with 0 years of training (). The lack of significant difference between the group of 0 years and the group of more than 5 years of training suggests that perhaps the length of duration of training may not have an obvious impact on the perceived affect. The statistical difference noted in both affect types may be due to the large proportion of ratings near the origin, given by the group with 1 to 5 years of training, and not found in the other two groups.
Lastly, the ratings of master MTurk participants showed a statistical difference with non-master MTurk participants where arousal is concerned (). Non-master participants had a large proportion of ratings near the origin, while master participants showed a tendency to rate with higher arousal values. Though the same is observed in valence, the difference between the two groups is not substantial enough to be significant. This peak of values near the origin is observed in many of the profile types aforementioned, which suggests that perhaps those groups have more non-master participants. This is found to be the case for country of residence and enculturation, where approximately 73% of the India group are also non-master. It is also possible that there is a subset of non-master participants that cause this peak. Alternatively, since the mouse pointer is positioned at the origin when the experiment begins, it is possible that non-master participants move their mouse less, or respond later.
The significant differences found above confirm the importance of capturing profile information in a dataset of valence and arousal ratings of music. In the next section, we predict valence and arousal ratings from the audio and profile information captured in this newly proposed dataset, and thus provide a baseline model. The significant differences found between various groups for the different profile types suggest that affect prediction may be improved and refined by feeding the model this information; this is what we will test in our experiments.