1. Introduction
The growing popularity of using streaming media services and the need to adapt the quality of the services offered to the needs of users as effectively as possible are factors that stimulate the necessity to better understand factors affecting how the recipient subjectively perceives presented video content. Research in this area is a part of the Quality of Experience (QoE) knowledge area [
1,
2,
3].
A subjective assessment of video quality is the result of a number of factors influencing the viewer’s perception. Among these, the literature [
4] mentions, e.g., human, system, and context factors. Analysis of the current research on the impact of various factors on the perceived subjective assessment of video quality includes aspects such as:
Technological—related to coding, compression, transmission, image presentation, etc.;
Social—regarding the social context of an observation (e.g., in a group vs. alone);
Environmental—related to the environment (e.g., air temperature, noise level, etc.);
Human—concerning a number of features differentiating the recipients of the content;
Content-related—differentiating the content of the presented video streams in terms of content.
Figure 1 shows basic types of factors influencing subjective perception of multimedia streaming.
It should also be noted that the subjective assessment of a given phenomenon is inextricably linked to its human perception, embedded in the context of the method of presenting and making available a research sample for a given phenomenon. Cognitive sciences identify an indisputable influence of the arrangement of presented research samples on the subjective perception and evaluation of a given phenomenon.
A number of publications in the scientific literature were issued, indicating the significant impact of the arrangement of the research samples on the results of subjective research in fields such as psychology, behavioral analysis, legal sciences, or medicine.
In the field of legal sciences, publication [
5] indicates that the order in which the speeches of attorneys at law were evaluated was significant and statistically significant. An issue of the optimal sequence of stimuli presentation is described in the literature in the aspect of teaching visual–auditory conditional differentiation [
6], where the effect of slower acquisition of content at the first research attempt was described. Another of the described experiments [
7], which is in the field of behavioral analysis and is related to the estimation of the durations of phenomena occurring in sequences, brought results indicating some regularity that the second of the pair of durations is usually overestimated in relation to the first one. From the point of view of audio–visual sciences, the influence of the order of content presentation was described in [
8], where an experiment was described showing that the perceived attractiveness of images presented in parallel was greater. In publication [
9], it was shown that an order of stimulus presentation has a large impact on the evaluation of stimuli during evaluation processes. After applying the forward conditioning technique (the pairing of two stimuli such that the conditioned stimulus is presented before the unconditioned stimulus), a change in the tone evaluating the previously neutral stimuli in a positive direction was observed. An aspect of the impact of the sequence of stimuli presentation on the decision-making process has also been identified in the field of purchasing decisions, where theories of consumer behavior regarding the key importance of the first stimulus and the reference point in making purchase decisions were published. In [
10], an experiment confirms that the first presented alternative is preferable, while the effects of the presentation order are not the same for all purchasing items.
An aspect of the influence of research stimuli arrangement is also taken into account in the field of multimedia QoE, where there are also publications devoted to the influence of the arrangement and sequence of a research sample on subjective assessment results. An issue of the analysis of the affective images was described in [
11], with the conclusion that unpleasant pictures at the end were assessed less negatively than unpleasant pictures presented at the beginning; therefore, the order of presentation had an impact on the recipient. On the other hand, in [
12], an impact of the Peak-End Effect, known in psychology, on the assessment of video quality was described, where this effect was identified for videos of poor quality but not for videos of good quality.
Examples cited from various areas of cognitive science indicate that an effect of order or frequency of the presentation of research stimuli on subjective assessment or actions resulting from human perception is indisputable. The effect of learning, improvement, or vice versa—weariness or impatience with the presented content or its sequence—should also be reflected in the area of Quality of Experience, in particular, in the context of analyzing the impact of factors such as the structure of research stimuli presentation on subjective assessment of multimedia content.
Thereby, it is reasonable to undertake research aimed at further verification of whether and how the phenomena identified in various fields of cognitive science, and related to the impact of the presentation structure of research stimuli on the experiments’ results, also refer to the subjective assessment of multimedia content in the area of Quality of Experience.
The overall subject of this publication is an analysis of the results of the conducted experiment, regarding the potential impact on the subjective assessment of the micro-structure of content presentation, i.e., the relative arrangement of content, including sequence order, frequency, and multiplicity of views, or reference to the quality background, i.e., quality of the preceding video sequence for 2D videos of varying technical quality.
The assumed contribution of this publication refers to research aspects in the field of multimedia Quality of Experience, directly resulting from the conducted research, i.e.,
Demonstrating the relationship between factors of the research stimuli presentation structure (number of views, quality of the preceding video, and content variability) and the obtained values of subjective video quality assessment;
Referring obtained results in the field of multimedia QoE to other areas of cognitive sciences (psychology, economic sciences, legal sciences, medicine, etc.), utilizing the aspect of subjective assessment in the research activities;
Determining future research directions for new research aspects, justifying extending conducted research or research regarding impact factors in addition to those presented in this publication.
2. Research Topic
The aim was to find an answer to the following research question: is the subjective assessment of video sequence quality affected by factors such as:
Objective quality of a given video—understood as a measure of the objective technical quality of a video sequence, expressed as a numerical value on a scale of 0–100, obtained using the VMAF (Video Multimethod Assessment Fusion) metric [
13].
Number of times a given video is displayed (fatigue/habituation effect)—a value specifying how many times a given video sequence is presented to the tester. Due to the experiment conditions, the multiplicity of impressions assumes acceptable values from sets 1, 2, and 3.
Qualitative background, i.e., objective quality of the preceding video—expressed by VMAF value defined above for video immediately preceding a given video sequence
The conducted analysis of the impact of the structure and layout of the research stimuli (video sequences) presentation was made in relation to the quality of the objective video sequence (system type factor). An additional research assumption was the neutrality of the experiment in relation to the other factors identified in the field of Quality of Experience, including system factors other than the objective technical quality, environmental factors, and the human factor. Research samples were not differentiated in relation to these factors and the homogeneity of the experiment conditions was maintained. Referring to the research question, three initial research hypotheses were adopted:
Hypothesis 1. A direct relationship between the subjective evaluation of video quality and its measured objective quality value for each objective quality level exists; i.e., for each objective quality level, video sequences with a higher VMAF value correspond to statistically significantly higher subjective evaluations.
Hypothesis 2. The relationship between the number of views of a video sequence and its subjective rating exists; i.e., the average subjective rating for the first viewing of a given video sequence is statistically significantly different from the average subjective rating for the third viewing of a given sequence.
Hypothesis 3. The relationship between the objective quality of the video immediately preceding a given video sequence and its subjective rating exists; i.e., the average subjective rating for a video whose direct predecessor was a video of higher objective quality differs statistically significantly from the average subjective rating for a video whose direct predecessor was a video of lower objective quality.
3. Experiment Description
3.1. Experiment Design
The experiment was carried out in accordance with the recommendations of Recommendation ITU-T P.913—the methods for the subjective assessment of video quality, audio quality, and audiovisual quality of Internet video and distribution quality television in any environment [
14]. The aim of the experiment was to obtain subjective ratings issued by testers for a set of PVSs (Processed Video Sequences) sequences displayed in the appropriate order and multiplicity. In the experiment, video presentation sessions were conducted for testers who were divided into two experimental groups:
The REGULAR group, in which each tester viewed each PVS only once;
The REPEAT group, in which each tester viewed each PVS sequence three times.
The groups of testers were fully disjoint; i.e., each tester participated in only one experimental group. The testers’ task was to determine subjectively the perceived quality of a watched video sequence (QoE) and assign it a rating on a 5-point Absolute Category Rating (ACR) scale, i.e., (5—excellent, 4—good, 3—fair, 2—poor, 1—bad) [
15]. Testers assessed the quality after viewing each PVS separately. Testers evaluated sequences under the same homogeneous technical and environmental conditions. In the REGULAR group, each tester was shown a total number of 170 PVSs, consisting of a randomly selected combination of all 34 unique SRC sequences selected for the experiment, at 5 selected levels of quality degradation, such that each SRC sequence with a specific level of quality degradation was displayed in sequence at random only once. In the REPEAT group, the number of available unique SRC sequences was limited to 12 (out of a total of 34 available in the experiment), assuming that each tester observes the same predetermined set of videos (i.e., no random selection of 12 out of 34 videos for each tester independently was performed). Each tester in this group was shown a total of 180 videos consisting of a randomized combination of 60 unique PVSs (12 SRC sequences in 5 quality degradation levels) so that each video sequence was shown to the tester 3 times in random order. The final result of the experiment is a set of ratings, separately for each experimental group (REGULAR/REPEAT) with admissible discrete values from sets 1, 2, 3, 4, and 5, associated with additional data, such as tester ID, video ID, the level of declared quality/degradation of the video, and the date and time of the registration of the rating.
3.2. Research Dataset
The set of video sequences, selected from sequences made publicly available in the Netflix, CableLabs, SJTU Media Lab, and Xiph.org Video Test Media databases, was utilized in this experiment. The research data consisted of full HD video sequences in MPEG-4 standard [
16], with a resolution of 1080p: 1920 × 1080 pixels with a playback speed of 60 frames/s and a variable bit rate of individual videos. PVSs (Processed Video Sequences) selected for the experiment were created by processing 34 unique SRC (Source Reference Circuit) video sequences, undifferentiated in terms of characteristics; generating corresponding videos for each sequence at five levels of quality degradation; and assigning videos to appropriate groups qualitatively, in accordance with
Figure 2.
The selection of videos in particular groups is presented in
Figure 2.
For each SRC sequence, files created at bitrate levels varying between 100,000 bps and 21,000,000 bps were generated.
The aspect of selecting research databases also provides the potential for further analyses. For further research work, it is reasonable to select datasets parameterized in terms of the studied features. A number of video databases with various characteristics of distortion (compression, transmission errors, frame rates, spatial and temporal resolution, etc.) have been described in the literature, along with methods for their evaluation. Publications [
17,
18] describe available databases for User-Generated Content (UGC) live videos, Professionally Generated Content (PGC), or Occupationally Generated Content (OGC) videos. An approach to creating pre-processed transcoded video databases has also been described [
19]. From the experiment purpose view, database selection is not critical due to the fact that the focus is on the mutual relations between the research stimuli in the sequence (the number of repetitions, predecessors, and order) and not on the quality and characteristics of the video sequence itself. Future research, as a continuation of this experiment, can regard an aspect of database selection, including, e.g., the types of distortions for individual video sequences.
The popular objective video quality metric VMAF (Video Multimethod Assessment Fusion) was used to generate the final set of PVSs used in an experiment [
13]. It is an objective reference video quality metric that allows the prediction of video quality based on reference and distorted video sequences. To predict video quality, VMAF uses image quality metrics such as:
Visual Information Fidelity (VIF): reflects information fidelity loss at four different spatial scales;
Detail Loss Metric (DLM): measures detail loss and damage that distracts the viewer;
Mean Co-Located Pixel Difference (MCPD): measures the time difference between frames on the luminance component.
The above parameters are combined using a regression based on SVM (Support Vector Machine)-supervised learning models. The final result is a single output score ranging from 0 to 100 for each video frame, with 100 being the same quality as a reference video. These scores are then temporally combined across the entire video sequence using the arithmetic mean to produce a cumulative mean differential opinion score (DMOS) for that image. For each of the 34 selected SRC sequences, one PVS was selected for the appropriate quality group (from A to E), whose objective VMAF assessment was closest to the values of 90 (Group A), 70 (Group B), 50 (Group C), 30 (Group D), and 10 (Group E). Thus, 34 videos in each of 5 quality groups were selected for the experiment. For each of the 34 SRC sequences, after selecting 5 PVSs (representing the five quality levels A–E), a total number of 170 PVSs were obtained in the target set. Each PVS lasted 10 s.
The choice of the full reference VMAF metric, using both original and degraded video sequences to parameterize technical video quality, was dictated by the necessity to determine objective quality parameters as simply as possible for the purpose of the adopted assessment method for five different quality levels of assessed videos. In this case, the level of objective technical quality has only an auxiliary value, subservient to the main purpose of the experiment, because the video technical parameters were not analyzed in the experiment, and the levels of objective quality were used only to determine whether dependencies in the assessment of subjective quality occur to a similar extent for different quality levels. Therefore, the focus was on the selection of video sequences, differentiated from each other using one standardized metric parameterizing the levels of objective quality. However, regarding future directions of the research resulting from this experiment, it is reasonable to verify the depth of the correlation of the obtained subjective assessment results with the levels of the technical objective quality obtained for the various types of objective video quality metrics, including “blind”, no-reference video quality metrics. The metrics in [
20], requiring a previously created dataset, were used to evaluate the degraded video. The possible use of no-reference metrics concerns a number of research areas described in the literature, such as the NAVE metric for autoencoders [
21], NR-GVQM for gaming [
22], or the H.264/AVC-based bitstream no-reference video quality metric employing a multiway Partial Least Squares Regression (PLSR) [
23]. Hybrid models, utilizing both Full-reference and No-reference feature extraction to assess objective technical quality was also published [
24]. The selection of hybrid or no-reference video quality metrics in planned future research will be the subject of a separate analysis beyond the scope of this publication.
3.3. Data Collection Interface
The experiment was carried out in the computer laboratory of AGH University of Krakow, Institute of Telecommunications, with the use of standard computer and network equipment in the laboratory. Data collection during the experiment was carried out using dedicated software—a test platform created for the purpose of this experiment—made available on the AGH server. After logging in using the tester’s ID, the test platform automatically assigned testers to one of two experimental groups (REGULAR or REPEAT). An implication of assigning a tester to the appropriate group was to display a video sequence appropriate for the group. Immediately after starting the experiment, the tester was shown the first single PVS selected in random order with a duration of 10 s from the pool appropriate for the selected group. An exemplary video screen is presented in
Figure 3.
Immediately after the playback of the sequence was completed, a rating screen appeared, containing a single-choice list of possible ratings according to the ACR scale. The video assessment panel is presented in
Figure 4. Polish phrase “Podaj Twoją ocenę obejrzanego filmu” is translated into “Insert your rating for the video you watched” in English.
After evaluation, another randomly selected video from the available pool was displayed, and the process was repeated until the PVS collection was exhausted. Testers’ evaluations were saved in the result file on the server.
3.4. Subjects
The pool of testers was selected from the population of AGH students interested in multimedia. The group of testers was homogenous because the selection of testers did not include gender allocation or any other characteristics differentiating testers apart from belonging to the group of AGH students. No analyses were planned in the experiment regarding the differentiation of any characteristics of the testers. The experiment involved 35 testers, including 7 women and 28 men, divided into two groups in such a way that the REGULAR group consisted of 12 testers, and the REPEAT group consisted of 23 testers. The assignment of the testers to groups was based on evenness or oddness of the last digit of the ID used by the tester to log into the test platform.
Due to the unequal representation of women and men in the research sample, the sample is not representative from the point of view of gender balance in society. Since the gender aspect was not differentiated during the research, it is not possible to determine the impact of this aspect in the obtained results. However, it is planned to continue the presented research, taking into account the gender parameter as one of the aspects differentiating subjects. Based on the results of future research, it will be possible to attempt to determine the impact of imbalance in the selection of representatives of both genders on the obtained results.
5. Discussion
As a result of the conducted experiment, at the level of statistical significance , the following have been shown:
For all levels of objective quality groups—a correlation of the average subjective assessment for the presented video sequences (MOS) with their objective quality (for Hypothesis 1);
For the highest level of objective quality group—the dependence of the subjective quality assessment for the presented video sequences on their display number (for Hypothesis 2);
For the highest level of objective quality group—the dependence of the subjective quality assessment on the objective quality of the previously displayed video sequence; the average assessment if the preceding video belonged to the highest quality group is higher than if the preceding video was of very low quality (for Hypothesis 3).
Further discussion on the obtained results should focus on conclusions concerning Hypotheses 2 and 3, i.e., the impact of the number of views for a given video and the impact of the preceeding video’s objective quality on the subjective quality assessment. Hypothesis 1, concerning the compliance of technical objective assessment with average values of the subjective assessments, is of an auxiliary nature as the correlation has been mainly utilized to verify data consistency and to confirm that the obtained results retain their substantive and logical sense.
The regularity observed in the experiment is the deterioration of the subjective ratings for videos from the highest quality group with successive views in a research sample. Videos from group A are the ones of the highest quality, i.e., close to the source quality. The phenomenon of the deterioration of the average rating in subsequent views of the same video would indicate the effect of a lower “appreciation” of the video’s quality by the tester with subsequent presentations. The deteriorating ratings of videos of the highest quality also fit in with the state of knowledge developed within cognitive sciences, indicating a more favorable assessment by the tester for a high-quality stimulus that he/she can see for the first time [
26]. An option presented first is ultimately more often chosen by a tester, so it is also treated more favorably than a similar stimulus presented later.
It should also be noted that such an effect was not obtained for videos from other quality groups, as no statistically significant differences in the average ratings of these videos were noted in the conducted research. The regularity worth noting is an improvement in the average ratings with successive views for videos from the D quality group, i.e., from the penultimate worst group of objective quality. The obtained results could suggest that with successive views of the same video, their average rating becomes “averaged”, as videos of the highest quality are rated worse and videos of low quality are rated better. It should be noted here that videos from the lowest quality group E are characterized by a VMAF value of around 10. It is assumed that the lowest acceptable VMAF value that should be analyzed is 20, so the D quality group is the lowest quality group for which reliable results can be obtained. The obtained results are in line with the results of the research described in the subject literature. Publication [
11] indicates that, in an experiment, “unpleasant” images displayed at the end are less “unpleasant” than when they are displayed at the beginning. This corresponds to the situation of presenting a low-quality video that is not pleasant to watch to the tester and its subjective rating increases with subsequent views.
It is also worth referring to the results published in [
12], containing the so-called “Peak-End Effect”, regarding the overall quality of the QoE measures obtained after participants watched a sequence of videos. Within this concept, the Peak-End Effect is a regularity in which a subject evaluates an overall experience largely on the basis of the sensations at the moment of the most intense sensations and on the basis of the final ones. Other information beyond the peak and end of the experience is not lost, but it is not used. This applies to both positive and negative impressions. Admittedly, this effect cannot be directly translated into the results obtained in the experiment because the described effect applies to the final video evaluation after watching a whole video, consisting of video sequences of various quality, while, indirectly, the growing feelings of the evaluator may translate into a subjective evaluation of the individual video sequences.
Demonstrated in the experiment, the regularity of the higher variance of the subjective ratings for the highest quality group than the variance of ratings for the lowest quality group indicates that for videos of a very low quality level, the subjects focused more on the negative assessment of poor quality videos than their appreciation of the high quality videos. This regularity correlates with the theses expressed in publication [
27], in which the negative assessment of events equidistant from the permissible boundary values is stronger than the positive assessment. This constitutes a kind of “negative bias” where negative evaluations are “weighted” more than positive evaluations. Worth discussing, as well, is an aspect of multimedia content presentation order and structure randomization for mitigating the impact of the identified effects mentioned above and, as a result, avoiding the so-called order bias. In publication [
28], the authors identified an aspect of “learning” the rating scale along with running a test, proposing a solution that uses other comparison methods (the Pair Comparison method) in their research, instead of the ACR methodology used in this experiment, in order to mitigate a bias effect.