1. Introduction
It is crucial to have a reliable evaluation of the quality of video services offered online, including those of video-on-demand. This is because such services are rapidly expanding and gaining widespread popularity, as seen in regard to YouTube, Netflix, and other streaming platforms. The technology used for streaming is also evolving, from the traditional non-adaptive form to adaptive streaming. Additionally, internet video streaming technology is moving from an adapted connection-oriented video transport protocol, such as the Real-Time Messaging Protocol (RTMP), to adaptive streaming that utilizes HTTP.
The use of surveillance cameras has increased significantly, leading to challenges in storing, transmitting, and analyzing video data. As a result, there is a great need for a reliable system to manage large amounts of video data. Such a system should have efficient video compression, stable storage, and high-bandwidth ethernet or internet transmission capabilities. In modern cities, distributed cameras capture video from various scenarios. Video sensor systems can provide valuable data for improved traffic planning and management. Future intelligent road technologies will rely heavily on the quality and quantity of such data. Video-based detection systems are an essential component of intelligent transport systems and offer a flexible means of acquiring data. They are also evolving rapidly, which makes them highly promising.
We must bear in mind that quality is a crucial factor in many industries. In the video industry, various databases are designed to assess video quality using objective or subjective metrics. We have compared various databases produced by different institutions. The Ultra Video Group [
1] shot 16 4K quality video sequences, capturing different spatio-temporal information and covering all four quadrants. They used well-known sequences, such as Beauty, Bosphorus, Jockey, or ReadySetGo, as a good basis for testing. The SJTU database [
2] contains ten 4K video sequences, some of which could be used in the smart city monitoring industry, such as Bund Nightscape, Marathon, Runners, Traffic and Building, and Traffic Flow. However, this database is not designed specifically for this purpose. In [
3], a database of video sequences suitable for mobile viewing was described, exploring the impact of adaptive streaming for mobile phones, where the quality can vary depending on the connection. In [
4], a database representing different distortions was created and presented, taking into account not only temporal and spatial information, but also colorfulness, blurriness, or contrast.
The authors proposed a database called LIVE-NFLX-II in the paper [
5]. This database contains subjective QoE responses for various design dimensions, such as different bit rate adaptation algorithms, network conditions, and video content. The content characteristics cover a wide range, including natural and animated video content, fast and slow scenes, light and dark scenes, and low and high texture scenes. In [
6], the authors proposed and built a new database, the LIVE-Qualcomm mobile in-capture video quality database, which contains a total of 208 videos that model six common in-capture distortions. The coding structure, syntax, various tools, and settings relevant to the coding efficiency have been described in [
7]. In the paper, the perception of compression as well as spatial and temporal information was further investigated. The authors compiled an extensive database of video sequences, whose quality was subjectively evaluated.
To enhance the accuracy of video quality prediction in NR, a comprehensive video quality assessment database was developed in [
8]. This database comprises 585 videos featuring unique content, captured by a diverse group of users with varying levels of complexity and authentic distortions. The subjective video quality ratings were determined through crowdsourcing. To effectively analyze and utilize data in areas such as smart cities and smart traffic, it is crucial to expand the existing databases. This means adding new sequences that solely contain snapshots of the city, traffic, or other relevant information.
2. Related Work
Quality of Experience (QoE) is highly dependent on QoS parameters, as factors such as latency, jitter, or packet loss are also important in video traffic. Although such factors are easily measurable, QoE cannot be easily quantified. Currently, one of the most popular network services is live video streaming, which is growing at a rapid scale [
9,
10]. In [
11], a detailed quantitative analysis of video quality degradation in a homogeneous HEVC video transcoder was presented, along with an analysis of the origin of these degradations and the impact of the quantization step on the transcoding. The differences between video transcoding and direct compression were also described. The authors also found a dependency between the quality degradation caused by transcoding and the bit rate changes of the transcoded bit rate.
In [
12], the authors compared the available 4K video sequences. The compression standards H.264 (AVC), H.265 (HEVC), and VP9 were compared. Video sequences were examined using objective metrics like PSNR, SSIM, MS-SSIM, and VMAF. Recently, many experts and researchers have provided quality performance analyses of well-known video codecs such as H.264/AVC, H.265/HEVC, H.266/VVC, and AV1. The authors in [
13] performed an analysis between the HEVC and VVC codecs for test sequences with a resolution ranging from 480p up to ultra HD (UHD) resolution using the Peak Signal-to-Noise Ratio (PSNR) objective metric. In paper [
14], the rate distortion analysis of the same codecs using the PSNR, Structural Similarity Index (SSIM), and Video Multi-Method Assessment Fusion (VMAF) quality metrics was provided. The authors in [
15,
16] assessed the video quality of the HEVC, VVC, and AV1 compression standards for test sequences with resolutions varying from 240p to UHD/4K, and in [
17,
18] at full HD (FHD) and ultra HD (UHD) resolutions, respectively. The compression efficiency was calculated using the PSNR objective metric. In [
17,
18], for quality evaluation, the Multi-Scale Structural Similarity Index (MS-SSIM) method was used. Paper [
19] presents a comparative performance assessment of five video codecs—HEVC, VVC, AV1, EVC, and VP9. The experimental evaluation was performed on three video datasets with three different resolutions—768 × 432, 560 × 488, and 3840 × 2160 (UHD). Paper [
20] deals with an objective performance evaluation of the HEVC, JEM, AV1, and VP9 codecs using the PSNR metric. A large test set of 28 video sequences with different resolutions varying from 240p to ultra HD (UHD) was generated. Paper [
21] examines the compression performance of three codecs, namely HEVC, VVC, and AV1, measured with the PSNR and SSIM objective video quality metrics. In paper [
22], the authors compared the coding performance of HEVC, EVC, VVC, and AV1 in terms of computational complexity.
In [
23], the authors proposed a new methodology for video quality assessment using the just-noticeable difference (JND). The publication focuses on describing the process of subjective tests. In [
24], the authors presented an empirical study of the impact of packet-loss-related errors on television viewing. They compared different delivery platforms and technologies. Video sequences and delivery quality information obtained from the service provider were used in the experiments. The sequence length, content, and connection type were compared. In [
25], 16 types of metrics were compared for quality assessment. Packet loss was simulated in the video encoding and the losses were then hidden using different techniques to conceal the errors. The purpose was to show that the subjective quality of a video cannot be predicted from the visual quality of the frame alone when some hidden error occurs. In [
26], a new objective indicator, the pixel loss rate (XLR), was proposed. It evaluates the packet loss rate during video streaming. This method achieved comparable results with fully benchmarked metrics and a very high correlation with MOS. In [
27], the authors provided an overview of packet loss in Wi-Fi networks, mainly for real-time multimedia.
In [
28], an optimal packet classification method was proposed to classify packets that were given a different priority when the transmission conditions deteriorated. The network transmits the segments with the highest priority concerning perception quality when constrained network conditions occur. The results showed that the proposed method can achieve higher MOS scores compared to non-selective packet discarding. The authors in [
29] stated that highly textured video content is difficult to compress because a trade-off between the bit rate and perceived video quality is necessary. Based on this, they introduced a video texture dataset that was generated using a development environment. It was named the BVI-SynTex video dataset and was created from 196 video sequences grouped into three different texture types. It contains five-second full HD video scenes with a frame rate of 60 fps and a depth of 8 bits.
Video analysis in smart cities is very useful when deploying various systems and investigating their performance. A description of the technologies used for video analysis in smart cities can be found in [
30]. With the help of this analysis, various situations have been studied, including traffic control and monitoring, security, and entertainment. In [
31], the authors evaluate the transmission of multimedia data at a certain throughput setting. They then evaluate the performance and describe the benefits of real-time transfer. This publication describes video surveillance in smart cities and multimedia data transmission with cloud utilization. They discuss the impact of network connectivity on the transmission of multimedia data over a network. An algorithm dealing with image and video processing was presented by the authors in [
32]. This solution suppresses noise to achieve accuracy in the traffic scene. This knowledge was then used in a smart lighting experiment in a smart city system.
The transmission of streaming video in automotive ad-hoc networks is addressed by the authors in [
33]. These scenarios investigate the conditions that affect the quality of streaming, which is simulated in the NS3 environment. The publication [
34] discusses the influence of the resolution, number of subjects per frame, and frame rate on the performance of metrics for object recognition in video sequences. The authors used videos taken from cameras placed at intersections, which captured the scene from above. Changing the cropping method or changing the frame width was described. The classification of municipal waste using an automated system was proposed in paper [
35]. The suggested model classified the waste into multiple categories using convolutional neural networks.
The authors proposed a blind video quality assessment (BVQA) method based on a DNN to compare scenarios in the wild in [
36]. Transfer learning methods with spatial and temporal information were used. They used the DNN to account for the motion perception of video sequences for spatial features. They applied the results to six different VQA databases. In this work, the authors used their knowledge from image quality assessment (IQA). The authors of the research paper [
4] have developed a database of videos. This database is sampled and subjectively annotated and is intended to display authentic distortions. To ensure that the dataset was diverse in terms of content and multidimensional quality, six attributes were computed. These attributes included spatial information, temporal information, blur, contrast, color, and VNIQE. The paper introduces a new VQA database called KoNViD-lk.
In paper [
37], the authors propose an Enhanced Quality Adaptation Scheme for DASH (EQASH). The proposed scheme adjusts the quality of the segments not only based on the network and playback buffer status but also based on the VBR characteristics of the contents. The proposed scheme also reduces the latency by employing the new server push feature in HTTP 2.0. According to a study [
38], a video playback schedule that has a minimum number of low-resolution video segments provides the best QoE. The paper presents the M-Low linear time scheduling algorithm, which adjusts the video resolution and optimizes the QoE indices in the DASH streaming service. The authors of the study describe several QoE metrics, including the minimization of resolution switching events, freeze-free playback, the maximization of the video playback bit rate, and the minimization of low-resolution video segments.
The introduction of smart cities has revolutionized the way that we live, work, and commute. Although the papers presented in the Introduction [
1,
2,
3,
4,
5,
6,
7,
8] have created various databases, none of them have covered the content of smart cities directly. After conducting extensive research, it has become evident that this topic is highly relevant and the creation of a database that focuses on smart transportation and smart cities will be highly beneficial. The primary objective of this work is to create a database that is unprecedented in its type. The database will contain images that capture smart transportation and smart cities. Additionally, the reference sequences will be transcoded into different quality settings. The final transcoded images will be evaluated subjectively and objectively to ensure that they meet the desired quality standards. This will enable researchers, developers, and stakeholders to have access to high-quality images that can be used in various applications related to smart cities.
3. Motivation
The concept of a smart city has emerged in the last decade, as a combination of ideas on how information and communication technologies can enhance the functioning of cities. The goal is to improve efficiency and competitiveness and provide new ways to address poverty and social deprivation. The main idea is to coordinate and integrate technologies that can bring new opportunities to improve quality of life. A smart city can take various forms, such as a virtual, digital, or information city. These perspectives emphasize the role of information and communication technologies in the future operation of cities.
The concept of Quality of Experience was introduced as an alternative to Quality of Service (QoS) to design more satisfactory systems and services by considering human perception and experience. Our research focuses on QoS as well as QoE and their interconnection using a mapping function, followed by prediction. We test the impact of various quality parameters, such as the resolution, bit rate, and compression standard, on the resulting quality. In the case of QoE, we use subjective metrics for evaluation, while, for QoS, objective metrics are used. We also simulate the impact of packet loss, delay, or network congestion using a simulation tool to understand their effects on quality.
Based on the results and evaluations obtained, we recommend an appropriate choice of parameters that will guarantee the maximum quality for the end user while ensuring bandwidth efficiency for the provider. By combining these parameters, we can set the variable bit rate (VBR) to stream the video as efficiently as possible. In a classical streaming scenario, the video is viewed at one specific resolution, which is predefined before each session is started using a connection-oriented transport layer protocol. Adaptive streaming, on the other hand, involves encoding the video at multiple discrete bit rates. Each bitstream or video with a specific resolution is then divided into sub-segments or chunks, each taking a few seconds to complete (typically around 2–15 s). For optimal video quality during playback, it is important to ensure that the end user’s connection conditions and download speed are taken into consideration. VBR encoding can lead to inconsistencies in video block size, which can cause frequent re-caching and reduce the user’s QoE, especially when the network bandwidth is limited and fluctuating.
In this publication, we discuss the impact of various quality settings, such as the codec used, resolution, and bit rate, on the overall quality. Both objective and subjective metrics are used to determine the quality. Quality and appropriately set parameters are also important in the field of smart cities and traffic. The rapid expansion of cities in recent years has resulted in urban problems such as traffic congestion, public safety concerns, and crime monitoring. Smart city technologies leverage data sensing and big data analytics to gather information on human activities from entire cities. These data are analyzed to provide intelligent services for public applications.
4. Methodology
Video quality analysis focuses on packet loss in the network depending on the codec used, which causes artifacts in the video. We use QoE metrics to determine user satisfaction boundaries and, most importantly, the application of such QoS tools in the network to guarantee the minimum QoE expected by the user. The use of the internet as an environment for multimedia delivery is quite common today, but it is not entirely guaranteed that the user will receive, in such an environment, a service with the desired quality. This makes QoE monitoring and the search for links between QoS and QoE all the more important today.
It is essential to evaluate the performance of systems for the sending of information from one source to another (data link) and ensure efficient information transfer. When evaluating the transmission quality of IPTV services, we focus on user satisfaction with the quality of media content. It is generally assumed that high performance and transmission quality result in high user satisfaction with the service. From a human perceptual point of view, quality is determined by the perceived composition, which involves a process of perception and judgment. During this process, the perceiver compares the perceived events with a previously unknown reference. The nature of the perceived composition may not necessarily be a stable characteristic of the object, as the reference may influence what is currently perceived. Quality is usually relative and occurs as an event in a particular spatial, temporal, and functional context.
Objective quality assessment is an automated process, as opposed to subjective assessment, which requires human involvement. There are three types of methods for objective video quality models, which can be classified based on the availability of information about the received signal, the original signal, or whether the signal is present at all (FF). In our evaluation, we use FF objective methods (SSIM, MS-SSIM, PSNR, and VMAF). A more detailed description can be found in our previous publications [
39,
40]. As a subjective metric, we use the non-referential ACR method, because, in this case, the video is compared only based on the seen video sequence and not by comparison with a reference. In a real environment, when receiving a signal from a service provider, the end user also receives only the received signal and does not compare it with the reference original. The quality is defined by a 5-degree MOS scale. This standard [
41] provides a methodology for the subjective assessment of the quality of voice and video services from the end user’s perspective. This metric summarizes ratings that are averaged on a scale from 1, which is the worst quality, to 5, which represents excellent quality. For more information, see our publication [
40].
5. Methods of Proposed Model
Our primary goal is to create video sequences in ultra HD 4K resolution, which will contain various shots that map the traffic and the city. The created database of video sequences will cover both static and dynamic sequences. The created video sequences will then be transcoded to the necessary quality settings and objectively and subjectively rated. Furthermore, they will be accessible for subjective evaluation by another group. Each sequence will be identifiable by different parameters, e.g., spatial information (SI) and temporal information (TI). Subsequently, using a neural network, an appropriate bit rate can be allocated to each video sequence to achieve the desired quality.
To begin with, we had to take numerous shots, from which we selected reference video sequences. These were chosen to cover as much space as possible in the SI and TI quadrants. Their description can be found in
Section 6.1. The next step was to encode these reference sequences into a combination of full HD and ultra HD resolutions, using the H.264 (AVC) and H.265 (HEVC) compression standards and bit rates of 5, 10, and 15 Mbps using FFmpeg. FFmpeg is a collection of programs and libraries that enable the processing, editing, or playing of multimedia files. It is operated via the command line. In our case, the multimedia content had to be first encoded into a defined codec, which is a compression algorithm. Then, it was decoded to enable its use. With transcoding, it is possible to convert multimedia files to a different file container or codec, or to use different frame rates.
The selection and evaluation processes are illustrated in
Figure 1. The encoding process can be found in
Section 6.2. After transcoding, the sequences are characterized again using SI and TI information. The sequences are evaluated using objective metrics such as SSIM, MS-SSIM, PSNR, and VMAF (see
Section 6.3 for a description). The subjective metric ACR evaluation is described in
Section 6.4.
6. Results
In this section, we describe how the database was created, the encoding of the resulting video sequences, their characteristics, and then the objective and subjective evaluation of each sequence.
6.1. Description of the Dataset
The video sequences were captured using a DJI Mavic Air 2 drone. Unfortunately, this device does not allow the shooting of video sequences in uncompressed .yuv format. However, UHD (3840 × 2160) format is available. The parameters that were chosen for shooting can be found in
Table 1. However, 4K with a resolution of 4096 × 2160 is not yet used commercially and, therefore, UHD is preferred due to its 16:9 image ratio. This is why a UHD resolution was used for our recording. The aim of this work is to create a database of 4K video sequences that cover scenes from traffic monitoring and cityscapes. These sequences will be encoded to the necessary quality parameters and rated either objectively or subjectively. The video sequences that we have created offer a wide variety of dynamicity, whether in the dynamics of the objects in the video or the dynamics of the camera.
The following video sequences have been created, focusing on transport:
Dynamic road traffic—dynamic camera motion—frequent traffic at higher vehicle speeds (name in our database: Sc1);
Dynamic road traffic—static camera motion (Sc2);
Parking lot—dynamic camera motion—less dynamic movement of cars in a parking lot (Sc8);
Parking lot—static camera motion (Sc3);
Road traffic—busy traffic at lower vehicle speeds (Sc4);
Traffic roundabout—dynamic camera motion—traffic on a roundabout (Sc5);
Traffic roundabout with a parking lot—a dynamic part of the scene with slow movement in the parking lot (Sc6);
Traffic roundabout—static camera motion—traffic on a roundabout (Sc10);
Train station—train leaving the station (Sc7);
Dynamic train—train in dynamic motion (Sc9);
Trolleybus—trolleybus arriving at a public transport stop;
Dynamic trolleybus—trolleybus in dynamic driving;
The university town—university town (movement of people);
Waving flags—flags flying in the university town.
A preview of the reference sequences can be found in
Figure 2.
Each video sequence was evaluated based on its SI and TI values. The resulting parameter value is the maximum value across all frames. The temporal information is derived from changes in the brightness of the same pixels in successive frames, while the spatial information is obtained from the Sobel filter for the luminance component and the subsequent calculation of the standard deviation of the pixels. More details can be found in our publication [
40].
Table 2 shows the characterization of the reference sequences based on the SI and TI values, which highlights the diversity of the individual sequences in terms of spatial and temporal information.
6.2. Encoding of the Reference Video Sequences
The first ten reference sequences from the list above were further encoded to the full HD (1920 × 1080) resolution and H.264/AVC compression standard. These are labeled “Sc_x” so that we can name them for each variation. These reference sequences were selected precisely based on the characterization of temporal and spatial information. The quality of the encoded content is determined by the amount of data lost during the compression and decompression of the content. In real broadcasting, a bit rate of 10 Mbps is often used for HD resolution, and some stations use bit rates up to around 15 Mbps. This bit rate is also taken into account for UHD deployments. Therefore, each sequence in both resolutions and compression standards has been encoded with bit rate values of 5, 10, and 15 Mbps.
Table 3 shows the parameters that we used in encoding the video sequences. This combination produced twelve variations for each one, which means 120 sequences. We evaluated each with objective metrics. Seven of them were also evaluated subjectively (Sc1–5, Sc9–10). The seven sequences for subjective evaluation were selected to comply with the recommendations [
42]. With the combination of seven sequence types with twelve coding variations, we could evaluate one group continuously without a long pause. If we selected more sequence types, we would have needed to split the subjective evaluation of one group of evaluators, which would imply a larger time requirement. When selecting these seven video sequences, we also considered the calculation of the spatial and temporal information of the sequences.
The created database will be available to the general scientific public. The created video sequences can be further used for the needs of the analysis of the appropriate qualitative setting in order to provide the highest possible quality while saving as many network resources as possible. Thus, it will be possible to further work with the database, to shoot new sequences, which will then be evaluated, either by objective or subjective tests. This will give a detailed view of the performance of streaming video over IP-based networks. Video sequences offer the possibility to test which other parameters can characterize a given sequence, or how individual video parameters affect the quality. It will also be possible to see at which bit rate each scene achieves the highest end user satisfaction in terms of quality and thus define boundaries for each scene based on selected content information. A suitable bit rate would be assigned for each boundary so that it satisfies the highest quality at the individual resolution. This will allow technologies and applications in the smart cities and smart traffic sector to use the available resources efficiently.
We encoded the created reference sequences with changing quality parameters using FFmpeg. A coding example for 15 Mbps is as follows:
ffmpeg -i input_sequence -vf scale = resolution -c:v codec -b:v 15000k -maxrate 15000k -bufsize 15000k -an -pix_fmt yuv420p -framerate 50 SeqName.ts.
A description of the individual parameters used in the command is as follows:
- i is used to import video from the selected file;
- vf scale is used to specify the resolution of the video; in our case, this parameter was changed for full HD resolution (1920 × 1080) and uHD resolution (3840 × 2160);
- c:v is used to change the video codec; we used two codecs—H.264/AVC, which is written libx264, and the H.265/HEVC codec, which is written in libx265;
- b:v is used to select the bit rate; we varied this parameter at 5, 10, and 15 Mbps;
- maxrate is used to set the maximum bit rate tolerance; it requires buffsize in the settings;
- buffsize is used to choose the buffer;
- an is a parameter that removes the audio track from the video;
- pix_fmt is the parameter used to select the subsampling;
- framerate is used to set the number of frames per second.
The last parameter is the video output, where we set the video name and its format.
The output sequence has been encoded into a .ts container so that we can test the impact of packet loss in the future. One can use programs like Media Info and Bitrate Viewer to check the individual transcoded parameters. Media Info will display all the parameters and settings of the video, while Bitrate Viewer is used to display the bit rate in exact time.
We have included a five-second pause between each sequence to ensure that the evaluators do not overthink the evaluation and it remains spontaneous. The video contains a grey background, so that the image is not distorted and does not draw the eyes of the evaluators. We have inserted text into the grey background that describes the rating so that the raters know in which part of the evaluation process to conduct it.
To ensure an accurate evaluation, a maximum of three people participated simultaneously, and they had a direct and undistorted view of the TV set. The video sequences were evaluated on a Toshiba 55QA4163DG TV set placed 1.1 m from the raters, in compliance with the standard [
42]. The distance between the viewer and the monitor should be 1.5 times the height of the monitor. Each evaluator had access to a questionnaire, where they recorded early evaluations of a given sequence. A total of 30 human raters participated in the evaluation, rating 84 video sequences. The MOS rating scale of 1 to 5 was used for the evaluation, where 1 represents the worst quality, while 5 is the best.
In the following sections, we will analyze the outcomes of both the objective (see
Section 6.3) and subjective (see
Section 6.4) evaluations. Please note that the results presented here are based on selected samples only, while all other numerical or graphical data can be obtained upon request. Moreover, we are currently working on creating a website where the entire database will be published and available for free.
6.3. Objective Quality Evaluation
In the case of objective evaluation, we selected one video sequence to present the results, namely the traffic roundabout with a parking lot (Sc6). For this sequence, we present the evaluation progress frame by frame for the individual objective metrics (SSIM, MS-SSIM, PSNR, and VMAF) for a 15 Mbps bit rate in both resolutions and codecs. Results are presented by normalized value range <0, 1>. Here, it is possible to compare the overall correlation of the evaluated metrics.
The results for the ultra HD resolution for the H.265 (HEVC) codec can be seen in
Figure 3 and for the H.264 (AVC) codec in
Figure 4. The full HD resolution can be viewed in
Figure 5 for the H.265 (HEVC) compression standard and in
Figure 6 for the H.264 (AVC) compression standard. With such a high bit rate, the H.265 compression standard achieves better results compared to H.264 for both resolutions.
The full HD resolution achieves a better rating. With an increasing bit rate, the difference is smaller. When comparing
Figure 3 and
Figure 4, we can conclude that the ratings correlate with each other and there are noticeable equal rating shifts in both compression standards. At full HD resolution, we can observe a larger variation between the compression standard H.265 in
Figure 5 and for H.264 in
Figure 6. Mapping the results of the different objective metrics confirms the high correlation between the methods used and brings the comparison of these metrics closer to the citers. We can also see that the VMAF scores oscillate more than the results of other metrics.
We present the final results for each sequence (Sc1–Sc10) in the form of mean values of the VMAF and PSNR metrics for the 15 Mbps bit rates. As expected, the H.265 codec achieves better results, and we can also see an improvement in the results with an increasing bit rate value. The results of the objective evaluation of all sequences for the ultra HD resolution in combination with the H.265 (HEVC) codec are shown in
Figure 7 and for the H.264 (AVC) codec in
Figure 8. The results for the other sequences also confirm that the H.265 (HEVC) compression standard has a better rating. For some sequences, the difference is more pronounced, which is due to the dynamics of the scene.
At full HD resolution, the differences between H.265 (HEVC), which can be seen in
Figure 9, and H.264 (AVC), shown in
Figure 10, are smaller. In both cases, the full HD resolution achieves higher values than the ultra HD resolution.
6.4. Subjective Quality Evaluation
In this section, we present the results of the subjective evaluation of seven reference sequences. These sequences were recoded into various qualitative parameters. We calculated the average ratings from 30 users for each type of coded sequence from the references Sc1 (dynamic road traffic—dynamic camera motion) and Sc2 (dynamic road traffic—static camera motion), as shown in
Figure 11a. The results for Sc3 (parking lot—static camera motion) and Sc4 (road traffic) can be seen in
Figure 11b, while the results for Sc5 (traffic roundabout—dynamic camera motion) and SC10 (traffic roundabout—static camera motion) are presented in
Figure 11c.
Figure 8.
Mean values of VMAF and PSNR for UHD, H.264, 15 Mbps.
Figure 8.
Mean values of VMAF and PSNR for UHD, H.264, 15 Mbps.
Figure 9.
Mean values of VMAF and PSNR for full HD, H.265, 15 Mbps.
Figure 9.
Mean values of VMAF and PSNR for full HD, H.265, 15 Mbps.
Figure 10.
Mean values of VMAF and PSNR for full HD, H.264, 15 Mbps.
Figure 10.
Mean values of VMAF and PSNR for full HD, H.264, 15 Mbps.
Figure 11.
Average values of subjective evaluation. (a) Subjective evaluation of Sc1 and Sc2. (b) Subjective evaluation of Sc3 and Sc4. (c) Subjective evaluation of Sc5 and Sc10.
Figure 11.
Average values of subjective evaluation. (a) Subjective evaluation of Sc1 and Sc2. (b) Subjective evaluation of Sc3 and Sc4. (c) Subjective evaluation of Sc5 and Sc10.
In
Table 4, one can find the complete results of the transcoded sequences from the dynamic train (Sc9) reference sequence.
Table 4 includes the average result as well as the exact number of occurrences for each MOS scale value.
6.5. Correlation between Objective and Subjective Assessments
There are various metrics to express the correlation between subjective and objective assessments. The two most commonly used statistical metrics to measure the performance are the Root Mean Square Error (RMSE) and Pearson’s correlation coefficient. A high correlation value (usually greater than 0.8) is considered to be effective. To measure the correlation, we used three sequences (Sc1—dynamic road traffic, Sc9—dynamic train, and Sc10—traffic roundabout) in UHD resolution for comparison. The results show that there is a strong correlation between the subjective evaluation by the respondents and the objective evaluation. One can see the correlation between these evaluations in
Table 5.
7. Discussion
We need to consider the purpose and space of capturing individual moments when monitoring smart city footage. Depending on the importance of the captured part of the city, we can define the necessary quality of the recording. If we need to address security, we can use high-resolution security cameras such as Internet Protocol cameras (IP cameras), which can produce a 4K resolution or better. However, when monitoring a certain event, checking the traffic, or monitoring a location with a static background, we do not need the best-resolution video. In this case, wireless cameras can be used, but their quality may not match the reality of the viewed footage. The quality of the footage may be limited by an insufficient Wi-Fi signal or a monitor/display with a lower resolution on which the video footage is viewed. The selection of an individual system for deployment involves several important aspects. Our recommendations for the setting of the quality parameters can help to determine appropriate parameters. We can define sufficient quality for different types of video sequences based on the deployment requirements. To achieve this, we created a large set of video sequences, some of which had to be recorded multiple times due to poor weather conditions or image interference. The final shots were of high quality, with different object dynamics in the scenes and dynamic camera movement.
We have created a database of 4K video sequences that cover scenes from traffic or city monitoring. Our goal is to expand this database with video sequences shot with different devices, such as classic cameras, drones, mobile phones, and GoPro cameras. This will help us to determine whether the quality is also affected by the camera on which the video sequences are shot. In the future, we plan to extend the encoded sequences with the H.266 and AV1 codecs and bit rates of 1, 3, 7, and 20 Mbps, to compare the ratings of other combinations of quality parameters. We are also considering using other metrics for objective evaluation and a larger sample for subjective evaluation.
We are looking for partners who can provide us with video sequences to improve our monitoring system. Our team is interested in collaborating with the city of Zilina to identify video sequences that could be used to enhance the system. We are also interested in using some of their own recordings. Furthermore, we are looking for a reliable security systems company to partner with and expand our database in the future. In addition, we are interested in working with partners who can help us to film 8K sequences and expand our laboratory with 8K imaging units to perform subjective tests. Although we have reached out to other universities in Slovakia and the Czech Republic, the possibilities are currently limited.
The reference sequences of our database are available at [
43]. All of them, as well as the encoded sequences, can be downloaded by researchers from the server using the File Transfer Protocol (FTP). The FTP server is configured to allow passwordless access to users at IP address 158.193.214.161 via any FTP client. Once connected, the user has access to both reference and transcoded sequences. The “reference sequences” section contains the names defined in the description of the dataset, while the “encoded sequences” section contains sub-sequences for each resolution (full HD, ultra HD). The transcoded sequences’ names are defined by the key original sequence name_compression standard_bitrate. We have a test web page that is currently being finalized, which will contain these sequences, their descriptions, and a contact form where users can leave comments or advice. Until the website is launched, interested parties can contact the researchers by email for more information or to provide feedback.
Modern technology is rapidly developing in all areas of society. However, the potential advantages and disadvantages of these technologies are often not sufficiently discussed. Although they can make our lives easier and more efficient, they can also have a negative impact on social relationships. An example is the use of industrial cameras in public spaces. CCTV cameras are used in public spaces primarily for monitoring and crime prevention. However, this type of surveillance raises human rights concerns that are often overlooked in discussions about the use of modern technology. CCTV is intended for places where increased security and public surveillance are needed, and smart technologies are used to create a safer environment. Video recordings do not target individuals or their personal belongings, but rather are used for research purposes. Anyone who downloads sequences from our store agrees to this statement.
8. Conclusions
The purpose of this research was to create 4K UHD video sequences to capture traffic conditions in the city and monitor specific areas. The footage was intended to be used to analyze quality requirements and provide recommendations for the implementation of technologies such as smart cities or smart traffic. To begin, we determined the types of video sequences that could be applicable in the smart cities or traffic sector. We selected video sequences that provided slower but also more dynamic shots, as well as video sequences where the camera movement was both static and dynamic, changing the characteristics of the footage. We identified individual video scenes through spatial and temporal information, knowing that camera movement also affects these values, producing a different type of video sequence. For transportation, we chose Zilina’s available means of public transportation, specifically the trolleybus coming and going from the public transport stop, as well as its dynamic driving. We also recorded the traffic situation at lower and higher speeds, including busy roads, roundabouts, and parked vehicles. We focused on rail transport as well, recording slower trains arriving or leaving the station and faster-moving trains. Selecting video sequences for smart cities was more difficult, as we needed to cover different dynamics. We chose a sequence that monitored the movement of people in a university town and flags flying as a demonstration of an object that could be recorded. Monitoring systems record various situations, whether in the context of security or sensing different situations, where the system helps to evaluate the appropriate response.
We used both objective and subjective methods to evaluate the tests conducted and, based on the measurements obtained, we plan to propose a QoS model for the estimation of triple-play services in our future work. Our next focus is to assess the quality of video data delivery in various scenarios by simulating different values of packet loss and delay in the network. The results of this study will help us to determine whether it is better for video quality to receive packets in the incorrect order or to lose them entirely.
We plan to expand our database to include video sequences recorded with various devices, including mobile phones, GoPro cameras, and conventional 4K cameras. This comprehensive database will allow us to compare the resulting quality of the videos captured by different devices. These comparisons will help to improve stream services. We will also develop a prediction model that can calculate the resulting video quality based on the network’s state and behavior. This model can be used by ISPs during the network architecture design process.