4.1. Experimental Setup
All the encoding operations were conducted using personal computers with Intel i7-10700 eight-core 2.90-GHz processors and a 64-bit Windows 10 operating system, with the hyper-threading and turbo modes turned off. Experiments were performed without GPUs to reduce the complexity of the ML models. The training and testing of the ML models were assessed using Jupyter Notebook. Visual Studio 2017 was used for conversion and experiments with c++ languages.
The model performances were evaluated using the TensorFlow [
29] and scikit-learn libraries [
30]. The TensorFlow library is an open-source software library for ML and artificial intelligence. TensorFlow can be used for a range of tasks but focuses particularly on the training and inference of deep neural networks. TensorFlow was developed by the Google Brain team for internal Google use in research and production. Scikit-learn is a free software ML library for the Python programming language. Scikit-learn includes various classification, regression, and clustering algorithms, including SVM, random forest (RF), gradient boosting, and k-means. It is designed to inter-operate with the numerical and scientific Python libraries such as NumPy and SciPy.
4.3. Performance of ML Models for Accurate TT-Split Prediction
Table 4 and
Table 5 present the performance results (accuracy and training time) of the ML models established for the TT_H and TT_V split directions, respectively.
The ML models used in the TT-split decision stage were DT, RF, and multi-layer perceptron (MLP) [
9]. We first established three DT models with different maximum depths (max depth = 5, 6, and 7) and then established three RF models with the different number of DTs (number of DTs = 5, 6, and 7). We finally constructed a fully connected neural network with 13 input nodes, 30 hidden nodes, and 1 output node (the MLP model) and set the number of epochs to 2000 or 3000. The number of hidden layers was set to 30 to ensure the same accuracy for evaluating the proposed method as that for the existing method [
11].
The results show that the DT models achieved higher accuracy within less training time than the other models. The DT model with max depth = 7 achieved the highest accuracy within a fast total training time in TT_H decisions. Thus, this model was selected for determining whether a TT-split is required in the TT-partitioning decision stage.
4.4. Performance of the Proposed Object-Cooperated TT Partitioning Decision Method
We now compare the performance of the method that inputs context-based features and the proposed method that additionally inputs object-features.
Table 6 and
Table 7 display the accuracy of the methods per sequence in the horizontal and vertical directions of TT-split, respectively, on a 0th frame of 22 sequences at various video resolutions [
25].
Using the existing method, we evaluated a DT model with max depth 7 and only context-based features. In the proposed object-cooperated method, the TT-partitioning decision method, the DT model with a max depth = 7 was trained using 13 features comprising 11 context-based features and two additional features (object-features) obtained via object detection—employing YOLOv5.
As shown in
Table 6, our proposed object-cooperated method exhibits higher accuracy than the DT-based method [
9] in the worst cases (video sequences with an accuracy of less than 80%). In the worst cases, we also prove that our proposed method improves the accuracy of five out of seven sequences, as shown in
Table 7, confirming its effectiveness.
4.5. Complexity-Reduction Performances of the MLP-Based and Proposed Method for Encoding-Complexity
Table 8 compares the performances of the existing and proposed method with respect to
and
. To demonstrate that our proposed method is flexible for users depending on the need of applications, we adjusted the
value of classification on DT model for TT_H and TT_V. The
value was optimized using the proposed method (with
). The best result of
is the application of the proposed method when
with a 60%, on average, compared with the anchor (VTM4.0). To list methods that show the superior performance based on
, they are in the order of the proposed method (
) [
11], and the proposed method (
). We also confirm that out proposed method reasonably reduced the encoding complexity of VVC. Meanwhile, the
value when using the the proposed method (
) increased by 0.56%, which is 0.01% higher than that obtained using a previously reported model [
11]. However, the value obtained using the proposed method (
) increased by only 0.11% relative to the anchor, although the
value was 75%. Thus, our proposed method achieved a moderate trade-off between encoding complexity and coding efficiency.
The results of the video sequence experiments show that the proposed methods (
and
) outperformed the method reported in [
11] in terms of
and
, respectively. The largest reduction in encoding time was 57%, achieved using our proposed method with
on the RaceHorses (832 × 480), Johnny sequence. On the same sequence, at the resolutions of (832 × 480) and (416 × 240), the existing MLP-based method reduced by 61% and 62%, respectively. Comparing the best results, it can be seen that our proposed method (
) improved by 4% and 5% in terms of
, respectively, over the MLP-based method.
Table 9 shows results between the bitrate and the average object’s number, the object’s ratio when the DT model sets
as 0.5. The average object’s number and the object’s ratio were determined by object detection of frames of the JVET test sequences. As the result, we identified the assumption that object-features can be hints to determine the characteristics of the video. Based on various JVET test sequences [
25], it was confirmed that sequences with a low object ratio or a small number of objects are superior to other sequences in terms of bitrate. For example, BQSquare and PartyScene sequences show a low average object ratio and the best bitrate. The MLP-based method [
11] was incomparable because there were no object features.
Figure 6 and
Figure 7 show the decoded images of models yielding the best
results on the video sequence of RaceHorses (832 × 480) and RaceHorses (416 × 240) in
Table 8 for QPs of 22 and 37, respectively. The image-quality degradations were not noticeably different in the proposed method, the MLP-based method [
11], and VTM4.0. Meanwhile,
Figure 8 and
Figure 9 show the decoded images of models yielding the worst
results in
Table 8 for QPs of 22 and 37, respectively. On the video sequences of RitualDance and Cactus, where the proposed method (with
) delivered the poorest performance (68% and 71%, respectively), the encoding times were increased by 72% and 73%, respectively, in the existing method. Comparing the worst results, it can be seen that our proposed method (
) improved by 4% and 2% in terms of
, respectively, over the MLP-based method. Moreover, increasing the QP from 22 to 37 caused no significant difference in the image-quality degradation of the proposed method, the MLP-based method [
11], and VTM4.0.