4.3. Embedding Capacity
The embedding capacity of each video using the proposed method is recorded in
Table 4 and
Table 5 for the LDP and AI configurations, respectively. Generally, video sequences with complex texture or fast-moving scenes (e.g.,
PartyScene and
BasketballDrill), which are predicted by using more small PBs, can offer more rooms for data embedding, and vice versa (e.g.,
RushHour). In the best case scenario, the sequence
PartyScene embeds 516.9 kbps with an average
bit rate overhead for the LDP configuration, while it embeds 1578.6 kbps with an average
bit rate overhead for the AI configuration. On the other hand, the sequence
RushHour, which contains smooth regions of different sizes and mostly static background (due to the air ripples caused by heat) offers the smallest embedding capacity. The embedding capacity for this video sequence is expected to be small because most slices in the video are coded by using large PB.
Results also suggest that when is set to a larger value, higher embedding capacity is achieved. This trend is most apparent for the sequence RushHour because, when increases, more split blocks are coded instead of the block sizes suggested by RDO. Hence, a trade-off between embedding capacity and bit rate overhead can be achieved by tuning . It is observed that AI achieves, in general, a higher embedding capacity in comparison to LDP. The largest difference is observed in the sequence FourPeople, followed by BasketballDrill, BlueSky, PartyScene, RaceHorses and RushHour. Notably, for the sequence FourPeople, AI achieves, on average, ∼5.9 × more embedding capacity for in comparison to LDP. This is because the sequence FourPeople contains static background scenes. On the other hand, for the sequence RushHour, AI achieves 2.2∼4.3 × more embedding capacity in comparison to LDP for .
4.4. Comparison with Conventional Methods
For the purposes of fair comparison and analysis, the experiment environment and data set should be exactly the same, viz., using the same video encoder settings and the same video test sequences. However, we face three major challenges: (a) different video test sequences are used in the literature; (b) most experiments reported in the literature are conducted by using earlier video coding standards, and; (c) incomplete information about the parameter settings used for conducting experiments. These challenges prevent us from conducting and furnishing a thorough, fair and meaningful experiment using the conventional methods. Despite these challenges, we do however still provide a comparative analysis based on the available information for the completion of discussion.
Specifically, for performance comparison, we define the embedding cost as the number of bits spent on the video test sequence for encoding one payload bit, where and are the numbers of bits spent on coding (bit stream size) the original and processed video sequences, respectively, and is number of payload bits embedded into the video sequence of interest. Based on the results, for , on average, and bits are required to embed 1 bit for the LDP and AI configurations, respectively.
In comparison, to embed one payload bit into the video, Aly et al’s method for MPEG-2 video requires
bits, while Noorkami et al.’s method for H.264 video requires
bits [
39]. On the other hand, Mareen et al.’s embedding method implemented by using HEVC requires
bits [
15]. When
is increased, more bits are spent to embed each payload bit because split blocks might be coded (instead of a bigger block), which required some signaling to the decoder. In other words, the increase in bit stream size is due to data embedding (i.e., a sub-optimal mode is selected hence larger prediction error) as well as additional signaling to the decoder. For example, for
, the embedding cost falls in the range of
and
for the LDP and AI configurations, respectively. Similarly, for
, the ranges are
and
for the LDP and AI configurations, respectively.
Table 6 and
Table 7 record the embedding capacity for the proposed and conventional methods [
15,
20,
21,
30] using the same QP settings. It is observed that the proposed method (which uses multiple syntax elements) achieves higher embedding capacity in comparison to the conventional methods (which uses a single syntax element) considered. The embedding capacity is doubled for LDP when both MVP and intra prediction mode are utilized for data embedding. It is also observed that higher embedding capacity is achieved for AI configuration. This is because more prediction modes are manipulated for data embedding in the proposed method.
A relative functional comparison is conducted by considering the conventional methods proposed for MPEG-2 [
19], H.264/AVC [
39,
40], H.264/SVC [
25,
31], and HEVC [
15,
16,
20,
22,
23,
24,
41].
Table 8 compares the aforementioned methods in terms of embedding venue, applicability to intra/inter picture prediction, embedding capacity, video quality and bit rate overhead. Here, a relative comparison is performed because the results and values reported by the respective authors in the existing works are collected by using different test videos, video coding standards, parameter settings, etc. The labels of high (H), moderate (M) and low (L) are context-dependent and they are relative in nature. Firstly, for embedding capacity, we assign one of the three labels depending on the number of available syntax elements in a video. Specifically, more syntax elements provide more venues for data embedding, and vice versa [
42]. While classical venues such as the coefficient offer high embedding capacity, recently identified venues (e.g., MVD in HEVC) will, in general, offer less embedding capacity. Nonetheless, the proposed technique has the flexibility to embed different amounts of payload by adjusting
. When the required embedding capacity is low,
can be adopted so that the bit rate overhead and distortion can be reduced. On the other hand, more embedding capacity could be achieved by using a larger
so that each CTU can be split into smaller blocks. Note that the total number of smaller blocks in a CTU is determined by the RDC, and it is guided by
so that block-splitting will not be performed when the cost is higher than
.
Next, we analyze the impact of data embedding on the quality of the video. It is observed that the quality of the video may be affected depending on when (i.e., at which stage) data are embedded within the video encoding process as well as the embedding capacity. For example, when data embedding takes place before the computation of the prediction error/residual such as using intra prediction mode and MV, the residual will be obtained by using the new predicted value; therefore, the quality of the video can be maintained, although the residual is slightly larger. On the other hand, manipulating coding block structure may affect the selection of the optimal prediction block size and prediction mode, which will indirectly cause higher prediction error. Therefore the quality of the video is affected. Similarly, manipulating the transformed coefficients causes the entire block of pixel values to change. The perceived quality of these two methods will then be affected by the quantization process applied on the prediction residues. Similar findings are observed from related works; for example, the perceptual quality degradation for the intra prediction mode and merge mode is less than
dB [
15,
16,
40] while MVD [
19] is less than
dB. Since the quality degradation is less than 1 dB, which is negligible, therefore we label the quality as ‘H’. In addition, it is observed that the quality degradation for coefficient-based methods [
25,
39] and coding block structure-based methods [
22,
23,
24] falls in the range of
dB and
dB, respectively. Since the quality is slightly inferior in comparison to the former, it is labeled‘M’ for these methods.
Subsequently, we analyze the impact of data embedding on the bit rate based on these works. Generally, when more bits are embedded into a video sequence, the bit rate overhead will be higher. Besides that, more bits are required to code the non-optimal block structure or predicted values, as well as additional syntax elements resulting from data embedding. Therefore, when more changes to the syntax elements or more syntax elements are coded due to data embedding, the bit rate overhead increases accordingly. These findings are consistent with the observation. Specifically, it is observed that the number of MVD is relatively low as compared to other syntax elements, and the bit rate overhead incurred is less than
[
19]. A coding block and its neighboring blocks are usually correlated with each other and they contain the same moving object with similar motion. The MV after mapping the merging block to the payload bit is likely to be maintained. Therefore, the bit rate overhead after manipulating merge mode can be maintained at a relatively low value (e.g.,
as reported in [
43]). Since the bit rate overhead is maintained at a minimum level, these methods are labeled as ‘L’. On the other hand, the bit rate overhead caused by using the intra prediction mode to embed data in [
15] is capped at
, while the bit rate overhead caused by manipulating coding block structure falls in the range of [5, 5.5] in Yang et al.’s method [
24]. In addition, the bit rate overhead caused by manipulating coefficient falls in the range of [1.4, 5.0] in Noorkami et al.’s method [
39] and Buhari et al.’s method [
25]. Since the bit rate overhead for these methods is relatively higher in comparison to the former methods, they are labeled as ‘M’.
In general, all considered methods embed data during the encoding process. However, depending on the eventual mode of encoding of a block (i.e., MB in H.264 or CU in HEVC/SHVC), most existing methods may or may not be able to embed data. For example, the intra picture prediction-based method will not be able to embed data into a block when the inter picture prediction mode turns out to be more cost effective (hence the block is coded by the using inter prediction mode), and vice versa. On the other hand, the proposed method is able to embed data into both intra and inter predicted blocks. In other words, eventually one mode will be chosen (i.e., intra or inter), and in any case, data can be embedded by the proposed method. Similar to the proposed method, Tew et al.’s method [
22] can embed data in each CU since block size is exploited, while Shanableh [
23] offers the same capability by exploiting the block-splitting flag.
Since technology constantly advances to newer standards, we also evaluate the feasibility of applying the conventional data embedding methods to the SHVC standard. The MVD in [
19] is not feasible for data embedding in EL when an inter-layer reference picture is utilized in EL, because all MVs are set to
zero to fulfil the requirement of bit stream conformance in EL(s) [
44]. On the other hand, when the fast encoder mode setting is enabled, only blocks with size
are coded, hence any method that is based on block partitioning (e.g., [
23]) can only be applied to BL, unless the fast encoder mode is disabled. Likewise, although a coefficient-based method can be easily adopted in SHVC, propagation of error should be handled carefully to avoid poor perceptual quality for inter-layer predicted picture.
All in all, in comparison to AVC-based and HEVC-based methods, the achievable embedding capacity in scalable coded video by the proposed method is encouraging because the syntax elements in all coded layers can be exploited for data embedding purposes. Furthermore, some existing techniques can be combined with our work to achieve better trade-off among embedding capacity, quality, and bit rate. For example, the PB embedding technique can be applied to BL while the MVD, coefficient and merge mode can be applied to all layers to achieve a higher combined embedding capacity. This trade-off will be further investigated as part of our future work.