*2.7. Aspects of Authenticity*

In [8], the authenticity of MIDI files is paid a lot of the focus. There, the increasing or decreasing velocities due to increasing or decreasing pitch of tones played is preserved. That approach renders the method a top score with regards to authenticity.

It is assumed that the MIDI music is sequenced either by some kind of automatic procedure, where all notes are given a constant velocity throughout the whole piece or by being played by somebody, possibly at half-speed, which then results in all notes having velocities in a broad interval and relatively few velocity values being exactly the same. Therefore, the predecessor to the proposed method, Velody [15] suffers a great deal in terms of authenticity upon inspection of the actual velocity values, since these velocities are to equal proportions either of two velocity values. Thus, close inspection of the velocities by a meticulous steganalyst would reveal clear deviance from both the stereotype patterns (either all one velocity value throughout the piece or a variety of velocities).

### *2.8. Hypothesis Test of Audibility*

A drawback of a steganography method is if the method leaves audible marks in the music. To this end, a hypothesis test was carried out. In this test, people listened to 10 pairs of musical pieces selected from a database of classical MIDI music (please, refer to Appendix A for a description of the database). The songs selected were


The experiment was presented at a web page (see Figure 2) where all 10 pairs of musical pieces were playable via buttons embedded into the page. For each pair of songs, there was one called File 1 and another called File 2. One of these was the original song and the other was the same song but modified through Velody 2 with a message hidden inside. For each pair, the listener was instructed to guess which of the files that were steganography by indicating this using radio buttons. Each pair consisted of two MIDI songs converted to flac-format to be playable more independently on different computers. Thus, the test was solely devoted to finding out audible differences between the original song and the corresponding stego song. It did not involve the inspection of eventlists or any other kind of analysis. The number of respondents who contributed to this experiment was 30. At the top of the page, apart from declaring their name, each participant should write a 7 character code which ensured that they had been given instructions about what to do and which also served to reduce the risk of the same person taking the test multiple times and being able to differentiate between groups of respondents in retrospect.


**Figure 2.** The experiment was carried out by encouraging people to listen to pairs of MIDI songs played by pressing buttons in a web form and indicating utilizing radio buttons which of the two alternatives included steganography.

#### *2.9. Relevance of Power of a Hypothesis Test*

If there was a detectable difference in the songs after steganography had been performed, this would be indicated by increasing the probability *π* of correctly guessing which song included steganography, i.e., making this probability ><sup>1</sup> <sup>2</sup> . If, on the other hand, there were no signs of steganography manipulation at all, that probability *π* would be equal to <sup>1</sup> 2 . Now, a hypothesis test could only prove the alternative, which in this case would be that *π* > <sup>1</sup> <sup>2</sup> as opposed to the null hypothesis *<sup>π</sup>* <sup>=</sup> <sup>1</sup> <sup>2</sup> . The claim that there is no audible effect of Velody can never be proved by a hypothesis test. If the null is accepted this just means that effect could not be proved.

However, one might hypothesize, if there were audible effects due to the steganography method these effects would have to be so large that they resulted in a rejection of the null hypothesis. With a larger number of respondents this ability to prove an effect also if *π* was just a little larger than <sup>1</sup> <sup>2</sup> , i.e., a large number of respondents would increase the power of the test.

#### *2.10. Robustness*

Another aspect of importance is robustness as considered by e.g., Lang et al [13]. Currently, the proposed method, Velody 2, is not implemented with any support for improving the robustness of the method. Including the hidden message with redundancy in the carrier would not contribute to robustness since MIDI file would not play and there would not even be an eventlist in the case of as much as one bit failing. However, one could just send the stego MIDI file with redundancy, i.e., sending the same stego MIDI file multiple times. In addition, the receiver could be assumed to have many alternative addresses so the transmission could still be made to many different addresses. This way of increasing robustness is claimed to lead to minimal suspicion. Still, this is rather recommended behaviour than a part of the Velody 2 method.

#### **3. Results**

The results are divided into those regarding security aspects, mainly steganalysis resilience aspects such as whether there are no audible revealing footprints of the steganography and other changes which may catch the attention of an alert steganalyst, and capacity aspects, such as embedding capacity, the number of bits per event and file-size change rate, following the definitions by Liu and Wu [7].

#### *3.1. Steganalysis Resilience*

The Velody 2 method proposed in this paper is based on a slight extension of a velocity LSB embedding algorithm, but where the strategy to embed the data tries to mimic the output of "Humanization" available in midi tools such as Abelton. The strategy does not inflate the file size and has a minuscule performance effect. However, it has been shown in a statistical experiment that the performance effect introduced is most likely not possible to detect for a human listener. A summary of these methods and their resilience in these respects is summarized in the Table 1.

**Table 1.** Summary of properties regarding resilience to steganalysis and to what extent these are satisfied by the different methods considered.


From this table, the method in Inoue, Suzuki and Matsumoto [3] comes out best since it has no performance effect at all, it leaves a minimal amount of artificial traces while still not contributing to inflation. Other good methods are Liu and Wu [7], Wu, Hsiang and Chen [8], Yamamoto and Iwakiri [5] and Velody 2, the proposed method which are considered to suffer from only one of the shortcomings. Depending on which of these properties are more important these methods could be differently preferable.

In attempting to make a steganography method resilient it is important to leave as few traces of manipulation of the carrier upon performing the data hiding according to the method.

For instance drastically changing properties such as Mean Absolute Error (MAE), Peak Signal-to-Noise Ratio (PSNR) or file-size change-rate (*Fr*) in the information hiding process are shortcomings in that method in respect of steganalysis resilience. In Table 2 some values of these entities for a few methods are given. Here, it turns out that Velody 2 (the proposed method) and the method in Inoue, Suzuki and Matsumoto [3] are preferable with respect to file-size change-rate. Values of MAE and PSNR could not be compared since no such values were found for the other methods in the literature.

**Table 2.** Table of Mean Absolute Error (MAE), Peak Signal-to-Noise Ratio (PSNR) and file-size change-rates (*Fr*) for the methods considered. The file-size change-rates are as defined by Liu and Wu [7] and briefly explained in the text. Values within parentheses are standard errors.


<sup>1</sup> According to Liu and Wu [7]. <sup>2</sup> This is WSNR which is different to PSNR but still a variant of a Signal-to-Noise Ratio.

#### *3.2. Audibility*

There were 30 respondents to the audibility form which consisted of telling which of two alternatives of the same song was steganography for 10 different MIDI songs. The 10 songs were as listed in Section 2.8 and the results of this experiment are illustrated in the bar charts in Figure 3.

**Figure 3.** Barplots of the distribution of guesses divided into the 10 songs in the audibility experiment. A total of 30 respondents guessed each and every of the 10 pairs of songs about which one was steganography. The bars show the number of incorrect guesses in turquoise and the number correct guesses in orange for each pair of songs.

We tested the hypothesis that the suggested method left audible traces a test of whether the probability *π* = *P* (a respondent cannot tell apart the stego MIDI file from the carrier MIDI file) exceeds 0.5 against the null hypothesis that the probability *π* = 0.5 (corresponding to the respondent choosing one of the alternatives entirely at random).

Letting each guess be coded as 0 if it is wrong and 1 if it is right, a binomial test is an obvious possibility since the sum *S* of correct guesses over all pairs of MIDI songs and all respondents is a sum of 0-1-variables which, assuming independence between songs and respondents and that all guesses are correct with equal probability *π*, is binomially distributed with parameters *N* and *π*. 30 respondents were signing up for the experiment and 10 pairs of songs making *n* = 300. In total, 153 correct guesses made the *p*-value of the binomial test

$$P(S > 153 \mid H\_0) = \sum\_{k=154}^{N} \binom{N}{k} \pi^k (1 - \pi)^{N - k} \Big|\_{N=500} = 0.3431$$

which is well and truly above than any reasonable level of significance, i.e., there are no indications of standing a better chance of guessing which of the songs is steganography after listening to both MIDI carrier and stego MIDI file.

The more common test for this kind of question is a *χ*2-test. Letting each song constitute a class, the number of correct guesses for each class was calculated. Under the assumption of independence between guesses and that each song was guessed to be steganography correctly with probability *π* the total number *X* of correct guesses for one respondent could be subjected to a *χ*2-test of whether *X* is binomially distributed with parameters *n* = 10 and *π* = 0.5 or not. After merging classes so that the expected number of observations in each class exceeded 2, there were 5 classes and the test statistic turned out 5.8252 rendering a high *p*-value of 0.8171.

So, what does it mean that the null hypothesis *X* ∈ *Bin*(10, 0.5) can not be rejected? Certainly, it does not prove that *X* ∈ *Bin*(10, 0.5) and that *π* = 0.5 which corresponds to that respondents can not tell the stego MIDI file apart from the carrier file, only that no deviance in the distribution of *X* from *Bin*(10, 0.5) can be found. How large would that deviance have had to be for the hypothesis test to prove it? That question is answered by looking at the power of the test as illustrated in Figure 4. From these curves, it is clear that for deviances of *π* about 0.08 from the null hypothesis value 0.5 the power is clearly greater than 0.95 meaning that the test most likely would have shown a significant difference in this case. For telling even smaller deviances from 0.5 with that great power a larger sample size is needed.

**Figure 4.** To the left: Power curve of the binomial test of deviance from the value 0.5 of the parameter *π* = *P* (a respondent cannot tell apart the stego MIDI file from the carrier MIDI file.). To the right: Power curve of the *χ*2-test of deviance from the binomial distribution with parameters *n* = 10 (since there are 10 questions in the experiment) and *π*. For both tests, the power depends on the sample size, i.e., the number of respondents in this case. Here, the sample size was 30 as indicated by the red curves, but had it been 10 the curves would have been as indicated in green and had it been 100 the curves would have been as indicated in blue.

#### *3.3. Capacity*

For evaluation of steganography methods, Liu and Wu [7] define several variables related to capacity, i.e., the number of bits that can be encoded into the carrier MIDI file referred to as embedding capacity *Nc*, the total number of embedded bits divided by the size of the carrier MIDI file before encoding referred to as the embedding rate *Er*, the total number of embedded bits divided by the total number of events in the carrier MIDI file referred to as the number of bits/event *Ne*, and the absolute change of the carrier MIDI file before and after encoding divided by the size before encoding referred to as the file-size change rate *Ft*. Following their initiative, the suggested method is evaluated according to these key performance indicators and parameters and compared to the corresponding values of other methods. It is assumed that the full embedding capacity is used for encoding. See Table 3 for a comparison of a selection of steganography methods. As Table 3 shows, the Velody 2 strategy does not limit the number of available events until the *Ne* approaches 6 bits/event, and even then the reduction is small.


**Table 3.** Table of averages of capacity properties: the number of embedded secret bits *Nb*, number of bits per event *Ne* and embedding rate *Er* as defined by Liu and Wu [7] and briefly explained in the text. Values within parentheses are standard errors.

<sup>1</sup> According to Liu and Wu [7]. <sup>2</sup> According to Inoue, Suzuki and Matsumoto [3].

It can also be seen that the *Er* is high compared to other techniques. While the Velody 2 strategy most likely does not outperform the work of Wu, Hsiang and Chen [8] when it comes to the deviation in average velocity, this is of little practical effect since it according to the statistical experiment seems hard to detect this when listening and that there exist tools that creates exactly this type of deviation as a part of the music production process. The suggested strategy from Liu and Wu [7] achieves a fairly good *Ne*, but this comes at the cost of a low *Er* value due to the file size expansion introduced by the strategy. The method proposed by Inoue, Suzuki and Matsumoto [3] achieves a good *Er* value with no inflation, performance effects, or obvious artefacts. However, the optimal performance of this elegant strategy is still outperformed by the averaged performance of the Velody 2 method.

#### **4. Discussion and Conclusions**

A novel MIDI steganography method, called Velody 2, is presented. Its capacity turns out to be on par with the highest capacity methods available in the literature while still leaving few traces of manipulation, such as small values of Mean Absolute Error (MAE) and Peak Signal-to-Noise Ratio (PSNR). It also has no inflation and an experiment was carried out verifying that there are no signs of audible traces.

Regarding many methods suggested in the literature, the MIDI code shows artificial patterns that would not be likely, or even possible, to produce by generating the MIDI file merely by automatic sequencing from sheet music or input of a MIDI song by playing on a keyboard and possibly modifying it slightly afterwards. Examples of such artefacts are extra data not occurring normally in a MIDI song (as is the case with the padding in Liu and Wu [7]), simultaneous note events (so-called simulnotes) that may occur in any order without sounding different and neither causing inflation but where the sequencer always put these events in a certain order and deviance from that pattern should arouse suspicion (in Inoue, Suzuki and Matsumoto [3]), and only two values of velocities (as in Vaske, Weckstén and Järpe [15]). Such artificial giveaways are perfect signals to a steganalyst searching for indications of suspect MIDI steganography.

In the suggested method, velocities are scattered randomly within a narrow interval to be specified. This would have been the result of having played a piece on a keyboard or humanized using a midi software that supports randomization of the velocities. Of course, the mean level is constant and not drifting as is the case in a majority of MIDI music. Still, for harpsichord and organ music this poses no problem at all, and even if this is a strong restriction MIDI music is abundant within this subgroup. An audibility hypothesis test was carried out to see if there were audible traces in the stego MIDI files compared to the carrier files. However, the *p*-values here were 0.3431 (binomial test) and 0.8171 (chi-square test) meaning that no signs of steganography could be detected. If the music is entered by playing the piece on a keyboard this is also likely to cause starting time and duration of notes to be slightly fluctuating. Thus, if the velocities are scattered while starting times and duration are not this might be considered as an unrealistic artefact of the method. However, this is not at all unrealistic taking into account that the composer could well have

quantized the notes after having made the keyboard recording, a very common kind of facility in many MIDI sequencer programs. Thus, the notes would be perfectly according to the measures and bars but velocity differences would remain.

Another aspect of footprint is the file-size change-rate (*Fr*). For Velody 2 this is 0, i.e., there is no change of the file-size at all. This makes it optimal in this respect together with the methods of Inoue, Suzuki and Matsumoto [3] and slightly better than Wu, Hsian and Chen [8]. In addition statistical estimators such as the Mean Absolute Error (MAE) and the Peak Signal-to-Noise Ratio (PSNR) of the various kinds of MIDI events in the music may be interesting from a steganalysis perspective. When having to check a vast material of MIDI music for suspect features, a steganalyst is unlikely to be able to go through each MIDI song's event list to check for revealing footprints such as those mentioned above, unless it is possible to fully automate. Instead, the search is likely to build on summarizing characteristics such as the MAE and PSNR of different MIDI events, and these could systematically be retrieved in an automatized process. Thus, steganography methods which stand out in such a listing are likely to be scrutinized for further indications of steganography. In the proposed method, averages of MAE, ranging from 6.27 to 25.37, and PSNR, ranging from 12.62 to 24.64, were calculated. Corresponding values for other methods have to be calculated to compare methods. This, however, remains as a task for future studies.

It could be argued that most likely a steganography method that creates output that has the reversibility property, thus where the carrier MIDI file can be restored from the stego MIDI file, is by definition not providing plausible deniability. The reason for this is that since the carrier MIDI file can be recreated from the stego MIDI file there has to be extra information available in the file that can be removed.

Veoldy 2 was not developed with robustness in mind. Therefore it does not include any steps to increase its properties in the aspects of robustness. Such development remains as a possibility for future studies.

To further investigate resilience to steganalysis the steganography methods could be submitted to the most common steganalysis tools. This has been done for audio steganography [17] and possibly other kinds of steganography [18] and investigating how successful the procedures are in these papers is an important step to properly finding out the ability of steganography methods to withstand the attempts made by steganalysts. Suggestions for improvements for future experiments include increasing the number of respondents as well as increasing the share of respondents that have training in playing and listening to music. The experiment itself could be improved by generating a large pool of carrier MIDI files and related stego MIDI files from which each experiment randomly generates a unique set of questions. This would reduce the opportunity of collusion leading to test bias.

**Author Contributions:** Formal analysis, M.W.; Investigation, E.J. and M.W.; Methodology, E.J.; Software, E.J. and M.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research leading to the results reported in this work received funding from the Knowledge Foundation in the framework of SafeSmart Safety of Connected Intelligent Vehicles in Smart Cities Synergy project (2019–2023), grant number F2019/151.

**Acknowledgments:** The authors which to extend their sincere gratitude to all respondents in the experiment which provided data for the audibility test, see their name in the Appendix A below.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

Data and source code is available at https://github.com/wecksten/Velody-2.

The respondents in the audibility experiment were: E. Spennare, T. Holtzberg, P. Wärnestål, M. Dougherty, A. Galozy, A. Olsson, M.R. Bouguelia, J. Johansson, O. Andersson, M.A. Rasool, O. Engelbrektsson, F. Johansson, S. Nilsson, M. Blom, T. Svane, J. Elmlund, A. Alabdallah, S. Lindberg, M. Cooney, A. Stefanescu, L. Wandel, K. Eldemark, M. Menezes, E. Gustafsson, N. Benamer, K. Raats and V. Prgomet. Thank you all! Without you the experiment could not have been carried out.

### **References**

