Sonic Watermarking Method for Ensuring the Integrity of Audio Recordings
Abstract
:Featured Application
Abstract
1. Introduction
- Methods for authenticity checking—have the goal to find the date and time of a recording, the device that was used for recording, etc.
- Passive methods—investigate the artifacts caused by the editing operations such as double compression [3], reverberation changes or, in the case of copy-move operations, consequences of the editing (e.g., finding that certain words are uttered identically in the recording, which is improbable in reality), etc. The great advantage of these methods is that they can be used to analyse any recording. However, the degree of certainty of the result is not high enough to consider them reliable. For example, double compression happens when audio recordings are edited, but it cannot highlight if the meaning of the message has been modified;
- Active methods—rely on the insertion of auxiliary data during the recording. Those are extracted in the integrity check process and compared with the reference data that were inserted when the recording was captured. If the extracted data matches the reference, it can be concluded that the integrity of the recording was preserved. These methods offer a much greater degree of certainty about the offered result, but only the recordings that contain the auxiliary data can be checked.
2. Principles for Developing a Suitable Watermark
- The watermark should be embedded at the time of recording—embedding the watermark in a recording at a later moment can be considered a malicious operation and contested. This principle imposes computational complexity limits on the embedding method because it should work in real time;
- The watermark should be captured by all the recording devices that are near to one another—all the participants in a discussion can carry their own recording device and, if the watermark is embedded only in some of the recordings, a dispute about what recording is authentic can start;
- The embedding of the watermark should not be noticeable—if people know that a watermarking process is running, they could be reserved in their declarations and the flow of the conversation could be affected by it. Thus, it should be kept secret and the embedding operations should not rise suspicion;
- The watermark should be secure—only authorized persons should be able to extract the watermark.
3. Materials and Methods
3.1. Signals Involved in the System
3.1.1. The Speech Signal
3.1.2. The Ticking Sound
3.1.3. The Chirp Signals
3.2. Method for Generating the Sonic Watermark
3.2.1. The Data Input Blocks
- The ENF measurement system—Provides the primary data for the sonic watermark. The proposed system receives the value of the ENF averaged over one second, every second. The measurement should be made with great resolution, at least Hz. Systems that offer Hz exist [6]. In the design process of the proposed system, the resources were allocated according to these finer measurements. This approach allows the system to work also with ENF measurement devices that offer coarser measurements ( Hz resolution). According to the power grid specifications in [24], the variation of the ENF is in the range of 50 ± 0.2 Hz. If the measurement of the ENF is made with a resolution equal to Hz, 401 values are obtained (from 49.8 Hz to 50.2 Hz with a 0.001 Hz step size) which can be binary encoded using 9 bits. The ENF measurement is not the main subject of this paper and this is the reason those systems are not thoroughly detailed here. They are well documented in [25]. For this work, the ENF values measured at each second for the year 2015 were analyzed, because they were made available by [6]. Their histogram was computed and illustrated in Figure 6. It can be observed that the data follows a normal distribution with the estimated mean equal to 50 Hz, and the standard deviation equal to Hz.
- The room identification number (room ID)—represents the secondary information carried by the sonic watermark. The room ID is an integer number set once when a sonic watermark system is installed in a room. It can be used later to precisely determine the room in which the conversation that is investigated took place. Because no two rooms should have the same identifier, this number should be assigned automatically by an external room management system (e.g., a server that holds a database of the rooms in which sonic watermarking systems are installed). The room ID is used as the seed of a pseudorandom binary sequence (PRBS) generator. The generator delivers one bit per second.
3.2.2. The Audio Signal Generator Blocks
- The ticking sounds player delivers sounds that imitate the ticking of a mechanical clock at time intervals equal to one second. Those sounds are recorded from a real mechanical clock and then played in an infinite loop. Care must be taken when preparing the loop because it should have a duration equal to an integer number of seconds. In this way they will sound natural and will not draw the attention of the people around them, even if listened for long periods. The longer the recording is, the more natural it will sound, and the less likely is to be suspicious for a listener when played on repeat. The ticking sounds are used to mask the chirp signals that carry the ENF information.
- The chirp signals generator—Synthesizes the chirp signals that encode through their presence or absence the value of the ENF received from the ENF measurement system. It is the block with the greatest importance in the design and is thoroughly described onwards. This block is designed to maximize the probability of detecting the chirp signals in the process of checking the integrity of the audio file that has the proposed sonic watermark embedded in it. The following principles were followed when the parameters of the chirp signals were designed:
- The chirp signals should not be removed by the audio recording devices. Therefore, they should be placed inside the audio bandwidth, between 20 Hz and 20 kHz. The farther away from these limits they are placed, the higher the chance that they will not be removed.
- The chirp signals should be able to be played using small speakers, because the overall dimensions of the sonic watermark generator should not exceed the dimensions of a table clock, given the scenario explained at the beginning of Section 3. Therefore, the chirp signals should be placed at high frequencies so speakers of small dimensions can play them.
- The chirp signals should not share bandwidth with other signals. It was shown in [23] that noise can determine false detections when matched filters are used as chirp signal detectors. The filter matched to a certain signal is given by the time reflected and delayed version of that signal. According to [21,22], the high limit of the speech signals’ bandwidth is 8 kHz. Therefore, the chirp signals should be placed at frequencies over 8 kHz so they will not overlap the bandwidth of the speech signals, minimizing the occurrence of false detections determined by the speech signals.
- The duration of the chirp signals should be similar to the duration of the ticking sounds so simultaneous auditory masking principles can be used. Therefore, a duration of 45 ms was used.
- The presence of the chirp signals should not be perceived by the human auditory system (HAS). Therefore, auditory masking principles should be used to minimize the probability of their detection by persons. The masker of the chirp signals is the ticking sound that is naturally quiet. In these conditions the non-simultaneous auditory masking (i.e., forward masking) cannot be used because the masking threshold decreases very fast after the masker signal disappears and it is lower than the level of the masker [26]. Simultaneous masking remains to be used. The simultaneous masking exploits the organization of the HAS as an array of overlapping band-pass filters (i.e., auditory filters). The bandwidth of each filter is called critical band [27] and it increases as the central frequency increases. The masking effect is stronger if the masker occupies most of the critical band, and the masked signal only a small portion of it. This is likely to happen in the case of the proposed sonic watermark because the ticking sound (i.e., the masker) has a wide bandwidth, as shown in Section 3.1.2. This is an argument for having narrow bandwidth chirp signals, placed at high frequencies where the critical bands are larger. To improve the auditory masking performance even more, only one chirp signal should be placed in each critical band. Auditory filters can be characterized using the equivalent rectangular bandwidth (ERB). In this convenient approach, the filters are treated as rectangular band-pass filters [28,29,30]. The ERB can be computed, for young listeners and moderate sound levels, according to [31], with the following formula:
3.2.3. The Signal Processing Blocks
- The SNR improvement block—Consists of a filter bank with nine bandstop filters used to increase the chirp-to-tick sound SNR. It was detailed in [23] that noise in the same bandwidth with the chirp signals can determine false chirp detections if a matched filter and threshold comparator is used as the detector. To minimize the probability of false chirp detections, the ticking sounds are filtered using 9 bandstop filters to remove their spectral content existent at frequencies occupied by the chirp signals. Because the stop bands of those filters are very narrow compared to the corresponding critical bands, the filtering effect is inaudible. The magnitude response of this signal processing block is illustrated in Figure 8.
- The controlled delay line—Is used to encode auxiliary information by deviating the temporal distance between the ticking sounds by a small amount. The auxiliary information that is encoded is the room identification number. The input data of this block are the bits of the PRBS generator with its seed equal to the room ID. The bits of the binary sequence switch on or off the controlled delay line (“1” → delay line active, “0” → delay line bypassed). When active, the block delays the input signal with ms. The signal at the input of this block is a mixture made of the filtered ticking sound and the chirp signals corresponding to the current ENF value. Changing the state of the delay line should be done between ticks, to avoid the occurrence of audible artifacts. An example of the delay line functioning is shown in Figure 9 along with the bits of the PRBS generator that are used to control it.
3.3. Method for Extracting the Watermark
3.3.1. Extracting the Chirp Signals and the Enf Information
- The system can be remote controlled to emit all the chirp signals with the next ticking sound, bypassing the measured ENF value for that second, and encoding a value equal to 50.255 Hz which has a very low probability of occurrence in a real power grid. This will determine maximum values at the output of the matched filters in the receiver. The threshold can be set slightly lower than the maximum value. It was determined through experiments that threshold values 5 dB lower than each maximum can be used.
- The detector can exploit the property that maxima in the output signal of the matched filters can occur at most once per second. The global maximum value of each signal can be found, and the threshold set 5 dB lower than it. After this, the temporal distances between the peaks that were greater than the set threshold can be evaluated. If those are smaller than 1 s it results in that the respective chirp signal was never sent, and what is sensed is the effect of noise and speech signals, because the threshold is set too low. In this case, all the bits corresponding to that chirp signal should be set to “0”.
3.3.2. Extracting the Pseudorandom Binary Sequence
3.4. Methods for Checking the Integrity of the Watermarked Audio Recordings
3.4.1. Identifying a Cut Region Larger Than One Second
3.4.2. Identifying a Cut Smaller Than One Second
3.5. The Effects of the Propagation of the Sonic Watermark Through the Room
4. Results and Discussion
- The watermark extraction performance depending on the power ratio (PR) between the voice signal and the sonic watermark, while keeping the power ratio between the watermark’s components constant, and on the power ratio between the watermark components, while keeping the power ratio between the voice signal and the sonic watermark constant. This experiment was made in the two studied acoustic environments: a meeting room and a lecture room;
- The performance of detecting forgeries done using audio cut operations.
4.1. Watermark Extraction Performance
4.2. The Performance of Detecting Cut Operations
4.3. Subjective Tests and Investigation on the Effects of Non-Linear Distortions of Real Speakers
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Rodrıguez, D.P.N.; Apolinario, J.A.; Biscainho, L.W.P. Audio authenticity: Detecting ENF discontinuity with high precision phase analysis. IEEE Trans. Inf. Forensics Secur. 2010, 5, 534–543. [Google Scholar] [CrossRef]
- Cooper, A. Detecting Butt-spliced Edits in Forensic Digital Audio Recordings. In Proceedings of the 39th International Conference: Audio Forensics: Practices and Challenges, Hillerod, Denmark, 17–19 June 2010. [Google Scholar]
- Luo, D.; Yang, R.; Huang, J. Detecting Double Compressed AMR Audio Using Deep Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 2669–2673. [Google Scholar]
- Grigoraș, C. Digital Audio Recording Analysis: The Electric Network Frequency (ENF) Criterion. Int. J. Speech Lang. Law 2005, 12, 63–76. [Google Scholar] [CrossRef]
- Kajstura, M.; Trawinska, A.; Hebenstreit, J. Application of the Electrical Network Frequency (ENF) Criterion–A case of a Digital Recording. Forensic Sci. Int. 2005, 155, 165–171. [Google Scholar] [CrossRef]
- Measurement of the Mains Frequency. Available online: www.mainsfrequency.com (accessed on 23 March 2020).
- Huijbregtse, M.; Geradts, Z. Using the ENF Criterion for Determining the Time of Recording of Short Digital Audio Recordings. In Proceedings of the IWCF ′09: Proceedings of the 3rd International Workshop on Computational Forensics, The Hague, The Netherlands, 13–14 August 2009; pp. 116–124.
- The Hum that Helps to Fight Crime. Available online: www.bbc.com/news/science-environment-20629671 (accessed on 7 May 2020).
- Power Grid Fluctuations Hidden in Audio Recordings Proved a Powerful Tool for Police Forensics. Available online: https://phys.org/news/2018-02-power-grid-fluctuations-hidden-audio.html (accessed on 7 May 2020).
- Asmara, R.A.; Agustina, R.; Hidayatulloh. Comparison of Discrete Cosine Transforms (DCT), Discrete Fourier Transforms (DFT), and Discrete Wavelet Transforms (DWT) in Digital Image Watermarking. Int. J. Adv. Comput. Sci. Appl. 2017, 8, 245–249. [Google Scholar]
- Dhar, P.K.; Shimamure, T. Blind audio watermarking in transform domain based on singular value decomposition and exponential-log operations. Radioengineering 2017, 26, 552–561. [Google Scholar] [CrossRef]
- Lei, W.N.; Chang, L.C. Robust and high-quality time-domain audio watermarking based on low-frequency amplitude modification. IEEE Trans. Multimed. 2006, 8, 46–59. [Google Scholar]
- Erfani, Y.; Siahpoush, S. Robust audio watermarking using improved TS echo hiding. Digit. Signal Process. 2009, 19, 809–814. [Google Scholar] [CrossRef]
- Basia, P.; Pitas, I.; Nikolaidis, N. Robust audio watermarking in the time domain. IEEE Trans. Multimed. 1998, 3, 232–241. [Google Scholar] [CrossRef]
- Natgunanathan, I.; Xiang, Y.; Hua, G.; Beliakov, G.; Yearwood, J. Patchwork-Based multi-layer audio watermarking. IEEE Trans. Audio Speech Lang. Process. 2017, 25, 2176–2187. [Google Scholar] [CrossRef]
- Xiang, S.J.; Li, Z.H. Reversible audio data hiding algorithm using noncausal prediction of alterable orders. EURASIP J. Audio Speech Music Process. 2017, 4. [Google Scholar] [CrossRef] [Green Version]
- Hu, H.T.; Hsu, L.Y.; Chou, H.H. Variable-dimensional vector modulation for perceptual-based DWT blind audio watermarking with adjustable payload capacity. Digit. Signal Process. 2014, 31, 115–123. [Google Scholar] [CrossRef]
- Hua, G.; Huang, J.; Shi, Y.Q.; Goh, J.; Thing, V.L. Twenty years of digital audio watermarking—A comprehensive review. Elsevier Signal Process. 2016, 128, 222–242. [Google Scholar] [CrossRef]
- Nita, V.A.; Ciobanu, A. Tic-Tac, Forgery Time Has Run-Up! Live Acoustic Watermarking For Integrity Check in Forensic Applications. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15−20 April 2018; pp. 1977–1981. [Google Scholar]
- Dobre, R.A.; Preda, R.O.; Marcu, A.E. TIC-TAC Based Live Acoustic Watermarking With Improved Forgery Detection Performances. In Proceedings of the 2019 IEEE 25th International Symposium for Design and Technology in Electronic Packaging (SIITME), Cluj-Napoca, Romania, 23−26 October 2019; pp. 408–412. [Google Scholar]
- Byrne, D.; Dillon, H.; Tran, K. An International Comparison of Long-term Average Speech Spectra. J. Acoust. Soc. Am. 1994, 96, 2108–2120. [Google Scholar] [CrossRef]
- 7 kHz Audio Coding within 64 kbit/s. ITU-T Rec. G.722. 1988. Available online: https://www.itu.int/rec/dologin_pub.asp?lang=f&id=T-REC-G.722-198811-S!!PDF-E&type=items (accessed on 12 May 2020).
- Othman, M.A.B.; Belz, J.; Farhang-Boroujeny, B. Matched Filter Bank for Detection of L inear Frequency Modulated Chirp Signals. IEEE Trans. Aerosp. Electron. Syst. 2017, 53, 41–54. [Google Scholar] [CrossRef]
- ENTSO-E Policy 1: Load Frequency Control and Performance, Chapter A. Available online: https://erranet.org/wp-content/uploads/2017/02/Policy_1_final.pdf (accessed on 12 May 2020).
- Zhang, Y.; Penn, M.; Tao, X.; Lang, C.; Yanzhu, Y.; Zhongyu, W.; Zhiyong, Y.; Lei, W.; Jason, B.; Jon, B.; et al. Wide-area Frequency Monitoring Network (FNET) Architecture and Applications. IEEE Trans. Smart Grid 2010, 1, 159–167. [Google Scholar] [CrossRef]
- Spanias, A.; Painter, T.; Atti, V. Audio Signal Processing and Coding; John Wiley & Sons: Hoboken, NJ, USA, 2007. [Google Scholar]
- Fletcher, H. Auditory Patterns. Rev. Mod. Phys. 1940, 12, 47–65. [Google Scholar] [CrossRef]
- Gelfand, S.A. Hearing -An Introduction to Psychological and Physiological Acoustics, 6th ed.; CRC Press, Taylor & Francis Group: New York, NY, USA, 2018. [Google Scholar]
- Munkong, R.; Biing-Hwang, J. Auditory perception and cognition. IEEE Signal Process. Mag. 2008, 25, 98–117. [Google Scholar] [CrossRef]
- Eggermont, J. Hearing Loss, 1st ed.; Academic Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Moore, B.C.J.; Glasberg, B.R. Suggested Formulae for Calculating Auditory-filter Bandwidths and Excitation Patterns. J. Acoust. Soc. Am. 1983, 74, 750–753. [Google Scholar] [CrossRef]
- Glasberg, B.R.; Moore, B.C.J. Derivation of Auditory Filter Shapes from Notched-noise Data. Hear. Res. 1990, 47, 103–138. [Google Scholar] [CrossRef]
- Charles, M.B.; Cook, E. Radar Signals: An Introduction to Theory and Application, 1st ed.; Academic Press: Cambridge, MA, USA, 1967. [Google Scholar]
- Achim, H. Processing of SAR Data: Fundamentals, Signal Processing, Interferometry, 1st ed.; Springer: Berlin, Germany, 2004. [Google Scholar]
- Esquef, P.A.A.; Apolinario, J.A.; Biscainho, L.W.P. Edit Detection in Speech Recordings via Instantaneous Electric Network Frequency Variations. IEEE Trans. Inf. Forensics Secur. 2014, 9, 2314–2326. [Google Scholar] [CrossRef]
- Gay, S.L.; Benesty, J. Acoustic Signal Processing for Telecommunication; Kluwer Academic Publisher: Boston, MA, USA, 2000. [Google Scholar]
- Benesty, J.; Paleologu, C.; Gänsler, T.; Ciochină, S. A Perspective on Stereophonic Acoustic Echo Cancellation; Springer-Verlag Berlin Heidelberg: Berlin, Germany, 2011; ISBN 978-642-22573-4. [Google Scholar]
- Paleologu, C.; Benesty, J.; Ciochină, S. Linear System Identification Based on a Kronecker Product Decomposition. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 11, 1793–1808. [Google Scholar] [CrossRef]
- Karjalainen, M.; Antsalo, P.; Mäkivirta, A.; Välimäki, V. Pereption of Temporal Decay of Low-frequency Room Modes. In Proceedings of the AES 116th Convention, Berlin, Germany, 8−11 May 2004; pp. 1–8. [Google Scholar]
- Zhang, J.; Shen, Y.; Jiang, B.; Li, Y. Sound Absorption Characterization of Natural Materials and Sandwich Structure Composites. Aerospace 2018, 5, 75. [Google Scholar] [CrossRef] [Green Version]
- Freesound. Available online: www.freesound.org/ (accessed on 23 March 2020).
- Aachen Impulse Response Database. Available online: www.iks.rwth-aachen.de/en/research/tools-downloads/databases/aachen-impulse-response-database (accessed on 23 March 2020).
- Carnegie Melon University. Available online: http://www.speech.cs.cmu.edu/cmu_arctic/cmu_us_bdl_arctic/wav/ (accessed on 23 March 2020).
Chirp Signal Identifier | Frequency Range (kHz) | Chirp Signal Identifier | Frequency Range (kHz) |
---|---|---|---|
Cut Duration | ||
---|---|---|
0 | 0 | |
0 | 1 | |
1 | 0 | |
1 | 1 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dobre, R.-A.; Preda, R.-O.; Vlădescu, M. Sonic Watermarking Method for Ensuring the Integrity of Audio Recordings. Appl. Sci. 2020, 10, 3367. https://doi.org/10.3390/app10103367
Dobre R-A, Preda R-O, Vlădescu M. Sonic Watermarking Method for Ensuring the Integrity of Audio Recordings. Applied Sciences. 2020; 10(10):3367. https://doi.org/10.3390/app10103367
Chicago/Turabian StyleDobre, Robert-Alexandru, Radu-Ovidiu Preda, and Marian Vlădescu. 2020. "Sonic Watermarking Method for Ensuring the Integrity of Audio Recordings" Applied Sciences 10, no. 10: 3367. https://doi.org/10.3390/app10103367