Skip to Content
You are currently on the new version of our website. Access the old version .
Applied SciencesApplied Sciences
  • Communication
  • Open Access

8 June 2023

Escaping Printable Encoded Streams to Embed Out-of-Band Data

and
Dipartimento di Informatica, Università degli Studi di Torino, Corso Svizzera 185, 10149 Torino, Italy
*
Author to whom correspondence should be addressed.

Abstract

In this paper, we propose to exploit the unused configurations of a printable encoding such as Base41, Base45 or Base85 to create a side channel that can store extra data such as error detection or correction codes, integrity verification and authentication information or application defined data. After introducing the encoding of binary octet strings in printable form, we present some case studies that show possible applications of the unused configurations.

1. Introduction

Printable string encoding of binary data is a well-known and widely spread technique used by systems designed to manage only printable characters: in other words, printable string encoding is an encapsulation method to process, in a transparent way, any possible bit string by systems able to treat only printable strings (for example, some mail servers).
Typical examples of this are old mail servers that are not able to directly accept binary data and the QR-code representation of the European Union Digital COVID Certificate.
To allow the storage and transmission of binary data with printable strings, many encodings have been proposed: in Section 3, some of these works will be recalled, emphasizing the use of the legal strings and computing the space of unused configurations that allow the embedding of additional information.
In general, the approach is to define an alphabet made of symbols that are used to compose strings, each one associated with a binary configuration to be encoded. For example, Base41 [1,2] uses three symbols from an alphabet of 41 printable characters to encode a pair of octets: in this case the possible printable strings are 41 3 = 68 , 921 , while the configurations of pair of octets are 2 16 = 65 , 536 , leaving 41 3 2 16 = 3385 free printable strings that, in general, are considered “illegal” but may be employed to store additional data such as a Cyclic Redundancy Check (CRC) value, a cryptographic hash, a Message Authentication Code (MAC), or a digital signature.
The main contributions of this paper are as follows:
  • Showing the redundancy present in some printable encodings;
  • Developing a methodology to embed extra data in a printable encoded stream;
  • Showing the effectiveness of embedding these data for security purposes;
  • Showing some other applications that leverage unused printable strings.
The following Section 2 will establish a uniform nomenclature and notation to be used throughout the paper, after which Section 3 will present some printable encoding methods and applications. Section 4 will give details on some of the encodings that allow the embedding of extra (i.e., payload) data and will present some methods for performing this operation. Section 5 will present a numerical analysis for some embodiments of the case studies introduced in Section 4 and Section 6 will discuss some conclusions on the proposed methodology and its applications.

2. Nomenclature and Notation

In this section, we briefly recall some nomenclature and notation to have a uniform and clear definition and representation of the entities involved in this paper.
A symbol or character is a graphical representation of an abstract or real entity or concept.
An alphabet is an ordered, finite size collection of distinct symbols.
A set is a collection of distinct items, or elements, that in the present context will be symbols or characters.
A sequence or string will refer to an ordered collection of symbols from an alphabet. In particular, a sequence of symbols from an alphabet having cardinality 2 is called binary string. A string of characters is written within double quotes, e.g., “ABC” represents the string of the first three symbols of the Latin alphabet (capital letters).
To refer to instances of the previous entities, variables, or their properties, we denote them with the following rules:
  • Alphabet: uppercase boldface italic letter, e.g., A ;
  • Sequence of symbols from an alphabet: uppercase letter, e.g., S , S 1 ;
  • Set: uppercase calligraphic letter, e.g., B ;
  • Cardinality of a set: the function card counting the number of elements in a set, e.g., card B ;
  • Floor operation: the operator defining the integer number not greater that its argument, e.g., π = 3 ;
  • Bijection between sets: the symbol ⇔, e.g., A B ;
  • Constant or single scalar value: lowercase italic letter, e.g., v.
Throughout this document, BaseYY will denote an encoding method based on an alphabet of YY symbols: for example, Base41 refers to an encoding method based on 41 symbols.

4. Printable Encodings and Case Studies of Payload Data Embedding

Consider the set B of all binary strings of length n bits; thus, card B = 2 n . Furthermore, having an alphabet A of t printable symbols compute the value v such that:
v = min k | 2 n t k
and define the set S of all sequences of v symbols from the alphabet A : obviously, card S = t v .
Using 2 n different sequences from S , it is possible to encode all the bit strings in B using only symbols from A . It follows that there will be a subset E of S ( E S ), whose elements are in one-to-one correspondence with the binary strings of B , that is, there is a bijection between E and B , E B .
Table 1 reports the characterizing values for some printable encodings.
Table 1. Some printable encodings with their main parameters and number of unused sequences.
The set U = S E , which contains the unused sequences of S , will have card U = t v 2 n . From Table 1, it may be observed that this set U is non-empty for Base41 [1], Base45 [5], Base85 [9], and Base91 [11].
As previously said, in [6], the sequences in U are employed for reversibly embedding data into a Base45 or Base85 encoded stream.
Here, we propose a general framework for exploiting the unused sequences in several contexts, allowing applications to choose the most appropriate setting for their own purposes. Therefore, every application must define the meaning assigned to every unused sequence and how to process it. Suppose to encode binary sequences of n bits with v symbols belonging to an alphabet A (v is determined as in Equation (1)). If U (see, for example, the encodings with a non-zero value in the last column of Table 1), an application selects a set of sequences Z U and assigns a meaning to every sequence S Z . The semantics of each sequence must be known to both the encoder and decoder and agreed upon to have a correct transmission and extraction of the encoded data.
As will be shown later on, a sequence S Z may represent:
  • A string of bits encoding the whole or part of a Cyclic Redundancy Check (CRC) code;
  • A prefix indicating that a fixed number of following sequences encode a CRC, a Message Authentication Code, or a digital signature;
  • One or more bits to be transmitted separately from the data encoded by the sequences belonging to E ;
  • A separator to split portions of the data stream encoded by the sequences in E ;
  • An identifier specifying the characteristics of a portion of following sequences;
  • A context defining the meaning of the following sequences S Z .
For instance, an application that uses Base41 printable encodings can decide that the sequence “zxx” is a prefix indicating that the next two sequences represent a 32 bit CRC. Note that different applications can assign different meanings to the same sequence from  Z .
The next subsections will present some possible embodiments using the previously introduced representations.

4.1. Error Detection and Correction Information Embedding

The stream of printable encoded data may be stuffed with sequences belonging to U that encode a Cyclic Redundancy Check (CRC) [17] of a portion of data that has to be controlled for errors.
It is possible to encode a CRC of length l log 2 t v 2 n bits using a subset C of 2 l sequences in U associating every CRC binary string of length l to one sequence in C (Figure 1a). In this case, the proposed framework is instantiated with Z = C .
Figure 1. Simple schemes to show possible encodings of CRC codes (gray arrows show the sequences covered by the CRC). (a) CRC encoded in a single unused sequence. (b) CRC encoded in multiple unused sequences. (c) CRC encoded in an unused sequence and multiple legal sequences.
The maximum values of l for the encodings in Table 1 are 11 for Base41, 14 for Base45, and 27 for Base85. Longer CRC codes may be stuffed by simply concatenating more unused sequences (Figure 1b) and also in this case Z = C or, considering a single unused sequence S c , Z = S c , as a preamble for a fixed number of legal sequences belonging to E each carrying n bits of the CRC (Figure 1c) (see [18] for a comprehensive list of CRC polynomials).
Example 1.
Considering the Base41 encoding [1], an implementation of Figure 1a is to employ 2048 of the 3385 unused sequences available to stuff CRCs of length l = log 2 2048 = 11 bits computed on the previous bit string for error detection.
Example 2.
Using the same Base41 encoding [1], an implementation of Figure 1b is to employ 2048 of the 3385 unused sequences available and concatenate three of them to stuff CRCs of length 3 l = 3 log 2 2 , 048 = 33 bits computed on the previous bit string for error detection.
Example 3.
A possible implementation of Figure 1c with Base45 [5] is to employ one of the 25 , 589 unused sequences available (see Table 1) to specify that the following two sequences belonging to E (each one encoding 16 bits) will encode a 2 × 16 = 32 bits CRC.

4.2. Integrity Information, Message Authentication Code, and Digital Signature Embedding

The printable encoded data may be stuffed and/or terminated with security information such as a cryptographic hash, a Message Authentication Code (MAC), or a signature covering the whole or a portion of the encoded data. Due to the bit length of these binary strings, it is more efficient to employ three unused sequences S h , S m , S d s from U to specify the type of security information, respectively, hash, MAC, and signature, encoded in the following sequences and then use a fixed number of sequences in E to store the hash, the MAC, or the signature (Figure 2). In this case, Z = S h , S m , S d s .
Figure 2. Simple scheme to show possible encodings of security information for data protection (gray arrows show the sequences covered by the hash, MAC, or digital signature).
Example 4.
As shown in Figure 2, a single unused sequence S h 1 of the Base41 encoding [1] may be employed to specify that the following eight sequences belonging to E (each one encoding 16 bits) will store a 8 × 16 = 128 bits hash, such as MD5 [19]. Furthermore, another unused sequence S h 2 of the Base41 encoding can be utilized to indicate that the following sixteen sequences belonging to E (each one representing 16 bits) will encode a 16 × 16 = 256 bits hash such as SHA3-256 [20]. In this case, Z = S h 1 , S h 2 .

4.3. Secondary Data Channel

It is possible to create a second data channel that carries information, such as a watermark, using the sequences in the previously defined set C ( Z = C ): every sequence represents l bits of information and may be interleaved anywhere in the encoded data stream being recognizable and distinguishable from data transformed in printable form (Figure 3).
Figure 3. Secondary channel information interleaved in printable encoded data.
Example 5.
Suppose a desire to store extra data in a Base85 [9] encoded stream. Exploiting the 142 , 085 , 829 unused sequences (see Table 1), it is possible to encode l = log 2 142 , 085 , 829 = 27  bits with an unused sequence of five characters. These can be inserted anywhere in the normal flow of Base85 sequences creating a secondary channel that, for example, can carry RGB colors (expressed with 8 bits per channel for a total of 24 bits).

4.4. Parameter Separation

A printable encoding may be also employed to encode parameters passed to a function in a context where binary data cannot be directly transmitted, for example, in the query string of a Web address. To separate the various encoded parameters, it is possible to use a single sequence S d belonging to the previously defined set U and another sequence S t from the same set to indicate the end of the parameters (Figure 4). The framework is instantiated with Z = S d , S t .
Figure 4. Possible encoding of parameters with separators S d and S t .
Another possibility is to identify the data types of the various parameters employing sequences from the set U (Figure 5): for example, it is possible to use a use sequence S i U to identify an integer, another sequence S f U to specify a float, then S o U to specify an octet string, S p U to express a binary pointer, and two sequences S r s , S r e U to indicate the beginning and the end of a record made of fields in turn identified with these delimiters (with a possible recursive structure). The parameter’s list can be terminated with the sequence S t from the same set U . In this case, the framework is instantiated with Z = S i , S f , S o , S p , S r s , S r e , S t .
Figure 5. Identifying types of parameters with separators.
Nonetheless, the encodings proposed in Section 4.1 and Section 4.2 may be used as an additional data protection feature for the parameters, taking care to choose S i , S f , S o , S p , S r s , S r e , S d , and S t among the sequences in U not encoding a CRC (Figure 1) nor a type of hash, MAC, or digital signature (Figure 2). The proposed framework has Z = S i , S f , S o , S p , S r s , S r e , S t , S c , S h , S m , S d s .
Example 6.
Assume having a program running on a Web server that needs a (variable) set of parameters in binary form. In this case, the various data can be encoded with Base41 [1] and sent as a query string to the program, separating the various parameters with a single sequence from U and terminating the parameter list with another sequence in U . At the receiving side, the program can split the data using the separator and recover the original binary values decoding the Base41 strings.

5. Discussion and Results

In this section, we perform some numerical computations on some possible practical applications of the proposed method to printable encoded streams.

5.1. CRC Embedding

In the first run of tests, we considered adding an 11 bits CRC to blocks of data encoded in printable form with Base41 [1]. The method adds three octets to the Base41 encoding of the block; thus, if the block has size n octets ( 8 n bits), then the Base41 encoding inflates it to 1.5 n octets, adding the CRC leads to 1.5 n + 3 octets with an overload of 3 1.5 n + 3 × 100 % . On the other hand, an 11 bits CRC on a block of 8 n bits represents an overload of 11 8 n + 11 × 100 % . Analogous formulas can be derived for 14 bits CRC and employing Base45 unused sequences.
We performed the computation of the overload for blocks of sizes 128, 256, 512, and 1024 bits (or 16, 32, 64, and 128 octets, respectively). Table 2 shows the resulting overloads for CRCs embedded into Base41 and Base45 encodings as proposed, comparing them with the classical overload had when embedding a CRC of (11 and 14 bits, respectively). From these data, it may be seen that the increase in overload is quite limited and feasible for an application level error detection and data protection from unintentional modifications.
Table 2. Computation and comparison of CRC overloads for Base41 and Base45 encoded CRCs (11 and 14 bits, respectively).

5.2. Hash Embedding

Let us now examine a Base41 or a Base45 encoding: three printable characters encode two octets (apart from a single octet encoded when the stream length is not even). An MD5 hash [19] has a length of 16 octets, and thus, 3 + 3 × 16 / 2 = 27 octets may encode a file MD5 hash. Furthermore, a SHA-1 hash [21] has a length of 20 octets, and thus, 3 + 3 × 20 / 2 = 33 octets may encode a file SHA-1 hash.
Considering Base41 [1], we may assign the unused sequence S M D 5 = “zzM” to indicate that the following 24 characters encode an MD5 hash and the unused sequence S S H A 1 = “zzS” to indicate that the following 33 characters encode a SHA-1 hash (in this embodiment, the framework is instantiated with Z = S M D 5 , S S H A 1 ). The impact on the size of the resulting encoding is, in both cases, only of three octets due to the escaping sequence (in this case, “zzM” or “zzS”).

5.3. Extra Data Attachment

As a practical instance of Example 5, let us consider the use of Base85 to represent the pixels of an RGB color image to be appended to an Ascii85 encoded stream. Building Z with 16 , 777 , 216 sequences, and thus, card Z = 16 , 777 , 216 of the 142 , 085 , 829 unused ones, it is possible to printable encode the pixels of the image: if the image dimensions are 320 × 240 = 76 , 800 pixels, then the size (inflated with a ratio 5:3) of the uncompressed image will be 76 , 800 × 5 = 384 , 000 or octets.

5.4. Client-Server Parameter Passing

Let us consider passing a variable number of parameters from a Web client to a server. As a concrete example, suppose conveying a 16 bit integer valued 41, a 16 bit integer valued 65 , 535 and a character string valued “BASE”. Having built Z with the unused sequences “xBA”, “xBB”, “xBC”, “xBD”, “xBF”, “xBG”, “xBH”, and “xBJ” to represent S i , S f , S o , S p , S r s , S r e , S d , and S t , respectively, to perform an encoding that follows the proposal shown in Figure 5, the resulting printable stream will be:
xBA    ABA    xBA    vzV    xBC    MDk    Qiv    xBJ
Corresponding to:
S i     41     S i     65 , 535     S o     “BA”     “SE”     S t
It is obvious that the resulting Base41 string can be immediately and unambiguously decoded by a procedure aware of the Base41 encoding symbols assignment and expecting the corresponding parameters.
One disadvantage is that the insertion of extra data in the encoding increases the size of the processed stream and this might be limiting the application on low-capacity links or small-capacity devices.
Concerning the security issues of the proposed framework, it should be pointed out that any printable encoding presents the same security issues, being just an encoding. We merely present a way in which an application can make use of unused configuration to insert extra information in the encoding. When this extra information is a MAC or a signature, the encoded data are protected against modification attacks. Transferring the resulting encoding in a secure way is out of the scope of the present work, and mainly relies on the use of proper security measures (for instance, cryptography, secure protocols such as https, SSL, and TLS, etc.).

6. Conclusions

In this paper, we proposed to exploit the unused configurations of some printable encodings to carry extra information that may be employed for the following:
  • Error detection and correction of data at application level;
  • Carrying a cryptographic hash, a Message Authentication Code, or a digital signature for integrity protection and origin authentication;
  • Carrying extra payload data building a secondary communication channel;
  • Transferring parameters in (remote) function calls.
The data hiding method in [6,7] is a special case of the proposed framework (case c in the list). In particular, the sequences in Z = U carry a watermark bit valued ‘1’, and each of them is associated with one, and only one, sequence of E carrying a ‘0’-valued watermark bit.
As we already implemented the payload embedding in [6,7], we plan to develop functions that allow the storage/extraction of integrity/security data and the secure communication of parameters in function calls in Web browsers.

Author Contributions

All the authors gave the same contribution in all aspects of this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the Italian Ministero dell’Università e della Ricerca.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

  1. Botta, M.; Cavagnino, D. Base41: A proposal for printable encoding of bit strings. Eng. Rep. 2023, 5, e12606. [Google Scholar] [CrossRef]
  2. Botta, M.; Cavagnino, D. Base41: A Method for Bit String Encoding in Printable Form. 2023. Available online: https://watermarking.di.unito.it/base41.html (accessed on 2 May 2023).
  3. Josefsson, S. RFC 4648; The Base16, Base32, and Base64 Data Encodings; RFC Editor: Phoenix, AZ, USA, 2006. [Google Scholar] [CrossRef]
  4. Veljkovic, S. Base41. 2014. Available online: https://github.com/sveljko/base41 (accessed on 27 March 2023).
  5. Fältström, P.; Ljunggren, F.; van Gulik, D.W. RFC 9285; The Base45 Data Encoding; RFC Editor: Phoenix, AZ, USA, 2022. [Google Scholar] [CrossRef]
  6. Botta, M.; Cavagnino, D. A Framework for Reversible Data Embedding into Base45 and Other Non-Base64 Encoded Strings. Appl. Sci. 2022, 12, 241. [Google Scholar] [CrossRef]
  7. Botta, M.; Cavagnino, D. Improving data embedding capacity into Base45 encoded strings. Eng. Rep. 2023, e12622. [Google Scholar] [CrossRef]
  8. Elz, R. RFC 1924; A Compact Representation of IPv6 Addresses; RFC Editor: Phoenix, AZ, USA, 1996. [Google Scholar] [CrossRef]
  9. Adobe Systems Incorporated. PostScript Language Reference, 3rd ed.; Addison-Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1999. [Google Scholar]
  10. Henke, J. basE91 Encoding. 2006. Available online: https://base91.sourceforge.net/ (accessed on 28 April 2023).
  11. He, D.; Sun, Y.; Jia, Z.; Yu, X.; Guo, W.; He, W.; Qi, C.; Lu, X. A Proposal of Substitute for Base85/64–Base91. In Proceedings of the Proceedings of the SUMMER 8th International Conference on Computing, Communications and Control Technologies: CCCT, 2010, Orlando, FL, USA, 29 June–2 July 2010. [Google Scholar]
  12. Por, L.Y.; Wong, K.; Chee, K.O. UniSpaCh: A text-based data hiding method using Unicode space characters. J. Syst. Softw. 2012, 85, 1075–1082. [Google Scholar] [CrossRef]
  13. Liu, T.Y.; Tsai, W.H. A New Steganographic Method for Data Hiding in Microsoft Word Documents by a Change Tracking Technique. IEEE Trans. Inf. Forensics Secur. 2007, 2, 24–30. [Google Scholar] [CrossRef]
  14. Ali, A.E. A New Text Steganography Method By Using Non-Printing Unicode Characters. Eng. Tech. J. 2010, 28, 72–83. [Google Scholar]
  15. Aman, M.; Khan, A.; Ahmad, B.; Kouser, S. A hybrid text steganography approach utilizing Unicode space characters and zero-width character. Int. J. Inf. Technol. Secur. 2017, 9, 85–100. [Google Scholar]
  16. Borges, P.V.K.; Mayer, J.; Izquierdo, E. Robust and Transparent Color Modulation for Text Data Hiding. IEEE Trans. Multimed. 2008, 10, 1479–1489. [Google Scholar] [CrossRef]
  17. Peterson, W.W.; Brown, D.T. Cyclic Codes for Error Detection. Proc. IRE 1961, 49, 228–235. [Google Scholar] [CrossRef]
  18. Koopman, P. Best CRC Polynomials. 2018. Available online: https://users.ece.cmu.edu/~koopman/crc/ (accessed on 2 May 2023).
  19. Rivest, R.L. RFC 1321; The MD5 Message-Digest Algorithm; RFC Editor: Phoenix, AZ, USA, 1992. [Google Scholar] [CrossRef]
  20. Dworkin, M. SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015. [Google Scholar] [CrossRef]
  21. FIPS Pub 180-1; Secure Hash Standard. National Institute of Standards and Technology: Gaithersburg, MD, USA, 1995.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.