Improving data embedding capacity into Base45 encoded strings

In the context of data hiding reversible embedding of information into printable string encoded data may provide a solution to store extra bits for various applications. A previous work demonstrated how to leverage the unused configurations for Base45 and Base85 encodings to store data into files and QR‐codes. In this paper we show that adapting the encoding to the byte distribution statistics of various data and file types significantly improves the payload for Base45 encoded data like QR‐codes.


INTRODUCTION
Data hiding is a wide family of algorithms developed for embedding data into digital objects for various purposes. Embedding can be performed to simply exploit redundancy into digital objects to carry extra information, to protect an object from modifications or illegal use, or even secretly transmit information. Digital watermarking and steganography are different categories of data hiding. 1 Data encoding in printable strings is an operation performed in many systems, like Base64 encoding in e-mails or Base45 encoding in QR-codes. Researchers have developed various encoding methods in printable strings, each one with its rationale, properties, advantages and disadvantages. To cite some of them, we may recall Base16, Base32, and Base64, 2 Base85 used in References 3,4 and Base45 5 that has been applied in the QR-code representation of the European Union Digital COVID Certificate.
Reference 6 presents a framework for reversibly embedding additional data (watermark) in encodings using a base that has extra configurations that are considered illegal (i.e., unused). For example, Base85 uses five characters from an alphabet of 85 symbols to encode a sequence of 32 bits: thus 85 5 − 2 32 = 142,085,829 Base85 strings are considered illegal and Reference 6 proposes to use them for storing extra data.
In this paper, we start from the work in Reference 6 and improve the Base45 encoding to increase the payload capacity by leveraging statistic distributions of input data. In Reference 6 we presented a possible application for storing digital signatures into QR-codes. Normally digital signatures run from 448 to 4096 bits and an increase in the available space would allow larger, and thus more secure, signatures into a QR-code.
Moreover, the proposed improvement to the framework 6 allows increasing the data hiding payload in all those printable encodings having unused strings. In particular, all the applications employing a Base45 encoding will benefit of the possibility to store more payload bits. This paper is structured as follows: first, in Section 2 some related works along with the framework presented in Reference 6 are recalled making the present work self-contained, then Section 3 introduces some nomenclature and the notation used in the paper and Section 4 presents the original framework on which the present paper builds upon. Section 5 details the proposed improvement based on a statistical analysis for some file types and extensive experimental results are shown in Section 6. Conclusions are drawn and future works are foreseen in the last section.

RELATED WORKS
This paper is developed in the context of text encoding of binary data. In many applications managing generic binary data is not possible and only the subset of printable strings is allowed: for this reason, many binary-to-text encodings have been developed in the previous years. In this section, we briefly recall the most widely known, referring to the paper 6 and to each encoding specification for a wider and detailed explanation. In addition, Section 4 presents a concise description of the framework introduced in Reference 6, instantiating it for Base85 and Base45: the latter will be used to show the proposed improvement based on the data statistics. One of the most widely known text encodings for binary data is Base64 2 : from a binary stream it transforms each group of 24 bits into a sequence of four characters from an alphabet of 64 printable symbols. The same document 2 describes also the Base16 and Base32 encodings for analogous purposes.
The Base85 is used in Reference 3 (where it is called Ascii85) to encode binary data with strings using an alphabet of 85 symbols, namely the ASCII characters from "!" (code 33) to "u" (code 117): with this encoding 32 bits are stored in a sequence of five symbols. In addition, IPv6 addresses can be represented with a Base85 encoding for a compact display. 4 Base45 is described in the Internet Draft 5 and is proposed for QR-code encoding of binary data. The alphabet used is composed of 45 symbols, namely the 10 decimal digits, the 26 letters of the English alphabet, the space and the eight special characters "$%* + -./:." Two bytes of data are encoded in a sequence of three Base45 symbols (the encoding of a single byte produces two Base45 symbols). Table 1 reports in the first column the main bases for printable text encoding of binary data (we skip Base16 and Base32 because they have analogous properties to Base64 concerning our discussion), the number of binary configurations encoded (second column), the number of possible sequences of symbols in the corresponding alphabet (third column), and in the final column the difference between the values in the third and second column: these last values express how many sequences will be unused in encoding binary data and, for this reason, considered illegal. The framework proposed in Reference 6 leverages these illegal configurations for embedding watermark data.
Many other printable string encodings of binary data have been proposed in literature and are used in various applications. To cite some, we recall Base58, 7,8 Base62, 9,10 and Base91 11,12 and refer the interested reader to these papers as a starting point. Moreover, routines for converting to and from Base36, that uses the 10 digits and the 26 letters of the English alphabet, are available in many programming libraries, for example, the Python class int. 13 Even if our algorithm has not been developed for steganography, we recall some algorithms that hide data into textual files. Data embedding in text data is proposed as a steganographic method in Reference 14 where UNICODE zero width joiner and non-joiner characters (ZWJ and ZWNJ) are used to embed bits or English alphabet letters in a UNICODE encoded text. Another steganographic method developed to hide data in Microsoft Word documents is presented in Reference 15: the proposal makes use of the feature to track changes in a document to embed a secret message.
To our knowledge 6 is the first paper that performs data hiding taking advantage of the unused code words of a printable string encoding. The present paper builds upon that idea and improves it by: • proposing the use of encoding tables based on statistics for every file type allowing a payload increase up to 50%; • analyzing various file types showing those that have more benefits from the proposed improvement; • indirectly showing a methodology to analyze new file types for a possible application to them of the proposed algorithm.

NOTATION
The notation used in the paper obeys the conventions in the subsequent list and Table 2: • an element of an alphabet is called symbol; • an ordered concatenation of symbols is called a sequence; • a sequence of n symbols from the alphabet Ω = {0, 1} is called an n bits binary string; • an instance of a sequence of symbols is written between single quotes.

ORIGINAL FRAMEWORK
In this section we summarize the framework proposed in Reference 6: the method allows to reversibly embed a watermark bit string into a printable encoding like Base45 or Base85. Suppose to have an alphabet Ψ of t (printable) symbols used to encode binary strings of n bits: this encoding shall use sequences of u symbols where Let us define a function that transforms a binary number X of n bits into the corresponding value in base t: S = (X). Among the t u possible sequences r = t u − 2 n of them will be unused and considered illegal in the normal encodings seen so far (e.g., References 3-5). The redundancy 16 induced by the r sequences is exploited in the framework 6 to reversibly embed watermark bits in the stream of symbols from Ψ making use of illegal sequences.
In detail, the t u sequences can be split into: • a set containing the r unused sequences, let us call it W; • a set of 2 n sequences, where each sequence represents a different string of n bits; this set may be partitioned in: ⚬ a set V composed of r sequences; ⚬ the remaining 2 n − r sequences.
A sequence from V can be used when we want to encode a watermark bit value 0 and a sequence from W when we want to encode a watermark bit value 1.
A one-to-one correspondence between the sequences in V and W is defined in the following manner: Being the function bijective, the two sets V and W are isomorphic w.r.t. the function and therefore can always be used interchangeably.
The encoding procedure works as follows: given an n bits string X, a sequence S of u symbols from Ψ is obtained by applying the function to X. If S belongs to the set V then a watermark bit can be embedded: if the watermark bit is 0 then the sequence is left as is, otherwise the corresponding sequence (through the mapping ) in W is used in place of S. If S does not belong to the set V, then no watermark bit is embedded.
A procedure for encoding written in pseudo-code is shown in the following Algorithm 1:

Algorithm 1. Pseudo-code of the encoding algorithm
For each binary string X of n bits: Encode X with the corresponding sequence of u symbols from Ψ: S = (X); If S ∈ V then Get the next watermark bit b; The n bits binary string and watermark decoding from sequences of u symbols in Ψ is performed inverting the encoding operations according to the following Algorithm 2 (written in pseudo-code). For every sequence, if it belongs to the set V then it is storing a 0 valued watermark bit, otherwise if the sequence belongs to the set W then it is storing a 1 valued watermark bit, but in this latter case the sequence is restored to the corresponding one in V. After this watermark extraction step, the sequence may be decoded to the original binary string. If the binary stream to be encoded has maximum entropy, that is, we assume a uniform distribution of the 2 n binary strings, then the average payload (in bits per input binary string, bpbs) is On the other hand, the average payload per output symbol (in bits per symbol, bps) is As an example of the application of this framework we present the Base45 encoding extended according to Reference 6 using Map 2. Figures 1 and 2 show a graphical representation of this encoding/decoding mechanism.
Referring to these figures, three Base45 symbols (t = 45, u = 3) allow to represent t u = 45 3 = 91,125 sequences. When encoding binary strings of n = 16 bits, only 65,536 sequences are used. The remaining 91,125 − 65,536 = 25,589 sequences of three Base45 symbols are unused. Thus, the idea proposed in Reference 6 is to build a one-to-one mapping between the set of unused symbols W (having cardinality 25,589) and a set V (also having cardinality 25,589) of symbols chosen among those used to represent the 2 16 binary strings. The Map 2 of Reference 6 defines the set V as the first 25,589 sequences of three Base45 symbols.
After this preparation phase, it is possible to start converting a sequence of words, each 16 bits long, into Base45 printable form. Moreover, using the mapping it is possible to embed extra bits from a watermark (payload). The conversion works as follows: • if a 16 bits string is not converted to a sequence in V then its Base45 conversion is output and the process restarts for the next 16 bits string (e.g., green bit string in Figure 1 mapped to the blue Base45 sequence); • otherwise, if the 16 bits string (e.g., pink bit string in Figure 1) is converted to a sequence belonging to V then one watermark (payload) bit b can be embedded: if b = 0 then the Base45 sequence is left as is (orange Base45 sequence in Figure 1), if b = 1 then the Base45 sequence is mapped, through , to the corresponding Base45 sequence in W (red Base45 sequence in Figure 1) and the obtained sequence is output. After that, the process restarts for the next 16 bits string.
It is obvious that the decoding (Figure 2) restores the original sequence of 16 bits strings and extracts the embedded watermark. In fact, if the three Base45 symbols belong to V then a watermark bit b = 0 is extracted, otherwise if the three Base45 symbols belong to W a watermark bit b = 1 is obtained and the original Base45 sequence is restored through F I G U R E 2 Graphical representation of the framework proposed in Reference 6 applied to Base45 decoding the inverse mapping (which exists because it is a bijection). In any case the 16 bits string is restored from the Base45 sequence which, now, is one of the first 65,536 sequences.

PROPOSED WORK
As stated in the previous discussion, the average payloads were computed assuming a uniform distribution of the n bits binary strings: that is, the probability of finding one string in the input stream is 1∕2 n . This is the only possible reasonable assumption when no knowledge on the type of input stream is available: this leads to the fact that the choice of the sequences that compose the set V can be arbitrary. In fact, in Reference 6 three possibilities (called mappings in that paper) were tested, namely: • the sequences corresponding to the r binary strings ⌊2 n ∕ (t u − 2 n )⌋ positions apart starting from {0} n ; • the sequences corresponding to the first r binary strings; • the sequences corresponding to the last r binary strings.
In Reference 6 slightly different performances among mappings were found for different file formats encoded with Base45 and Base85, but none of the mappings could be considered the best for all file types. This is due to the various distributions of the binary strings that compose different file types. From this observation it is possible to conclude that building the set V with the sequences associated to the most probable binary strings in the input stream would increase the payload capability of the algorithm.
The idea proposed in this paper is to analyze different data representation and file formats building a statistical distribution of the n bit binary strings that compose them and to use these statistics to build the set V with the sequences corresponding to the most probable r binary strings. Thus, we will have a set V for each file or stream format allowing for an increased payload. The statistic for each file format will specify the most probable r = t u − 2 n binary strings and consequently the associated sequences. Considering the cardinality of V for the encodings in Table 1 it follows that Base45 has a reasonable size (25,589 elements) whilst Base 85 would require a too large vector (142,085,829 elements) to record the most probable binary strings.
As a last consideration, Base64 has no redundancy thus it cannot be used for this purpose. According to Reference 6, Base62 could be another candidate for this application, but we prefer Base45 because comparisons with previous experiments are straightforward.
In the next section we apply the proposed improvement to the Base45 encoding computing the 25,589 most probable 16 bit binary strings for various file formats using many files and we show the improvement in the payload capacity.

EXPERIMENTAL RESULTS
In this section we present the results from an extensive set of experiments aimed at testing the proposed method versus the original algorithm proposed in Reference 6 applied to Base45 encoded files. The experiments were run on a laptop with a processor Intel(R) Core(TM) i7-1165G7 2.80 GHz and a 16 GB RAM. The software implementing the described algorithms was written in C language.
We considered eight file types, namely • JPEG, TIFF, PNG for images; all images were 768 × 576 pixels 24 bpp, JPEG format used quality 80 with 2 × 2 subsampling, TIFF images were uncompressed and PNG were compressed, • MP3 for audio, • ZIP, BZ2, GZ for compressed data and • PDF for formatted documents; the PDF format of the files we encoded ranged from 1.3 to 1.7.
In the context of images, we started from a set of 500 images each one saved in the mentioned formats. For each format, we performed a 5-fold cross-validation, by partitioning each fold into two subsets (in a ratio of 80:20): • one composed of 400 images used to build the statistic distribution of the 2 16 binary strings from which to select the 25,589 having the highest probabilities; • the second one made of 100 images used to compute the payload of the method.
Then, results from the folds are averaged and reported in Table 4. For each image file format, the best payload from Reference 6, the averaged payload of the proposed method and the averaged cumulative probability of the 25,589 sequences are shown; for comparison, consider that in case of uniform distribution any 25,589 sequences account for a cumulative probability of p U = 25,589∕65,536 ≈ 0.39046 accounting for an average payload per output symbol of a payload = 0.130152 bps (see Table 3).
It can be observed that the percentage improvements for the three image file formats are: The new values of payload per output symbol (third column of Table 4) are in line with the intrinsic redundancy present in the three image encodings summarized in the last (fourth) column of Table 4.  In the case of audio, we used 3000 MP3 files having size 500,000 bytes. Even in this case we made a 5-fold cross-validation with a splitting proportion of 80:20 using the first set to calculate the statistic distribution and the other to assess the performances.
As in the case of images we report the results for the best payload of Reference 6 in the second column of Table 5, while the third column of the same table gives the average payload, over the 5 folds, of the proposed method. The last column shows the cumulative probability of the most probable 25,589 sequences used in computing the statistics.
As it can be seen from the data in Table 5, the improvement over the method proposed in Reference 6 is (0.16989 − 0.13212)∕0.13212 = 28.6% which is a meaningful gain for MP3 files.
For the compressed file formats, we considered 7000 files of 500,000 bytes in ZIP format, 1100 files of 100,000 bytes in BZ2 format, and 2000 files of 500,000 bytes in GZ format. Analogously, we performed a 5-fold cross-validation by splitting each set in folds of two sets (in ratio of 80:20): the first is used as training to compute the statistics, and the second set is used to test the performances. Table 6 reports the best payload from Reference 6 for each compressed format and the corresponding payload obtained with the proposed method for the files in the test set, averaged over the 5 folds. As before, the last column of Table 6 contains the cumulative probability of the 25,589 binary strings in the files used for building the distribution statistics.
It can be observed that the percentage improvements for the three compressed file formats are: The values of payload per output symbol in the third column of Table 6 are coherent with the redundancy in the three compression formats summarized in the last column of Table 6.
For the PDF file format, we used 2000 files each one having size 500,000 bytes. Even in this case we performed a 5-fold cross-validation by splitting each set in folds of two sets in ratio of 80:20.
The best payload from the three methods in Reference 6 is reported in the second column of Table 7, then the averaged payload of the proposed method and the averaged cumulative probability of the 25,589 sequences are shown in the third and fourth columns, respectively. From Table 7 the percentage improvement for the PDF file format is (0.16998 − 0.13893)∕0.13893 = 22.3%. In addition, in this case, the results show a significant increment in payload when the Base45 symbols are paired with the most probable 16 bits strings.
For all file types the payload improvement is statistically significant (t-test p-value <0.05).

CONCLUSIONS AND FUTURE WORK
This paper has presented an application of the framework defined in Reference 6 for the reversible embedding of watermark data into Base45 printable string encoded information. It was shown how a wise choice of the sequences carrying payload data could improve the payload capability of the method exploiting the redundancy in the file format. Various file types were examined, and their binary strings statistics computed: from them it was also possible to understand which file types are more suitable for carrying extra data when encoded in Base45 format. A wide set of experimental results on many files have computed the quantitative payload capability of the proposed method.
An obvious limitation of the proposed method is that the choice of an encoding table not containing the most probable sequences in the input stream leads to a reduced payload. Moreover, encoder and decoder must share the same mapping table that, for efficiency, must be used for multiple data of the same type (ideally, for a kind of application context).
A future work is an analysis of the possibility to embed the encoding/decoding mapping in the file itself: a (compressed) mapping computed for the specific file is embedded (using a predefined mapping) as first part of the payload and used to encode/decode the remaining payload. This requires investigating advantages and disadvantages of the method, in particular the file size required to gain payload against the space required to embed the mapping.
In the present work, only one watermark bit is embedded in each sequence from V. Another line of future development is to study when it is convenient to store more than one bit in each sequence. To do so, a statistical analysis of the watermark might be needed in order to find the more frequent bit pairs, triplets, and so on, to define a mapping function that optimizes the number of embeddable watermark bits. This might require to look for an optimization algorithm (like those presented in  to build an optimal mapping table that allows encoding more than one watermark bit in the most frequent symbols and corresponding printable Base45 sequences.
A possible application of the proposed method is storing digital signatures into QR-codes: increasing the payload of the method using a mapping table leveraging the symbols with higher frequencies allows employing longer and more secure signatures.