A universal data transfer technique over voice channels of cellular mobile communication networks

A universal technique for transmitting data over voice (DoV) links that can be applied to various cellular networks is proposed. In the scheme, a generic modulation model is established based on waveform symbol mapping, where the data to be transmitted are mapped into ﬁnite waveform symbols of a codebook generated ofﬂine and demodulated with soft decision results at the receiving end. To make the proposed scheme applicable to various kinds of cellular networks, ﬁrst sinusoidal signals are selected to synthesise waveform symbols due to their stable transmission characteristics over various voice channels. Then, an analytical method based on Surface Packing is proposed to optimise modulation codebooks. Finally, the target demodulation codebook is obtained through learning from the modulation codebook online. Simulation results show that the proposed scheme performs well over voice channels with different vocoders and has low symbol error rate on various vocoders. Compared with previous schemes designed for some speciﬁc vocoders in global system for mobile communication (GSM), the proposed scheme can extend the application scope of DoV to cross-network scenarios consisting of GSM, universal mobile telecommunications system (UMTS) and long-term evolution (LTE).


INTRODUCTION
Cellular mobile networks, especially the fifth generation (5G) networks, have been developing and evolving rapidly over the past several decades. However, it is noticeable that the main types of networks in service in some countries and regions are still the second generation (2G) and the third generation (3G) networks, the data transmission efficiency of which is much more inferior than the more advanced ones. Unlike data transmission, which is provided mainly in modern networks, real-time voice communication is a relatively more basic service in all kinds of cellular mobile networks and therefore occupies a wider range of service areas. Besides, compared to the besteffort features of data transmission services, the voice services possess much higher priority and timeliness. [1] Therefore, when a secret call is required, the encryption scheme based on voice service can be applied to a wider range of regions and networks and is more reliable than that based on data service. In many cellular mobile communication networks, voice encryption is only supported during the over-the-air transmis-This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. IET Communications published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology sion stage and is controlled by network operators, [2][3][4][5] thus the voice is transmitted in plain mode in the core networks. As a result, secret calls need to be implemented by end-to-end voice encryption schemes. [6] But during the over-the-air transmission stage, only the signals with speech characteristics can pass through the vocoders located at the mobile terminals and the base stations. [1] Therefore, the encrypted voice data that lose speech characteristics needs to be transmitted over the voice channels of cellular mobile networks by the data over voice (DoV) technique. [6] In a regular implementation of DoV, data are modulated into speech-like signals and transmitted over voice channels of cellular mobile networks transparently in camouflage, which would be demodulated at the receiving end. For convenience of description, the voice channels of cellular mobile networks will be referred to as voice channels in the following.
Apart from voice encryption, the DoV technique can be further applied to the field of instant messaging, which can be regarded as an emergency supplement to cellular network data links. The pan-European eCall vehicle emergency call system [7] could extend the E112 system based on the voice channels of GSM networks and automatically transmit data to the public safety answering node in an emergency. The WikiWalk project [8] could provide wizards for pedestrians by sending data commands over the voice channels. The Federal Communications Commission (FCC), a subsidiary of the US government, requires that the emergency transmission of Baudot codes be provided on all 911 emergency calls. [9] The applicable fields of DoV technique also include ATM and POS terminals with high real-time communication requirements in the financial field. However, it should be clear that the research on DoV is not intended to compete with the data services of cellular mobile communication networks, but only acts as an emergency supplement.
To allow greater channel capacity, vocoders are used in the voice channels to perform compression coding on the speech signals transmitted through them. The source signals are encoded into speech parameters at the sending end and reconstructed by the vocoder at the receiving end according to those parameters. [7] These reconstructed speech signals just sound like the original ones, but have inevitable waveform distortions. [7] At the same time, in order to improve the voice channel efficiency, techniques such as voice activation detection (VAD), discontinuous transmission (DTX), noise cancellation and echo cancellation are widely employed in vocoders, [1,6] which will further aggravate the signals' waveform distortions or even suppress their transmission. Thus, there are two important issues to address in order to implement DoV. The first is to design speech-like signals that could be transmitted over voice channels with slight distortions, and the second is to effectively deal with the speech-like signals' waveform distortions and demodulate data accurately.
According to the status quo that multi-generation and multitype cellular mobile networks coexist in many countries and regions, we focus on designing a universal DoV technique, aiming to provide technical support for transmitting encrypted voice and emergency DoV channels. We make the following contributions.
• We design a modulation model with soft decision in demodulation. Based on the maximum likelihood rule, the received signals are demodulated into the most likely multiple symbols with their corresponding probability values, which provides an excellent support for reducing the error rate by channel coding. • We propose a method to optimise modulation codebooks offline based on Surface Packing. First, the codebook's minimum distance and the Surface Packing are defined, respectively. Then, the analytical relationship between the optimal minimum distance value of a codebook and the parameters used to generate the codebook is presented. Based on these analytic relationships, the Surface Packing makes the waveform symbols in a codebook distributed evenly on the surface of a unit sphere in a multi-dimensional Euclidean space, which makes it possible to keep the received symbols as separable as possible. And compared with the previous heuristicbased optimisation algorithms, the computational cost of this scheme is much smaller.
• We design a learning algorithm to obtain the demodulation codebook online. In the algorithm, the waveform symbols of the demodulation codebook are made from the stable outputs of their corresponding symbols in the modulation codebook used in operation. This algorithm makes the same modulation codebook could be used over different vocoders, which is a prerequisite for the proposed DoV scheme to be applied to various types of cellular mobile networks.
It should be noted that the standard modem scheme over Public Telephone Networks (PSTNs) could achieve error-free transmission rates up to 56,000 bits per second (bps), [10] while the rate of the proposed scheme is only about 500-2,000 bps with higher bit error rates (BERs). However, Katugampala et al. [5] pointed out that the standard data modem over PSTNs cannot be employed for transmitting data over lossy voice compression channels of cellular mobile networks.
The remaining is organised as follows. In Section 2, the relevant works on DoV are introduced. In Section 3, a universal modulation model with soft decision in demodulation is presented. The codebook generation and optimisation are described in Section 4. The simulation setup and results are provided in Section 5. We conclude the paper in Section 6.

RELATED WORKS
Generally, the methods for implementing DoV could be classified into three categories, which are parameter mapping, parameter modulation and waveform symbol mapping.

Parameter mapping
The parameter mapping method generate speech-like signals dynamically by mapping the data to be transmitted into parameters which are extracted at the receiving end. Katugampala et al. [5,11,12] mapped data to pitch, energy and linear spectral frequencies (LSFs) to implement DoV, while Rashidi et al. [13][14][15] used parameters such as speech formants, pitch and phases. The speech-like signals synthesised by such methods were prone to have phase and amplitude mutations at the splicing, which included high frequency harmonics and further led to high symbol error rate (SER) in demodulation. Kotnik et al. [16] selected two sets of LSFs to synthesise new LSFs by mapping the data to be transmitted bit-by-bit, and then used a predefined signal to excite the filters composed of the synthesised LSFs to generate speech-like signals. However, the signals generated by this scheme do not necessarily possess speech features. Kondoz et al. [17] mapped data to pulse positions based on linear algebraic code-excited speech coding and data interleaving techniques, and Werner et al. [18] mapped data to bipolar pulses. In these two schemes high data transfer rates could be achieved, but the polar pulses contain many harmonics, which will lead to a high BER. Boloursaz et al. [19] reduced the BER by repeating coding and clustering, but resulted in a lower data transmission rate.
Similar to the above-mentioned methods, some speech processing technologies in the field of speech recognition provide some possible alternative methods for DoVs' implementation. Lojka et al. [20] developed an acoustic events detection system with modified viterbi decoder operating over hidden Markov models (HMM). The feature mechanism of the system can be employed to train the acoustic models and extract feature vectors from the audio signals during demodulating for DoV.
Vavrek et al. [21] proposed a weighted fast sequential dynamic time warping (WFSDTW) algorithm. The acoustic modelling method of the scheme can be employed to build phonetic units, which could be transmitted over cellular mobile networks and the synthetic phonetic units could be identified by WFSDTW at the receiving end.

Parameter modulation
The parameter modulation method is similar to the conventional data modulation, in which the data to be transmitted are modulated into amplitudes, frequencies and phases of the carrier signals, and performs data demodulation at the receiving end. Dhananjay et al. [22] modulated data to a combination of two frequency sinusoidal signals, which can achieve selfsynchronisation at the receiving end. However, when the data transfer rate is higher than 1000 bps, high frequencies of the carriers are needed, which will cause severe waveform distortions and high BER. Chmayssani et al. [23] used frequency shift keying (FSK) and Quadrature Amplitude ModulatiSM) to implement DoV, respectively, and concluded that the VAD and DTX had great impacts on the performance of QAM. Ali et al. [8] described a modulation scheme based on M-ary frequency shift keying (M-FSK), but the data transfer rate is no more than 800 bps. Sheikh et al. [24] implemented a modulator based on quadrature phase shift keying (QPSK). XU [25] proposed two schemes based on binary phase shift keying (BPSK) and QPSK, respectively. As the channel utilisation is low in these schemes, the BER is very high when the data are transferred at a high rate.

Waveform symbol mapping
In the waveform symbol mapping method, the data to be transmitted are mapped into speech-like waveform symbols of a codebook generated offline. LaDue et al. [26] first proposed the waveform symbol mapping method for GSM and used the cooperative genetic algorithm (GA) to perform codebook optimisation. However, the optimisation algorithm converges slowly and requires large amount of calculations. Sapozhnykov et al. [27] introduced a pattern search (PS) algorithm to improve the speed of codebook optimisation. But specific codebook optimisations for different vocoders are still needed in their schemes. Shahbazi et al. [28,29] concluded that the waveform symbols generated by LaDue's scheme did not necessarily possess speech features, so they generated codebooks based on TIMIT, which is a real speech library. In Shahbazi's schemes, pitch modifications were required to make the codebook symbols have the same pitch. Mashhadi et al. [30] pointed out that the pitch modification might adversely affect the transmission of the speech-like signals and removed them from codebook generation. In the schemes using real speech libraries, the signals synthesised from the codebook symbols do not necessarily have speech characteristics as a whole.
Compared to the DoVs with the waveform symbol mapping method, those based on the parameter mapping or parameter modulation method have higher BERs and require more complex calculations in the frequency domain. Therefore, the waveform symbol mapping method is chosen to design a DoV scheme for cellular mobile communication networks.
In addition, most of the schemes are optimised for some specific vocoders in GSM in previous research on DoV. When these schemes are applied to various kinds of cellular networks, the speech-like signals will suffer from severe waveform distortions, which will lead to high SERs. To improve this situation, we propose a scheme suitable to various types of cellular networks.

PROPOSED MODULATION MODEL
To accurately describe the general voice channel's transmission characteristics, we abstract it into a channel composed of a discrete memoryless component and a memory component. For a speech symbol y j with n samples in length, its output y ′ j over such a voice channel is represented by where (y j ) denotes the stable output of y j over the memoryless component, which is only related to y j , while (⋅) denotes the dynamic output of y j over the long-time memory component, which is related to the precious received p symbols prior to y ′ j . Using the waveform symbol mapping method, a universal DoV modulation model combined with soft decision in demodulation as shown in Figure 1 is established based on (1). Where i denotes the data to be transmitted, C M m×n denotes the modulation codebook generated offline, which contains m speech-like waveform symbols with n samples in length, C D m×n denotes the demodulation codebook, and its acquisition will be described in detail in Section 4. The recorder records the latest p received symbols, while the compensator compensates for the distortion of the received signals. {î k : p k } denotes the demodulation output set with soft decision, andî k : p k represents the probability of demodulating the received signal y ′′ toî k .
In operation, the modulator maps y j into one symbol s i in the codebook C M m×n indexed by the data i. The compensator compensates the output y ′ j of y j based on the previous p symbols recorded by the recorder, and its output is denoted by y ′′ j . The demodulator gives the probabilities of demodulating y ′′ j into the most probable v waveform symbols in C D m×n according to the maximum likelihood rule. For the symbol y j in Figure 1, its output over a voice channel y ′′ j is demodulated by where s D i ∈ C D m×n , arg top i (x, v) denotes getting the indexes of the top v elements of vector x in descending order of their values.
The s D i and y ′′ j can be regarded as two N -dimensional vectors, whose similarities are denoted by cosine similarity. Therefore, P (s D i |y ′′ j ) can be equivalently calculated by where ‖ ⋅ ‖ denotes the Euclidean norm calculation.
In the study of phonetics, the p-order all-pole model is generally used as the speech generation model. Therefore, the speech compensation could be performed by where i denotes the influence coefficient of the previous ith samples on the current sample. The values of i are determined by autoregressive model, and Wan et al. [31] described it in detail. It needs to be clear that the compensation and demodulation operations in the modulation model need to be performed on the basis of accurate signal synchronisations. Due to the distortion of the speech-like signals, it is difficult, especially in the scenarios of different cellular networks, to achieve precise synchronisations by the traditional synchronisation methods. Here the focus is on the general techniques of modulation and demodulation, and the signal synchronisation is studied as a separate work. [32] Wherein, we reviewed that Kotnik et al. [16] employed chirp signals as the synchronisation signals for their DoV scheme. However, such a scheme was prone to missing synchronisation due to the weakening of the correlation of the synchronisation signals caused by the vocoders in cross-network scenarios. Therefore, we proposed a synchronous scheme combining time domain analysis with fast correlation method in [32].

CODEBOOK GENERATION AND OPTIMISATION
Definition 1. The minimum value of the Euclidean distances between any two symbols in codebook C , is defined as the minimum distance of C , denoted as d (C ).
Definition 2. For a codebook C and a symbol s(∉ C ), the minimum distance between s and C , denoted as d (s, C ), refers to the minimum value of the Euclidean distances between s and all of the symbols in C .

Definition 3.
The maximum cosine similarities of a codebook C , denoted as cos(C), represents the maximum value of the cosine similarities between any two symbols in C .

Definition 5.
For m points and a unit sphere in a multidimensional space, the process of distributing these points uniformly on the surface of this sphere is defined as Surface Packing.

Theorem 1.
If the Euclidean norm of any symbol in codebook C is 1, then the maximisation of d (C ) is equivalent to the minimisation of cos(C ).
Proof. Take two symbols s i and s j in C , where s i = (S i1 , S i2 , … S in ) and s j = (S j 1 , S j 2 , … S jn ), the cosine similarity between s i and s j can be calculated by The Euclidean distance between s i and s j is calculated by According to (7), d (s i , s j ) has a monotonous decreasing relationship with cos(s i , s j ), so the conclusion of Theorem 1 holds. □

Generation of waveform symbols
Since sinusoidal signals can be stably transmitted over various types of voice channels, we choose a series of sinusoidal signals with equal frequency intervals in the voice frequencies to synthesise the waveform symbols. The analog form e i (t ) of the waveform symbols in the codebook C M m×n is represented by where q stands for the number of subcarriers, f j (= f 0 + ( j − 1) ⋅ Δ f ) represents the frequency of the j th subcarrier, and a i j (∈ [−1, 1]) denotes the coefficient of the j th subcarrier in the ith waveform symbol. We use f 0 and Δ f to indicate the starting frequency of the subcarrier and the interval of the subcarrier frequency, respectively. Let the sampling frequency be f s , then the sampling values in x i (= (X i0 , X i1 , … , X i(n−1) )) of e i (t ) starting from t = 0 is denoted by The waveform symbol s i is obtained by normalising x i as

Optimisation for modulation codebook
The performance of a DoV scheme is mainly evaluated by the data transfer rate and the BER. For a codebook C M m×n , the data transfer rate r is calculated by Since f s , r is only related to the codebook sizes of m and n, therefore, we take minimising the BER at various data transfer rate as the goal of codebook optimisation. Whenî 1 is used as the hard decision result in demodulation as shown in (2), the BER in probability is calculated by where P ei denotes the probability of demodulation misjudgment of symbol s i after being transmitted. LaDue et al. [26] considered that the output signal s ′ i of s i over a voice channel conforms to the Gaussian distribution, while Kazemi et al. [33] assumed that s ′ i conforms to the Weibull distribution and the chi-square distribution. Regardless of which distribution is most realistic, s ′ i is scattered around its statistical means i . If the distribution is in a limited range, P e takes its minimum value when the proposition shown in Definition 4 has its optimal value.
According to Shannon, [34] the waveform symbols in a codebook C m×n can be regarded as m points on the surface of a unit sphere in the n-dimensional Euclidean space Wherein, the center of n−1 is located at the origin of the space ℝ n . If there is no constraint on s i , the essence of the proposition shown in Definition 4 is a Surface Packing problem as shown in Definition 5. According to Leopardi,[35] the solution for it is to divide the spherical surface n−1 into m parts evenly, and locate s i at the center of each area. Leopardi [35] studied this issue through his whole doctoral thesis. For an illustrative example shown in Figure 2, the surface of an unit sphere in the threedimensional space is divided into 32 equal-area regions according to his scheme.
Due to the constraints of the symbols shown in (9), the proposition of Definition 4 cannot be solved directly using Leopardi's scheme. Suppose that the subcarrier coefficient a i j of s i as a point a(= a i1 , a i2 , … , a iq ) in the space ℝ q and normalise it by (10), it can be divided into two steps of determining the optimal frequency interval and searching for the optimal number of subcarriers to optimise the modulation codebook based on Surface Packing.

Determining the optimal frequency interval
Given m and n, and ensuring that f j is within the speech frequency range of [300, 3400]Hz by initialising f 0 and q, the optimal deployment point a * i can be obtained by the q-dimensional Surface Packing, and then the codebook C and cos(C ) can be calculated respectively. Taking minimising cos(C ) as the goal, a one-dimensional searching for the frequency interval Δ f can be carried out by Algorithm 1.
Based on the calculations of Algorithm 1, the relationship between Δ f and cos(C ) is shown in Figure 3, and it can be seen that cos(C ) takes its minimum values when Δ f = 0.5 ⋅ k ⋅ f s ∕n (k = 1, 2, 3, …).

Searching for the optimal number of subcarriers
Based on the the relationship between Δ f and cos(C ) from Figure 3, we construct Algorithm 2 to seek the optimal number of subcarriers q * by making cos(C ) obtain its minimum value. 2: while f q < 3400Hz do 3: Calculating a * i according to Surfacing Packing, and calculating C and cos(C ) from a * i . 4: if tmp < cos(C) then

8: end while
According to Algorithm 2, the relationship between q and cos(C ) is shown in Figure 4. It can be seen from that, when q ≥ m∕2, cos(C ) takes its minimum values, and when f 0 = f s ∕n, the minimum value of cos(C ) is 0. It should be clear that, when q < m∕2, the value of cos(C ) in Figure 4 is not necessarily the optimal value under its condition.
At various data transfer rates, the optimal parameters used to generate codebooks are listed in Table 1.
Taking the parameters that m = 16 and n = 32 listed in Table 1 as an illustrative example, some waveform symbols of codebook generated by these parameter are shown in Figure 5

Demodulation codebook acquisition via online learning
For a waveform symbol y, its stable output over a voice channel can be calculated by according to (1).

FIGURE 4
The relationship between q and cos(C ) Suppose that in practice the previous p waveform symbols of y conform to a uniform random distribution, if the amount of transmission of y is very large, then E ( ) → 0. Therefore, (14) can be simplified as where N t denotes the amount of transmission of y, and the addition operation on y ′ j is processed as vectors. According to (15), the demodulation codebook could be obtained by online learning, and a demodulation codebook learning model is established as shown in Figure 6. Where D{d j ∈ [1, m]| j = 1, 2, …} denotes a predefined random number set, and y j denotes a copy of the k th = (d j ) symbol in the codebook C M m×n . In operation, the modulator takes d j sequentially from D and maps it into y j at the sending end. The codebook renovator learns and updates the demodulation codebook C D m×n based on the received signals y ′ j and D. Each symbol is sent N t times, and the total amount of data to be sent during the learning process is m ⋅ N t . Based on (15), a demodulation codebook learning algorithm is given in Algorithm 3.
It can be known from (15) that in order to obtain the accurate stable output of the waveform symbols in the modulation codebook, it takes a long time for learning to satisfy the condition of N t → ∞, but this is not feasible in practice. Therefore, a tradeoff between the stable accuracy of the demodulation codebook and the learning time is required.

SIMULATION AND ANALYSIS
In order to test the performance of the proposed scheme, simulation scenarios are constructed by combining various vocoders and their coding rates.

Simulation settings
According to the proposed DoV modulation model, a simulated platform is built on Matlab with two pairs of vocoders employed, as shown in Figure 7. Where D is generated randomly in [1, m], and its front part is used for demodulation  codebook learning and the latter part is for SER test. C M m×n is generated offline and C D m×n is obtained online via Algorithm 3. Both pairs of the vocoders contain an encoder and a decoder. The channel encoder and decoder are used to make use of the soft decision results in demodulation to improve the demodulation accuracy.
The vocoders currently employed by various types of cellular mobile networks are shown in Table 2.
Since HR is rarely used currently and the ANSI source codes of the vocoders except EVRC and EVRC-B are available from ETSI, therefore, we choose FR, EFR, AMR and EVS to construct the simulation scenarios. Wherein, the sampling frequency of EVS is set to 8 kHz.
Given the parameters shown in Table 1, when Δ f = k ⋅ f s ∕n(k = 1, 2, …), the generated waveform symbols are phasecontinuous, which can reduce the waveform distortion of the speech-like signals. Therefore, the optimal parameters employed to generate codebooks are set at each data transfer rate as shown in Table 3.
In the channel coding scheme, b symbols are grouped into a group, which could represent (b ⋅ R)-bit data. The first (b ⋅ R − c ) bits are set to the data to be transmitted, and the last c bits are set to the checksum of the first (b ⋅ R − c ) bits. In the simulation, b is set to 5 and c is set to 2. It needs to be clear that this channel coding scheme is just a simple demonstration of using the soft decision. Some error correction mechanisms can be used in a dedicated channel coding scheme.
Among other parameters, f s is set to 8 kHz, v is set to 3, N t is set to 5, and 1,000,000 symbols will be sent in each test.

Result analysis
When the vocoder pair A and B are set to the same vocoder with the same channel coding rate, the test results of SER at each data transfer rate are shown in Table 4 and Figure 8. Table 4 and Figure 8 show that the SER are correlated to the data transfer rate and the channel coding rate in the following ways.
• The SER increases as the data transfer rate increases. On the one hand, when the data transfer rate is increased by increasing the number of symbols of the codebooks, the mutual Euclidean distances between the symbols become smaller, which will reduce the symbols' separability. On the other hand, when the data transfer rate is increased by reducing the symbol length, more harmonics outside the speech spectrum will be generated, which will exacerbate the waveform distortions of the speech-like signals. • The SER increases as the channel coding rate decreases. At a lower channel coding rate, a higher compression rate will be selected in the vocoders in operation, which leads to a higher information loss rate and more severe distortions of Where 0.0000 indicates that the SER is less than 1.0 × 10 −6 .

FIGURE 8
The relationship between SER and data transfer rate under the same vocoder with the same channel coding rate condition the speech-like signals. In practical scenarios, the AMR and EVS vocoders will negotiate to use a lower rate of coding rate in environments with poor signals.
It can also be seen from Figure 8 that the SER at the data transfer rate of 1200 bps is not higher than that at 1000 bps when other conditions are the same. Therefore, in the range of 800-1200 bps, the data transfer rate of 1200 bps is preferred.
When the vocoder pair A and B are set to different types of vocoders, the test results of SER at each data transfer rate are shown in Table 5 and Figure 9, which show that in a simulated environment composed of different vocoders, the SER is higher than that of the same vocoder, and there is a high correlation between the vocoders and their channel coding rates used in both parties. If the channel coding rates of both parties are high, the SER could be low. Any party using a lower channel coding rate will result in a higher SER.

Comparison with previous schemes
In terms of codebook optimisation, different algorithms are proposed in [26,27] and [30]. Among them, the GA algorithm proposed in [26] takes 25,000 iterations to generate the codebook symbols, the pattern search algorithm introduced in [27] needs 8000 iterations, while the algorithm used in [30] still needed a large amount of calculations with a roughly exponential convergence rate. The optimisation algorithm proposed is an analytical method, which has a better performance in the computational cost than the other three.
In terms of performance in SER, as shown in Table 6, the results show that the proposed scheme is superior to [27] over the FR vocoders, and is equivalent to [27] and [28] over the EFR vocoders. Under the AMR 12.2 vocoders conditions, the proposed scheme is slightly inferior to [30], equivalent to [27] and better than [25].
In terms of performance in adaptability in cross-network scenarios, the schemes in previous studies are designed for some specific types of vocoders, which makes it difficult to adapt to different networks. The proposed scheme takes adaptability as an important feature, and the simulation results show that it achieves excellent performance in scenarios composed of various types of vocoders. Where 0.0000 indicates that the SER is less than 1.0 × 10 −6 .

FIGURE 9
The relationship between SER and data transfer rate under different types of vocoders condition

CONCLUSION
This paper studied transmitting digital data through cellular mobile network voice channels. A universal implementation scheme for DoV was constructed based on the waveform symbol mapping method. In order to make the proposed scheme applicable to various cellular mobile networks, we selected sinusoidal signals to synthesise waveform symbols due to their stable transmission characteristics over various cellular networks. On the other hand, we designed a learning algorithm to obtain demodulation codebooks online, which avoided optimising codebooks for different networks or vocoders.
To achieve high performance in SER, we proposed an analytical algorithm based on Surface Packing to optimise modulation codebooks offline firstly. The proposed algorithm makes the waveform symbols distributed evenly on the surface of a multi-dimensional unit sphere, which ensures that each symbol is as far away as possible from other symbols in the codebook. Then, we used soft decision in demodulation, which could provide an excellent support to reduce SER by channel coding.
Simulation results show that the SER of the proposed scheme is not higher than 6 × 10 −5 at the data transfer rate of 1200 bps between any two vocoders of EFR, AMR1220 and EVS2440. Therefore, the proposed scheme performs well in cross-network scenarios.