Transcoding Unicode Characters with AVX-512 Instructions

Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these instructions to efficiently transcode strings between the most common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice as fast as the previous best solutions. For example, we transcode Chinese text from UTF-8 to UTF-16 at more than 5 GiB/s using fewer than 2 CPU instructions per character. To ensure reproducibility, we make our software freely available as an open source library. Our library is part of the popular Node.js JavaScript runtime.

of a fast network connection.
IBM mainframes based on z/Architecture provide special-purposes instructions named "CONVERT UTF-8 TO UTF-16" and "CONVERT UTF-16 TO UTF-8" for translation between the two encodings [3].By virtue of being implemented in hardware, these exceed 10 GiB s −1 processing speed for typical inputs.While commodity processors currently lack such dedicated instructions, they can benefit from single-instruction-multiple-data (SIMD) instructions.
Unlike conventional instructions which operate on a single machine word (e. g. 64 bits), these SIMD instructions operate on larger registers (128 bits, 256 bits, . . . ) representing vectors of numbers.A single SIMD instruction may add eight pairs of 16-bit words at once.We can transcode gigabytes of text per second [2] by a deliberate use of conventional SIMD instructions (e.g., ARM NEON, SSE, AVX2).
In recent years, Intel introduced new SIMD instruction sets operating over registers as wide as 512 bits.If Intel had merely doubled the width of the registers, there would be little need for further work on our part.However, our experience suggests that to fully benefit from AVX-512 instructions, we need to use adapted algorithms [4].Indeed, while AVX-512 instructions benefit from wider registers, Intel has also added many more instructions than what is typically found in SIMD instruction sets.There is also a slightly different model: AVX-512 instruction may consume or generate masks in mask registers which have no equivalent in prior commodity instruction sets.In AVX-512, a mask is conceptually an array of 8, 16, 32, or 64 bits corresponding to vectors of 8, 16, 32, or 64 elements.
We present novel transcoding functions using AVX-512 instructions.On average, we are roughly twice as fast as the previous fastest functions [2] on commodity processors.

| UNICODE AND ITS ENCODINGS
Unicode is a standard based on the Universal Character Set (UCS).An extension to ASCII, UCS is a character set whose characters (called universal characters) have code points numbered from U+0 2 to U+10FFFF (decimal 1 114 111).These code points are organized into 17 planes of 65 536 characters each, with the first plane U+0000-U+FFFF being called the Basic Multilingual Plane (BMP).Code points in the range 0xd800-0xdfff are reserved for surrogates used in the UTF-16 encoding and do not represent universal characters.
Unlike simpler character sets like ASCII, universal characters are seldomly stored directly as integers, as such a storage format is wasteful and incompatible to existing byte-oriented environments.Instead, several Unicode Transformation Formats (UTF) are employed to store and process universal characters, depending on the use case at hand.A Unicode Transformation Format transforms each universal character into a sequence of integers, with the size of the integer being dependent on the format.Popular Unicode Transformation Formats include: UTF-32 representing each universal character as a 32-bit integer.Mainly used as an internal representation.
UTF-16 representing each universal character as one or two 16-bit integers [5].All code-point values up to U+FFFF are stored as 2-byte integer values directly.Otherwise we use surrogate pairs: two consecutive 2-byte values, each storing 10 bits of the codepoint.Used by Java, Windows NT, databases, binary protocols, and others.UTF-8 representing each universal character as 1-4 bytes [6].An extension to ASCII, UTF-8 is by far the most popular text encoding on the World Wide Web.
Though our software work covers many cases (from UTF-8 to UTF-16 or UTF-32, from UTF-16 to UTF-8 or UTF-32, and so forth), we study the two most difficult cases: from UTF-8 to UTF-16 and back.
Multi-byte words in computers representing numerical values can be stored in either little-endian format or bigendian format, depending on whether the first byte is the least significant or the most significant.Unicode Transfor- The bits are named A to W starting at the least significant bits with 0vuts = WVUTS − 1.
mation Formats representing characters in units larger than bytes are subject to endianess.If the endianess is not known from the context 3 , it can be given by adding a LE or BE suffix to the name of the Unicode Transformation Format, giving e. g.UTF-16BE or UTF-32LE.We can reverse the order of the bytes-between big and little endian-at high speed: e.g., using one instruction per 64 bytes.For simplicity, we present our results on UTF-8 and UTF-16LE.

| UTF-16
When the Universal Character Set was initially defined, it was meant to be a 16-bit character set with UTF-16 being its natural encoding, representing each universal character in one 16-bit word.It was later realized that 65 536 code points are insufficient to represent the writing systems of the world's many cultures, especially when having to account for over 50 000 Chinese, Japanese, and Korean ideographs.UCS was therefore extended past the Basic Multilingual Plane to code points up to U+10FFFF and UTF-16 retrofitted with a surrogate mechanism to permit representation of these newly added characters.
UTF-16 is a versatile Unicode Transformation Format as it permits (absent surrogates) easy processing of text in many popular languages, while not being as memory-hungry as UTF-32.it can thus be assumed to be in wrong byte order, permitting automatic byte-order detection in many situations.Our algorithms do not make use of this convention and strictly assume UTF-16LE throughout.A BOM is neither generated, nor checked for, nor stripped.
As illustrated in the "UTF-16" column of Fig. 1, code points in the Basic Multilingual Plane are represented as themselves.Code points outside of this plane have 0x10000 subtracted from them (the surrogate plane shift), yielding a 20-bit number.This number is split into two 10-bit halves.The high half is tagged with 0xd800, yielding a high surrogate.
Likewise, the low half is tagged with 0xdc00, yielding a low surrogate.The character is then encoded by giving its high surrogate, directly followed by its low surrogate.It is for this purpose that code points in the range 0xd800-0xdfff do not represent universal characters.
Decoding UTF-16 is a matter of joining the bits of surrogate pairs, leaving Basic-Multilingual-Plane characters unchanged.Care must be taken to validate that each high surrogate is succeeded by a low surrogate and vice versa.
With this sequencing requirement ensured, all UTF-16 sequences are valid and have a 1:1 mapping to code points.

| UTF-8
The most popular Unicode Transformation Format is UTF-8, representing each universal character as a sequence of 1-4 bytes.Replacing the earlier UTF-1, the format was designed to be backwards-compatible to ASCII while also being safe for use in UNIX file names, and comes with many other desirable features.Under many circumstances, UTF-8 text can be processed as if it was a conventional ASCII-based 8-bit encoding like those of the ISO-8859 family.
This includes common applications like concatenation, substring search, field-splitting (with ASCII characters or UTF-8 strings for separators), and collation, rendering it the most popular UTF.
UTF-8 can be seen as an extension to ASCII, where each ASCII character (U+00-U+7F) is represented as itself with other characters being represented by sequences of bytes in the range 0x80-0xf4 (cf.with a lead byte (0xc25 -0xf4) indicating the length of the sequence in its tag bits, followed by 1-3 continuation bytes (0x80-0xbf), making the encoding stateless, and self-synchronizing.
The details are summarized in the "UTF-8" column of Fig. 1: The bits of the code point are numbered A-W starting at the least significant bit.For each of the four possible cases (the ASCII/1-byte case, the 2-byte case, the 3-byte case, and the 4-byte case 6 ), the bits of the code point are copied into the lead and continuation bytes as indicated in the figure.Tag bits are applied (underlined in Fig. 1) to distinguish ASCII, lead, and continuation bytes.
For many universal characters, more than one encoding seems to be possible according to the figure.However, only the shortest possible encoding for each character is permitted to ensure uniqueness of the encoding.While 4byte sequences could encode code points in excess of U+10FFFF, such sequences are not legal either.The bytes 0xc0, 0xc1, and 0xf5-0xff are thus not used by UTF-8.
Decoding UTF-8 begins by looking at the tag bits to tell the start and length of each sequence.Then, the code point is assembled from the payload of these bytes.A critical part in decoding UTF-8 is validation, especially against overly-long sequences and illegal code points (surrogates, code points greater than 10FFFF).In the algorithm presented in § 6 we demonstrate how decoding UTF-8 with comprehensive validation and then reencoding it into UTF-16 can be implemented efficiently, leveraging AVX-512 instructions.

| RELATED WORK
There are relatively few academic publications on Unicode string processing using SIMD instructions.Cameron [7]   proposed a UTF-8 to UTF-16 transcoder using SIMD instruction using bit streams.A bit stream is a transposition on the character inputs.For example, from 128 bytes of data, we produce eight 128-bit registers with the first register containing the most significant bits of each input byte, and the last register containing the least significant bit of each input byte.The transcoding from UTF-8 to UTF-16 is done in this bit stream form with a final phase where unused bytes are removed.

| NOTATIONAL CONVENTIONS
In the algorithms described below, all logical symbols refer to bitwise logic.Comparisons are performed between corresponding elements of vectors, yielding a bit mask of those elements for which the comparison holds.All arithmetic operations, shifts, and comparisons are performed on unsigned numbers.The width of the number depends on the vector used.
As a general convention, scalars, vectors of bytes, and masks derived from them are indicated with lowercase letters.Vectors of 16-or 32-bit words are indicated with uppercase letters. 7The symbol n is number of bytes in a vector; for AVX-512 it is n = 64.This convention permits us to explain the algorithms in terms of AVX-512 instructions while giving generic formulae potentially applicable to other future instruction sets.
The operator precedence follows C precedence rules with being parsed as Table 2 gives a list of symbols used in decreasing order of precedence.

| Mask Operations
Masks are conceptually arrays of bits-containing between 8 and 64 bits-meant to be used in conjunction with vectors having the same number of elements.For example, byte masks (noted m 1 , m 234 , . ..) may contain 64 bits if they correspond to vectors of 64 bytes.We also have word masks (e.g., M 3 ) containing 16 bits when they corresponding to 512-bit vectors of 32-bit values.See Appendices A and B for detailed lists of our masks and other variables.We operate on masks as if they were unsigned integer values: m +3 = m 4 ≪ 3 means that the whole mask m 4 is shifted to the left by three places to give m +3 ; the 64 individual mask bits are always either 0 or 1.The logical operations or (∨), and (∧) and not (¬) are applied bitwise.We have that m = 0 sets all bits to zero whereas m = ¬0 sets all bits to one.
In practice, the processor has several instructions dedicated to AVX-512 mask registers (e. g., kshiftrd, kandq, korb).Mask registers can be converted back and forth to general-purpose registers as needed-with the caveat that the conversion from mask registers to general-purpose registers may have a high latency (e. g., 3 cycles).

| Vector Operations
When operating on vectors, equations have to be read as "SIMD formulae" applying element-by-element.For example, we write w = m ? a + b : c to mean "each element of w is set to the sum of the corresponding elements in a and b if the corresponding bit is set in m or to c otherwise."With an explicit index i = 0 . . .n − 1, the previous expression could be written as We believe that the presentation as "SIMD formulae" is easier to understand and prefer it where possible.Explicit indices are only used when permutations are involved.For example, we write to mean "w is v permuted by the index vector p." Conversions from one element size to another are not explicitly written out; watch the letter case of the variables used to see when this happens.All such conversions are zero-extensions or truncations.

Remark
The conventional binary notation presents the least significant bits last.When working with masks and vectors, these least significant bits correspond to the first elements of the vectors.This discrepancy in the order is a source of confusion, but it is difficult to avoid.Intel intrinsic functions reflect this confusion by providing two sets of functions to create new vectors: _mm512_set_* and _mm512_rset_* depending on the prefered order [10].

| Special Functions
We use several special bit-manipulation functions corresponding to instructions available on contemporary x86 computers: ctz The count trailing zeroes operation ctz(a ) counts the number of trailing (least significant) zero bits in a, i. e. how often a can be divided by 2 until leaving an odd number.It corresponds to the bsf/tzcnt instructions of the x86 instruction set.Our algorithms never invoke ctz(0).
width The bit width operation width(a ) counts the number of bits needed to represent a.It is This operation is efficiently implemented on many architectures through the count leading zeroes operation (x86 instruction bsr/lzcnt).Our algorithms never invoke width(0).
popcount The population count operation popcount(a ) computes the number of bits set in a.This can also be under- (3) compress The compress vector operation compress(m, v ) is the only vector operation among our special functions.It performs the same operation as the parallel extract operation pext, but instead of extracting bits from a bit field, it extracts elements from a vector.This corresponds to the vpcompressb instruction on recent x86 processors.For the visualization, we have given the mask m = 0xcd with the least significant bit on the left to make the operation easier to see.The least significant mask bit decides whether to keep the first vector element and so on until the most significant mask bit decides whether to keep the last vector element: 12 34 56 78 9a bc de f0 kept elements 12 --56 78 ----de f0 compress(m, v ) 12 56 78 de f0 00 00 00 Observe how we reversed the bit order of the mask m to match the natural vector order: its usual binary representation is 11001101.

| AVX-512
Our algorithms are based on the AVX-512 family of instruction-set extensions to the Intel 648 instruction-set architecture [11].An extension to the AVX family of instruction-set extensions, AVX-512 provides a comprehensive set of SIMD instructions for operation on vectors of 16, 32, or 64 bytes organized into bytes or words of 16, 32, or 64 bits.
A register file of 32 vector registers zmm0-zmm31 complemented by 8 mask registers k0-k7 is provided.
AVX-512 instructions are generally non-destructive, writing their output into a separate operand from their inputs.
In most AVX-512 instructions, one operand is permitted to be a memory operand with the remaining operands being register or immediate operands.This is usually the first input operand, but for some instructions it may also be the output operand.
The AVX-512 instruction set is split into a set of extensions.Each extension adds new instructions to the Intel 64 architecture, enhancing the capabilities of AVX-512.Depending on the microarchitecture used, not all AVX-512 extensions might be available.Table 3 gives a list of AVX-512 instructions used and the extension they hail from.In the following, we list those AVX-512 extensions needed to execute the algorithms described in this paper:

AVX-512BW
The byte/word extension extends the AVX-512F instructions to vectors of bytes and 16-bit words.

AVX-512VBMI
The vector byte manipulation instructions extension adds instructions to permute and manipulate bytes.

AVX-512VBMI2
The vector byte manipulation instructions 2 extension adds compress/expand support and doublewidth shifts for bytes and 16-bit words.
The first generation of Intel 64 processors supporting all required AVX-512 extensions are those code named Icelake, based on the microarchitecture code named Sunny Cove.By emulating vpcompressb through other instructions, it is likely possible to adapt the algorithms to processors as early as the generation code named Cannon Lake, albeit at significant reduction in performance.ktestd/q BW test bitwise and/and-not of masks for all-zero kortestd/q BW test bitwise or of masks for all-zero/all-one TABLE 3 Selected AVX-512 instructions.

| Masking
The output of most vector instructions is subject to masking, a novel feature of AVX-512.A mask register k1-k79 is applied to the output operand, specifying either merge masking or zero masking.With merge masking, only those vector elements indicated by bits set in the mask register are modified in the output operand.The other vector elements remain unchanged.With zero masking, vector elements for which the bits in the mask register are clear are zeroed out.
Masking on register operands is free for most instructions, though merge masking introduces an input dependency on the old value of the output operand.
Masking on memory operands enables memory fault suppression for most instructions.This means that the CPU does not signal memory faults for masked-out vector elements, permitting masked out elements to extend into unmapped or non-writable pages.This suppression affects both input and output memory operands.

| Microarchitectural Details
To simplify the implementation of AVX-512 on microarchitectures designed to execute the older SSE and AVX families of instruction-set extensions, most SIMD instructions operate within lanes of 16 bytes.That is, in many ways, it is as if the 64-byte vector registers were made of four nearly independent 16-byte subregisters.Instructions that process data across lanes (such as vpermb or vpcompressb) exist, but can typically execute on less execution units and take longer to execute in comparison to instructions that do not.We thus want to avoid cross-lane operations if feasible.
On current Intel microarchitectures including Sunny Cove (Icelake), Cypress Cove (Rocket Lake), and Willow Cove (Tiger Lake), most AVX-512 instructions 10 can execute on execution ports 0, 1, and 5. Instructions that do not cross lanes usually execute in a single cycle, instructions that do take 3 or more cycles.Some instructions are restricted in the ports they can execute on: shifts can only execute on ports 0/1, permutations and other cross-lane instructions, as well as comparisons into masks can only execute on port 5. Instructions operating on masks (i.e. those whose mnemonics start with k) are restricted to one of ports 0 or port 5, depending on the instruction [12, 13].
In addition to these restrictions, ports 0 and 1 support a vector length of only 32 bytes while port 5 supports the whole 64 bytes.Instructions operating on a vector length of 64 bytes are executed either on port 5 or on ports 0/1 joined together, occupying both ports for one cycle simultaneously.Thus, there are effectively only two ports available to execute instructions with a 64-byte vector length.While 32-byte vectors are processed at 3 vectors of 32 bytes (i.e. 6 lanes) per cycle, 64-byte vectors are processed at only 2 vectors of 64 bytes (or 8 lanes) per cycle, leading to a theoretical speedup by a factor of 4/3 or 33 % of 64-byte vectors over 32-byte vectors for an otherwise identical algorithm.This stands in contrast to the factor 2 or 100 % speedup one would naïvely expect from doubling the vector length.
It is vital for the performance of AVX-512 code to keep track of which ports instructions execute on, rearranging or editing the code such that both port 0/1 and port 5 can execute instructions at the same time [14].Through the use of microarchitectural simulation [15] in the design of the algorithms, good port utilization has been ensured.

| TRANSCODING FROM UTF-8 TO UTF-16
We transcode UTF-8 to UTF-16 by gathering the bytes that make up each character from the last byte of each character to its first byte.This exploits the similarity in bit arrangement between the four cases (ASCII, 2-byte, 3-byte, and 4-byte) highlighted in Fig. 1a.The bytes of each UTF-8 sequence are isolated from the input string, liberated of their tag bits, shifted into position, and finally summed up into a code point.
Using the exact correspondence between 4-byte UTF-8 characters and characters represented as surrogate pairs in UTF-16, we treat 4-byte characters as an overlapping pair of a 3-byte sequences and a 2-byte sequence that is later fixed up into a high and a low surrogate.This saves us extra code for extracting the fourth-last byte of each sequence and avoids the costly use of 32-bit words for intermediate results.
To illustrate this idea, consider the following example, translating the Unicode characters U+40 (@), U+A7 ( §), U+2208 The algorithm can be roughly described with the following plan of attack: 1. Read a vector of 64 bytes.

2.
Classify each byte according to whether it is an ASCII byte, continuation byte, 2-byte lead byte, 3-byte lead byte, or 4-byte lead byte.

3.
Construct a mask indicating the last byte of each UTF-8 sequence.For 4-byte characters, the third byte is indicated, too, treating them as a 3-byte sequence for the high surrogate and a 2-byte sequence for the low surrogate.

4.
Use the mask to gather the last, 2 nd last, and 3 rd last byte of each sequence.

5.
Strip tag bits, shift bits into place and or them into UTF-16 words.

6.
Postprocess surrogates by shifting their bits into place, and applying tag bits and surrogate plane shift.

7.
Write the resulting bytes to the output, incrementing the input and output pointers by the number of bytes consumed/generated.

8.
Repeat until the end of input or an encoding error are encountered.
Apart from this general plan, there are also fast paths for the cases (a) ASCII characters only, (b) ASCII, and 2-byte sequences only, and (c) 1-3-byte sequences only.
Validation is performed throughout the transcoding process, as explained in § 6.4.In comparison to previous algorithms, it is simplified by advancing the input only by complete UTF-8 sequences; if the input is correct UTF-8, each vector of input thus begins with a complete sequence.

| Classification and Masks
After reading a vector of bytes from the input buffer, the characters in it are classified according to the range they fall into.Various masks are then built from this classification.In the following explanations, we follow the convention from § 4 where names of the form m ... refer to masks about the input vector w in while names of the form M ... refer to masks about the output vector.
These two kinds of masks are connected through the pext and pdep operations, relating the end bytes of the decoded UTF-8 sequences to the UTF-16 words they correspond to and vice versa.
The first set of masks is derived directly from w in , classifying the input into ASCII 2/3/4-byte sequence lead bytes 3/4-byte sequence lead bytes and 4-byte sequence lead bytes From these we then derive a mask indicating the presence of any kind of lead byte.All other bytes (¬m 1234 ) are continuation bytes.
Then we construct the important mask m end identifying the last bytes of each sequence to be decoded.These are the last bytes of each UTF-8 sequence as well as the third byte of each 4-byte sequence.Working backwards from these last bytes, we later use this mask to gather the last, second-last and third-last bytes of each sequence.
The key insight in constructing m end is that as each UTF-8 sequence is followed by another UTF-8 sequence, we can find the positions of the last bytes as those preceding the lead bytes of the next sequence (m 1234 ≫ 1).The third byte of each 4-byte sequence is added by first computing the fourth byte of each sequence and then shifting the result to the right to obtain the last third bytes An unfortunate consequence of defining m end by going backwards from the lead bytes of the next characters is that we only catch the last character of the vector when it is followed by an incomplete character whose lead byte we can shift to the right.For 4-byte sequences right at the end of the vector, this leads to us only detecting the third last byte in the character.Oring in m +3 at the end fixes this problem for the 4-byte case.
For the other cases, the only effect of this process is that if w in does not end in a partial character, decodes to no more than 32 words of UTF-16, and the last character is not a 4-byte sequence, we process one less character in the current iteration than possible.However, the minor performance impact of hitting this edge case is more than outweighed by not spending extra time computing the mask correctly. 11 visualize the various masks, consider the strings "x P" and "ε ≤±1" with a vector length of n = 8 bytes: w in 78 e2 88 87 f0 9d 94 93 ce b5 e2 89 a4 c2 b1 31 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 m 234 = 0xc0 ≤ w in 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 m 34 = 0xe0 ≤ w in 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 m 4 = 0xf0 ≤ w in 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 In this example, masks are arrays of eight bits corresponding to eight-byte sequences: in our actual implementation, we use 64-bit masks.Note in particular how m end accounts for the last character in the left string (being a 4-byte character), but not in the right string, where it is an ASCII character.Also note how the character P has two end bits, being treated as a 3-byte sequence overlapping a 2-byte sequence.

| Assembling Characters
With these masks in hand, we can strip off the tag bits and assemble characters.The UTF-8 tag bits are stripped off by clearing the most significant two bit of each non-ASCII byte in w in , giving The tag bits of 3/4-byte lead bytes are not completely removed by this step; this is sufficient for our purposes as these tag bits get shifted out later on.
Characters are assembled by selecting from w stripped the last (W end ), second-last (W −1 ) and third-last bytes (W −2 ) of each sequence, zero-extending them to 16 bits and joining their bits into a UTF-16 word.We do this by first preparing a permutation vector P that holds for each word in the output vector, the index of the last byte of the 11 If a perfect mask is desired, you can instead use , where m 2 = m 234 ∧ ¬m 34 indicates 2-byte sequence lead bytes.The first byte of each sequence is shifted to the position of its last byte (and the third byte of a 4-byte sequence).The mask is then post-processed by clearing the third byte of a 4-byte sequence starting in the third-last byte of w in , as only complete sequences can be processed.
corresponding sequence.This vector is prepared by compressing (vpcompressb) a byte vector holding an identity permutation (0, 1, . . ., 63) subject to m end .The compressed vector is then zero-extended (vpmovzxbw) to 16-bit words, keeping its first n/2 elements: We only generate one vector of UTF-16 words per iteration representing at most 32 characters.When the input contains ASCII characters, it might be possible for m end to contain more than 32 set bits.Bits set in m end past the 32 nd bit are discarded during the processing: P contains only n/2 (or 32) elements.With P in hand, we can load the last byte of each sequence with a single permutation instruction (vpermb). 12 decrementing the entries of P , we produce index vectors corresponding to the second-last and third-last bytes of each sequence.To avoid loading the third-last byte of a 1/2-byte sequence or the second-last byte of an ASCII sequence, we mask w stripped with masks to clear out bytes before ASCII characters resp.those that do not start a 3/4-byte sequence, accounting for possible wrap around. 13We then obtain our vectors as desired.The last, second-last, and third-last bytes are shifted into place and ored such that the bits A-W are contiguous, giving The value of W sum depending on the case taken can be visualized as follows: in W out .High surrogates always precede low surrogates.
Surrogates are fixed up by shifting high surrogates into position and applying surrogate plane shift and tag bits, 14 giving The operation of Eq. 25 can be visualized as follows, where 0vuts For illustration purposes, we provide C code implementing equation Eq. 25 using Intel intrinsic functions: see Fig. 2.
The vector W out holds the UTF-16LE encoded characters we want to write out.There is a final issue: the 64 bytes of UTF-8 data in the input may correspond to anywhere from 21 to 64 words of output, of which the first up to 32 words are processed. 15If a surrogate pair happened to straddle the end of W out , we would discard the corresponding low surrogate and produce an incorrect result.So once again, special care must be taken to omit the 32 nd word of __m512i mask_d7c0d7c0 = _mm512_set1_epi32(0xd7c0d7c0); __m512i mask_dc00dc00 = _mm512_set1_epi32(0xdc00dc00); //... // Mlo, Mhi and Wsum have been computed, we compute Wout.__m512i lo_surr_mask = _mm512_maskz_mov_epi16(Mlo, mask_dc00dc00); __m512i shifted4_Wsum = _mm512_srli_epi16(Wsum, 4); __m512i tagged_lo_surrogates = _mm512_or_si512(Wsum, lo_surr_mask); __m512i Wout = _mm512_mask_add_epi16(tagged_lo_surrogates, Mhi, shifted4_Wsum, mask_d7c0d7c0); FIGURE 2 C code using Intel intrinsic functions equivalent to Eq. 25.
output if it is a high surrogate.We do so by computing a mask of the elements of W out excluding the last element if it happens to be a high surrogate.We introduce a variable b which is set to all ones (b = ¬0) except at the end of the input (cf.§ 6.3).By depositing the mask M out into the last bytes of each sequence, we obtain a mask holding the locations of the last byte of each sequence that has been processed into a word in W out .
With this mask, we can compute the number of bytes of input processed and the number of words of output produced The first n out bytes of the output vector are then deposited into the output buffer, input and output buffers are advanced by n in and n out and we continue with the next iteration.
To visualize the generation of m processed , consider the example string "±1= O" with a vector length of n = 8 bytes: As the character D835 DCAA straddles the end of the vector, it cannot be processed in the current iteration: Depositing the bits of M out through m end , we then obtain m processed 0 1 1 1 0 0 0 0 bytes processed C2 B1 31 3D -------- and advance buffers by n in = 4 bytes and n out = 3 words respectively.The bytes corresponding to O will be processed again in the next iteration.

| Processing the Tail
The final bit of input with less than 64 characters remaining (tail) is handled through the variable b.This variable holds a mask of those bytes in w in we are permitted to process.Initially we set b = ¬0, permitting all bytes to be processed.
When the end of the input with ℓ < n bytes remaining to be processed is reached, we set b to a mask of the first The tail of input is read zero-masked by b, padding it with NUL bytes.Then, a final iteration of the main loop is performed, processing only the bytes accounted for in b.

| Input Validation
Throughout the transcoding process, we check the input for encoding errors and abort transcoding if any such error occurs.Aborting is done by determining the location of the encoding error and setting the remaining input length ℓ to the number of bytes preceding the first error.We then clear all input bytes starting at the first erroneous byte and jump to the tail-handling code from § 6.3, effectively restarting the current iteration as "final" iteration.
Having talked about how to continue after an error has occurred, we shall now direct our attention to the kinds of errors we have to check for.A UTF-8 encoded document must conform to the following rules: 1. Bytes 0xf5-0xff must not occur.

2.
Lead and continuation bytes must match: each byte in 0xc0 to 0xdf must be followed by one continuation byte, each byte from 0xe0 to 0xef by two continuation bytes and each byte from 0xf0 to 0xf4 by three continuation bytes.
3. Continuation bytes may not otherwise occur.

4.
The decoded character must be larger than U+7F for 2-byte sequences, larger than U+7FF for 3-byte sequences, and larger than U+FFFF for 4-byte sequences.

5.
The character must be no greater than U+10FFFF.

6.
The character must not be in the range U+D800-U+DFFF.
We check for these rules throughout the algorithm, mostly reusing masks we already have to compute for other steps of the code.Three checks are performed in total: Overlong 2-byte sequences Right at the beginning, we check whether any of the bytes 0xc0 or 0xc1 occur.Presence of these bytes indicates a 2-byte sequence that encodes a code point below U+80, violating condition 4. The first invalid input byte is the first 0xc0 or 0xc1 byte found:

Mismatched continuation bytes
After computing the various classification masks, we check if conditions 2 and 3 hold.As each byte of UTF-8 is either a lead or continuation byte, we check this by computing where continuation bytes should be (m c ) and comparing this with where lead bytes are not: We compute m c from the location of the second (m +1 ), third (m +2 ), and fourth byte (m +3 , see Eq. 11) of each sequence: Conveniently, this check also fails on the input if it starts with continuation bytes, violating the invariant established earlier.We do not catch a UTF-8 sequence straddling the end of the vector; such a sequence is checked properly in the next iteration once additional bytes have been fed in.
If this check fails, we must distinguish two cases to determine the location of the first encoding error: If the first mismatch of m c and m 1234 is due to a continuation byte present where there should not be one, the first invalid byte is that byte, giving Otherwise a continuation byte is missing where there should be one and the corresponding lead byte is the first invalid byte.This byte can be found by masking m 1234 to all bits preceding the mismatch and then finding the last (most significant) bit in it, corresponding to the lead byte that is missing a continuation byte.
This gives

Encodings out of range
Finally, we check if the codepoints encoded by 3-and 4-byte sequences are in range (conditions 4 and 5) and that 3-byte sequences do not encode surrogates (condition 6).The algorithm treats input bytes in the range 0xf5-0xff as lead bytes of 4-byte sequences.Such sequences encode code points well in excess of U+110000, allowing us to verify condition 1 as a side effect with no extra code.
We augment our existing mask set with a mask indicating the location of 3-byte sequence start bytes in w in .Shifting the mask to indicate the last byte of each 3-byte sequence, extracting through m end , and truncating to n/2 bits, we obtain a mask indicating which words in W out correspond to 3-byte sequences.We then use M 3 to check if any 3-byte sequences encode codepoints below U+800, indicating violations of condition 4.
Then we check for surrogates: words in M 3 must not encode surrogates, words in M hi must encode high surrogates (condition 6). 16A word in M hi produces a high surrogate if and only if the code point it encodes is in range U+10000-U+10FFFF (conditions 1, 4, and 5).The masks and indicate violations of these conditions. 17The check succeeds if no offending words are found: If an offending word is found, the first invalid byte is the start byte of the corresponding sequence.As the error can never occur in a low surrogate, we can find its location by projecting its location back onto the locations of the first and fourth bytes of every sequence:

| Fast Paths
Three fast paths are provided, speeding up common cases.The first two are programmed such that they cannot be triggered in the "final" iterations for the tail or in case of an encoding error, allowing us to omit the handling of b in their length computations for a further performance increase.

ASCII only
If the first 32 bytes of input are all ASCII bytes, we process these by zero-extension (vpmovzxbw) of the first 32 bytes to 16-bit words.The number of processed bytes is always 32, the number of words written out always 32, shortening the dependency chain to the next iteration.No validation is needed in this case as ASCII bytes are always valid.
Only the first 32 bytes are considered before embarking on the fast path as the default path does not process more than 32 characters in any case.Hence, while checking for all 64 bytes to be ASCII would allow for slightly faster processing in the all-ASCII case, performance for documents with short runs of ASCII characters amidst other characters (e. g.HTML documents) suffers significantly, outweighing the benefits of the other case.

1/2 byte only
In the absence of 3-and 4-byte sequences (m 34 = 0), we employ a simplified variant of the algorithm.While following the same operating principles as the main algorithm, we can take some shortcuts in the proven absence of 3-and 4byte sequences.First, the computation of some masks is greatly simplified, with most masks being entirely irrelevant for this path: We then employ a simplified scheme to compute W out : Instead of masking out tag bits, we subtract 0xc2 from the lead byte of each two-byte sequence to cancel out the tag bits of both lead and continuation byte, giving Instead of first building a permutation vector P and then using it to permute the input bytes into place, we directly compress the bytes into position (vpcompressb) and then zero extend to 16-bit words (vpmovzxbw), giving W end = compress(m end , w in ) and (54) Vectors W end and W −1 must be merged by addition instead of bitwise or to correctly cancel out tag bits, giving The operation on 2-byte characters can be visualized as follows; 0xc2 is subtracted separately to illustrate the idea: W end 0000 0000 10FE DCBA + W −1 ≪ 6 0011 0LKJ HG00 0000 − 0xc2 ≪ 6 0011 0000 1000 0000 We want to increment the input pointer quickly-without a long chain of operations.We find it advantageous to always process half a vector (32 bytes or 33 bytes to include a final continuation byte) of input data per iteration like in the ASCII-only fast path.While this approach usually processes less data than first determining the maximum number of input bytes we can process, being able to load the next data quicker is more important.We avoid accessing the SIMD masks to determine whether we advance by 32 bytes or 33 bytes.
Thus we have processing 32 bytes per iteration unless a 2-byte sequence straddles the middle of the vector 18 , in which case we process that extra byte, too.
The output buffer is advanced by the number of characters starting in the first 32 bytes, giving As for validation, the checks for "encodings out of range" are omitted.The check for "mismatched continuation bytes" is simplified to as continuation bytes must always directly follow 2-byte sequence lead bytes.The combination of all these simplifications yields a code path of roughly half the latency of the standard code path.

1/2/3 byte only
In the absence of 4-byte sequences (m 4 = 0), all characters are in the Basic Multilingual Plane.In this common case, we can slightly simplify the main routine.We have that m +3 = m 4 ≪ 3 is zero.Consequently, we can simplify the definitions of m c and m end to The computation of W out and M out is eliminated.As no surrogates are present, we can omit the surrogate postprocessing and don't need to account for surrogate pairs straddling the end of the vector.Instead, we directly get W out = W sum and (63) Finally, the validation check for out-of-range encoding is slightly simpler: as surrogates cannot occur, we can drop the M 4s term off Eq. 48.

| TRANSCODING FROM UTF-16 TO UTF-8
As explained in § 2.1, UTF-16 encodes characters in the Basic Multilingual Plane (U+0000-U+FFFF) in one 16-bit word and all others in two words as surrogate pairs.To encode a code point as a surrogate pair, 0x10000 is subtracted from the character code to obtain a 20-bit binary number.The most significant 10 bits are added to 0xD800 to form a high surrogate, which is followed by the less significant 10 bits added to 0xDC00, producing the corresponding low surrogate.1. Read a vector of 16-bit words.

3.
Zero extend each 16-bit word to a 32-bit word and join low and high surrogates.

4.
Shuffle the bits within each 32-bit word into the right positions and apply tag bits according to the type of character, producing UTF-8 sequences padded with null bytes.

5.
Compress this vector, squeezing out the padding bytes.

6.
Write the byte string to the output buffer and proceed to the next iteration.
Apart from this general plan, we also have fast code paths for the three cases of (a) ASCII characters only, (b) all in U+0000-U+07FF, and (c) no surrogates, complementing the default code path (d) surrogates present.Which code path to take is decided based on the characters in the current 62-byte chunk of input.We expect that most text inputs would consistently rely on the same code paths.Thus branches corresponding to the various fast paths are easy to predict, and we expect that they may provide a significant performance boost.
We would now like to explain the steps in the plan of attack in detail.The steps are interlinked with information produced in each step being reused for the subsequent steps.Additionally, the classification masks are reused for input validation.
First, 32 words (i.e. 64 bytes) of input are loaded from memory into W in .Of these words, 31 words are encoded in the iteration with the last word serving as a look ahead for surrogate processing ( § 7.2).The mask indicates the position of the lookahead word in W in .

| Classification and Fast Paths
We first need to find out what UTF-8 cases the characters in our input correspond to.Comparing the 16-bit words in the input vector with 0x0080 and 0x0800, we produce the masks telling us if non-ASCII (i.e. 2-, 3-, or 4-byte) characters and ASCII or 2-byte characters are present.ASCII characters in the lookahead are ignored to simplify some later bits of the algorithm.Based on this information, we can then embark on a code path suitable for this chunk of input.

ASCII only
If all input words represent ASCII characters (M 234 = 0), we handle the input in an ASCII-only fast path: the vector is truncated to bytes (vpmovwb) and deposited into the output buffer, advancing it by 31 bytes.Though we could advance by 32 bytes, we want the the algorithm to proceed with a constant stride through memory irrespective of the content.

Default path
If some 3-or 4-byte characters are present (M 12 ∨ L ¬0), we check for surrogates.We do this by masking the words with 0xfc00 and then checking if the result is equal to 0xd800 (high surrogate, M hi ) or 0xdc00 (low surrogate, M lo ), giving If surrogates are found to be present (M hi ∨ M lo 0), we proceed to § 7.2 to handle them. 19Otherwise we skip that step, set W joined = W in zero-extended from 16-bit to 32-bit (vpmovzxwd), and directly go to § 7.3.

1/2 byte only
In the third and final case, we know that the input is a mix of ASCII and 2-byte characters.We process this case by shuffling the bits of two-byte characters into position. 20The most significant two bits of each byte are cleared and tag bits are applied.Through this whole process, ASCII characters are left unchanged, giving us We illustrate this equation in the 2-byte case: The words of W out before the lookahead are then bytewise compared with 0x080021 producing a mask holding binary 01 for ASCII characters, 11 for 2-byte characters, and 00 for the lookahead.With this mask, we finally compress W out into a UTF-8 stream and write it to the output.
The output buffer pointer is advanced by the number of bytes of output produced, which is one byte for each word of input (sans lookahead) and another byte for each 2-byte character.

| Surrogates
When surrogates are present in the input, the bits of low surrogate have to be merged into those of the corresponding high surrogate, yielding the code point of the character to be encoded.
First, W in is zero extended to 32 bits per element. 22A vector W lo , holding for each high surrogate in W in its corresponding low surrogate, is produced by rotating W in to the right by one element.
Then, the surrogates are joined by subtracting the tag bits (0xd800 for the high surrogate, 0xdc00 for the low surrogate), undoing the surrogate plane shift for the high surrogate, shifting the bits of the high surrogate into place and then adding the two together.By pulling out the constants representing the tag bits and the plane shift, these additions and subtractions can be combined into one using 32-bit unsigned arithmetic.This gives us With the surrogate pairs decoded, we can then proceed to § 7.3 to encode into UTF-8.The vector elements corresponding to low surrogates are ignored for the rest of the algorithm.

| Encoding into UTF-8
When we reach this step, we have transformed W in into a vector W joined of 32-bit integers, holding the code points of the characters in the input. 23We would now like to encode these code points into UTF-8, producing 1-4 bytes of output per code point.
Consider Fig. 1a: for the 2-, 3-and 4-byte case, the bits A-W making up the code point always appear in the same position.This suggests using the same encoding procedure for the 2-, 3-, and 4-byte case with merely different tag bits applied at the end.ASCII characters are handled with a shift into position.
The encoding procedure is based on the vpmultishiftqb instruction introduced with the VBMI instruction set extension.Given a vector of 64-bit words and for each such word a vector of eight bytes, the instruction uses the byte vectors as indices to pick eight 8-bit chunks of data (8 consecutive bits) from the corresponding source words.
By choosing these indices such that they do not cross a 32-bit boundary, we can effectively use the instruction to select four 8-bit chunks out of each 32-bit word.
Applying the index vector (18, 12, 6, 0) to each 32-bit word 24 of W joined , we obtain W shifted with each bit shifted into the right position with some bits left over: To fix up the left-over bits, we mask with 0x3f3f3f3f, reusing the mask from the 2-byte fast path.Then, appropriate tag bits W tag are applied: case W shifted masked with 0x3f3f3f3f tag bits 2-byte 00FE DCBA 000L KJHG 0000 0000 0000 0000 0x80c00000 3-byte 00FE DCBA 00ML KJHG 0000 RQPN 0000 0000 0x8080e000 4-byte 00FE DCBA 00ML KJHG 00ts RQPN 0000 0wvu 0x808080f0 Finally, the ASCII case is handled by just shifting the ASCII words into position and merging these shifted characters into the output of the other cases, giving us We end up with UTF-8 encoded characters in W out .Each character occupies a 32-bit word and is padded with 0x00 bytes to 4 bytes.Input words corresponding to low surrogates have been passed through, being decoded into junk content.We get rid of the padding and the low surrogate junk by preparing a mask of bytes we want to keep and compressing out the unwanted bytes using the vpcompressb instruction.
In the mask, we want to keep the most significant byte of each 32-bit word and all non-zero bytes-except for processed low surrogates.These seemingly complex requirements can be negotiated in two steps by first building a comparison mask and then taking all bytes that are not lower than the mask.For low surrogate bytes and the lookahead, the mask is 0xff which cannot occur in w out . 25 For the most significant byte of all other words, it is 0x00 which admits every byte.For other bytes, it is 0x01, admitting only nonzero bytes.Thus we have With this mask, we compress W out into write it to the output buffer and advance the output by bytes.Due to the little-endian orientation of the x64 architecture, the bytes of each UTF-8 sequence end up in the right order: within each 32-bit word, they are written from the least significant byte to the most significant byte.

| Validation
In contrast to the UTF-8 to UTF-16 procedure, validation of UTF-16 input is less involved.We merely have to check for the correct sequencing of surrogates: every high surrogate must be followed by a low surrogate and vice versa.
As this validation only pertains surrogates, it is skipped in their absence, i. e. in all fast paths; input strings without surrogates are always valid.
To aid in this process, we only process 31 words of input in each iteration, permitting a "look ahead" into the first word of the next iteration.We also keep track of a surrogate carry c indicating if the first word in W in was preceded by a high surrogate.This carry allows us to decide if a low surrogate in W [0] is to be ignored (c = 1) or is a sequencing error (c = 0). 26rrect sequencing is checked for by concatenating M hi with c and shifting it to the position of the corresponding low surrogates M lo .The input is valid if each high surrogate corresponds to a low surrogate: The carry for the next iteration is computed as the presence of a high surrogate in the vector element right before the lookahead, giving In the absence of surrogates, i. e. in the fast paths, the carry is cleared (c out = 0).
implementation, we believe that this approach facilitates better portability and integration into existing software.
Our library is organized in different kernels that are automatically selected at runtime based on the features of the CPU, a process sometimes called runtime dispatching.During benchmarking, we can manually select the different kernels.As the names suggest, the AVX2 kernel relies on AVX2 instructions (32-byte vector length) while the AVX-512 kernel using our new functions relies on AVX-512 instructions with a 64-byte vector length.Our new functions are part of the AVX-512 kernel, and the AVX2 kernel represents results presented by Lemire and Muła [2].
For benchmarking, we use Ubuntu 22.04 on a non-virtual (metal) server from Amazon Web Services (c6i.metal).
These servers have 32-core Intel Xeon 8375C (Ice Lake) processors with 41 MiB of L3 memory, with 48 kB of L1 data cache memory and 1.25 MiB of L2 cache memory per core.The base clock frequency is 2.9 GHz, with a maximal frequency of 3.5 GHz.They have 256 GiB of main memory (DDR4, 3200 MHz).The benchmarks are single-threaded and we exclude disk and network accesses from our tests.The software is written in C++ and compiled with the Clang 14 C++ compiler from the LLVM project using the default cmake setting for a release build: -O3 -DNDEBUG.
We could use several threads.For example, we could split the input into segments, and compute the expected transcoded size of the segments, before transcoding each segment in its own thread.However, merely joining a thread under Linux can require tens of microseconds of waiting from the main thread.With the high speed of our functions, this penalty is equivalent to the time required to process hundreds of kilobytes of data.We could use faster synchronization techniques (e.g., spin locks and thread pools), but at the expense of complexity and power efficiency.
We expect that multicore parallelism is only warranted for large inputs, in the megabytes or gigabytes range.Future work might consider such cases.

| Setup
We benchmark the transcoding of data files between UTF-8 and UTF-16 in memory.We repeat the task 10 000 times, measuring the time of each conversion: the C++ library reports a precision of 1 ns for the std::chrono::steady_clock measures on our test system [17].The distribution of timings has a long tail akin to a log-normal distribution: most values are close to the minimum.We verify automatically that the difference between the minimum and the average timing is small (less than 1 %).
AVX-512 capable Intel processors prior to the Ice Lake and Rocket Lake families would systematically reduce their frequency when using 512-bit instructions, a process that Intel referred to as licensing.Such 512-bit licensing is no longer present in the more recent processors [18].However, the processor frequency may fluctuate based on power consumption and heat production as is generally the case with Intel processors.We expect 512-bit instructions to use more power, and thus to run at a slightly lower frequency.Irrespective of power usage, Intel processors execute 512bit instructions at a reduced speed initially (e.g., 4× slower)-for a few microseconds.We assume that our functions with 512-bit instructions are part of a binary executable compiled with optimizations for 512-bit capable processors so that this temporary effect is uncommon, maybe occurring only once.
We are interested in the steady-state performance of our functions: we therefore always benchmark our functions twice: once to intuitively warm the processor so that 512-bit instructions always execute at full speed and so that the processor has had a chance to decode the instructions.Furthermore, we may sometimes benchmark a function relying on 512-bit instructions, followed by a conventional function: to ensure that the latter is not penalized by the power usage of the first function, we pause for a millisecond when switching the benchmarked function.
We report performance results in characters per second.A given string has the same number of characters irrespective of the format (UTF-8, UTF-16).We also report speeds in gigabytes per second by taking the size of the input and dividing by the time elapsed.We focus on little-endian UTF-   In Fig. 4, we present the measured transcoding speed for various small prefixes of the Arabic files.We find that as long as the input has hundreds of characters, we can reach and exceed a billion characters decoded per second.Historically, some processors could only read and write data when the memory address was a multiple of the data size.Older Intel processors could read and write at any address, but with a severe penalty for unaligned memory addresses.On recent processors (e.g., Intel's Sandy Bridge microarchitecture launched in 2011), there is reportedly no measurable performance penalty for reading or writing misaligned memory operands [14].However, there might be indirect penalties (e.g., accessing more cache lines).In the hope of achieving better performance, we could require that our memory buffers start at an address divisible by 512 bits.However, we expect that the performance of the transcoding functions is generally unaffected by memory alignment on our test system.Thus our benchmarking code does not align the memory in any particular manner, relying instead on the default behavior of the memory allocator.
To test the effect of the memory alignment of the input, we transcoded the same data, but shifted by 0 to 512 bytes 500 1,000  4 Validating transcoding speed in billions of characters per second for prefixes of various lengths of the Arabic files using our techniques.inside a buffer.Using one of our UTF-8 file (Arabic), we measured a difference of 2% between the fastest and slowest alignment when using our fast (AVX-512) transcoder.We get a similar result if we transcode from a fixed input to offsetted locations inside a destination buffer.Our results suggest that memory alignment is likely not a significant factor.

| CONCLUSION
It is not a priori obvious that character transcoding is amenable to SIMD processing.Earlier work achieved high speeds but it required kilobytes of lookup tables [2].Our work indicates that the AVX-512 instruction-set extensions enables high speed for tasks such as character transcoding-without lookup tables and using few instructions.It suggests that some features of the AVX-512 instruction-set extensions might serve as a reference for future instruction-set extensions.In particular, we find masked SIMD instructions (move, load, store, compress) with byte-level granularity useful.
Both Intel and AMD support AVX-512 instructions.They also both offer specialized compilers, tuned for their processors.Future work could compare the performance of our routines on more varied Intel and AMD processors (e.g., Intel Rocket Lake and Sapphire Rapids, AMD Zen 4), using specialized compilers (e.g., from Intel and AMD) and hand-tuned assembly.We could extend our benchmarks to cover a wider range of string.
. Such sequences start expression description ¬a bitwise complement of a ctz(a ) number of trailing zeroes in a width(a ) number of bits needed to represent a popcount(a ) number of bits set in a pext(a, b ) the bits given in a extracted from b pdep(a, b ) b deposited into the bits given in a compress(m, v ) vector v compressed by mask m a + b sum of a and b a ≪ b a logically shifted to the left by b places a ≫ b a logically shifted to the right by b places a = b mask indicating elements of a equal to those of b a ∧ b bitwise and of a and b a ∨ b bitwise or of a and b a ⊕ b bitwise exclusive-or of a and b a ?b : c ternary operator; equal to a ∧ b ∨ ¬a ∧ c

2 )
stood as the sum of the bits of a.It corresponds to the popcnt instruction of the x86 instruction set.pext The parallel extract operation pext(a, b ) takes a bit mask a indicating a possibly non-consecutive bit field and extracts those bits from b, packing them into popcount(a ) bits.This corresponds to the pext instruction on recent x86 processors.The operation is perhaps best understood with a diagram: a 1010111011000100 b 1000101011110001 bit field 1-0-101-11---0-pext(a, b ) 0000000010101110 (pdep The parallel deposit operation pdep(a, b ) takes a bit mask a indicating a possibly non-consecutive bit field and deposits the bits from b into this field.It performs the opposite operation to pext and corresponds to the pdep instruction on recent x86 processors.We can likewise visualize its operation through a diagram: a 1010111011000100 b 1011010010101110 bit field 1-0-101-11---0-pdep(a, b ) 1000101011000000 ∈), and U+1D4AA (O) from UTF-8 to UTF-16: four characters demonstrate the behavior of the algorithm on the four UTF-8 cases, representing ASCII, 2byte, 3-byte, and 4-byte respectively.Observe especially how the code sequence F0 9D 92 AA for O is split into two overlapping sequences F0 9D 92 and 92 AA.The first of these two is translated into the high surrogate D835 with the second one becoming the low surrogate DCAA.
Unicode characters in the range U+0000-U+007F in one byte, characters in the range U+0080-U+07FF in two bytes, characters in the range U+0800-U+FFFF in three bytes and the other characters in four bytes.Characters encoded in one UTF-16 word thus correspond to characters encoded in 1-3 bytes of UTF-8 and characters encoded in two UTF-16 words correspond to characters encoded in 4 bytes of UTF-8.This suggests the following plan of attack for transcoding UTF-16 to UTF-8: It is widely used in databases and binary file

TABLE 1
Types of UTF-8 bytes with tag bits underlined.formats and is the preferred internal text representation on Windows NT.Nevertheless, with the advent and growing popularity of universal characters outside of the Basic Multilingual Plane, UTF-16 has been steadily declining in use.Despite big-endian byte order being prescribed for UTF-16, the little-endian variant UTF-16LE is more commonly 4ncountered under the influence of x86's little-endian orientation.A common convention to deal with this ambiguity is to prefix UTF-16 encoded documents with the byte order mark (BOM) U+FEFF.4Itsbyte-swapped counterpart U+FFFE is a reserved "uncharacter" and should not occur in Unicode text.If a UTF-16 encoded document begins with U+FFFE,

TABLE 2
Summary of notation Inoue et al. [8]presented a limited UTF-8 to UTF-16 transcoder which lacked validation and could not handle 4-byte UTF-8 characters.They rely on a 105 KiB lookup table.Lemire and Muła[2]presented a generic approach that does full UTF-8 to UTF-16 and UTF-16 to UTF-8 transcoding, with validation.Their UTF-8 to UTF-16 transcoding function is similar in principle to the strategy used by Inoue et al.[8]in that they rely on the presence of instructions to quickly permute bytes within a register in an arbitrary manner, based on a lookup table.The accelerated UTF-8 to UTF-16 transcoding algorithm processes up to 12 input UTF-8 bytes at a time.Given the input bytes, it finds beginning of each character, forming a 12-bit word which is used as a key in a 1024-entry table.Each entry in the table contains the number of UTF-8 bytes to consume and an index into another table where we find shuffle masks.The tables use about 11 KiB.The shuffle masks are applied to the 12 input bytes to form a vector register that can be transformed efficiently.This 12-byte routine works within 64-byte blocks.
This representation is close to UTF-16LE format with only the surrogate cases diverging.To address this difference, we first identify the locations of surrogates in W out .Sequences in w in corresponding to low surrogates end at the fourth bytes of 4-byte sequences.By extracting the locations of these through m end into the space of W out , we obtain the locations of low surrogates

TABLE 5
Validating transcoding speeds in gigabytes of input per second over the lipsum datasets, last row is the harmonic mean of the column.The last column (AVX-512) presents the results from our new algorithms.512 function in gigacharacters per second vary by multiples (from 4.3 with Arabic to 1.0 with Emoji), the gaps are much less significant in gigabytes per second (from 7.6 with Arabic to 4.2 with Emoji).
16, but our software supports big-endian

Table 6
presents the number of instructions retired per character, measured using the hardware performance counters provided by Intel.31In the worst case (for the Emoji files), the new AVX-512 kernel still requires fewer than per cycle.We find that the AVX-512 kernel is associated with a lower number of instructions retired per cycleespecially so when transcoding from UTF-8.Correspondingly, we expect a lower number of 64-byte instructions being retired per cycle compared to 32-byte instructions due to the microarchitectures of the Intel CPUs ( § 5.2).The utf8lut library, when transcoding from UTF-8, requires fewer instructions than our AVX-2 kernel, but it is associated with few instructions per cycle.Hence, the utf8lut library is generally slower than our AVX-2 kernel despite relying on the same instruction set.The utf8lut library relies on a 2 MiB table for UTF-8 to UTF-16 transcoding as opposed to a small table(11 KiB)for our AVX-2 kernel, and no table at all for our AVX-512 kernel.A large table may cause the CPU to wait for loads to complete and increases overall cache pressure.

TABLE 6
TABLE 7 CPU instructions retired per cycle when transcoding with validation.The last column (AVX-512) presents the results from our new algorithms.

A | UTF-8 TO UTF-16: SUMMARY OF VARIABLES
Variables pertaining to the fast paths are not listed.