Technology Beats Algorithms (in Exact String Matching)

More than 120 algorithms have been developed for exact string matching within the last 40 years. We show by experiments that the \naive{} algorithm exploiting SIMD instructions of modern CPUs (with symbols compared in a special order) is the fastest one for patterns of length up to about 50 symbols and extremely good for longer patterns and small alphabets. The algorithm compares 16 or 32 characters in parallel by applying SSE2 or AVX2 instructions, respectively. Moreover, it uses loop peeling to further speed up the searching phase. We tried several orders for comparisons of pattern symbols and the increasing order of their probabilities in the text was the best.


Introduction
The exact string matching is one of the oldest tasks in computer science.The need for it started when computers began processing text.At that time the documents were short and there were not so many of them.Now, we are overwhelmed by amount of data of various kind.The string matching is a crucial task in finding information and its speed is extremely important.
The exact string matching task is defined as counting or reporting all the locations of given pattern p of length m = |p| in given text t of length n = |t| assuming m ≪ n, where p and t are strings over a finite alphabet Σ.The first solutions designed were to build and run deterministic finite automaton [1] (running in space O(m|Σ|) and time O(n)), the Knuth-Morris-Pratt automaton [11] (running in space O(m) and time O(n)), and the Boyer-Moore algorithm [3] (running in best case time Ω(n/m) and worst case time O(mn)).There are numerous variations of the Boyer-Moore algorithm like [9,18,16,10].In total more than 120 exact string matching algorithms [6] have been developed since 1970.
Modern processors allow computation on vectors of length 16 bytes in case of SSE2 and 32 bytes in case of AVX2.The instructions operate on such vectors stored in special registers XMM0-XMM15 (SSE2) and YMM0-YMM15 (AVX2).As one instruction is performed on all data in these long vectors, it is considered as SIMD (Single Instruction, Multiple Data) computation.

Algorithms 2.1 Naïve Approach
In the naïve approach (shown as Algorithm 1) the pattern p is checked against each position in the text t which leads to running time O(mn) and space O(1).However, it is not bad in practice for large alphabets as it performs only 1.08 comparisons [10] on average on each character of t for English text.The variable found in Algorithm 1 is not quite necessary.It is presented in order to have a connection to the SIMD version to be introduced.
Like in the testing evironment of Hume & Sunday [10] and the SMART library [8], we consider the counting version of exact string matching.It can be is easily transformed into the reporting version by printing position i in line 11.
Algorithm 1 end for ⊲ destination for go to 13: end function Using SIMD instructions (shown in Algorithm 2) we can compare α bytes in parallel, where α = 16 in case of SSE2 or α = 32 in case of AVX2 and 'AND' represents the bit-parallel 'and'.This allows huge speedup of a run.
For a given position i in the text t, the idea is to compare the pattern p with the α substrings t To this end, we use a primitive SIMDcompare(t, i, p, j, α) which, given a position i in t and j in p, compares the strings S 1 = t[i + j − 1..i + j − 1 + α] and S 2 = p[j] α and returns an α-bit integer such that the k-th bit is set iff In other words, the output integer encodes the result of the j-th symbol comparison for all the α substrings.For example, consider the α leftmost substrings of length m of t, corresponding to i = 1.
Example of comparisons for the text t = aabcd • • • and pattern p = abcd using the SIMD-Naïve-search algorithm (alignment of pattern vector and vector found to text t).
Algorithm 2 end for ⊲ destination for go to 13: end function For j = 1, the function compares t[1..α] with p[1] α , i.e., the first symbol of the substrings against p [1].For j = 2, the function compares t[2..α + 1] with p[2] α , i.e., the second symbol against p [2].Let found be the bitwise and of the integers SIMDcompare(t, i, p, j, α), for j = 1, . . ., m.Clearly, t[i + k..i + k + m − 1] = p iff the k-bit of found is set.We compute found iteratively, until we either compare the last symbol of p or no substring has a partial match (i.e., the vector found becomes zero).Then, the text is advanced by α positions and the process is repeated starting at position i + α.For a given i, the number of occurrences of P is equal to the number of bits set in found and is computed using a popcount instruction.Reporting all matches in line 11 would add an O(s) time overhead, as O(s) instructions are needed to extract the positions of the bits set in found, where s is the number of occurrences found.
The 16-byte version of function SIMDcompare is implemented with SSE2 intrinsic functions as follows: SIMDcompare(x, y, 16) x_ptr = _mm_loadu_si128(x) y_ptr = _mm_loadu_si128(s(y,16)) return _mm_movemask_epi8(_mm_cmpeq_epi8(x_ptr, y_ptr)) Here s(y,16) is the starting address of 16 copies of y.The instruction _mm_loadu_si128(x) loads 16 bytes (=128 bits) starting from x to a SIMD register.The instruction _mm_cmpeq_epi8 compares bytewise two registers and the instruction _mm_movemask_epi8 extracts the comparison result as a 16-bit integer.For the 32-byte version, the corresponding AVX2 intrinsic functions are used.For both versions the SSE4 instruction _mm_popcnt_u32 is utilized for popcount.

Frequency Involved
In order to identify nonmatching positions in the text as fast as possible, individual characters of the pattern are compared to the corresponding positions in the text in the order given by their frequency in standard text.First, the least frequent symbol is compared, then the second least frequent symbol, etc.
Therefore the text type should be considered and frequencies of symbols in the text type should be computed in advance from some relevant corpus of texts of the same type.Hume and Sunday [10] use this strategy in the context of the Boyer-Moore algorithm.
Algorithm 3 end for ⊲ destination for go to 13: end function Algorithm 3 shows the naïve approach enriched by frequency consideration.A function π gives the order in which the symbols of pattern should be compared (i.e., p[π(1)], p[π(2)], . . ., p[π(m)]) to the corresponding symbols in text.An array for the function π is computed in O(m log m) time using a standard sorting algorithm on frequencies of symbols in p.
Hume and Sunday [10] call this strategy optimal match, although it is not necessarily optimal.For example, the pattern 'qui' is tested in the order 'q'-'u'-'i', but the order 'q'-'i'-'u' is clearly better in practice because 'q' and 'u' appear often together.Külekci [12] compares optimal match with more advanced strategies based on frequencies of discontinuous q-grams1 with conditional probabilities.His experiments show that the frequency is beneficial in case of texts of large alphabets like texts of natural language.Computing all possible frequencies of q-grams is rather complicated and the possible speed-up to optimal match is likely marginal.Thus we consider only simple frequencies of individual symbols.

Loop Peeling
Guard test [10,15] is a widely used technique to speed-up string matching.The idea is to test a certain pattern position before entering a checking loop.Instead of a single guard test, two or even three tests have been used [14,17].Guard test is a representative of a general optimization technique called loop peeling, where a number of iterations is moved in front of the loop.As a result, the loop becomes faster because of fewer loop tests.Moreover, loop peeling makes possible to precompute certain values used in the moved iterations.For example, p[π(1)] is explicitly known.In some cases, loop peeling may even double the speed of a string matching algorithm applying SIMD computation as observed by Chhabra et al. [4].
In the following, we call the number of the moved iterations the peeling factor r. We assume that the first loop test is done after r iterations.Thus our approach differs from multiple guard test, where checking is stopped after the first mismatch.All r iterations are performed in our approach.
Loop peeling for r = 2 is shown in Algorithm 4. The first two comparisons of characters are performed regardless the result of the first comparison (in line 4).
If we consider string matching in English texts, it is less probable that all the α comparisons fail at the same time than the other way round in the case of a pattern picked randomly from the text.Therefore it is advantageous to use the value r = 2 for English.
In theory, r = 3 would be good for DNA.Namely, every iteration nullifies roughly 3/4 of the remaining set bits of the bitvector found.However, we achieved the best running time in practice with r = 5.

Alternative Checking Orders
If the computation of character frequencies is considered inappropriate, there are other possibilities to speed-up checking.In natural languages adjacent characters have positive correlation.To break correlations one can use a fixed order Algorithm 4 In letter-based languages, the space character is the most frequent character.We can transform π h to a slightly better scheme π hs by moving first all the spaces to the end and then processing the remaining positions as for π h .

Experiments
We have selected four files of different types and alphabet sizes to run experiments on: bible.txt(Fig. 2, Table 1) and E.coli.txt (Fig. 4, Table 3) taken from Canterbury Corpus [2], Dostoevsky-TheDouble.txt (Fig. 3, Table 2), novel The Double by Dostoevsky in Czech language taken from Project Gutenberg2 , and protein-hs.txt(Fig. 5, Table 4) taken from Protein Corpus [13].File Dostoevsky-TheDouble.txt is a concatenation of five copies of the original file to get file length similar to the other files.
We have compared methods Naive16 and Naive32 having 16 and 32 bytes processed by one SIMD instruction respectively.Naive16-freq and Naive32-freq are their variants where comparison order given by nondecreasing probability of pattern symbols (Section 2.2).Naive16-fixed and Naive32-fixed are the variants where comparison order is fixed (Section 2.4).Our methods were compared with the fastest exact string matching algorithms [7] up to now SBNDM2, SBNDM4 [5] and EPSM3 [7] taken from SMART Library 4 .
The experiments were run on GNU/Linux 3.18.12,with x86 64 Intel R Core TM i7-4770 CPU 3.40GHz with 16GB RAM.The computer was without any other workload and user time was measured using POSIX function getrusage().The average of 100 running times is reported.The accuracy of the results is about ±2%.
The experiments show for both SSE2 and AVX2 instructions that for natural text (bible.txt)with the scheme π h of fixed frequency of comparisons improves the speed of SIMD-Naïve-search but it is further improved by considering frequencies of symbols in the text.In case of natural text with larger alphabet (Dostoevsky-TheDouble.txt) the scheme π h improves the speed only for AVX2 instructions.The comparison based on real frequency of symbols is the bext for both SSE2 and AVX2 instructions.In case of small alphabets (E.coli.txt,protein-hs.txt) the order of comparison of symbols does not play any role (except for protein-hs.txtand SSE2 instructions).
For files with large alphabet (bible.txt,Dostoevsky-TheDouble.txt) the peeling factor r = 3 gave the best results for all our algorithms except for Naive16-freq and Naive32-freq where r = 2 was the best.The smaller the alphabet is, the less selective the bigrams or trigrams are.For file protein-hs.txt,r = 3 was still good and but for DNA sequences of four symbols, r = 5 turned to be the best We also tested Naïve-search.In every run it was naturally considerably slower than SIMD-Naïve-search.Frequency order and loop peeling can also be applied to Naïve-search.However, the speed-up was smaller than in case of SIMD-Naïve-search in our experiments.

Concluding remarks
In spite of how many algorithms were developed for exact string matching, their running times are in general outperformed by the AVX2 technology.The implementation of the naïve search algorithm (Freq-SIMD-Naïve-search) which uses AVX2 instructions, applies loop peeling, and compares symbols in the order of increasing frequency is the best choice in general.However, previous algorithms EPSM and SBNDM4 have an advantage for small alphabets and long patterns.Short patterns of 20 characters or less are objects of most searches in practice and our algorithm is especially good for such patterns.For texts with expected equiprobable symbols (like in DNA or protein strings), our algorithm naturally works well without the frequency order of symbol comparisons.Our algorithm is considerably simpler than its SIMD-based competitor EPSM which is a combination of six algorithms.