Parsing Millions of URLs per Second

URLs are fundamental elements of web applications. By applying vector algorithms, we built a fast standard-compliant C++ implementation. Our parser uses three times fewer instructions than competing parsers following the WHATWG standard (e.g., Servo's rust-url) and up to eight times fewer instructions than the popular curl parser. The Node.js environment adopted our C++ library. In our tests on realistic data, a recent Node.js version (20.0) with our parser is four to five times faster than the last version with the legacy URL parser.

needed.For example, the input string http://你好你好.在/./a/../b/./cshould be normalized to the string https://xn-6qqa088eba.xn-3ds/b/cwhere https: represent the protocol, xn-6qqa088eba.xn-3ds is the host, and /b/c is the path.We may also need to parse a URL string relative to another string.For example, given the base string http://example.org/foo/bar,the relative string http:/example.com/leads to the final URL http://example.org/example.com/.We should also be able to modify the various components of a URL (protocol, host, username, etc.).To illustrate the complexity, our C++ software library implementing the WHATWG URL standard-and little else-has approximately 20 000 lines of code.
The WHATWG URL standard follows the robustness principle (Postel's law): be conservative in what you send, be liberal in what you accept.Parsing URLs using the WHATWG URL standard can be more challenging than using the earlier standard (RFC 3986).For example, consider the string https://\tlemire.me/en/where \t is the tabulation character.The WHATWG URL standard requires us to ignore the tabulation characters.A conventional URL parser following RFC 3986 (e.g., curl 1 ) would reject such a string.
There are many components that impact the performance of a web application but URL parsing is practically always required.URL parsing is relatively expensive.Parsing a single URL may take 4 µs on average in a system like Node.js.In our tests, the popular curl library can parse about half a million URLs per second, yet a fast C++ number parser-converting ASCII number strings to binary floating-point numbers-can process more than 50 million numbers per second [5].Thus we can parse almost 100 floating-point numbers in the time it takes curl to parse a single URL.
We think that popular systems such as Node.js should be able to parse several million URLs per second on modern systems without sacrificing correctness or safety.We present our work on the efficient implementation of the current WHATWG specification.Our implementation is freely available. 2We provide benchmarks and comparisons with other fast and popular URL parsers in C, C++, and Rust, whether they follow RFC 3986 [1] (curl and Boost.URL) or WHATWG URL (Servo rust-url).We review various strategies that are efficient when parsing strings.
Our work has been integrated into the popular Node.jsJavaScript runtime environment over several versions, concluding with a final integration in Node.js version 20.We are therefore able to run JavaScript benchmarks before the inclusion of our fast parser (e.g., Node.js version 18) and after its complete integration (e.g., Node.js version 20).
Though many factors contribute to improved performance, we estimate that the large performance gains in URL parsing are mostly the result of our work.

| RELATED WORK
Much of the academic research regarding URLs relates to security issues.For example, Ajmani et al. [3] as well as Reynolds et al. [4] test a wide range of popular URL parsers: they find many differences and discuss the security implications of these differences.In our work, we sought to provide complete and rigorous support to the WHATWG URL specification.
To our knowledge, there is no related work on the production of high-performance URL parsers.However, there is related work regarding the high-performance parsing of web formats.Park et al. [6] show that we can improve the performance of web applications by parsing JavaScript concurrently.XML parsing has received much attention: e.g., Van Engelen proposes fast XML parsing with deterministic finite state automata [7], Kostoulas et al. achieve higher 1 Curl stands for command line tool and library for transferring data with URLs and it is sometimes capitalized as cURL though the official documentation and website use a lowercase name: curl. 2 https://www.github.com/ada-url/adahttps, ws, and wss.The protocol string is terminated by the colon character (':').The protocol might be followed by a host.In such cases, the protocol-terminating colon character is followed by two slash characters '//'.A host might be preceded by credentials.Credentials in a URL define the username with an optional password split with the ':' character.E.g., postgresql://username:password@localhost:5432.To have credentials, a URL string must not have the protocol file and it must have a non-empty host.A host begins with a hostname string, optionally followed by the colon character (':') and a port number string.Thus, given the URL string data://example.com:8080/pathname?search, the host is example.com:8080whereas the hostname is example.com.URL strings with a special protocol must contain a host, whereas it is optional for other types of URL strings.Host names may be domain names, IPv4, or IPv6 addresses.
• For non-ASCII domain names, we must follow RFC 2390 [13] which involves converting Unicode to punycode [14] and checking that various rules are satisfied.
• The IPv4 address is a 32-bit unsigned integer that identifies a network address.The WHATWG URL specification considers both 192.168.1.1 and 192.0x00A80001 as valid and equivalent IPv4 addresses.The normalized URL string is made of four decimal integers (192.168.1.1).
• The IPv6 address is a 128-bit unsigned integer that identifies a network address.It is represented as a list of eight 16-bit unsigned integers, also known as IPv6 pieces.We surround IPv6 addresses by square brackets: e.g., http://[c141:ffff:0:ffff:ffff:ffff:ffff:ffff].
• Port numbers are represented by 16-bit integers with a maximum value of 65536.Special protocol have default ports: e.g., the http protocol has default port 80. Default ports are omitted in the normalized string.The file protocol cannot have a port.It is also disallowed to have port without a hostname.
It is possible for a URL to have no host (and thus no credentials), in which case the protocol string is not followed by two slashes '//': e.g., non-spec:/.//p.The standard distinguishes between an empty host (e.g., protocol:///mypath) and a missing host (e.g., protocol:/mypath).
A URL string may contain a pathname after the protocol and (optional) host.If there is no host then the pathname is opaque: e.g., the URL mailto:john@doe.comhas the opaque pathname john@doe.com.Otherwise a URL pathname starts with '/'.If the host is empty, there might be a sequence of three slash characters: e.g.file:///file.txt.
The pathname is always optional.If the pathname contains non-ASCII characters, they are percent encoded: treating the characters as UTF-8 bytes, we replace non-ASCII characters with a sequence of '%' characters followed by twocharacter hexadecimal codes.For example, the character é is replaced by the sequence %C3%A9.
We may then have a search component (also called a query).A URL search component is represented by either null or an ASCII string and starts with the character '?'.It is usual for the search component to content a sequence of key-value pairs separated by the ampersand character '&' and linked by the equal sign '=': ?a=b&c=d.The search component is percent-encoded as needed.Similarly, we may have a hash component (also called a fragment).The URL hash is the URL part that starts with the '#' character.It may also be percent-encoded.

| FAST PARSING
The WHATWG URL standard is specified as an algorithm following a state-machine.See Fig. 1.URL parsing begins in the Scheme Start state.The algorithm consumes one character at a time, and changes state according to state-specific rules.In certain scenarios, the URL state machine reverses the iteration and goes back, resulting in re-iterating the same character more than once.
We wrote our parser in C++ by initially following the finite-state design.However, the byte-by-byte processing implied by the standard is a poor choice for performance.Thus we adapted the design so that once we enter a state, we fully consume the relevant component of the URL string, as much as possible.
The standard also suggests that each component is parsed into a separate string instance.Though we optionally support this design, our default is to parse into a single string which constitutes the normalized string at the end of the parsing.We call the result an url_aggregator because the components are aggregated during parsing into a single buffer.Having a single buffer has several performance benefits: • At the beginning of the parsing, we allocate a buffer that has the size of the input string, rounded up to the next power of two.Usually there is no need for further memory allocation or copying.By allocating less memory, we reduce the probability of incurring expensive cache misses.
• When querying for string components, or for the normalized string, there is no need to generate and allocate a new string instance.We may simply return an immutable view on the underlying buffer.It is made convenient by the introduction of the string_view class in C++17 but it is also convenient in other programming languages: Rust has string slices (str), Java has CharSequence, C# has ReadOnlySpan<char> and so forth.We expect that most components consumed from input URL strings do not need to be modified and they may be copied as is.We optimized our code for this scenario by integrating tests leading to fast paths.For example, we must remove tabulation and newline characters from input strings, since they are ignored during the processing.However, most input strings do not contain tabulation and newline characters.Thus we use a fast scanning function to verify that there are no such characters.Common processors (Intel, AMD, ARM, POWER) support single-instruction-multipledata (SIMD) instructions.SIMD instructions operate on several words at once unlike regular instructions.Though different processors support different SIMD instruction sets, there is some common ground.The 64-bit processors from Intel and AMD (x64) are required to support SSE2 instructions while 64-bit ARM processors (Apple, Qualcomm, etc.) support NEON instructions.We can use these instructions through intrinsic functions in C and C++: these special functions often provide functionality similar to a given instruction (e.g., a NEON addition), without using assembly.
running variable that contains initially only 16 zero bytes.During each iteration, we load up 16 bytes of data from the input, and we execute three comparisons between the newly loaded 16 bytes and each one with each of the three registers corresponding to \r, \n, and \t respectively.We combine the result with a bitwise-OR operation.If one of the three characters (\r, \n, and \t) appeared in the input, then at least one element from the running register will be non-zero.We have a final iterator for the case where the input does not contain a multiple of 16 bytes: in this case, we copy the last section of input to a 16-byte array on the stack and load 16 bytes from this array.At the end, and only at the end, we check whether one of the element of the running variable is non-zero with the pmovmskb instruction and a branch.Thus our code always consumes the entire input: we proceed in this manner because we expect inputs to almost never contain the characters \r, \n, and \t.We prefer to save instructions and reduce the number of branches in the common case when the three characters are absent, at the expense of more expensive processing when the one of the three characters are present.In this sense, our approach is optimistic: we assume that, most times, our input is as expected and we assume that special cases (e.g., the presence of \r, \n, and \t within the URL string) are rare.We have also the equivalent function in NEON, as well as a fallback function for other processors.Both SSE2 and NEON instructions are a standard component of the x64 and aarch64 (64-bit ARM) instruction sets.The compiler routinely compiles C++ code to these instructions (SSE2 and NEON) and they are part of the standard libraries.We detect the target family of processors at compile time.Effectively, the routine compares each input character with the newline and tabulation characters.When at one such character is found, we use a slow path where a temporary buffer is allocated.We write a version of the input string to the temporary buffer while omitting the newline and tabulation characters.We find in practice that it is rarely needed.
Most of the strings begin with a protocol string (e.g., file or https).We must recognize a limited set of special protocols specified by the WHATWG URL standard.We identify that first occurrence of the colon character ':' and seek to recognize quickly the protocol.We expect most protocol strings to be special in practice: it is uncommon for protocols not to be one of http, https, ws, wss, ftp or file.We designed a perfect hash function [15] (see Fig. 3).
The function first checks whether the string is empty, a special case.If it is not empty, we use as a hash function, twice the length of the string plus the integer value of the first byte of the string.We select only the the least significant std::string_view is_special_list[] = {"http", " ", "https", "ws", "ftp", "wss", "file", " "}; F I G U R E 3 Analysis of the protocol string three bits of the results, thus generating a value between 0 and 7 inclusively.For valid special protocols, the hash function returns a value between 0 and 6 inclusively.It is a perfect hash function: special protocols are mapped to distinct integer values.We can verify that the string http is mapped to 0, the string https to 2, and so forth.We look up the result in a table (http, , https, ws, ftp, wss, file): The function compares the input with the content of the table, so no false positive is possible.Based solely on the length of the protocol string and the first character, we can distinguish any one of the special protocols.At several steps during the processing, the standard requires us to check the protocol.If we merely store a string value representing the protocol, then we may need to do a string-to-string comparison each time.Instead, for example, we can verify whether we have file protocol by comparing the protocol type with the integer value 6: an integer-to-integer comparison may result in a single instructions once compiled unlike a string comparison.
Most URL strings have a host string that must be processed.In the majority of cases, the host string requires no further processing: it is a lower-case ASCII string.We use the function of Fig. 4 to identify problematic characteristics.
Effectively, it is a stream of table lookups with bitwise OR operations.Each character is viewed as a byte value (between 0 and 256) and the 256-byte table contains values 0, 1, 2 depending on whether the character is a forbidden character (value 1), an upper case letter (value 2) or a valid character (value 0).The result of the function is zero if the input is a lower-case ASCII string, it is 2 if it is an otherwise valid input but with upper case letters.If the result of the function is 1 or 3, then the input contains invalid characters.Though we could use SIMD instructions for this purpose, the hostnames are relatively short.When the host string contains non-ASCII lower-case characters, we fall back on a relatively extensive normalization process which may include punycode encoding [13,14].About half of our C++ source code (or 10 000 lines) is dedicated to this normalization: thankfully it is rarely needed in practice.We also include a fast routine to detect IPv6 when the host string begins with the bracket '['.We also check for IPv4 by scanning for digits and the dot character '.'.As soon as an IPv6 or IPv4 address is found, we normalize it using a specialized routine.
We must then process the rest of the URL string, including the path, the search, and the hash substrings.These components may sometimes require percent-encoding.To avoid unnecessary percent-encoding, we search through each substring for the first character that might require percent encoding, when none is found, we can skip percent encoding entirely.Otherwise, we proceed with percent encoding from that character.We classify characters needing percent encoding using fast table lookups.
• The second bit is set whenever the backslash character is present.
• The third bit is set whenever a dot character is present.
• The fourth bit is set whenever the percent character is present.
We call the result of the function a path signature.We could use SIMD instructions for the computation of the path signature-it would be beneficial for long paths-but our signature routine is already efficient.
As we parse the input strings, we store the components (e.g., protocol, hostname) on a single buffer that becomes our normalized string.To record the location of the components, we use a convention similar to other parsers (e.g., https://user:pass@example.com:1234/foo/bar?baz#quux hash_start search_start pathname_start port host_end host_start username_end protocol_end Servo rust-url): counting the normalized string length, we only need nine integers to characterize a parsed URL.See Fig. 6.In our actual implementation compiled with GCC 12 under Linux, we use 80 bytes per URL (not counting the dynamic memory allocation), of which 32 bytes are used by the std::string instance that we use as our buffer.
Though our memory usage could be further optimized, it is clear that storing multiple std::string instances would use much more memory.

| JavaScript Integration
In a system like Node.js, calling C++ from JavaScript can be relatively expensive.Indeed, creating a new JavaScript string instance from C++ data can be a costly operation.With our design where we have a single normalized string, we just need to additionally pass some integer offsets to indicate the position of the components in the string.We also provide JavaScript with the protocol type as an integer, which allows (for example) to check that we have a file URL with a single integer comparison.Components such as protocol, hostname, pathname, search, hash, etc. are computed as needed as substrings of the normalized string from within JavaScript.In effect, we reduce as much as possible the need to copy strings between C++ and JavaScript, relying instead on integer values.

| BENCHMARKS
For C++ benchmarking, we use the release 2.4.1 for the Ada library.Our implementation is safe and correct in the sense that it has undergone thorough testing, including extensive tests with random inputs (fuzzing).
To directly compare our C++ implementation, we use the following competitors: • A high-quality WHATWG URL C++ library published as open-source software by Misevičius. 3  that is difficult to remove or isolate.We found other URL parsers, but we believe that the standalone parsers we have selected are representative of the state-of-the-art: all of them are well maintained, reasonably fast, and well documented.
Our benchmark code consumes the URLs taken from large datasets: we ask each parser to normalize the strings.
We use Google Benchmarks to derive accurate timings.We also add additional code to capture CPU performance counters (cycles and instructions retired).
We gathered a collection of realistic URLs for benchmarking purposes and we make them freely available. 5 The wikipedia 100k dataset contains 100 000 URLs from a snapshot of all Wikipedia articles as URLs (collected March 6th, 2023).
• The top 100 dataset contains 100 031 URLs found in a crawl of the top 100 most popular websites on the Internet.
It contains some invalid URLs: 26 URLs according to the WHATWG URL specification are invalid.The curl parser finds 130 invalid URLs whereas the Boost.URL parser identifies 201 invalid URLs.We make freely available the JavaScript software we used to construct this dataset. 6Fig. 7 presents two histograms regarding this dataset.
The first histogram shows that the size of the host in bytes ranges roughly between 10 and 30 bytes, with some outliers.The total size of URL string ranges between a few bytes and hundreds of bytes.Most URLs use between 50 and 100 bytes.
• The Linux files dataset contains all files from a Linux system as URLs (169 312 URLs).
• The userbait dataset contains 11 430 URLs from a phishing benchmark. 7n some experiments, we also include another dataset: the kasztp dataset is made of 48 009 URLs from a URL shortener benchmark. 8  When they are not ASCII, all URLs are processed as UTF-8 strings.The conversion from UTF-16 inputs to UTF-8 would be take little computation [16].We assume that all inputs are valid Unicode, validation would be similarly require little computation [17].
We run our benchmarks on the two systems presented in Table 2.The AMD server runs Ubuntu 22.04 whereas the Apple processor is on a standard MacBook Air (2022).We monitor the effective frequency and find that the MacBook Air remains at 3.0 GHz whereas the AMD servers maintain 3.4 GHz.We find little variation in the effective frequency between tests (within 1 %).Our benchmark should not be interpreted as an assessment of the performance of ARM versus x64, or of AMD versus Apple.We use different hardware systems, released at different times, to arrive at a more robust comparison of the software.

| JavaScript runtime environments
We also benchmark URL parsing within JavaScript runtime environments.We used Node.js which can run JavaScript on servers using the Google v8 JavaScript engine.It contains additional code written in C++ and JavaScript.Apart from the popular Node.js runtime environment, we selected two similar environments.Deno resembles Node.js in that it also relies on the v8 JavaScript engine; it is written in Rust instead of C++.Bun is another JavaScript environment but it replaces Google v8 with the WebKit's JavaScript engine (upon which Apple Safari is based).Bun is also written in part with C++ and Zig.For URL parsing, Bun relies on WebKit's C++ internal code whereas Deno uses rust-url 9 .We use Deno (version 1.32.5),Bun (version 0.5.9), and Node.js (versions 18.15.0 and 20.1.0).We choose Node.js version 18.15 because more recent versions of Node.js include some of our URL-parsing software.All systems run the same scripts, parsing the URLs from the top 100 dataset.We use mitata (version 0.1.6)as the benchmarking framework in JavaScript. 10 We make our script available. 11ig.9 gives the number of millions of URLs processed per second for different datasets and different JavaScript systems.Node.js 20, with our Ada URL parser has the best performance.However, Bun also provides excellent performance, especially on the linux files dataset where it comes close to Node.js 20.Roughly speaking, compared to the C++ benchmarks (Fig. 8), the speeds are about half: Node.js 20 is consistently faster than 1.5 million URLs per second on the AMD system, and faster than 2.5 million URLs per second on the Apple system.It suggests that about half of the processing is tied to the JavaScript system, some of it spent in C++, and the rest in JavaScript.The most important difference is between Node.js 20 and Node.js 18 (which lacked Ada): Node.js 20 is four times faster on the Apple system and five times faster on the AMD system.We believe that it is essentially attributable to the replacement of the legacy URL parser by Ada.Node.js went from having the worst performance on URL parsing to the best performance compared to Bun and Deno.
To gain further evidence that the better performance in Node.js is largely due to our work, we used profiling.
Specifically, we ran the Node.jsbenchmarks using version 20 (with Ada) and version 18 (without Ada) under the Linux perf record command.The command gathers profiling data of a Linux application at 4000 Hz (by default).We then used the perf report command to identify the most time-consuming functions.For Node 18, we finnd that the most time-consuming function related to URL parsing is node::url::URL::Parse: it takes an estimated 4.7 s during the entire benchmark (all data files included).For Node 20, we find that the most time-consuming function related to URL parsing is ada::parser::parse_url<ada::url_aggregator>.It takes an estimated 1.4 s, so between three and four times less time than the equivalent function in Node 18.
Node.js 20 is more than twice as fast as Deno in our experiments: it is consistent with the fact that Node.js 20 relies on the ada C++ library whereas Deno relies on rust-url, a significantly slower software library (see Fig. 8).

| JavaScript server
To better assess the real-world performance impact of our URL parser, we wrote an http server (Fig. 10).The server supports two different paths: • simple returns the string contained in the body of the query; • href parses the string as a URL and returns its normalized form.
tool 12 : multiple requests are issued during 10 seconds, using 10 threads.We use a test query the JSON document { "url": "https://www.google.com/hello-world?query=search#value" }.We present the results from the Apple system in Table 5.The margin of error on the average number of requests per second is small (1%) during our tests.Our results suggest that Node.js 20.1 might be slightly faster than Node.js18.15 on trivial requests (simple) by up to 2%.However, we find that Node.js 20.1 is faster by 1 ≈ 10% compared to Node.js 18.15 for the requests that involve the URL parsing (href): it suggests that URL parsing could be a performance bottleneck in Node.js 18.15prior to the integration of our URL parsing library.We find interesting that when using Node.js 20, there is little difference between the simple and the href benchmarks (≈ 3%) which suggests that URL parsing may no longer be a performance bottleneck.

| CONCLUSION
We developed and released a new URL parser that provides full compliance with the WHATWG URL specification.It replaced the legacy Node.js parser, multiplying the performance of URL parsing in Node.js.We believe that our good results can be explained in large part by the following strategies: (1) reduce the number of memory allocations to a minimum, using a single buffer if possible, (2) implement fast functions to check for common fast paths, (3) replace strings with simpler types such as integers whenever possible.Our work suggests that there is still much room for performance improvements in the software used to build web applications.

Fig. 8 7 8
Fig.8gives the number of millions of URLs processed per second for different datasets and different software libraries.Our parser (Ada) dominates, being often twice as fast as other parsers.It is consistently faster than 3 mil-

Table 3
presents the collected performance counters while running the C++ benchmark, while Table4has the performance counters for the Apple system.Ada requires consistently fewer instructions than the other parsers.For example, on the top 100 dataset, it required 2200 instructions per URL (AMD) and 2400 instructions per URL (Apple) compared to 18 000 and 19 000 for curl: Ada required eight times fewer instructions.
TA B L E 5 http performance for Apple M2 system