Serializing strings has been a pain point for developers, and in the context
of this article, is a bottleneck in URL operations. Recently, with the help of
Daniel Lemire, we conducted an extensive research to reduce
the cost of string serialization on URL parsing operations in Node.js core,
resulting in a series of optimizations that addressed the issue, leading to
Ada v2.0.0. By implementing these techniques, we were able to improve
the performance of URL parsing and formatting, as well as reducing memory usage
and improving overall runtime stability. In this article, we will delve into
the challenges we encountered while optimizing such bottlenecks in Node.js
core, and try to explain the techniques we used to achieve the significant
performance improvements.
Quick Recap
What is the purpose of serialization?
Serialization enables us to save the state of an object and recreate the object
in a new location. In the context of this paper, serialization is required
and used to pass data between C++ and JavaScript layers.
How does C++ code communicate with JavaScript code in Node.js?
Node.js exposes C++ classes to the JavaScript layer using V8 through an
interface called internalBinding
where each subsystem of Node.js registers
its own bindings. An example implementation of how node:buffer
registers
a certain function is available below.
src/node_buffer.cc
void Initialize(Local<Object> target,
Local<Value> unused,
Local<Context> context,
void* priv) {
SetMethod(context, target, "setBufferPrototype", SetBufferPrototype);
}
void RegisterExternalReferences(ExternalReferenceRegistry* registry) {
registry->register(SetBufferPrototype);
}
NODE_BINDING_CONTEXT_AWARE_INTERNAL(buffer, node::Buffer::Initialize)
NODE_BINDING_EXTERNAL_REFERENCE(buffer, node::Buffer::RegisterExternalReferences)
The internals of how internalBinding
is created, maintained and used is out
of context of this article. For more information, please refer to the
Github discussion I’ve created called Communication steps between JS and C++.
Problem Definition
Here is a quick overview of the implementation provided by Node.js v19.8.0.
The code below is a simplified version of the actual implementation, and
does not include base url as the parameter.
Whenever a user calls new URL
inside Node.js, the following class is created.
This class is a wrapper for calling the actual implementation in C++ provided
by the Ada URL parser. The following code is available on Github.
Node.js URL class constructor
const { parse } = internalBinding('url');
class URL {
#context = new URLContext();
constructor(input) {
input = `${input}`;
if (!parse(input, this.#onParseComplete)) {
throw new ERR_INVALID_URL(input);
}
}
}
The parse
method takes 2 parameters, input and the completion callback.
This is mostly done to avoid the overhead of creating a new object for each
function.
For example, the following code is slow due to the serialization cost of objects:
Object serialization example
const parse = internalBinding('url');
const url = 'https://www.yagiz.co';
const { isValid, ...rest } = parse(url);
if (isValid) {
console.log(`parsed href is ${rest.href}`);
}
In the example above, the parse
function returns a boolean isValid
and
other properties of the parsed URL. However, the parse
function returns
these properties regardless of the isValid
flag. This means that
the structure of the rest
object is unknown on the compile time, and
V8 has to do its magic to optimize it with its limited knowledge on
the executed code block. This is a very common problem with JIT (Just in time)
compilers.
Let’s dive into the details of how the parse
function is implemented: The URL
constructor by default calls a C++ function called parse
which is
defined inside src/node_url.cc
. The parse
function is defined as follows:
Parse
void Parse(const FunctionCallbackInfo<Value>& args) {
CHECK_GE(args.Length(), 2);
CHECK(args[0]->IsString()); // input
CHECK(args[1]->IsFunction()); // complete callback
Local<Function> success_callback_ = args[2].As<Function>();
Environment* env = Environment::GetCurrent(args);
HandleScope handle_scope(env->isolate());
Context::Scope context_scope(env->context());
Utf8Value input(env->isolate(), args[0]);
ada::result out = ada::parse(input.ToStringView());
if (!out) {
return args.GetReturnValue().Set(false);
}
auto argv = GetCallbackArgs(env, out);
USE(success_callback_->Call(
env->context(), args.This(), argv.size(), argv.data()));
args.GetReturnValue().Set(true);
}
Whenever the parse function is called, it needs to be called with
input
parameter which is a string, and a callback function to pass
the values back to the JavaScript layer. This is a smart way of solving
the serialization problem of objects, and it is also a very common pattern
in Node.js core. Unfortunately, this pattern leads to making this function
a function that has a side effect. Meaning, it has to mutate the callback
according to the result of the parsing.
Here’s an example of how the callback is mutated to return the result
to the JavaScript layer:
GetCallbackArgs
auto GetCallbackArgs(Environment* env, const ada::result& url) {
Local<Context> context = env->context();
Isolate* isolate = env->isolate();
auto js_string = [&](std::string_view sv) {
return ToV8Value(context, sv, isolate).ToLocalChecked();
};
return std::array{
js_string(url->get_href()),
js_string(url->get_origin()),
js_string(url->get_protocol()),
js_string(url->get_hostname()),
js_string(url->get_pathname()),
js_string(url->get_search()),
js_string(url->get_username()),
js_string(url->get_password()),
js_string(url->get_port()),
js_string(url->get_hash()),
};
}
In order to process and save this data on the JavaScript layer,
preferably in URL class, JavaScript layer had to have a heavy
function to update the current context of the URL:
Simplified version of the URL class onParseComplete function
#onParseComplete = (href, origin, protocol, hostname, pathname,
search, username, password, port, hash) => {
this.#context.href = href;
this.#context.origin = origin;
this.#context.protocol = protocol;
this.#context.hostname = hostname;
this.#context.pathname = pathname;
this.#context.search = search;
this.#context.username = username;
this.#context.password = password;
this.#context.port = port;
this.#context.hash = hash;
};
This implementation as you’ve realized is not very efficient. It requires
sharing the knowledge of the callback function parameters, by index, between
JavaScript and C++. On top of this being a bad practice, there is a lot of
room for improvement in terms of performance. The bridge between
C++ to JavaScript is not very efficient, leading to bottlenecks when used
in hot paths.
This wasn’t a problem until now, where the true performance cost of this
function lied in the fact that the URL parser was slow. However, with
Ada URL parser the bottleneck was moved
to the serialization of the result.
As you know the URL contains a lot of properties, where href
is the only
attribute that contains all of the properties of URL, hence the identifier
of the URL.
URL properties
> new URL('https://www.yagiz.co')
URL {
href: 'https://www.yagiz.co/',
origin: 'https://www.yagiz.co',
protocol: 'https:',
username: '',
password: '',
host: 'www.yagiz.co',
hostname: 'www.yagiz.co',
port: '',
pathname: '/',
search: '',
searchParams: URLSearchParams {},
hash: ''
}
As you might notice, origin, protocol, host, hostname and others are all
substrings of href
. Well, the solution is not as simple as this, because
the origin
might differ from URL
where the hostname
can be different
with different pathname
values. There are lots of edge cases that needs
to be resolved if we are going to resolve this.
The Solution
With Ada URL Parser v2.0.0, we incorporated a common approach in
industry for storing the URL properties. The idea is to store the
href, and use offsets to represent the URL properties. This way, we can have
access to the URL properties without knowing the business logic behind
“How to parse a URL?”.
This solution comes with another advantage on top of solving the serialization
cost. The parsing becomes faster, because we don’t need to create multiple
strings for each URL property. We can reserve and allocate a string with a
guessed size, and use the offsets to construct the href
while parsing the URL.
This reduces the memory allocations, and the time spent on parsing the URL.
URL Components
Here’s a quick recap from Ada v2.0 article:
URL Components structure
https://user:pass@example.com:1234/foo/bar?baz#quux
| | | | ^^^^| | |
| | | | | | | `----- hash_start
| | | | | | `--------- search_start
| | | | | `----------------- pathname_start
| | | | `--------------------- port
| | | `----------------------- host_end
| | `---------------------------------- host_start
| `--------------------------------------- username_end
`--------------------------------------------- protocol_end
The structure of the URL class stayed the same, but with little caveats.
On the C++ side, we created a class c