During instance initialization, we build two sorts of arrays eagerly:
- We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function
in an instance.
- We initialize every element of a funcref table with an initializer to
a pointer to one of these anyfuncs.
Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
`ref.func`. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.
This PR implements two basic ideas:
- The anyfunc array can be lazily initialized as long as we retain the
information needed to do so. For now, in this PR, we just recreate the
anyfunc whenever a pointer is taken to it, because doing so is fast
enough; in the future we could keep some state to know whether the
anyfunc has been written yet and skip this work if redundant.
This technique allows us to leave the anyfunc array as uninitialized
memory, which can be a significant savings. Filling it with
initialized anyfuncs is very expensive, but even zeroing it is
expensive: e.g. in a large module, it can be >500KB.
- A funcref table can be lazily initialized as long as we retain a link
to its corresponding instance and function index for each element. A
zero in a table element means "uninitialized", and a slowpath does the
initialization.
Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.
We do have to set up a precomputed array of `FuncIndex`s for the table
in order for this to work. We do this as part of the module compilation.
This PR also refactors the way that the runtime crate gains access to
information computed during module compilation.
Performance effect measured with in-tree benches/instantiation.rs, using
SpiderMonkey built for WASI, and with memfd enabled:
```
BEFORE:
sequential/default/spidermonkey.wasm
time: [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
time: [69.406 us 69.435 us 69.465 us]
parallel/default/spidermonkey.wasm: with 1 background thread
time: [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
time: [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
time: [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
time: [326.81 us 337.32 us 347.01 us]
WITH THIS PR:
sequential/default/spidermonkey.wasm
time: [6.7821 us 6.8096 us 6.8397 us]
change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
Performance has improved.
sequential/pooling/spidermonkey.wasm
time: [3.0410 us 3.0558 us 3.0724 us]
change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
Performance has improved.
parallel/default/spidermonkey.wasm: with 1 background thread
time: [7.2643 us 7.2689 us 7.2735 us]
change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
time: [147.36 us 148.99 us 150.74 us]
change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
time: [3.1009 us 3.1021 us 3.1033 us]
change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
time: [49.449 us 50.475 us 51.540 us]
change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
Performance has improved.
```
So an improvement of something like 80-95% for a very large module (7420
functions in its one funcref table, 31928 functions total).
Cranelift Code Generator
A Bytecode Alliance project
Cranelift is a low-level retargetable code generator. It translates a target-independent intermediate representation into executable machine code.
For more information, see the documentation.
For an example of how to use the JIT, see the JIT Demo, which implements a toy language.
For an example of how to use Cranelift to run WebAssembly code, see Wasmtime, which implements a standalone, embeddable, VM using Cranelift.
Status
Cranelift currently supports enough functionality to run a wide variety of programs, including all the functionality needed to execute WebAssembly MVP functions, although it needs to be used within an external WebAssembly embedding to be part of a complete WebAssembly implementation.
The x86-64 backend is currently the most complete and stable; other architectures are in various stages of development. Cranelift currently supports both the System V AMD64 ABI calling convention used on many platforms and the Windows x64 calling convention. The performance of code produced by Cranelift is not yet impressive, though we have plans to fix that.
The core codegen crates have minimal dependencies, support no_std mode (see below), and do not require any host floating-point support, and do not use callstack recursion.
Cranelift does not yet perform mitigations for Spectre or related security issues, though it may do so in the future. It does not currently make any security-relevant instruction timing guarantees. It has seen a fair amount of testing and fuzzing, although more work is needed before it would be ready for a production use case.
Cranelift's APIs are not yet stable.
Cranelift currently requires Rust 1.37 or later to build.
Contributing
If you're interested in contributing to Cranelift: thank you! We have a contributing guide which will help you getting involved in the Cranelift project.
Planned uses
Cranelift is designed to be a code generator for WebAssembly, but it is general enough to be useful elsewhere too. The initial planned uses that affected its design are:
- WebAssembly compiler for the SpiderMonkey engine in Firefox.
- Backend for the IonMonkey JavaScript JIT compiler in Firefox.
- Debug build backend for the Rust compiler.
- Wasmtime non-Web wasm engine.
Building Cranelift
Cranelift uses a conventional Cargo build process.
Cranelift consists of a collection of crates, and uses a Cargo
Workspace,
so for some cargo commands, such as cargo test, the --all is needed
to tell cargo to visit all of the crates.
test-all.sh at the top level is a script which runs all the cargo
tests and also performs code format, lint, and documentation checks.
Building with no_std
The following crates support `no_std`, although they do depend on liballoc:
- cranelift-entity
- cranelift-bforest
- cranelift-codegen
- cranelift-frontend
- cranelift-native
- cranelift-wasm
- cranelift-module
- cranelift-preopt
- cranelift
To use no_std mode, disable the std feature and enable the core feature. This currently requires nightly rust.
For example, to build `cranelift-codegen`:
cd cranelift-codegen
cargo build --no-default-features --features core
Or, when using cranelift-codegen as a dependency (in Cargo.toml):
[dependency.cranelift-codegen]
...
default-features = false
features = ["core"]
no_std support is currently "best effort". We won't try to break it, and we'll accept patches fixing problems, however we don't expect all developers to build and test no_std when submitting patches. Accordingly, the ./test-all.sh script does not test no_std.
There is a separate ./test-no_std.sh script that tests the no_std support in packages which support it.
It's important to note that cranelift still needs liballoc to compile. Thus, whatever environment is used must implement an allocator.
Also, to allow the use of HashMaps with no_std, an external crate called hashmap_core is pulled in (via the core feature). This is mostly the same as std::collections::HashMap, except that it doesn't have DOS protection. Just something to think about.
Log configuration
Cranelift uses the log crate to log messages at various levels. It doesn't
specify any maximal logging level, so embedders can choose what it should be;
however, this can have an impact of Cranelift's code size. You can use log
features to reduce the maximum logging level. For instance if you want to limit
the level of logging to warn messages and above in release mode:
[dependency.log]
...
features = ["release_max_level_warn"]
Editor Support
Editor support for working with Cranelift IR (clif) files: