* x64: Refactor `Amode` computation in ISLE This commit replaces the previous computation of `Amode` with a different set of rules that are intended to achieve the same purpose but are structured differently. The motivation for this commit is going to become more relevant in the next commit where `lea` will be used for the `iadd` instruction, possibly, on x64. When doing so it caused a stack overflow in the test suite during the compilation phase of a wasm module, namely as part of the `amode_add` function. This function is recursively defined in terms of itself and recurses as deep as the deepest `iadd`-chain in a program. A particular test in our test suite has a 10k-long chain of `iadd` which ended up causing a stack overflow in debug mode. This stack overflow is caused because the `amode_add` helper in ISLE unconditionally peels all the `iadd` nodes away and looks at all of them, even if most end up in intermediate registers along the way. Given that structure I couldn't find a way to easily abort the recursion. The new `to_amode` helper is structured in a similar fashion but attempts to instead only recurse far enough to fold items into the final `Amode` instead of recursing through items which themselves don't end up in the `Amode`. Put another way previously the `amode_add` helper might emit `x64_add` instructions, but it no longer does that. This goal of this commit is to preserve all the original `Amode` optimizations, however. For some parts, though, it relies more on egraph optimizations to run since if an `iadd` is 10k deep it doesn't try to find a constant buried 9k levels inside there to fold into the `Amode`. The hope, though, is that with egraphs having run already it's shuffled constants to the right most of the time and already folded any possible together. * x64: Add `lea`-based lowering for `iadd` This commit adds a rule for the lowering of `iadd` to use `lea` for 32 and 64-bit addition. The theoretical benefit of `lea` over the `add` instruction is that the `lea` variant can emulate a 3-operand instruction which doesn't destructively modify on of its operands. Additionally the `lea` operation can fold in other components such as constant additions and shifts. In practice, however, if `lea` is unconditionally used instead of `iadd` it ends up losing 10% performance on a local `meshoptimizer` benchmark. My best guess as to what's going on here is that my CPU's dedicated units for address computation are all overloaded while the ALUs are basically idle in a memory-intensive loop. Previously when the ALU was used for `add` and the address units for stores/loads it in theory pipelined things better (most of this is me shooting in the dark). To prevent the performance loss here I've updated the lowering of `iadd` to conditionally sometimes use `lea` and sometimes use `add` depending on how "complicated" the `Amode` is. Simple ones like `a + b` or `a + $imm` continue to use `add` (and its subsequent hypothetical extra `mov` necessary into the result). More complicated ones like `a + b + $imm` or `a + b << c + $imm` use `lea` as it can remove the need for extra instructions. Locally at least this fixes the performance loss relative to unconditionally using `lea`. One note is that this adds an `OperandSize` argument to the `MInst::LoadEffectiveAddress` variant to add an encoding for 32-bit `lea` in addition to the preexisting 64-bit encoding. * Conditionally use `lea` based on regalloc
Cranelift Code Generator
A Bytecode Alliance project
Cranelift is a low-level retargetable code generator. It translates a target-independent intermediate representation into executable machine code.
For more information, see the documentation.
For an example of how to use the JIT, see the JIT Demo, which implements a toy language.
For an example of how to use Cranelift to run WebAssembly code, see Wasmtime, which implements a standalone, embeddable, VM using Cranelift.
Status
Cranelift currently supports enough functionality to run a wide variety of programs, including all the functionality needed to execute WebAssembly (MVP and various extensions like SIMD), although it needs to be used within an external WebAssembly embedding such as Wasmtime to be part of a complete WebAssembly implementation. It is also usable as a backend for non-WebAssembly use cases: for example, there is an effort to build a Rust compiler backend using Cranelift.
Cranelift is production-ready, and is used in production in several places, all within the context of Wasmtime. It is carefully fuzzed as part of Wasmtime with differential comparison against V8 and the executable Wasm spec, and the register allocator is separately fuzzed with symbolic verification. There is an active effort to formally verify Cranelift's instruction-selection backends. We take security seriously and have a security policy as a part of Bytecode Alliance.
Cranelift has three backends: x86-64, aarch64 (aka ARM64), and s390x (aka IBM Z). All three backends fully support enough functionality for Wasm MVP, and x86-64 and aarch64 fully support SIMD as well. On x86-64, Cranelift supports both the System V AMD64 ABI calling convention used on many platforms and the Windows x64 calling convention. On aarch64, Cranelift supports the standard Linux calling convention and also has specific support for macOS (i.e., M1 / Apple Silicon).
Cranelift's code quality is within range of competitiveness to browser JIT engines' optimizing tiers. A recent paper includes third-party benchmarks of Cranelift, driven by Wasmtime, against V8 and an LLVM-based Wasm engine, WAVM (Fig 22). The speed of Cranelift's generated code is ~2% slower than that of V8 (TurboFan), and ~14% slower than WAVM (LLVM). Its compilation speed, in the same paper, is measured as approximately an order of magnitude faster than WAVM (LLVM). We continue to work to improve both measures.
The core codegen crates have minimal dependencies and are carefully written to handle malicious or arbitrary compiler input: in particular, they do not use callstack recursion.
Cranelift performs some basic mitigations for Spectre attacks on heap bounds checks, table bounds checks, and indirect branch bounds checks; see #1032 for more.
Cranelift's APIs are not yet considered stable, though we do follow semantic-versioning (semver) with minor-version patch releases.
Cranelift generally requires the latest stable Rust to build as a policy, and is tested as such, but we can incorporate fixes for compilation with older Rust versions on a best-effort basis.
Contributing
If you're interested in contributing to Cranelift: thank you! We have a contributing guide which will help you getting involved in the Cranelift project.
Planned uses
Cranelift is designed to be a code generator for WebAssembly, but it is general enough to be useful elsewhere too. The initial planned uses that affected its design were:
- Wasmtime non-Web wasm engine.
- Debug build backend for the Rust compiler.
- WebAssembly compiler for the SpiderMonkey engine in Firefox (currently not planned anymore; SpiderMonkey team may re-assess in the future).
- Backend for the IonMonkey JavaScript JIT compiler in Firefox (currently not planned anymore; SpiderMonkey team may re-assess in the future).
Building Cranelift
Cranelift uses a conventional Cargo build process.
Cranelift consists of a collection of crates, and uses a Cargo
Workspace,
so for some cargo commands, such as cargo test, the --all is needed
to tell cargo to visit all of the crates.
test-all.sh at the top level is a script which runs all the cargo
tests and also performs code format, lint, and documentation checks.
Log configuration
Cranelift uses the log crate to log messages at various levels. It doesn't
specify any maximal logging level, so embedders can choose what it should be;
however, this can have an impact of Cranelift's code size. You can use log
features to reduce the maximum logging level. For instance if you want to limit
the level of logging to warn messages and above in release mode:
[dependency.log]
...
features = ["release_max_level_warn"]
Editor Support
Editor support for working with Cranelift IR (clif) files: