When adding some optimization rules for `icmp` in the egraph
infrastructure, we ended up creating a path to legal CLIF but with
patterns unsupported by three of our four backends: specifically,
`select_spectre_guard` with a general truthy input, rather than an
`icmp`.
In #5206 we discussed replacing `select_spectre_guard` with something
more specific, and that could still be a long-term solution here, but
doing so now would interfere with ongoing refactoring of heap access
lowering, so I've opted not to do so. (In that issue I was concerned
about complexity and didn't see the need but with this fuzzbug I'm
starting to feel a bit differently; maybe we should remove this
non-orthogonal op in the long run.)
Fixes#5417.
This adds support for `.wat` tests in `cranelift-filetest`. The test runner
translates the WAT to Wasm and then uses `cranelift-wasm` to translate the Wasm
to CLIF.
These tests are always precise output tests. The test expectations can be
updated by running tests with the `CRANELIFT_TEST_BLESS=1` environment variable
set, similar to our compile precise output tests. The test's expected output is
contained in the last comment in the test file.
The tests allow for configuring the kinds of heaps used to implement Wasm linear
memory via TOML in a `;;!` comment at the start of the test.
To get ISA and Cranelift flags parsing available in the filetests crate, I had
to move the `parse_sets_and_triple` helper from the `cranelift-tools` binary
crate to the `cranelift-reader` crate, where I think it logically
fits.
Additionally, I had to make some more bits of `cranelift-wasm`'s dummy
environment `pub` so that I could properly wrap and compose it with the
environment used for the `.wat` tests. I don't think this is a big deal, but if
we eventually want to clean this stuff up, we can probably remove the dummy
environments completely, remove `translate_module`, and fold them into these new
test environments and test runner (since Wasmtime isn't using those things
anyways).
* Fix optimization rules for narrow types: wrap i8 results to 8 bits.
This fixes#5405.
In the egraph mid-end's optimization rules, we were rewriting e.g. imuls
of two iconsts to an iconst of the result, but without masking off the
high bits (beyond the result type's width). This was producing iconsts
with set high bits beyond their types' width, which is not legal.
In addition, this PR adds some optimizations to the algebraic rules to
recognize e.g. `x == x` (and all other integer comparison operators) and
resolve to 1 or 0 as appropriate.
* Review feedback.
* Review feedback, again.
All instructions using the CPU flags types (IFLAGS/FFLAGS) were already
removed. This patch completes the cleanup by removing all remaining
instructions that define values of CPU flags types, as well as the
types themselves.
Specifically, the following features are removed:
- The IFLAGS and FFLAGS types and the SpecialType category.
- Special handling of IFLAGS and FFLAGS in machinst/isle.rs and
machinst/lower.rs.
- The ifcmp, ifcmp_imm, ffcmp, iadd_ifcin, iadd_ifcout, iadd_ifcarry,
isub_ifbin, isub_ifbout, and isub_ifborrow instructions.
- The writes_cpu_flags instruction property.
- The flags verifier pass.
- Flags handling in the interpreter.
All of these features are currently unused; no functional change
intended by this patch.
This addresses https://github.com/bytecodealliance/wasmtime/issues/3249.
I copied the `bmask` precise-output tests from x64 and used
CRANELIFT_TEST_BLESS=1 to generate this test.
I don't know aarch64 well enough to decide if this output is correct.
However, for I128 it is identical to the previous I128-only
precise-output tests, and the existing runtests for bmask pass on
aarch64, so I think it's likely correct.
Enable regalloc2's SSA verifier in debug builds to check for any outstanding reuse of virtual registers in def constraints. As fuzzing enables debug_assertions, this will enable the SSA verifier when fuzzing as well.
* egraph support: rewrite to work in terms of CLIF data structures.
This work rewrites the "egraph"-based optimization framework in
Cranelift to operate on aegraphs (acyclic egraphs) represented in the
CLIF itself rather than as a separate data structure to which and from
which we translate the CLIF.
The basic idea is to add a new kind of value, a "union", that is like an
alias but refers to two other values rather than one. This allows us to
represent an eclass of enodes (values) as a tree. The union node allows
for a value to have *multiple representations*: either constituent value
could be used, and (in well-formed CLIF produced by correct
optimization rules) they must be equivalent.
Like the old egraph infrastructure, we take advantage of acyclicity and
eager rule application to do optimization in a single pass. Like before,
we integrate GVN (during the optimization pass) and LICM (during
elaboration).
Unlike the old egraph infrastructure, everything stays in the
DataFlowGraph. "Pure" enodes are represented as instructions that have
values attached, but that are not placed into the function layout. When
entering "egraph" form, we remove them from the layout while optimizing.
When leaving "egraph" form, during elaboration, we can place an
instruction back into the layout the first time we elaborate the enode;
if we elaborate it more than once, we clone the instruction.
The implementation performs two passes overall:
- One, a forward pass in RPO (to see defs before uses), that (i) removes
"pure" instructions from the layout and (ii) optimizes as it goes. As
before, we eagerly optimize, so we form the entire union of optimized
forms of a value before we see any uses of that value. This lets us
rewrite uses to use the most "up-to-date" form of the value and
canonicalize and optimize that form.
The eager rewriting and acyclic representation make each other work
(we could not eagerly rewrite if there were cycles; and acyclicity
does not miss optimization opportunities only because the first time
we introduce a value, we immediately produce its "best" form). This
design choice is also what allows us to avoid the "parent pointers"
and fixpoint loop of traditional egraphs.
This forward optimization pass keeps a scoped hashmap to "intern"
nodes (thus performing GVN), and also interleaves on a per-instruction
level with alias analysis. The interleaving with alias analysis allows
alias analysis to see the most optimized form of each address (so it
can see equivalences), and allows the next value to see any
equivalences (reuses of loads or stored values) that alias analysis
uncovers.
- Two, a forward pass in domtree preorder, that "elaborates" pure enodes
back into the layout, possibly in multiple places if needed. This
tracks the loop nest and hoists nodes as needed, performing LICM as it
goes. Note that by doing this in forward order, we avoid the
"fixpoint" that traditional LICM needs: we hoist a def before its
uses, so when we place a node, we place it in the right place the
first time rather than moving later.
This PR replaces the old (a)egraph implementation. It removes both the
cranelift-egraph crate and the logic in cranelift-codegen that uses it.
On `spidermonkey.wasm` running a simple recursive Fibonacci
microbenchmark, this work shows 5.5% compile-time reduction and 7.7%
runtime improvement (speedup).
Most of this implementation was done in (very productive) pair
programming sessions with Jamey Sharp, thus:
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* Review feedback.
* Review feedback.
* Review feedback.
* Bugfix: cprop rule: `(x + k1) - k2` becomes `x - (k2 - k1)`, not `x - (k1 - k2)`.
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
This PR fixes two bugs in the riscv64 backend, where branch instructions were emitted in the middle of a basic block:
Constant emission, where the constants are inlined into the vcode and are jumped over at runtime,
The BrTableCheck pseudo-instruction, which was always emitted before a BrTable instruction, and would handle jumping to the default label.
The first bug was resolved by introducing two new psuedo instructions, LoadConst32 and LoadConst64. Both of these instructions serve to delay the original encoding to emission time, after regalloc2 has run.
The second bug was fixed by removing the BrTableCheck instruction. As it was always emitted directly before BrTable, it was easier to remove it and merge the two into a single instruction.
Rework the compilation of amodes in the aarch64 backend to stop reusing registers and instead generate fresh virtual registers for intermediates. This resolves some SSA checker violations with the aarch64 backend, and as a nice side-effect removes some unnecessary movs in the generated code.
As loading constants on aarch64 can take up to 4 instructions, we need to plumb through some additional registers. Rather than pass a fixed list of registers in, pass an allocation function.
* Cranelift: implement `heap_{load,store}` instruction legalization
This does not remove `heap_addr` yet, but it does factor out the common
bounds-check-and-compute-the-native-address functionality that is shared between
all of `heap_{addr,load,store}`.
Finally, this adds a missing optimization for when we can dedupe explicit bounds
checks for static memories and Spectre mitigations.
* Cranelift: Enable `heap_load_store_*` run tests on all targets
* Turn off probestack by default in Cranelift
The probestack feature is not implemented for the aarch64 and s390x
backends and currently the on-by-default status requires the aarch64 and
s390x implementations to be a stub. Turning off probestack by default
allows the s390x and aarch64 backends to panic with an error message to
avoid providing a false sense of security. When the probestack option is
implemented for all backends, however, it may be reasonable to
re-enable.
* aarch64: Improve codegen for AMode fallback
Currently the final fallback for finalizing an `AMode` will generate
both a constant-loading instruction as well as an `add` instruction to
the base register into the same temporary. This commit improves the
codegen by removing the `add` instruction and folding the final add into
the finalized `AMode`. This changes the `extendop` used but both
registers are 64-bit so shouldn't be affected by the extending
operation.
* aarch64: Implement inline stack probes
This commit implements inline stack probes for the aarch64 backend in
Cranelift. The support here is modeled after the x64 support where
unrolled probes are used up to a particular threshold after which a loop
is generated. The instructions here are similar in spirit to x64 except
that unlike x64 the stack pointer isn't modified during the unrolled
loop to avoid needing to re-adjust it back up at the end of the loop.
* Enable inline probestack for AArch64 and Riscv64
This commit enables inline probestacks for the AArch64 and Riscv64
architectures in the same manner that x86_64 has it enabled now. Some
more testing was additionally added since on Unix platforms we should be
guaranteed that Rust's stack overflow message is now printed too.
* Enable probestack for aarch64 in cranelift-fuzzgen
* Address review comments
* Remove implicit stack overflow traps from x64 backend
This commit removes implicit `StackOverflow` traps inserted by the x64
backend for stack-based operations. This was historically required when
stack overflow was detected with page faults but Wasmtime no longer
requires that since it's not suitable for wasm modules which call host
functions. Additionally no other backend implements this form of
implicit trap-code additions so this is intended to synchronize the
behavior of all the backends.
This fixes a test added prior for aarch64 to properly abort the process
instead of accidentally being caught by Wasmtime.
* Fix a style issue
Was missing some '$' characters and so was comparing string literals against
string literals instead of variable values against string literals. Regenerated
tests to fix them and add missing tests.
* Cranelift: consider heap's guard pages when legalizing `heap_addr`
Fixes#5328
* Update comment to align more directly with implementation
* Add legalization tests for `heap_addr` and offset guard pages
Fix shadowing identified in #5322 for imul and swiden_high/swiden_low/uwiden_high/uwiden_low combinations in the x64 backend, and remove some redundant rules from the aarch64 dynamic neon ruleset. Additionally, add tests to the x64 backend showing that the imul specializations are firing.
* Cranelift: Define `heap_load` and `heap_store` instructions
* Cranelift: Implement interpreter support for `heap_load` and `heap_store`
* Cranelift: Add a suite runtests for `heap_{load,store}`
There are so many knobs we can twist for heaps and I wanted to exhaustively test
all of them, so I wrote a script to generate the tests. I've checked in the
script in case we want to make any changes in the future, but I don't think it
is worth adding this to CI to check that scripts are up to date or anything like
that.
* Review feedback
Remove some unnecessary moves in the x64 gen_memcpy implementation -- the call instruction that's generated will already constrain the args to those registers.
* cranelift: Cleanup `fdemote`/`fpromote` tests
* cranelift: Fix `fdemote`/`fpromote` instruction docs
The verifier fails if the input and output types are the same
for these instructions
* cranelift: Fix `fdemote`/`fpromote` in the interpreter
* fuzzgen: Add `fdemote`/`fpromote`
We can encode more constants into 12-bit immediates if we do the following
rewrite for comparisons with odd constants:
A >= B + 1
==> A - 1 >= B
==> A > B
* Cranelift: Make `heap_addr` return calculated `base + index + offset`
Rather than return just the `base + index`.
(Note: I've chosen to use the nomenclature "index" for the dynamic operand and
"offset" for the static immediate.)
This move the addition of the `offset` into `heap_addr`, instead of leaving it
for the subsequent memory operation, so that we can Spectre-guard the full
address, and not allow speculative execution to read the first 4GiB of memory.
Before this commit, we were effectively doing
load(spectre_guard(base + index) + offset)
Now we are effectively doing
load(spectre_guard(base + index + offset))
Finally, this also corrects `heap_addr`'s documented semantics to say that it
returns an address that will trap on access if `index + offset + access_size` is
out of bounds for the given heap, rather than saying that the `heap_addr` itself
will trap. This matches the implemented behavior for static memories, and after
https://github.com/bytecodealliance/wasmtime/pull/5190 lands (which is blocked
on this commit) will also match the implemented behavior for dynamic memories.
* Update heap_addr docs
* Factor out `offset + size` to a helper
Modify return pseudo-instructions to have pairs of registers: virtual and real. This allows us to constrain the virtual registers to the real ones specified by the abi, instead of directly emitting moves to those real registers.
Add a MemFlags operand to the bitcast instruction, where only the
`big` and `little` flags are accepted. These define the lane order
to be used when casting between types of different lane counts.
Update all users to pass an appropriate MemFlags argument.
Implement lane swaps where necessary in the s390x back-end.
This is the final part necessary to fix
https://github.com/bytecodealliance/wasmtime/issues/4566.
The sample program in cranelift/filetests/src/function_runner.rs
would abort with an mprotect failure under certain circumstances,
see https://github.com/bytecodealliance/wasmtime/pull/4453#issuecomment-1303803222
Root cause was that enabling PROT_EXEC on the main process heap
may be prohibited, depending on Linux distro and version.
This only shows up in the doc test sample program because the main
clif-util is multi-threaded and therefore allocations will happen
on glibc's per-thread heap, which is allocated via mmap, and not
the main process heap.
Work around the problem by enabling the "selinux-fix" feature of
the cranelift-jit crate dependency in the filetests. Note that
this didn't compile out of the box, so a separate fix is also
required and provided as part of this PR.
Going forward, it would be preferable to always use mmap to allocate
the backing memory for JITted code.
* cranelift: improve syscall error/oom handling in JIT module
The JIT module has several places where it `expect`s or `panic`s
on syscall or allocator errors. For example, `mmap` and `mprotect`
can fail if Linux `vm.max_map_count` is not high enough, and some
users may wish to handle this error rather than immediately
crashing.
This commit plumbs these errors upward as new `ModuleError`
types, so that callers of jit module functions like
`finalize_definitions` and `define_function` can handle them
(or just `unwrap()`, as desired).
* cranelift: Remove ModuleError::Syscall variant
Syscall errors can just be folded into the generic Backend error,
which is an anyhow::Error
* cranelift-jit: return io::ErrorKind::OutOfMemory for alloc failure
Just using `io::Error::last_os_error()` is not correct as global
allocator impls are not required to set errno
The simplifier was performing an optimization to replace bitselect
with vselect if the all bytes of the condition mask could be shown
to be all ones or all zeros.
This optimization only ever made any difference in codegen on the
x64 target. Therefore, move this optimization to the x64 back-end
and perform it in ISLE instead. Resulting codegen should be
unchanged, with slightly improved compile time.
This also eliminates a few endian-dependent bitcast operations.
* fuzzgen: Request only one variable for bswap
This was included by accident. Bswap only has one input, instead of two.
* cranelift: Add `bswap.i128` support
Adds support only for x86, AArch64, S390X.
RISCV does not yet have bswap.
This branch removes the trapif and trapff instructions, in favor of using an explicit comparison and trapnz. This moves us closer to removing iflags and fflags, but introduces the need to implement instructions like iadd_cout in the x64 and aarch64 backends.
- Allow bitcast for vectors with differing lane widths
- Remove raw_bitcast IR instruction
- Change all users of raw_bitcast to bitcast
- Implement support for no-op bitcast cases across backends
This implements the second step of the plan outlined here:
https://github.com/bytecodealliance/wasmtime/issues/4566#issuecomment-1234819394
Adds Bswap to the Cranelift IR. Implements the Bswap instruction
in the x64 and aarch64 codegen backends. Cranelift users can now:
```
builder.ins().bswap(value)
```
to get a native byteswap instruction.
* x64: implements the 32- and 64-bit bswap instruction, following
the pattern set by similar unary instrutions (Neg and Not) - it
only operates on a dst register, but is parameterized with both
a src and dst which are expected to be the same register.
As x64 bswap instruction is only for 32- or 64-bit registers,
the 16-bit swap is implemented as a rotate left by 8.
Updated x64 RexFlags type to support emitting for single-operand
instructions like bswap
* aarch64: Bswap gets emitted as aarch64 rev16, rev32,
or rev64 instruction as appropriate.
* s390x: Bswap was already supported in backend, just had to add
a bit of plumbing
* For completeness, added bswap to the interpreter as well.
* added filetests and runtests for each ISA
* added bswap to fuzzgen, thanks to afonso360 for the code there
* 128-bit swaps are not yet implemented, that can be done later
* aarch64: Fix incorrect masking for small types on bmask
`bmask` was accidentally relying on the uppermost bits of the register
for small types.
This was found by fuzzgen, when it generated a shift left followed by
a bmask, the shift left shifted the bits out of the range of the input
type (i8), however these are not automatically cleared since they
remained inside the 32 bits of the register.
That caused issues when the bmask tried to compare the whole register
instead of just the bottom bits. The solution here is to mask the upper
bits for small types.
* aarch64: Emit 32bit cmp on bmask
This fixes an issue where bmask was accidentally comparing the
upper bits of the register by always using a 64bit cmp.
* riscv: Mask high bits in bmask
* riscv: Add compile tests for br{z,nz}
* riscv: Use shifts to mask 32bit values
This produces less code than the AND since that version needs to
load an immediate constant from memory.
* cranelift: Update test input to hexadecimal values
This makes it a bit more clear what is being tested.
* riscv: Use addiw for masking 32 bit values
Co-authored-by: Trevor Elliott <telliott@fastly.com>
* aarch64: Update bmask rule priority
Co-authored-by: Trevor Elliott <telliott@fastly.com>