* Restrict the types for isplit and iconcat to match backends
* Admit unimplemented bitwidths to isplit/iconcat
* Modify the NarrowInt type instead of shadowing it
* Fix filetest failures
* x64: Fix vbroadcastss with AVX2 and without AVX
This commit fixes a corner case in the emission of the
`vbroadcasts{s,d}` instructions. The memory-to-xmm form of these
instructions was available with the AVX instruction set, but the
xmm-to-xmm form of these instructions wasn't available until AVX2.
The instruction requirement for these are listed as AVX but the lowering
rules are appropriately annotated to use either AVX2 or AVX when
appropriate.
While this should work in practice this didn't work for the assertion
about enabled features for each instruction. The `vbroadcastss`
instruction was listed as requiring AVX but could get emitted when AVX2
was enabled (due to the reg-to-reg form being available). This caused an
issue for the fuzzer where AVX2 was enabled but AVX was disabled.
One possible fix would be to add more opcodes, one for reg-to-reg and
one for mem-to-reg. That seemed like somewhat overkill for a pretty
niche situation that shouldn't actually come up in practice anywhere.
Instead this commit changes all the `has_avx` accessors to the
`use_avx_simd` predicate already available in the target flags. The
`use_avx2_simd` predicate was then updated to additionally require
`has_avx`, so if AVX2 is enabled and AVX is disabled then the
`vbroadcastss` instruction won't get emitted any more.
Closes#6059
* Pass `enable_simd` on a few more files
* x64: Take SIGFPE signals for divide traps
Prior to this commit Wasmtime would configure `avoid_div_traps=true`
unconditionally for Cranelift. This, for the division-based
instructions, would change emitted code to explicitly trap on trap
conditions instead of letting the `div` x86 instruction trap.
There's no specific reason for Wasmtime, however, to specifically avoid
traps in the `div` instruction. This means that the extra generated
branches on x86 aren't necessary since the `div` and `idiv` instructions
already trap for similar conditions as wasm requires.
This commit instead disables the `avoid_div_traps` setting for
Wasmtime's usage of Cranelift. Subsequently the codegen rules were
updated slightly:
* When `avoid_div_traps=true`, traps are no longer emitted for `div`
instructions.
* The `udiv`/`urem` instructions now list their trap as divide-by-zero
instead of integer overflow.
* The lowering for `sdiv` was updated to still explicitly check for zero
but the integer overflow case is deferred to the instruction itself.
* The lowering of `srem` no longer checks for zero and the listed trap
for the `div` instruction is a divide-by-zero.
This means that the codegen for `udiv` and `urem` no longer have any
branches. The codegen for `sdiv` removes one branch but keeps the
zero-check to differentiate the two kinds of traps. The codegen for
`srem` removes one branch but keeps the -1 check since the semantics of
`srem` mismatch with the semantics of `idiv` with a -1 divisor
(specifically for INT_MIN).
This is unlikely to have really all that much of a speedup but was
something I noticed during #6008 which seemed like it'd be good to clean
up. Plus Wasmtime's signal handling was already set up to catch
`SIGFPE`, it was just never firing.
* Remove the `avoid_div_traps` cranelift setting
With no known users currently removing this should be possible and helps
simplify the x64 backend.
* x64: GC more support for avoid_div_traps
Remove the `validate_sdiv_divisor*` pseudo-instructions and clean up
some of the ISLE rules now that `div` is allowed to itself trap
unconditionally.
* x64: Store div trap code in instruction itself
* Keep divisors in registers, not in memory
Don't accidentally fold multiple traps together
* Handle EXC_ARITHMETIC on macos
* Update emit tests
* Update winch and tests
* Add narrower and wider constraints to the instruction DSL
* Add docs to narrower/wider operands
* Update cranelift/codegen/meta/src/cdsl/instructions.rs
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix assertion message
* Simplify upper bounds for the wider constraint
* Remove additional unnecessary cases in the verifier
* Remove unused variables
* Remove changes to is_ctrl_typevar_candidate
These changes were only necessary when the type returned by an
instruction was a variable constrained by narrow or widen. As we have
switched to requiring that constraints must appear on argument types and
not return types, these changes were not longer necessary.
---------
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* riscv64: Fix typo in extensions
* riscv64: Move converters to top of file
* riscv64: Group up all imm12 rules
* riscv64: Move zero_reg helpers to Physical Regs section
* riscv64: Move helpers away from `clz` lowerings
These were in the middle of the `clz` rules and are kind of distracting
* riscv64: Move `cls` rules next to `ctz`/`clz`
* cranelift: Move `u8_and` / `u32_add` to Primitive Arithmetic section
* riscv64: Mark some imm12 constructors as pure
* cranelift: Move `s32_add_fallible` next to `u32_add`
* riscv64: Fix Typo
* Change CLIF `shuffle` to validate lane indices
Previously the CLIF `shuffle` instruction did not perform any validation
on the lane shuffle mask and specified that out-of-bounds lanes always
returned 0 as the value. This behavior though is not required by
WebAssembly which validates that lane indices are always in-bounds.
Additionally since these are static immediates even other code
generators should be able to verify that the immediates are in-bounds.
As a result this commit updates the definition of the `shuffle`
instruction to specify that all byte immediates must be in-bounds in the
range of [0, 32). The verifier has been updated and some test cases have
been removed that were testing this functionality.
Closes#5989
* Only generate valid shuffle immediates in fuzzer
We've adopted this pattern in Cranelift's instruction definitions where
we let-bind some calls to `Operand::new` and then later use them in one
or more calls to `Inst::new`.
That pattern has two problems:
- It puts the type of each operand somewhere potentially far removed
from the instruction in which it's used.
- We let-bind the same name for many different operands, compounding the
first problem by making it harder to find _which_ definition is used.
So instead this commit removes all let-bindings for operand definitions
and constructs a new `Operand` every time.
Constructing an `Operand` at every use means we duplicate some
documentation strings, but not all that many of them as it turns out.
I've left the let-bound type-sets alone, so those are currently still
shared across many instructions. They have some of the same problems and
should be reviewed as well.
* Enable the native target by default in winch
Match cranelift-codegen's build script where if no architecture is
explicitly enabled then the host architecture is implicitly enabled.
* Refactor Cranelift's ISA builder to share more with Winch
This commit refactors the `Builder` type to have a type parameter
representing the finished ISA with Cranelift and Winch having their own
typedefs for `Builder` to represent their own builders. The intention is
to use this shared functionality to produce more shared code between the
two codegen backends.
* Moving compiler shared components to a separate crate
* Restore native flag inference in compiler building
This fixes an oversight from the previous commits to use
`cranelift-native` to infer flags for the native host when using default
settings with Wasmtime.
* Move `Compiler::page_size_align` into wasmtime-environ
The `cranelift-codegen` crate doesn't need this and winch wants the same
implementation, so shuffle it around so everyone has access to it.
* Fill out `Compiler::{flags, isa_flags}` for Winch
These are easy enough to plumb through with some shared code for
Wasmtime.
* Plumb the `is_branch_protection_enabled` flag for Winch
Just forwarding an isa-specific setting accessor.
* Moving executable creation to shared compiler crate
* Adding builder back in and removing from shared crate
* Refactoring the shared pieces for the `CompilerBuilder`
I decided to move a couple things around from Alex's initial changes.
Instead of having the shared builder do everything, I went back to
having each compiler have a distinct builder implementation. I
refactored most of the flag setting logic into a single shared location,
so we can still reduce the amount of code duplication.
With them being separate, we don't need to maintain things like
`LinkOpts` which Winch doesn't currently use. We also have an avenue to
error when certain flags are sent to Winch if we don't support them. I'm
hoping this will make things more maintainable as we build out Winch.
I'm still unsure about keeping everything shared in a single crate
(`cranelift_shared`). It's starting to feel like this crate is doing too
much, which makes it difficult to name. There does seem to be a need for
two distinct abstraction: creating the final executable and the handling
of shared/ISA flags when building the compiler. I could make them into
two separate crates, but there doesn't seem to be enough there yet to
justify it.
* Documentation updates, and renaming the finish method
* Adding back in a default temporarily to pass tests, and removing some unused imports
* Fixing winch tests with wrong method name
* Removing unused imports from codegen shared crate
* Apply documentation formatting updates
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
* Adding back in cranelift_native flag inferring
* Adding new shared crate to publish list
* Adding write feature to pass cargo check
---------
Co-authored-by: Alex Crichton <alex@alexcrichton.com>
Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>
* Remove the Cranelift `vselect` instruction
This instruction is documented as selecting lanes based on the "truthy"
value of the condition lane, but the current status of the
implementation of this instruction is:
* x64 - uses the high bit for `f32x4` and `f64x2` and otherwise uses the
high bit of each byte doing a byte-wise lane select rather than
whatever the controlling type is.
* AArch64 - this is the same as `bitselect` which is a bit-wise
selection rather than a lane-wise selection.
* s390x - this is the same as AArch64, a bit-wise selection rather than
lane-wise.
* interpreter - the interpreter implements the documented semantics of
selecting based on "truthy" values.
Coupled with the status of the implementation is the fact that this
instruction is not used by WebAssembly SIMD today either. The only use
of this instruction in Cranelift is the nan-canonicalization pass. By
moving nan-canonicalization to `bitselect`, since that has the desired
semantics, there's no longer any need for `vselect`.
Given this situation this commit subsqeuently removes `vselect` and all
usage of it throughout Cranelift.
Closes#5917
* Review comments
* Bring back vselect opts as bitselect opts
* Clean up vselect usage in the interpreter
* Move bitcast in nan canonicalization
* Add a comment about float optimization
* Initial support for the Relaxed SIMD proposal
This commit adds initial scaffolding and support for the Relaxed SIMD
proposal for WebAssembly. Codegen support is supported on the x64 and
AArch64 backends on this time.
The purpose of this commit is to get all the boilerplate out of the way
in terms of plumbing through a new feature, adding tests, etc. The tests
are copied from the upstream repository at this time while the
WebAssembly/testsuite repository hasn't been updated.
A summary of changes made in this commit are:
* Lowerings for all relaxed simd opcodes have been added, currently all
exhibiting deterministic behavior. This means that few lowerings are
optimal on the x86 backend, but on the AArch64 backend, for example,
all lowerings should be optimal.
* Support is added to codegen to, eventually, conditionally generate
different code based on input codegen flags. This is intended to
enable codegen to more efficient instructions on x86 by default, for
example, while still allowing embedders to force
architecture-independent semantics and behavior. One good example of
this is the `f32x4.relaxed_fmadd` instruction which when deterministic
forces the `fma` instruction, but otherwise if the backend doesn't
have support for `fma` then intermediate operations are performed
instead.
* Lowerings of `iadd_pairwise` for `i16x8` and `i32x4` were added to the
x86 backend as they're now exercised by the deterministic lowerings of
relaxed simd instructions.
* Sample codegen tests for added for x86 and aarch64 for some relaxed
simd instructions.
* Wasmtime embedder support for the relaxed-simd proposal and forcing
determinism have been added to `Config` and the CLI.
* Support has been added to the `*.wast` runtime execution for the
`(either ...)` matcher used in the relaxed-simd proposal.
* Tests for relaxed-simd are run both with a default `Engine` as well as
a "force deterministic" `Engine` to test both configurations.
* All tests from the upstream repository were copied into Wasmtime.
These tests should be deleted when WebAssembly/testsuite is updated.
* x64: Add x86-specific lowerings for relaxed simd
This commit builds on the prior commit and adds an array of `x86_*`
instructions to Cranelift which have semantics that match their
corresponding x86 equivalents. Translation for relaxed simd is then
additionally updated to conditionally generate different CLIF for
relaxed simd instructions depending on whether the target is x86 or not.
This means that for AArch64 no changes are made but for x86 most relaxed
instructions now lower to some x86-equivalent with slightly different
semantics than the "deterministic" lowering.
* Add libcall support for fma to Wasmtime
This will be required to implement the `f32x4.relaxed_madd` instruction
(and others) when an x86 host doesn't specify the `has_fma` feature.
* Ignore relaxed-simd tests on s390x and riscv64
* Enable relaxed-simd tests on s390x
* Update cranelift/codegen/meta/src/shared/instructions.rs
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
* Add a FIXME from review
* Add notes about deterministic semantics
* Don't default `has_native_fma` to `true`
* Review comments and rebase fixes
---------
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
* Revert "egraphs: disable GVN of effectful idempotent ops (temporarily). (#5808)"
This reverts commit c7e2571866.
* egraphs: fix handling of effectful-but-idempotent ops and GVN.
This PR addresses #5796: currently, ops that are effectful, i.e., remain
in the side-effecting skeleton (which we keep in the `Layout` while the
egraph exists), but are idempotent and thus mergeable by a GVN pass, are
not handled properly.
GVN is still possible on effectful but idempotent ops precisely because
our GVN does not create partial redundancies: it removes an instruction
only when it is dominated by an identical instruction. An isntruction
will not be "hoisted" to a point where it could execute in the optimized
code but not in the original.
However, there are really two parts to the egraph implementation that
produce this effect: the deduplication on insertion into the egraph, and
the elaboration with a scoped hashmap. The deduplication lets us give a
single name (value ID) to all copies of an identical instruction, and
then elaboration will re-create duplicates if GVN should not hoist or
merge some of them.
Because deduplication need not worry about dominance or scopes, we use a
simple (non-scoped) hashmap to dedup/intern ops as "egraph nodes".
When we added support for GVN'ing effectful but idempotent ops (#5594),
we kept the use of this simple dedup'ing hashmap, but these ops do not
get elaborated; instead they stay in the side-effecting skeleton. Thus,
we inadvertently created potential for weird code-motion effects.
The proposal in #5796 would solve this in a clean way by treating these
ops as pure again, and keeping them out of the skeleton, instead putting
"force" pseudo-ops in the skeleton. However, this is a little more
complex than I would like, and I've realized that @jameysharp's earlier
suggestion is much simpler: we can keep an actual scoped hashmap
separately just for the effectful-but-idempotent ops, and use it to GVN
while we build the egraph. In effect, we're fusing a separate GVN pass
with the egraph pass (but letting it interact corecursively with
egraph rewrites. This is in principle similar to how we keep a separate
map for loads and fuse this pass with the egraph rewrite pass as well.
Note that we can use a `ScopedHashMap` here without the "context" (as
needed by `CtxHashMap`) because, as noted by @jameysharp, in practice
the ops we want to GVN have all their args inline. Equality on the
`InstructinoData` itself is conservative: two insts whose struct
contents compare shallowly equal are definitely identical, but identical
insts in a deep-equality sense may not compare shallowly equal, due to
list indirection. This is fine for GVN, because it is still sound to
skip any given GVN opportunity (and keep the original instructions).
Fixes#5796.
* Add comments from review.
This instruction is only defined with i8x16 inputs and outputs so
there's no need for a type variable, so shadow the otherwise-generic `a`
result with a concrete i8x16 type.
This was added for the wasm SIMD proposal but I've been poking around at
this recently and the instruction can instead be represented by its
component parts with the same semantics I believe. This commit removes
the instruction and instead represents it with the existing
`iadd_pairwise` instruction (among others) and updates backends to with
new pattern matches to have the same codegen as before.
This interestingly entirely removed the codegen rule with no replacement
on the AArch64 backend as the existing rules all existed to produce the
same codegen.
For instructions with no results (such as branches and stores) or
instructions with multiple results (such as add with carry), we have
assertions checking that an optimization rule doesn't try to match on
or construct such instructions.
When we generate terms for matching or constructing instructions, the
terms for these instructions are guaranteed to panic if they're ever
used. So let's just not generate them.
In the future we may wish to generate terms with different types for
these instructions, to make them usable in ISLE rules for optimization
that fall outside our current egraph constraints.
Rework br_table to use BlockCall, allowing us to avoid adding new nodes during ssa construction to hold block arguments. Additionally, many places where we previously matched on InstructionData to extract branch destinations can be replaced with a use of branch_destination or branch_destination_mut.
When investigating #5716, I found that rematerialization of a `call`, in
addition to blowing up for other reasons, caused aliasing of the varargs
list (the `EntityList` in the `ListPool`), such that editing the args of
the second copy of the call instruction inadvertently updated the first
as well.
This PR modifies `DataFlowGraph::clone_inst` so that it always clones
the varargs list if present. This shouldn't have any functional impact
on Cranelift today, because we don't rematerialize any instructions with
varargs; but it's important to get it right to avoid a bug later!
* Cranelift: Introduce the `tail` calling convention
This is an unstable-ABI calling convention that we will eventually use to
support Wasm tail calls.
Co-Authored-By: Jamey Sharp <jsharp@fastly.com>
* Cranelift: Introduce the `return_call` and `return_call_indirect` instructions
These will be used to implement tail calls for Wasm and any other language
targeting CLIF. The `return_call_indirect` instruction differs from the Wasm
instruction of the same name by taking a native address callee rather than a
Wasm function index.
Co-Authored-By: Jamey Sharp <jsharp@fastly.com>
* Cranelift: Implement verification rules for `return_call[_indirect]`
They must:
* have the same return types between the caller and callee,
* have the same calling convention between caller and callee,
* and that calling convention must support tail calls.
Co-Authored-By: Jamey Sharp <jsharp@fastly.com>
* cargo fmt
---------
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
Remove the boolean parameters from the instruction builder functions, as they were only ever used with true. Additionally, change the returns and branches functions to imply terminates_block.
Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both.
This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks.
Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.
This PR follows up on #5382 and #5391, which rebuilt the egraph-based optimization framework to be more performant, by enabling it by default.
Based on performance results in #5382 (my measurements on SpiderMonkey and bjorn3's independent confirmation with cg_clif), it seems that this is reasonable to enable. Now that we have been fuzzing compiler configurations with egraph opts (#5388) for 6 weeks, having fixed a few fuzzbugs that came up (#5409, #5420, #5438) and subsequently received no further reports from OSS-Fuzz, I believe it is stable enough to rely on.
This PR enables `use_egraphs`, and also normalizes its meaning: previously it forced optimization (it basically meant "turn on the egraph optimization machinery"), now it runs egraph opts if the opt level indicates (it means "use egraphs to optimize if we are going to optimize"). The conditionals in the top-level pass driver are a little subtle, but will get simpler once we can remove the non-egraph path (which we plan to do eventually!).
Fixes#5181.
Add a new type BlockCall that represents the pair of a block name with arguments to be passed to it. (The mnemonic here is that it looks a bit like a function call.) Rework the implementation of jump, brz, and brnz to use BlockCall instead of storing the block arguments as varargs in the instruction's ValueList.
To ensure that we're processing block arguments from BlockCall values in instructions, three new functions have been introduced on DataFlowGraph that both sets of arguments:
inst_values - returns an iterator that traverses values in the instruction and block arguments
map_inst_values - applies a function to each value in the instruction and block arguments
overwrite_inst_values - overwrite all values in an instruction and block arguments with values from the iterator
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Switch duplicate loads w/ dynamic memories test to `min_size = 0`
This test was accidentally hitting a special case for bounds checks for when we
know that `offset + access_size < min_size` and can skip some steps. This
commit changes the `min_size` of the memory to zero so that we are forced to do
fully general bounds checks.
* Cranelift: Mark `uadd_overflow_trap` as okay for GVN
Although this improves our test sequence for duplicate loads with dynamic
memories, it unfortunately doesn't have any effect on sightglass benchmarks:
```
instantiation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[34448 35607.23 37158] gvn_uadd_overflow_trap.so
[34566 35734.05 36585] main.so
instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[44101 60449.62 92712] gvn_uadd_overflow_trap.so
[44011 60436.37 92690] main.so
instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[35595 36675.72 38153] gvn_uadd_overflow_trap.so
[35440 36670.42 37993] main.so
compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[17370195 17405125.62 17471222] gvn_uadd_overflow_trap.so
[17369324 17404859.43 17470725] main.so
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[7055720520 7055886880.32 7056265930] gvn_uadd_overflow_trap.so
[7055719554 7055843809.33 7056193289] main.so
compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[683589861 683767276.00 684098366] gvn_uadd_overflow_trap.so
[683590024 683767998.02 684097885] main.so
execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[46436883 46437135.10 46437823] gvn_uadd_overflow_trap.so
[46436883 46437087.67 46437785] main.so
compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[126522461 126565812.58 126647044] gvn_uadd_overflow_trap.so
[126522176 126565757.75 126647522] main.so
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[653010531 653010533.03 653010544] gvn_uadd_overflow_trap.so
[653010531 653010533.18 653010537] main.so
```
* cranelift-codegen-meta: Rename `side_effects_okay_for_gvn` to `side_effects_idempotent`
* cranelift-filetests: Ensure there is a trailing newline for blessed Wasm tests
* Cranelift: Make spectre guards GVN-able
While these instructions have a side effect that is otherwise invisible to the
optimizer, the side effect in question is idempotent, so it can be de-duplicated
by GVN.
* Cranelift: Run redundant load replacement and GVN twice
This allows us to actually replace redundant Wasm loads with dynamic memories.
While this improves our hand-crafted test sequences, it doesn't seem to have any
improvement on sightglass benchmarks run with dynamic memories, however it also
isn't a hit to compilation times, so seems generally good to land anyways:
```
$ cargo run --release -- benchmark -e ~/scratch/once.so -e ~/scratch/twice.so -m insts-retired --processes 20 --iterations-per-process 3 --engine-flags="--static-memory-maximum-size 0" -- benchmarks/default.suite
compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[683595240 683768610.53 684097577] once.so
[683597068 700115966.83 1664907164] twice.so
instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[44107 60411.07 92785] once.so
[44138 59552.32 92097] twice.so
compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[17369916 17404839.78 17471458] once.so
[17369935 17625713.87 30700150] twice.so
compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[126523640 126566170.80 126648265] once.so
[126523076 127174580.30 163145149] twice.so
instantiation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[34569 35686.25 36513] once.so
[34651 35749.97 36953] twice.so
instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[35146 36639.10 37707] once.so
[34472 36580.82 38431] twice.so
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[7055720115 7055841324.82 7056180024] once.so
[7055717681 7055877095.85 7056225217] twice.so
execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[46436881 46437081.28 46437691] once.so
[46436883 46437127.68 46437766] twice.so
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[653010530 653010533.27 653010539] once.so
[653010531 653010532.95 653010538] twice.so
```
* cranelift: Add `iabs.i128` runtest
* riscv64: Fix incorrect extension in iabs
When lowering iabs, we were accidentally comparing the unextended value
this caused the instruction to misbehave with certain top bits.
This commit also adds a zbb lowering that does not use jumps.
We have some operations defined on DataFlowGraph purely to work around borrow-checker issues with InstructionData and other data on DataFlowGraph. Part of the problem is that indexing the DFG directly hides the fact that we're only indexing the insts field of the DFG.
This PR makes the insts field of the DFG public, but wraps it in a newtype that only allows indexing. This means that the borrow checker is better able to tell when operations on memory held by the DFG won't conflict, which comes up frequently when mutating ValueLists held by InstructionData.
* cranelift-wasm: translate Wasm loads into lower-level CLIF operations
Rather than using `heap_{load,store,addr}`.
* cranelift: Remove the `heap_{addr,load,store}` instructions
These are now legalized in the `cranelift-wasm` frontend.
* cranelift: Remove the `ir::Heap` entity from CLIF
* Port basic memory operation tests to .wat filetests
* Remove test for verifying CLIF heaps
* Remove `heap_addr` from replace_branching_instructions_and_cfg_predecessors.clif test
* Remove `heap_addr` from readonly.clif test
* Remove `heap_addr` from `table_addr.clif` test
* Remove `heap_addr` from the simd-fvpromote_low.clif test
* Remove `heap_addr` from simd-fvdemote.clif test
* Remove `heap_addr` from the load-op-store.clif test
* Remove the CLIF heap runtest
* Remove `heap_addr` from the global_value.clif test
* Remove `heap_addr` from fpromote.clif runtests
* Remove `heap_addr` from fdemote.clif runtests
* Remove `heap_addr` from memory.clif parser test
* Remove `heap_addr` from reject_load_readonly.clif test
* Remove `heap_addr` from reject_load_notrap.clif test
* Remove `heap_addr` from load_readonly_notrap.clif test
* Remove `static-heap-without-guard-pages.clif` test
Will be subsumed when we port `make-heap-load-store-tests.sh` to generating
`.wat` tests.
* Remove `static-heap-with-guard-pages.clif` test
Will be subsumed when we port `make-heap-load-store-tests.sh` over to `.wat`
tests.
* Remove more heap tests
These will be subsumed by porting `make-heap-load-store-tests.sh` over to `.wat`
tests.
* Remove `heap_addr` from `simple-alias.clif` test
* Remove `heap_addr` from partial-redundancy.clif test
* Remove `heap_addr` from multiple-blocks.clif test
* Remove `heap_addr` from fence.clif test
* Remove `heap_addr` from extends.clif test
* Remove runtests that rely on heaps
Heaps are not a thing in CLIF or the interpreter anymore
* Add generated load/store `.wat` tests
* Enable memory-related wasm features in `.wat` tests
* Remove CLIF heap from fcmp-mem-bug.clif test
* Add a mode for compiling `.wat` all the way to assembly in filetests
* Also generate WAT to assembly tests in `make-load-store-tests.sh`
* cargo fmt
* Reinstate `f{de,pro}mote.clif` tests without the heap bits
* Remove undefined doc link
* Remove outdated SVG and dot file from docs
* Add docs about `None` returns for base address computation helpers
* Factor out `env.heap_access_spectre_mitigation()` to a local
* Expand docs for `FuncEnvironment::heaps` trait method
* Restore f{de,pro}mote+load clif runtests with stack memory
All instructions using the CPU flags types (IFLAGS/FFLAGS) were already
removed. This patch completes the cleanup by removing all remaining
instructions that define values of CPU flags types, as well as the
types themselves.
Specifically, the following features are removed:
- The IFLAGS and FFLAGS types and the SpecialType category.
- Special handling of IFLAGS and FFLAGS in machinst/isle.rs and
machinst/lower.rs.
- The ifcmp, ifcmp_imm, ffcmp, iadd_ifcin, iadd_ifcout, iadd_ifcarry,
isub_ifbin, isub_ifbout, and isub_ifborrow instructions.
- The writes_cpu_flags instruction property.
- The flags verifier pass.
- Flags handling in the interpreter.
All of these features are currently unused; no functional change
intended by this patch.
This addresses https://github.com/bytecodealliance/wasmtime/issues/3249.
* egraph support: rewrite to work in terms of CLIF data structures.
This work rewrites the "egraph"-based optimization framework in
Cranelift to operate on aegraphs (acyclic egraphs) represented in the
CLIF itself rather than as a separate data structure to which and from
which we translate the CLIF.
The basic idea is to add a new kind of value, a "union", that is like an
alias but refers to two other values rather than one. This allows us to
represent an eclass of enodes (values) as a tree. The union node allows
for a value to have *multiple representations*: either constituent value
could be used, and (in well-formed CLIF produced by correct
optimization rules) they must be equivalent.
Like the old egraph infrastructure, we take advantage of acyclicity and
eager rule application to do optimization in a single pass. Like before,
we integrate GVN (during the optimization pass) and LICM (during
elaboration).
Unlike the old egraph infrastructure, everything stays in the
DataFlowGraph. "Pure" enodes are represented as instructions that have
values attached, but that are not placed into the function layout. When
entering "egraph" form, we remove them from the layout while optimizing.
When leaving "egraph" form, during elaboration, we can place an
instruction back into the layout the first time we elaborate the enode;
if we elaborate it more than once, we clone the instruction.
The implementation performs two passes overall:
- One, a forward pass in RPO (to see defs before uses), that (i) removes
"pure" instructions from the layout and (ii) optimizes as it goes. As
before, we eagerly optimize, so we form the entire union of optimized
forms of a value before we see any uses of that value. This lets us
rewrite uses to use the most "up-to-date" form of the value and
canonicalize and optimize that form.
The eager rewriting and acyclic representation make each other work
(we could not eagerly rewrite if there were cycles; and acyclicity
does not miss optimization opportunities only because the first time
we introduce a value, we immediately produce its "best" form). This
design choice is also what allows us to avoid the "parent pointers"
and fixpoint loop of traditional egraphs.
This forward optimization pass keeps a scoped hashmap to "intern"
nodes (thus performing GVN), and also interleaves on a per-instruction
level with alias analysis. The interleaving with alias analysis allows
alias analysis to see the most optimized form of each address (so it
can see equivalences), and allows the next value to see any
equivalences (reuses of loads or stored values) that alias analysis
uncovers.
- Two, a forward pass in domtree preorder, that "elaborates" pure enodes
back into the layout, possibly in multiple places if needed. This
tracks the loop nest and hoists nodes as needed, performing LICM as it
goes. Note that by doing this in forward order, we avoid the
"fixpoint" that traditional LICM needs: we hoist a def before its
uses, so when we place a node, we place it in the right place the
first time rather than moving later.
This PR replaces the old (a)egraph implementation. It removes both the
cranelift-egraph crate and the logic in cranelift-codegen that uses it.
On `spidermonkey.wasm` running a simple recursive Fibonacci
microbenchmark, this work shows 5.5% compile-time reduction and 7.7%
runtime improvement (speedup).
Most of this implementation was done in (very productive) pair
programming sessions with Jamey Sharp, thus:
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* Review feedback.
* Review feedback.
* Review feedback.
* Bugfix: cprop rule: `(x + k1) - k2` becomes `x - (k2 - k1)`, not `x - (k1 - k2)`.
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* Turn off probestack by default in Cranelift
The probestack feature is not implemented for the aarch64 and s390x
backends and currently the on-by-default status requires the aarch64 and
s390x implementations to be a stub. Turning off probestack by default
allows the s390x and aarch64 backends to panic with an error message to
avoid providing a false sense of security. When the probestack option is
implemented for all backends, however, it may be reasonable to
re-enable.
* aarch64: Improve codegen for AMode fallback
Currently the final fallback for finalizing an `AMode` will generate
both a constant-loading instruction as well as an `add` instruction to
the base register into the same temporary. This commit improves the
codegen by removing the `add` instruction and folding the final add into
the finalized `AMode`. This changes the `extendop` used but both
registers are 64-bit so shouldn't be affected by the extending
operation.
* aarch64: Implement inline stack probes
This commit implements inline stack probes for the aarch64 backend in
Cranelift. The support here is modeled after the x64 support where
unrolled probes are used up to a particular threshold after which a loop
is generated. The instructions here are similar in spirit to x64 except
that unlike x64 the stack pointer isn't modified during the unrolled
loop to avoid needing to re-adjust it back up at the end of the loop.
* Enable inline probestack for AArch64 and Riscv64
This commit enables inline probestacks for the AArch64 and Riscv64
architectures in the same manner that x86_64 has it enabled now. Some
more testing was additionally added since on Unix platforms we should be
guaranteed that Rust's stack overflow message is now printed too.
* Enable probestack for aarch64 in cranelift-fuzzgen
* Address review comments
* Remove implicit stack overflow traps from x64 backend
This commit removes implicit `StackOverflow` traps inserted by the x64
backend for stack-based operations. This was historically required when
stack overflow was detected with page faults but Wasmtime no longer
requires that since it's not suitable for wasm modules which call host
functions. Additionally no other backend implements this form of
implicit trap-code additions so this is intended to synchronize the
behavior of all the backends.
This fixes a test added prior for aarch64 to properly abort the process
instead of accidentally being caught by Wasmtime.
* Fix a style issue
* Cranelift: Define `heap_load` and `heap_store` instructions
* Cranelift: Implement interpreter support for `heap_load` and `heap_store`
* Cranelift: Add a suite runtests for `heap_{load,store}`
There are so many knobs we can twist for heaps and I wanted to exhaustively test
all of them, so I wrote a script to generate the tests. I've checked in the
script in case we want to make any changes in the future, but I don't think it
is worth adding this to CI to check that scripts are up to date or anything like
that.
* Review feedback