Most of these optimizations are in the egraph `cprop.isle` rules now,
making a separate crate unnecessary.
Also I think the `udiv` optimizations here are straight-up wrong (doing
signed instead of unsigned division, and panicking instead of preserving
traps on division by zero) so I'm guessing this crate isn't seriously
used anywhere.
At the least, bjorn3 confirms that cg_clif doesn't use this, and I've
verified that Wasmtime doesn't either.
Closes#1090.
Improve the generated code for unordered floating point comparisons by negating the comparison and inveritng the branches. This allows us to pick the unordered versions, which generate significantly better code.
Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both.
This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks.
Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.
Ideally these pairs of CLIF instructions should emit a single x86
instruction, but they don't today. This test will tell us if somebody
fixes that.
Similar tests might make sense for imul/umulhi as well as signed
versions, but I haven't tried that.
This PR follows up on #5382 and #5391, which rebuilt the egraph-based optimization framework to be more performant, by enabling it by default.
Based on performance results in #5382 (my measurements on SpiderMonkey and bjorn3's independent confirmation with cg_clif), it seems that this is reasonable to enable. Now that we have been fuzzing compiler configurations with egraph opts (#5388) for 6 weeks, having fixed a few fuzzbugs that came up (#5409, #5420, #5438) and subsequently received no further reports from OSS-Fuzz, I believe it is stable enough to rely on.
This PR enables `use_egraphs`, and also normalizes its meaning: previously it forced optimization (it basically meant "turn on the egraph optimization machinery"), now it runs egraph opts if the opt level indicates (it means "use egraphs to optimize if we are going to optimize"). The conditionals in the top-level pass driver are a little subtle, but will get simpler once we can remove the non-egraph path (which we plan to do eventually!).
Fixes#5181.
* Support mergeable-but-side-effectful (idempotent) operations in general in the egraph's GVN.
This mirrors the similar change made in #5534.
* Add tests for egraph case.
* Adding in the foundations for Winch `filetests`
This commit adds two new crates into the Winch workspace:
`filetests` and `test-macros`. The intent is to mimic the
structure of Cranelift `filetests`, but in a simpler way.
* Updates to documentation
This commits adds a high level document to outline how to test Winch
through the `winch-tools` utility. It also updates some inline
documentation which gets propagated to the CLI.
* Updating test-macro to use a glob instead of only a flat directory
* add clif-util compile option to output object file
* switch from a box to a borrow
* update objectmodule tests to use borrowed isa
* put targetisa into an arc
* Switch duplicate loads w/ dynamic memories test to `min_size = 0`
This test was accidentally hitting a special case for bounds checks for when we
know that `offset + access_size < min_size` and can skip some steps. This
commit changes the `min_size` of the memory to zero so that we are forced to do
fully general bounds checks.
* Cranelift: Mark `uadd_overflow_trap` as okay for GVN
Although this improves our test sequence for duplicate loads with dynamic
memories, it unfortunately doesn't have any effect on sightglass benchmarks:
```
instantiation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[34448 35607.23 37158] gvn_uadd_overflow_trap.so
[34566 35734.05 36585] main.so
instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[44101 60449.62 92712] gvn_uadd_overflow_trap.so
[44011 60436.37 92690] main.so
instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[35595 36675.72 38153] gvn_uadd_overflow_trap.so
[35440 36670.42 37993] main.so
compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[17370195 17405125.62 17471222] gvn_uadd_overflow_trap.so
[17369324 17404859.43 17470725] main.so
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[7055720520 7055886880.32 7056265930] gvn_uadd_overflow_trap.so
[7055719554 7055843809.33 7056193289] main.so
compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[683589861 683767276.00 684098366] gvn_uadd_overflow_trap.so
[683590024 683767998.02 684097885] main.so
execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[46436883 46437135.10 46437823] gvn_uadd_overflow_trap.so
[46436883 46437087.67 46437785] main.so
compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[126522461 126565812.58 126647044] gvn_uadd_overflow_trap.so
[126522176 126565757.75 126647522] main.so
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[653010531 653010533.03 653010544] gvn_uadd_overflow_trap.so
[653010531 653010533.18 653010537] main.so
```
* cranelift-codegen-meta: Rename `side_effects_okay_for_gvn` to `side_effects_idempotent`
* cranelift-filetests: Ensure there is a trailing newline for blessed Wasm tests
* Cranelift: Make spectre guards GVN-able
While these instructions have a side effect that is otherwise invisible to the
optimizer, the side effect in question is idempotent, so it can be de-duplicated
by GVN.
* Cranelift: Run redundant load replacement and GVN twice
This allows us to actually replace redundant Wasm loads with dynamic memories.
While this improves our hand-crafted test sequences, it doesn't seem to have any
improvement on sightglass benchmarks run with dynamic memories, however it also
isn't a hit to compilation times, so seems generally good to land anyways:
```
$ cargo run --release -- benchmark -e ~/scratch/once.so -e ~/scratch/twice.so -m insts-retired --processes 20 --iterations-per-process 3 --engine-flags="--static-memory-maximum-size 0" -- benchmarks/default.suite
compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[683595240 683768610.53 684097577] once.so
[683597068 700115966.83 1664907164] twice.so
instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[44107 60411.07 92785] once.so
[44138 59552.32 92097] twice.so
compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[17369916 17404839.78 17471458] once.so
[17369935 17625713.87 30700150] twice.so
compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[126523640 126566170.80 126648265] once.so
[126523076 127174580.30 163145149] twice.so
instantiation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[34569 35686.25 36513] once.so
[34651 35749.97 36953] twice.so
instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[35146 36639.10 37707] once.so
[34472 36580.82 38431] twice.so
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[7055720115 7055841324.82 7056180024] once.so
[7055717681 7055877095.85 7056225217] twice.so
execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[46436881 46437081.28 46437691] once.so
[46436883 46437127.68 46437766] twice.so
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[653010530 653010533.27 653010539] once.so
[653010531 653010532.95 653010538] twice.so
```
Rework the compilation of fcmp in the riscv64 backend to be in ISLE, removing the need for the dedicated Fcmp instruction. This change is motivated by #5500, which showed that the riscv64 backend was generating branch instructions in the middle of a basic block.
We can't remove lower_br_fcmp quite yet as it's used in a few places in the emit module, but it's now no longer reachable from the ISLE lowerings.
Fixes#5500
* cranelift: Add `iabs.i128` runtest
* riscv64: Fix incorrect extension in iabs
When lowering iabs, we were accidentally comparing the unextended value
this caused the instruction to misbehave with certain top bits.
This commit also adds a zbb lowering that does not use jumps.
When lowering `select+icmp` we have an optimization that allows us to
avoid materializing the icmp result.
We were accidentally not masking the high bits for i8 and i16 in this case.
Issue #5498 reported this as an illegal instruction but what was happening
there was that the invalid select caused a division by zero.
We had a off-by-one bounds check error when checking if we should
jump to the default block in a br-table. Instead of always jumping
to the default block when we have a jump table with 0 targets we
would try to compute an offset past the end of the table.
This sometimes would not crash, but it would crash if the there was
no block after the br_table, thus adding a cold block would cause a
segfault.
The actual fix is quite simple, do not count the default block
as a jump table entry when computing the limits.
This commit also does a bunch of cleanup and adding some comments
to the br_table emission code.
In #5031, we removed `bool` types from CLIF, using integers instead for
"truthy" values. This greatly simplified the IR, and was generally an
improvement.
However, because x86's `SETcc` instruction sets only the low 8 bits of a
register, we chose to use `i8` types as the result of `icmp` and `fcmp`,
to avoid the need for a masking operation when materializing the result.
Unfortunately this means that uses of truthy values often now have
`uextend` operations, especially when coming from Wasm (where truthy
values are naturally `i32`-typed). For example, where we previously had
`(brz (icmp ...))`, we now have `(brz (uextend (icmp ...)))`.
It's arguable whether or not we should switch to `i32` truthy values --
in most cases we can avoid materializing a value that's immediately used
for a branch or select, so a mask would in most cases be unnecessary,
and it would be a win at the IR level -- but irrespective of that, this
change *did* regress our generated code quality: our backends had
patterns for e.g. `(brz (icmp ...))` but not with the `uextend`, so we
were *always* materializing truthy values. Many blocks thus ended with
"cmp; setcc; cmp; test; branch" rather than "cmp; branch".
In #5391 we noticed this and fixed it on x64, but it was a general
problem on aarch64 and riscv64 as well. This PR introduces a
`maybe_uextend` extractor that "looks through" uextends, and uses it
where we consume truthy values, thus fixing the regression. This PR
also adds compile filetests to ensure we don't regress again.
The riscv64 backend has not been updated here because doing so appears
to trigger another issue in its branch handling; fixing that is TBD.
Fixes#5199.
Fixes#5200.
Fixes#5452.
Fixes#5453.
On riscv64, there is apparently an autoconversion from `ValueRegs` to
`Reg` that takes just the low register [0], and removing this conversion
causes 48 errors. As a result of this, `select` with an `i128` condition
was silently miscompiling, testing only the low 64 bits. We should
remove this autoconversion to ensure we aren't missing any other silent
truncations, but for now this PR just adds the explicit `I128` logic for
`select` / `select_spectre_guard`.
[0]
d9fdbfd50e/cranelift/codegen/src/isa/riscv64/inst.isle (L1762)
* cranelift-wasm: translate Wasm loads into lower-level CLIF operations
Rather than using `heap_{load,store,addr}`.
* cranelift: Remove the `heap_{addr,load,store}` instructions
These are now legalized in the `cranelift-wasm` frontend.
* cranelift: Remove the `ir::Heap` entity from CLIF
* Port basic memory operation tests to .wat filetests
* Remove test for verifying CLIF heaps
* Remove `heap_addr` from replace_branching_instructions_and_cfg_predecessors.clif test
* Remove `heap_addr` from readonly.clif test
* Remove `heap_addr` from `table_addr.clif` test
* Remove `heap_addr` from the simd-fvpromote_low.clif test
* Remove `heap_addr` from simd-fvdemote.clif test
* Remove `heap_addr` from the load-op-store.clif test
* Remove the CLIF heap runtest
* Remove `heap_addr` from the global_value.clif test
* Remove `heap_addr` from fpromote.clif runtests
* Remove `heap_addr` from fdemote.clif runtests
* Remove `heap_addr` from memory.clif parser test
* Remove `heap_addr` from reject_load_readonly.clif test
* Remove `heap_addr` from reject_load_notrap.clif test
* Remove `heap_addr` from load_readonly_notrap.clif test
* Remove `static-heap-without-guard-pages.clif` test
Will be subsumed when we port `make-heap-load-store-tests.sh` to generating
`.wat` tests.
* Remove `static-heap-with-guard-pages.clif` test
Will be subsumed when we port `make-heap-load-store-tests.sh` over to `.wat`
tests.
* Remove more heap tests
These will be subsumed by porting `make-heap-load-store-tests.sh` over to `.wat`
tests.
* Remove `heap_addr` from `simple-alias.clif` test
* Remove `heap_addr` from partial-redundancy.clif test
* Remove `heap_addr` from multiple-blocks.clif test
* Remove `heap_addr` from fence.clif test
* Remove `heap_addr` from extends.clif test
* Remove runtests that rely on heaps
Heaps are not a thing in CLIF or the interpreter anymore
* Add generated load/store `.wat` tests
* Enable memory-related wasm features in `.wat` tests
* Remove CLIF heap from fcmp-mem-bug.clif test
* Add a mode for compiling `.wat` all the way to assembly in filetests
* Also generate WAT to assembly tests in `make-load-store-tests.sh`
* cargo fmt
* Reinstate `f{de,pro}mote.clif` tests without the heap bits
* Remove undefined doc link
* Remove outdated SVG and dot file from docs
* Add docs about `None` returns for base address computation helpers
* Factor out `env.heap_access_spectre_mitigation()` to a local
* Expand docs for `FuncEnvironment::heaps` trait method
* Restore f{de,pro}mote+load clif runtests with stack memory
* cranelift-filetest: Add the ability to test `.wat` to assembly
* Make the load/store test case generator script use `.wat` tests
And generate tests that exercise both Wasm-to-CLIF lowering and Wasm all the way
to assembly.
* Remove old versions of generated load/store tests
* Add new generated load/store tests
* Fix filename reference in script
The casloop_emit function in the s390x backend was using the fixed non-allocatable register %r0 directly with move instructions, which produced a panic in the regalloc2 checker (#5425). This PR changes the casloop_result function to use mov_preg instead of copy_reg to fetch the result, as it's not viewed by regalloc2 as a move.
Fixes#5425
* aarch64: constant generation cleanup
Add support for MOVZ and MOVN generation via ISLE.
Handle f32const, f64const, and nop instructions via ISLE.
No longer call Inst::gen_constant from lower.rs.
* riscv64: constant generation cleanup
Handle f32const, f64const, and nop instructions via ISLE.
* s390x: constant generation cleanup
Fix rule priorities for "imm" term.
Only handle 32-bit stack offsets; no longer use load_constant64.
* x64: constant generation cleanup
No longer call Inst::gen_constant from lower.rs or abi.rs.
* Refactor LowerBackend::lower to return InstOutput
No longer write to the per-insn output registers; instead, return
an InstOutput vector of temp registers holding the outputs.
This will allow calling LowerBackend::lower multiple times for
the same instruction, e.g. to rematerialize constants.
When emitting the primary copy of the instruction during lowering,
writing to the per-insn registers is now done in lower_clif_block.
As a result, the ISLE lower_common routine is no longer needed.
In addition, the InsnOutput type and all code related to it
can be removed as well.
* Refactor IsleContext to hold a LowerBackend reference
Remove the "triple", "flags", and "isa_flags" fields that are
copied from LowerBackend to each IsleContext, and instead just
hold a reference to LowerBackend in IsleContext.
This will allow calling LowerBackend::lower from within callbacks
in src/machinst/isle.rs, e.g. to rematerialize constants.
To avoid having to pass LowerBackend references through multiple
functions, eliminate the lower_insn_to_regs subroutines in those
targets that still have them, and just inline into the main
lower routine. This also eliminates lower_inst.rs on aarch64
and riscv64.
Replace all accesses to the removed IsleContext fields by going
through the LowerBackend reference.
* Remove MachInst::gen_constant
This addresses the problem described in issue
https://github.com/bytecodealliance/wasmtime/issues/4426
that targets currently have to duplicate code to emit
constants between the ISLE logic and the gen_constant
callback.
After the various cleanups in earlier patches in this series,
the only remaining user of get_constant is put_value_in_regs
in Lower. This can now be removed, and instead constant
rematerialization can be performed in the put_in_regs ISLE
callback by simply directly calling LowerBackend::lower
on the instruction defining the constant (using a different
output register).
Since the check for egraph mode is now no longer performed in
put_value_in_regs, the Lower::flags member becomes obsolete.
Care needs to be taken that other calls directly to the
Lower::put_value_in_regs routine now handle the fact that
no more rematerialization is performed. All such calls in
target code already historically handle constants themselves.
The remaining call site in the ISLE gen_call_common helper
can be redirected to the ISLE put_in_regs callback.
The existing target implementations of gen_constant are then
unused and can be removed. (In some target there may still
be further opportunities to remove duplication between ISLE
and some local Rust code - this can be left to future patches.)
This extractor had a side-effect of invoking `put_in_regs`, which is not
supposed to be invoked until the pattern-matching commits to evaluating
a rule right-hand side (i.e., cannot backtrack). In this case the
side-effect was mostly benign (in theory it could have caused additional
values to be computed needlessly), but in general we should be careful
to keep side-effects out of the left-hand side to enable further
optimizations and work on islec.
The implicit conversion from `Value` to `Reg` turns out to be enough to
make the rules in question work, so we can simply remove the use of the
extractor in this case.
Share a zero value in the translation of ushr for i128. This increases the lifetime of the value by a few instructions, and reduces the number of registers used in the translation by one, which seems like an acceptable trade-off.
When adding some optimization rules for `icmp` in the egraph
infrastructure, we ended up creating a path to legal CLIF but with
patterns unsupported by three of our four backends: specifically,
`select_spectre_guard` with a general truthy input, rather than an
`icmp`.
In #5206 we discussed replacing `select_spectre_guard` with something
more specific, and that could still be a long-term solution here, but
doing so now would interfere with ongoing refactoring of heap access
lowering, so I've opted not to do so. (In that issue I was concerned
about complexity and didn't see the need but with this fuzzbug I'm
starting to feel a bit differently; maybe we should remove this
non-orthogonal op in the long run.)
Fixes#5417.
This adds support for `.wat` tests in `cranelift-filetest`. The test runner
translates the WAT to Wasm and then uses `cranelift-wasm` to translate the Wasm
to CLIF.
These tests are always precise output tests. The test expectations can be
updated by running tests with the `CRANELIFT_TEST_BLESS=1` environment variable
set, similar to our compile precise output tests. The test's expected output is
contained in the last comment in the test file.
The tests allow for configuring the kinds of heaps used to implement Wasm linear
memory via TOML in a `;;!` comment at the start of the test.
To get ISA and Cranelift flags parsing available in the filetests crate, I had
to move the `parse_sets_and_triple` helper from the `cranelift-tools` binary
crate to the `cranelift-reader` crate, where I think it logically
fits.
Additionally, I had to make some more bits of `cranelift-wasm`'s dummy
environment `pub` so that I could properly wrap and compose it with the
environment used for the `.wat` tests. I don't think this is a big deal, but if
we eventually want to clean this stuff up, we can probably remove the dummy
environments completely, remove `translate_module`, and fold them into these new
test environments and test runner (since Wasmtime isn't using those things
anyways).
* Fix optimization rules for narrow types: wrap i8 results to 8 bits.
This fixes#5405.
In the egraph mid-end's optimization rules, we were rewriting e.g. imuls
of two iconsts to an iconst of the result, but without masking off the
high bits (beyond the result type's width). This was producing iconsts
with set high bits beyond their types' width, which is not legal.
In addition, this PR adds some optimizations to the algebraic rules to
recognize e.g. `x == x` (and all other integer comparison operators) and
resolve to 1 or 0 as appropriate.
* Review feedback.
* Review feedback, again.
All instructions using the CPU flags types (IFLAGS/FFLAGS) were already
removed. This patch completes the cleanup by removing all remaining
instructions that define values of CPU flags types, as well as the
types themselves.
Specifically, the following features are removed:
- The IFLAGS and FFLAGS types and the SpecialType category.
- Special handling of IFLAGS and FFLAGS in machinst/isle.rs and
machinst/lower.rs.
- The ifcmp, ifcmp_imm, ffcmp, iadd_ifcin, iadd_ifcout, iadd_ifcarry,
isub_ifbin, isub_ifbout, and isub_ifborrow instructions.
- The writes_cpu_flags instruction property.
- The flags verifier pass.
- Flags handling in the interpreter.
All of these features are currently unused; no functional change
intended by this patch.
This addresses https://github.com/bytecodealliance/wasmtime/issues/3249.
I copied the `bmask` precise-output tests from x64 and used
CRANELIFT_TEST_BLESS=1 to generate this test.
I don't know aarch64 well enough to decide if this output is correct.
However, for I128 it is identical to the previous I128-only
precise-output tests, and the existing runtests for bmask pass on
aarch64, so I think it's likely correct.
Enable regalloc2's SSA verifier in debug builds to check for any outstanding reuse of virtual registers in def constraints. As fuzzing enables debug_assertions, this will enable the SSA verifier when fuzzing as well.