Add a display method to BlockCall that returns a std::fmt::Displayable result. Rework the display code in the write module of cranelift-codegen to use this method instead.
Souper requires an `i1` condition value, we don't and will implicitly check
against 0. We were truncating conditions but should actually be doing the
comparison against `0`.
Most of these optimizations are in the egraph `cprop.isle` rules now,
making a separate crate unnecessary.
Also I think the `udiv` optimizations here are straight-up wrong (doing
signed instead of unsigned division, and panicking instead of preserving
traps on division by zero) so I'm guessing this crate isn't seriously
used anywhere.
At the least, bjorn3 confirms that cg_clif doesn't use this, and I've
verified that Wasmtime doesn't either.
Closes#1090.
Improve the generated code for unordered floating point comparisons by negating the comparison and inveritng the branches. This allows us to pick the unordered versions, which generate significantly better code.
Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both.
This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks.
Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.
This PR follows up on #5382 and #5391, which rebuilt the egraph-based optimization framework to be more performant, by enabling it by default.
Based on performance results in #5382 (my measurements on SpiderMonkey and bjorn3's independent confirmation with cg_clif), it seems that this is reasonable to enable. Now that we have been fuzzing compiler configurations with egraph opts (#5388) for 6 weeks, having fixed a few fuzzbugs that came up (#5409, #5420, #5438) and subsequently received no further reports from OSS-Fuzz, I believe it is stable enough to rely on.
This PR enables `use_egraphs`, and also normalizes its meaning: previously it forced optimization (it basically meant "turn on the egraph optimization machinery"), now it runs egraph opts if the opt level indicates (it means "use egraphs to optimize if we are going to optimize"). The conditionals in the top-level pass driver are a little subtle, but will get simpler once we can remove the non-egraph path (which we plan to do eventually!).
Fixes#5181.
* Support mergeable-but-side-effectful (idempotent) operations in general in the egraph's GVN.
This mirrors the similar change made in #5534.
* Add tests for egraph case.
Add a new type BlockCall that represents the pair of a block name with arguments to be passed to it. (The mnemonic here is that it looks a bit like a function call.) Rework the implementation of jump, brz, and brnz to use BlockCall instead of storing the block arguments as varargs in the instruction's ValueList.
To ensure that we're processing block arguments from BlockCall values in instructions, three new functions have been introduced on DataFlowGraph that both sets of arguments:
inst_values - returns an iterator that traverses values in the instruction and block arguments
map_inst_values - applies a function to each value in the instruction and block arguments
overwrite_inst_values - overwrite all values in an instruction and block arguments with values from the iterator
Co-authored-by: Jamey Sharp <jamey@minilop.net>
This is a cleanup to help prepare for #5464.
Most of the diff is inlining the closure for `mark_all_uses_as_multiple`
which was only called once. That avoids some borrow-checker challenges.
The key change is that the former `push_args_on_stack` closure no longer
actually pushes the iterator on the stack, but just returns it. That
way this closure doesn't need the name of the stack's type. It also
allows it to be reused in the debug_assert.
* add clif-util compile option to output object file
* switch from a box to a borrow
* update objectmodule tests to use borrowed isa
* put targetisa into an arc
Fuzz additional targets in the cranelift-icache target. The list of targets fuzzed is controlled by the targets enabled in fuzz/Cargo.toml.
This PR also reworks how instruction disabling is done in function generator, moving the deny-list to a function to make the decision at runtime instead of compile time.
* Switch duplicate loads w/ dynamic memories test to `min_size = 0`
This test was accidentally hitting a special case for bounds checks for when we
know that `offset + access_size < min_size` and can skip some steps. This
commit changes the `min_size` of the memory to zero so that we are forced to do
fully general bounds checks.
* Cranelift: Mark `uadd_overflow_trap` as okay for GVN
Although this improves our test sequence for duplicate loads with dynamic
memories, it unfortunately doesn't have any effect on sightglass benchmarks:
```
instantiation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[34448 35607.23 37158] gvn_uadd_overflow_trap.so
[34566 35734.05 36585] main.so
instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[44101 60449.62 92712] gvn_uadd_overflow_trap.so
[44011 60436.37 92690] main.so
instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[35595 36675.72 38153] gvn_uadd_overflow_trap.so
[35440 36670.42 37993] main.so
compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[17370195 17405125.62 17471222] gvn_uadd_overflow_trap.so
[17369324 17404859.43 17470725] main.so
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[7055720520 7055886880.32 7056265930] gvn_uadd_overflow_trap.so
[7055719554 7055843809.33 7056193289] main.so
compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[683589861 683767276.00 684098366] gvn_uadd_overflow_trap.so
[683590024 683767998.02 684097885] main.so
execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[46436883 46437135.10 46437823] gvn_uadd_overflow_trap.so
[46436883 46437087.67 46437785] main.so
compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[126522461 126565812.58 126647044] gvn_uadd_overflow_trap.so
[126522176 126565757.75 126647522] main.so
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[653010531 653010533.03 653010544] gvn_uadd_overflow_trap.so
[653010531 653010533.18 653010537] main.so
```
* cranelift-codegen-meta: Rename `side_effects_okay_for_gvn` to `side_effects_idempotent`
* cranelift-filetests: Ensure there is a trailing newline for blessed Wasm tests
Remove the lower_br_fcmp function from the riscv64 backend. This PR only affects the emit implementation for FloatRound, replacing the uses of lower_br_fcmp with direct uses of FpuRRR and CondBr.
Any changes in behavior here should be already covered by the runtests for ceil, floor, trunc, and nearest.
Remove the MInst::TrapFf instruction in the riscv64 backend. It was only used in two places in the emit case for FloatRound, and was easily replaced with a combination of FpuRRR and TrapIf.
* Cranelift: Make spectre guards GVN-able
While these instructions have a side effect that is otherwise invisible to the
optimizer, the side effect in question is idempotent, so it can be de-duplicated
by GVN.
* Cranelift: Run redundant load replacement and GVN twice
This allows us to actually replace redundant Wasm loads with dynamic memories.
While this improves our hand-crafted test sequences, it doesn't seem to have any
improvement on sightglass benchmarks run with dynamic memories, however it also
isn't a hit to compilation times, so seems generally good to land anyways:
```
$ cargo run --release -- benchmark -e ~/scratch/once.so -e ~/scratch/twice.so -m insts-retired --processes 20 --iterations-per-process 3 --engine-flags="--static-memory-maximum-size 0" -- benchmarks/default.suite
compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[683595240 683768610.53 684097577] once.so
[683597068 700115966.83 1664907164] twice.so
instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[44107 60411.07 92785] once.so
[44138 59552.32 92097] twice.so
compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[17369916 17404839.78 17471458] once.so
[17369935 17625713.87 30700150] twice.so
compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[126523640 126566170.80 126648265] once.so
[126523076 127174580.30 163145149] twice.so
instantiation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[34569 35686.25 36513] once.so
[34651 35749.97 36953] twice.so
instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[35146 36639.10 37707] once.so
[34472 36580.82 38431] twice.so
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm
No difference in performance.
[7055720115 7055841324.82 7056180024] once.so
[7055717681 7055877095.85 7056225217] twice.so
execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm
No difference in performance.
[46436881 46437081.28 46437691] once.so
[46436883 46437127.68 46437766] twice.so
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm
No difference in performance.
[653010530 653010533.27 653010539] once.so
[653010531 653010532.95 653010538] twice.so
```
Rework the compilation of fcmp in the riscv64 backend to be in ISLE, removing the need for the dedicated Fcmp instruction. This change is motivated by #5500, which showed that the riscv64 backend was generating branch instructions in the middle of a basic block.
We can't remove lower_br_fcmp quite yet as it's used in a few places in the emit module, but it's now no longer reachable from the ISLE lowerings.
Fixes#5500
* cranelift: Add `iabs.i128` runtest
* riscv64: Fix incorrect extension in iabs
When lowering iabs, we were accidentally comparing the unextended value
this caused the instruction to misbehave with certain top bits.
This commit also adds a zbb lowering that does not use jumps.
When lowering `select+icmp` we have an optimization that allows us to
avoid materializing the icmp result.
We were accidentally not masking the high bits for i8 and i16 in this case.
Issue #5498 reported this as an illegal instruction but what was happening
there was that the invalid select caused a division by zero.
We had a off-by-one bounds check error when checking if we should
jump to the default block in a br-table. Instead of always jumping
to the default block when we have a jump table with 0 targets we
would try to compute an offset past the end of the table.
This sometimes would not crash, but it would crash if the there was
no block after the br_table, thus adding a cold block would cause a
segfault.
The actual fix is quite simple, do not count the default block
as a jump table entry when computing the limits.
This commit also does a bunch of cleanup and adding some comments
to the br_table emission code.
In #5031, we removed `bool` types from CLIF, using integers instead for
"truthy" values. This greatly simplified the IR, and was generally an
improvement.
However, because x86's `SETcc` instruction sets only the low 8 bits of a
register, we chose to use `i8` types as the result of `icmp` and `fcmp`,
to avoid the need for a masking operation when materializing the result.
Unfortunately this means that uses of truthy values often now have
`uextend` operations, especially when coming from Wasm (where truthy
values are naturally `i32`-typed). For example, where we previously had
`(brz (icmp ...))`, we now have `(brz (uextend (icmp ...)))`.
It's arguable whether or not we should switch to `i32` truthy values --
in most cases we can avoid materializing a value that's immediately used
for a branch or select, so a mask would in most cases be unnecessary,
and it would be a win at the IR level -- but irrespective of that, this
change *did* regress our generated code quality: our backends had
patterns for e.g. `(brz (icmp ...))` but not with the `uextend`, so we
were *always* materializing truthy values. Many blocks thus ended with
"cmp; setcc; cmp; test; branch" rather than "cmp; branch".
In #5391 we noticed this and fixed it on x64, but it was a general
problem on aarch64 and riscv64 as well. This PR introduces a
`maybe_uextend` extractor that "looks through" uextends, and uses it
where we consume truthy values, thus fixing the regression. This PR
also adds compile filetests to ensure we don't regress again.
The riscv64 backend has not been updated here because doing so appears
to trigger another issue in its branch handling; fixing that is TBD.
Assert that we never see real registers as arguments to move instructions in VCodeBuilder::collect_operands.
Also fix a bug in the riscv64 backend that was discovered by these assertions: the lowerings of get_stack_pointer and get_frame_pointer were using physical registers 8 and 2 directly. The solution was similar to other backends: add a move instruction specifically for moving out of physical registers, whose source operand is opaque to regalloc2.
This change adds support for immediate to memory moves in x64 which
are needed by Winch for zeroing local slots.
This change follows the guideline in `isa/x64/inst/emit` and uses
other instructions (immediate to register moves) as a base for the
test cases.
The instruction encoding expectation was derived by assembling each
instruction and inspecting the assembly with `objdump`.
Fixes#5199.
Fixes#5200.
Fixes#5452.
Fixes#5453.
On riscv64, there is apparently an autoconversion from `ValueRegs` to
`Reg` that takes just the low register [0], and removing this conversion
causes 48 errors. As a result of this, `select` with an `i128` condition
was silently miscompiling, testing only the low 64 bits. We should
remove this autoconversion to ensure we aren't missing any other silent
truncations, but for now this PR just adds the explicit `I128` logic for
`select` / `select_spectre_guard`.
[0]
d9fdbfd50e/cranelift/codegen/src/isa/riscv64/inst.isle (L1762)