* x64: Add rudimentary support for some AVX instructions
I was poking around Spidermonkey's wasm backend and saw that the various
assembler functions used are all `v*`-prefixed which look like they're
intended for use with AVX instructions. I looked at Cranelift and it
currently doesn't have support for many AVX-based instructions, so I
figured I'd take a crack at it!
The support added here is a bit of a mishmash when viewed alone, but my
general goal was to take a single instruction from the SIMD proposal for
WebAssembly and migrate all of its component instructions to AVX. I, by
random chance, picked a pretty complicated instruction of `f32x4.min`.
This wasm instruction is implemented on x64 with 4 unique SSE
instructions and ended up being a pretty good candidate.
Further digging about AVX-vs-SSE shows that there should be two major
benefits to using AVX over SSE:
* Primarily AVX instructions largely use a three-operand form where two
input registers are operated with and an output register is also
specified. This is in contrast to SSE's predominant
one-register-is-input-but-also-output pattern. This should help free
up the register allocator a bit and additionally remove the need for
movement between registers.
* As #4767 notes the memory-based operations of VEX-encoded instructions
(aka AVX instructions) do not have strict alignment requirements which
means we would be able to sink loads and stores into individual
instructions instead of having separate instructions.
So I set out on my journey to implement the instructions used by
`f32x4.min`. The first few were fairly easy. The machinst backends are
already of the shape "take these inputs and compute the output" where
the x86 requirement of a register being both input and output is
postprocessed in. This means that the `inst.isle` creation helpers for
SSE instructions were already of the correct form to use AVX. I chose to
add new `rule` branches for the instruction creation helpers, for
example `x64_andnps`. The new `rule` conditionally only runs if AVX is
enabled and emits an AVX instruction instead of an SSE instruction for
achieving the same goal. This means that no lowerings of clif
instructions were modified, instead just new instructions are being
generated.
The VEX encoding was previously not heavily used in Cranelift. The only
current user are the FMA-style instructions that Cranelift has at this
time. These FMA instructions have one extra operand than `vandnps`, for
example, so I split the existing `XmmRmRVex` into a few more variants to
fit the shape of the instructions that needed generating for
`f32x4.min`. This was accompanied then with more AVX opcode definitions,
more emission support, etc.
Upon implementing all of this it turned out that the test suite was
failing on my machine due to the memory-operand encodings of VEX
instructions not being supported. I didn't explicitly add those in
myself but some preexisting RIP-relative addressing was leaking into the
new instructions with existing tests. I opted to go ahead and fill out
the memory addressing modes of VEX encoding to get the tests passing
again.
All-in-all this PR adds new instructions to the x64 backend for a number
of AVX instructions, updates 5 existing instruction producers to use AVX
instructions conditionally, implements VEX memory operands, and adds
some simple tests for the new output of `f32x4.min`. The existing
runtest for `f32x.min` caught a few intermediate bugs along the way and
I additionally added a plain `target x86_64` to that runtest to ensure
that it executes with and without AVX to test the various lowerings.
I'll also note that this, and future support, should be well-fuzzed
through Wasmtime's fuzzing which may explicitly disable AVX support
despite the machine having access to AVX, so non-AVX lowerings should be
well-tested into the future.
It's also worth mentioning that I am not an AVX or VEX or x64 expert.
Implementing the memory operand part for VEX was the hardest part of
this PR and while I think it should be good someone else should
definitely double-check me. Additionally I haven't added many
instructions to the x64 backend yet so I may have missed obvious places
to tests or such, so am happy to follow-up with anything to be more
thorough if necessary.
Finally I should note that this is just the tip of the iceberg when it
comes to AVX. My hope is to get some of the idioms sorted out to make it
easier for future PRs to add one-off instruction lowerings or such.
* Review feedback
Rework br_table to use BlockCall, allowing us to avoid adding new nodes during ssa construction to hold block arguments. Additionally, many places where we previously matched on InstructionData to extract branch destinations can be replaced with a use of branch_destination or branch_destination_mut.
* cranelift: Add `adrp` encoding to AArch64 backend
* cranelift: Support GOT Symbol References in AArch64
* cranelift: Add MachO GOT relocations
* cranelift: Do not mark the GOT PageOffset12 MachO relocation as relative
This updates the signatures of the `xmm_rm_r` helper function and then
updates existing users and migrates other users to the helper now that
the type information is no longer required.
* Add (bnot (bxor x y)) lowerings for s390x/aarch64
I originally thought that s390x's original lowering in #5709, but as was
rightfully pointed out `(bnot (bxor x y))` is equivalent to
`(bxor x (bnot y))` so the special lowering for one should apply as a
special lowering for the other. For the s390x and aarch64 backend that
have already have a fused lowering of the bxor/bnot add a lowering
additionally for the bnot/bxor combination.
* Add bnot(bxor(..)) tests for s390x 128-bit sizes
* Remove trailing whitespace in `lower.isle` files
* Legalize the `band_not` instruction into simpler form
This commit legalizes the `band_not` instruction into `band`-of-`bnot`,
or two instructions. This is intended to assist with egraph-based
optimizations where the `band_not` instruction doesn't have to be
specifically included in other bit-operation-patterns.
Lowerings of the `band_not` instruction have been moved to a
specialization of the `band` instruction.
* Legalize `bor_not` into components
Same as prior commit, but for the `bor_not` instruction.
* Legalize bxor_not into bxor-of-bnot
Same as prior commits. I think this also ended up fixing a bug in the
s390x backend where `bxor_not x y` was actually translated as `bnot
(bxor x y)` by accident given the test update changes.
* Simplify not-fused operands for riscv64
Looks like some delegated-to rules have special-cases for "if this
feature is enabled use the fused instruction" so move the clause for
testing the feature up to the lowering phase to help trigger other rules
if the feature isn't enabled. This should make the riscv64 backend more
consistent with how other backends are implemented.
* Remove B{and,or,xor}Not from cost of egraph metrics
These shouldn't ever reach egraphs now that they're legalized away.
* Add an egraph optimization for `x^-1 => ~x`
This adds a simplification node to translate xor-against-minus-1 to a
`bnot` instruction. This helps trigger various other optimizations in
the egraph implementation and also various backend lowering rules for
instructions. This is chiefly useful as wasm doesn't have a `bnot`
equivalent, so it's encoded as `x^-1`.
* Add a wasm test for end-to-end bitwise lowerings
Test that end-to-end various optimizations are being applied for input
wasm modules.
* Specifically don't self-update rustup on CI
I forget why this was here originally, but this is failing on Windows
CI. In general there's no need to update rustup, so leave it as-is.
* Cleanup some aarch64 lowering rules
Previously a 32/64 split was necessary due to the `ALUOp` being different
but that's been refactored away no so there's no longer any need for
duplicate rules.
* Narrow a x64 lowering rule
This previously made more sense when it was `band_not` and rarely used,
but be more specific in the type-filter on this rule that it's only
applicable to SIMD types with lanes.
* Simplify xor-against-minus-1 rule
No need to have the commutative version since constants are already
shuffled right for egraphs
* Optimize band-of-bnot when bnot is on the left
Use some more rules in the egraph algebraic optimizations to
canonicalize band/bor/bxor with a `bnot` operand to put the operand on
the right. That way the lowerings in the backends only have to list the
rule once, with the operand on the right, to optimize both styles of
input.
* Add commutative lowering rules
* Update cranelift/codegen/src/isa/x64/lower.isle
Co-authored-by: Jamey Sharp <jamey@minilop.net>
---------
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Cranelift: Introduce the `tail` calling convention
This is an unstable-ABI calling convention that we will eventually use to
support Wasm tail calls.
Co-Authored-By: Jamey Sharp <jsharp@fastly.com>
* Cranelift: Introduce the `return_call` and `return_call_indirect` instructions
These will be used to implement tail calls for Wasm and any other language
targeting CLIF. The `return_call_indirect` instruction differs from the Wasm
instruction of the same name by taking a native address callee rather than a
Wasm function index.
Co-Authored-By: Jamey Sharp <jsharp@fastly.com>
* Cranelift: Implement verification rules for `return_call[_indirect]`
They must:
* have the same return types between the caller and callee,
* have the same calling convention between caller and callee,
* and that calling convention must support tail calls.
Co-Authored-By: Jamey Sharp <jsharp@fastly.com>
* cargo fmt
---------
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
Improve the generated code for unordered floating point comparisons by negating the comparison and inveritng the branches. This allows us to pick the unordered versions, which generate significantly better code.
Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both.
This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks.
Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.
Add a new type BlockCall that represents the pair of a block name with arguments to be passed to it. (The mnemonic here is that it looks a bit like a function call.) Rework the implementation of jump, brz, and brnz to use BlockCall instead of storing the block arguments as varargs in the instruction's ValueList.
To ensure that we're processing block arguments from BlockCall values in instructions, three new functions have been introduced on DataFlowGraph that both sets of arguments:
inst_values - returns an iterator that traverses values in the instruction and block arguments
map_inst_values - applies a function to each value in the instruction and block arguments
overwrite_inst_values - overwrite all values in an instruction and block arguments with values from the iterator
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* add clif-util compile option to output object file
* switch from a box to a borrow
* update objectmodule tests to use borrowed isa
* put targetisa into an arc
Fuzz additional targets in the cranelift-icache target. The list of targets fuzzed is controlled by the targets enabled in fuzz/Cargo.toml.
This PR also reworks how instruction disabling is done in function generator, moving the deny-list to a function to make the decision at runtime instead of compile time.
Remove the lower_br_fcmp function from the riscv64 backend. This PR only affects the emit implementation for FloatRound, replacing the uses of lower_br_fcmp with direct uses of FpuRRR and CondBr.
Any changes in behavior here should be already covered by the runtests for ceil, floor, trunc, and nearest.
Remove the MInst::TrapFf instruction in the riscv64 backend. It was only used in two places in the emit case for FloatRound, and was easily replaced with a combination of FpuRRR and TrapIf.
Rework the compilation of fcmp in the riscv64 backend to be in ISLE, removing the need for the dedicated Fcmp instruction. This change is motivated by #5500, which showed that the riscv64 backend was generating branch instructions in the middle of a basic block.
We can't remove lower_br_fcmp quite yet as it's used in a few places in the emit module, but it's now no longer reachable from the ISLE lowerings.
Fixes#5500
* cranelift: Add `iabs.i128` runtest
* riscv64: Fix incorrect extension in iabs
When lowering iabs, we were accidentally comparing the unextended value
this caused the instruction to misbehave with certain top bits.
This commit also adds a zbb lowering that does not use jumps.
When lowering `select+icmp` we have an optimization that allows us to
avoid materializing the icmp result.
We were accidentally not masking the high bits for i8 and i16 in this case.
Issue #5498 reported this as an illegal instruction but what was happening
there was that the invalid select caused a division by zero.
We had a off-by-one bounds check error when checking if we should
jump to the default block in a br-table. Instead of always jumping
to the default block when we have a jump table with 0 targets we
would try to compute an offset past the end of the table.
This sometimes would not crash, but it would crash if the there was
no block after the br_table, thus adding a cold block would cause a
segfault.
The actual fix is quite simple, do not count the default block
as a jump table entry when computing the limits.
This commit also does a bunch of cleanup and adding some comments
to the br_table emission code.
In #5031, we removed `bool` types from CLIF, using integers instead for
"truthy" values. This greatly simplified the IR, and was generally an
improvement.
However, because x86's `SETcc` instruction sets only the low 8 bits of a
register, we chose to use `i8` types as the result of `icmp` and `fcmp`,
to avoid the need for a masking operation when materializing the result.
Unfortunately this means that uses of truthy values often now have
`uextend` operations, especially when coming from Wasm (where truthy
values are naturally `i32`-typed). For example, where we previously had
`(brz (icmp ...))`, we now have `(brz (uextend (icmp ...)))`.
It's arguable whether or not we should switch to `i32` truthy values --
in most cases we can avoid materializing a value that's immediately used
for a branch or select, so a mask would in most cases be unnecessary,
and it would be a win at the IR level -- but irrespective of that, this
change *did* regress our generated code quality: our backends had
patterns for e.g. `(brz (icmp ...))` but not with the `uextend`, so we
were *always* materializing truthy values. Many blocks thus ended with
"cmp; setcc; cmp; test; branch" rather than "cmp; branch".
In #5391 we noticed this and fixed it on x64, but it was a general
problem on aarch64 and riscv64 as well. This PR introduces a
`maybe_uextend` extractor that "looks through" uextends, and uses it
where we consume truthy values, thus fixing the regression. This PR
also adds compile filetests to ensure we don't regress again.
The riscv64 backend has not been updated here because doing so appears
to trigger another issue in its branch handling; fixing that is TBD.
Assert that we never see real registers as arguments to move instructions in VCodeBuilder::collect_operands.
Also fix a bug in the riscv64 backend that was discovered by these assertions: the lowerings of get_stack_pointer and get_frame_pointer were using physical registers 8 and 2 directly. The solution was similar to other backends: add a move instruction specifically for moving out of physical registers, whose source operand is opaque to regalloc2.
This change adds support for immediate to memory moves in x64 which
are needed by Winch for zeroing local slots.
This change follows the guideline in `isa/x64/inst/emit` and uses
other instructions (immediate to register moves) as a base for the
test cases.
The instruction encoding expectation was derived by assembling each
instruction and inspecting the assembly with `objdump`.
Fixes#5199.
Fixes#5200.
Fixes#5452.
Fixes#5453.
On riscv64, there is apparently an autoconversion from `ValueRegs` to
`Reg` that takes just the low register [0], and removing this conversion
causes 48 errors. As a result of this, `select` with an `i128` condition
was silently miscompiling, testing only the low 64 bits. We should
remove this autoconversion to ensure we aren't missing any other silent
truncations, but for now this PR just adds the explicit `I128` logic for
`select` / `select_spectre_guard`.
[0]
d9fdbfd50e/cranelift/codegen/src/isa/riscv64/inst.isle (L1762)
The casloop_emit function in the s390x backend was using the fixed non-allocatable register %r0 directly with move instructions, which produced a panic in the regalloc2 checker (#5425). This PR changes the casloop_result function to use mov_preg instead of copy_reg to fetch the result, as it's not viewed by regalloc2 as a move.
Fixes#5425
* Refactor lower_branch to have Unit result
Branches cannot have any output, so it is more straightforward
to have the ISLE term return Unit instead of InstOutput.
Also provide a new `emit_side_effect` term to simplify
implementation of `lower_branch` rules with Unit result.
* Simplify LowerBackend interface
Move all remaining asserts from the LowerBackend::lower and
::lower_branch_group into the common call site.
Change return value of ::lower to Option<InstOutput>, and
return value of ::lower_branch_group to Option<()> to match
ISLE term signature.
Only pass the first branch into ::lower_branch_group and
rename it to ::lower_branch.
As a result of all those changes, LowerBackend routines
now consists solely to calls to the corresponding ISLE
routines.