* x64: Add VEX Instruction Encoder
This uses a similar builder pattern to the EVEX Encoder.
Does not yet support memory accesses.
* x64: Add FMA Flag
* x64: Implement SIMD `fma`
* x64: Use 4 register Vex Inst
* x64: Reorder VEX pretty print args
* x64: port `atomic_rmw` to ISLE
This change ports `atomic_rmw` to ISLE for the x64 backend. It does not
change the lowering in any way, though it seems possible that the fixed
regs need not be as fixed and that there are opportunities for single
instruction lowerings. It does rename `inst_common::AtomicRmwOp` to
`MachAtomicRmwOp` to disambiguate with the IR enum with the same name.
* x64: remove remaining hardcoded register constraints for `atomic_rmw`
* x64: use `SyntheticAmode` in `AtomicRmwSeq`
* review: add missing reg collector for amode
* review: collect memory registers in the 'late' phase
This PR refactors the x64 backend address-mode lowering to use an
incremental-build approach, where it considers each node in a tree of
`iadd`s that feed into a load/store address and, at each step, builds
the best possible `Amode`. It will combine an arbitrary number of
constant offsets (an extension beyond the current rules), and can
capture a left-shifted (scaled) index in any position of the tree
(another extension).
This doesn't have any measurable performance improvement on our Wasm
benchmarks in Sightglass, unfortunately, because the IR lowered from
wasm32 will do address computation in 32 bits and then `uextend` it to
add to the 64-bit heap base. We can't quite lift the 32-bit adds to 64
bits because this loses the wraparound semantics.
(We could label adds as "expected not to overflow", and allow *those* to
be lifted to 64 bit operations; wasm32 heap address computation should
fit this. This is `add nuw` (no unsigned wrap) in LLVM IR terms. That's
likely my next step.)
Nevertheless, (i) this generalizes the cases we can handle, which should
be a good thing, all other things being equal (and in this case, no
compile time impact was measured); and (ii) might benefit non-Wasm
frontends.
This PR switches Cranelift over to the new register allocator, regalloc2.
See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.
Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:
```
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/A
```
We already defined the `Gpr` newtype and used it in a few places, and we already
defined the `Xmm` newtype and used it extensively. This finishes the transition
to using the newtypes extensively in lowering by making use of `Gpr` in more
places.
Fixes#3685
This primary motivation of this large commit (apologies for its size!) is to
introduce `Gpr` and `Xmm` newtypes over `Reg`. This should help catch
difficult-to-diagnose register class mixup bugs in x64 lowerings.
But having a newtype for `Gpr` and `Xmm` themselves isn't enough to catch all of
our operand-with-wrong-register-class bugs, because about 50% of operands on x64
aren't just a register, but a register or memory address or even an
immediate! So we have `{Gpr,Xmm}Mem[Imm]` newtypes as well.
Unfortunately, `GprMem` et al can't be `enum`s and are therefore a little bit
noisier to work with from ISLE. They need to maintain the invariant that their
registers really are of the claimed register class, so they need to encapsulate
the inner data. If they exposed the underlying `enum` variants, then anyone
could just change register classes or construct a `GprMem` that holds an XMM
register, defeating the whole point of these newtypes. So when working with
these newtypes from ISLE, we rely on external constructors like `(gpr_to_gpr_mem
my_gpr)` instead of `(GprMem.Gpr my_gpr)`.
A bit of extra lines of code are included to add support for register mapping
for all of these newtypes as well. Ultimately this is all a bit wordier than I'd
hoped it would be when I first started authoring this commit, but I think it is
all worth it nonetheless!
In the process of adding these newtypes, I didn't want to have to update both
the ISLE `extern` type definition of `MInst` and the Rust definition, so I move
the definition fully into ISLE, similar as aarch64.
Finally, this process isn't complete. I've introduced the newtypes here, and
I've made most XMM-using instructions switch from `Reg` to `Xmm`, as well as
register class-converting instructions, but I haven't moved all of the GPR-using
instructions over to the newtypes yet. I figured this commit was big enough as
it was, and I can continue the adoption of these newtypes in follow up commits.
Part of #3685.
* aarch64: Initial work to transition backend to ISLE
This commit is what is hoped to be the initial commit towards migrating
the aarch64 backend to ISLE. There's seemingly a lot of changes here but
it's intended to largely be code motion. The current thinking is to
closely follow the x64 backend for how all this is handled and
organized.
Major changes in this PR are:
* The `Inst` enum is now defined in ISLE. This avoids having to define
it in two places (once in Rust and once in ISLE). I've preserved all
the comments in the ISLE and otherwise this isn't actually a
functional change from the Rust perspective, it's still the same enum
according to Rust.
* Lots of little enums and things were moved to ISLE as well. As with
`Inst` their definitions didn't change, only where they're defined.
This will give future ISLE PRs access to all these operations.
* Initial code for lowering `iconst`, `null`, and `bconst` are
implemented. Ironically none of this is actually used right now
because constant lowering is handled in `put_input_in_regs` which
specially handles constants. Nonetheless I wanted to get at least
something simple working which shows off how to special case various
things that are specific to AArch64. In a future PR I plan to hook up
const-lowering in ISLE to this path so even though
`iconst`-the-clif-instruction is never lowered this should use the
const lowering defined in ISLE rather than elsewhere in the backend
(eventually leading to the deletion of the non-ISLE lowering).
* The `IsleContext` skeleton is created and set up for future additions.
* Some code for ISLE that's shared across all backends now lives in
`isle_prelude_methods!()` and is deduplicated between the AArch64
backend and the x64 backend.
* Register mapping is tweaked to do the same thing for AArch64 that it
does for x64. Namely mapping virtual registers is supported instead of
just virtual to machine registers.
My main goal with this PR was to get AArch64 into a place where new
instructions can be added with relative ease. Additionally I'm hoping to
figure out as part of this change how much to share for ISLE between
AArch64 and x64 (and other backends).
* Don't use priorities with rules
* Update .gitattributes with concise syntax
* Deduplicate some type definitions
* Rebuild ISLE
* Move isa::isle to machinst::isle
This was my first attempt at transitioning code to ISLE to originally
fix#3327 but that fix has since landed on `main`, so this is instead
now just porting a few operations to ISLE.
Closes#3336
On the build side, this commit introduces two things:
1. The automatic generation of various ISLE definitions for working with
CLIF. Specifically, it generates extern type definitions for clif opcodes and
the clif instruction data `enum`, as well as extractors for matching each clif
instructions. This happens inside the `cranelift-codegen-meta` crate.
2. The compilation of ISLE DSL sources to Rust code, that can be included in the
main `cranelift-codegen` compilation.
Next, this commit introduces the integration glue code required to get
ISLE-generated Rust code hooked up in clif-to-x64 lowering. When lowering a clif
instruction, we first try to use the ISLE code path. If it succeeds, then we are
done lowering this instruction. If it fails, then we proceed along the existing
hand-written code path for lowering.
Finally, this commit ports many lowering rules over from hand-written,
open-coded Rust to ISLE.
In the process of supporting ISLE, this commit also makes the x64 `Inst` capable
of expressing SSA by supporting 3-operand forms for all of the existing
instructions that only have a 2-operand form encoding:
dst = src1 op src2
Rather than only the typical x86-64 2-operand form:
dst = dst op src
This allows `MachInst` to be in SSA form, since `dst` and `src1` are
disentangled.
("3-operand" and "2-operand" are a little bit of a misnomer since not all
operations are binary operations, but we do the same thing for, e.g., unary
operations by disentangling the sole operand from the result.)
There are two motivations for this change:
1. To allow ISLE lowering code to have value-equivalence semantics. We want ISLE
lowering to translate a CLIF expression that evaluates to some value into a
`MachInst` expression that evaluates to the same value. We want both the
lowering itself and the resulting `MachInst` to be pure and referentially
transparent. This is both a nice paradigm for compiler writers that are
authoring and maintaining lowering rules and is a prerequisite to any sort of
formal verification of our lowering rules in the future.
2. Better align `MachInst` with `regalloc2`'s API, which requires that the input
be in SSA form.
* Add support for x64 packed promote low
* Add support for x64 packed floating point demote
* Update vector promote low and demote by adding constraints
Also does some renaming and minor refactoring
When shuffling values from two different registers, the x64 lowering for
`i8x16.shuffle` must first shuffle each register separately and then OR
the results with SSE instructions. With `VPERMI2B`, available in
AVX512VL + AVX512VBMI, this can be done in a single instruction after
the shuffle mask has been moved into the destination register. This
change uses `VPERMI2B` for that case when the CPU supports it.
When AVX512VL or AVX512BITALG are available, Wasm SIMD's `popcnt`
instruction can be lowered to a single x64 instruction, `VPOPCNTB`,
instead of 8+ instructions.
When AVX512VL and AVX512F are available, use a single instruction
(`VCVTUDQ2PS`) instead of a length 9-instruction sequence. This
optimization is a port from the legacy x86 backend.
Previously, the x64 backend's ABI code would generate a sign-extending
load when loading a less-than-64-bit integer from a spillslot. This is
incorrect: e.g., for i32s > 0x80000000, this would result in all high
bits set.
This interacts poorly with another optimization. Normally, the invariant
is that the high bits of a register holding a value of a certain type,
beyond that type's bits, are undefined. However, as an optimization, we
recognize and use the fact that on x86-64, 32-bit instructions zero the
upper 32 bits. This allows us to elide a 32-to-64-bit zero-extend op
(turning it into just a move, which can then sometimes disappear
entirely due to register coalescing).
If a spill and reload happen between the production of a 32-bit value
from an instruction known to zero the upper bits and its use, then we
will rely on zero upper bits that might actually be set by a
sign-extend. This will result in incorrect execution.
As a fix, we stick to a simple invariant: we always spill and reload a
full 64 bits when handling integer registers on x64. This ensures that
no bits are mangled.
This change implements `vselect` using SSE4.1's `BLENDVPS`, `BLENDVPD`,
and `PBLENDVB`. `vselect` is a lane-selecting instruction that is used
by
[simple_preopt.rs](fa1faf5d22/cranelift/codegen/src/simple_preopt.rs (L947-L999))
to lower `bitselect` to a single x86 instruction when the condition mask
is known to be boolean (all 1s or 0s, e.g., from a conversion). This is
better than `bitselect` in general, which lowers to 4-5 instructions.
The old backend had the `vselect` lowering; this simply introduces it to
the new backend.
This adds the machinery to encode the VPMULLQ instruction which is
available in AVX512VL and AVX512DQ. When these feature sets are
available, we use this instruction instead of a lengthy 12-instruction
sequence.
* x64: add EVEX encoding mechanism
Also, includes an empty stub module for the VEX encoding.
* x64: lower abs.i64x2 to VPABSQ when available
* x64: refactor EVEX encodings to use `EvexInstruction`
This change replaces the `encode_evex` function with a builder-style struct, `EvexInstruction`. This approach clarifies the code, adds documentation, and results in slight speedups when benchmarked.
* x64: rename encoding CodeSink to ByteSink
Because there are instructions that are present in more than one ISA feature set, we need to see if any of the ISA requirements match before emitting. This change includes the `VPABSQ` instruction as an example, which is present in both `AVX512F` and `AVX512VL`.
* Use stable Rust on CI to test the x64 backend
This commit leverages the newly-released 1.51.0 compiler to test the
new backend on Windows and Linux with a stable compiler instead of a
nightly compiler. This isolates the nightly build to just the nightly
documentation generation and fuzzing, both of which rely on nightly for
the best results right now.
* Use updated stable in book build job
* Run rustfmt for new stable
* Silence new warnings for wasi-nn
* Allow some dead code in the x64 backend
Looks like new rustc is better about emitting some dead-code warnings
* Update rust in peepmatic job
* Fix a test in the pooling allocator
* Remove `package.metdata.docs.rs` temporarily
Needs resolution of https://github.com/rust-lang/cargo/pull/9300 first
* Fix a warning in a wasi-nn example
This adds support for the "fastcall" ABI, which is the native C/C++ ABI
on Windows platforms on x86-64. It is similar to but not exactly like
System V; primarily, its argument register assignments are different,
and it requires stack shadow space.
Note that this also adjusts the handling of multi-register values in the
shared ABI implementation, and with this change, adjusts handling of
`i128`s on *both* Fastcall/x64 *and* SysV/x64 platforms. This was done
to align with actual behavior by the "rustc ABI" on both platforms, as
mapped out experimentally (Compiler Explorer link in comments). This
behavior is gated under the `enable_llvm_abi_extensions` flag.
Note also that this does *not* add x64 unwind info on Windows. That will
come in a future PR (but is planned!).
This instruction has a single instruction lowering in AVX512F/VL and a three instruction lowering in AVX but neither is currently supported in the x64 backend. To implement this, we instead subtract the vector from 0 and use a blending instruction to pick the lanes containing the absolute value.
This is in preparation for refactoring all x64::Inst arms to use OperandSize.
Current uses of OperandSize fall into two categories:
1. XMM operations which require 32/64 bit operands
2. Immediates which only care about 64-bit or not.
Adds assertions to existing Inst constructors to check that they are passed valid sizes.
This change also removes the implicit widening of 1 and 2 byte values to 4 bytes. from_bytes() is only used by category 2, so removing this behavior will not change any visible behavior.
Overall this change should be a no-op.
This implements all of the ops on I128 that are implemented by the
legacy x86 backend, and includes all that are required by at least one
major use-case (cg_clif rustc backend).
The sequences are open-coded where necessary; for e.g. the bit
operations, this can be somewhat complex, but these sequences have been
tested carefully. This PR also includes a drive-by fix of clz/ctz for 8-
and 16-bit cases where they were incorrect previously.
Also includes ridealong fixes developed while bringing up cg_clif
support, because they are difficult to completely separate due to
other refactors that occurred in this PR:
- fix REX prefix logic for some 8-bit instructions.
When using an 8-bit register in 64-bit mode on x86-64, the REX prefix
semantics are somewhat subtle: without the REX prefix, register numbers
4--7 correspond to the second-to-lowest byte of the first four registers
(AH, CH, BH, DH), whereas with the REX prefix, these register numbers
correspond to the usual encoding (SPL, BPL, SIL, DIL). We could always
emit a REX byte for instructions with 8-bit cases (this is harmless even
if unneeded), but this would unnecessarily inflate code size; instead,
the usual approach is to emit it only for these registers.
This logic was present in some cases but missing for some other
instructions: divide, not, negate, shifts.
Fixes#2508.
- avoid unaligned SSE loads on some f64 ops.
The implementations of several FP ops, such as fabs/fneg, used SSE
instructions. This is not a problem per-se, except that load-op merging
did not take *alignment* into account. Specifically, if an op on an f64
loaded from memory happened to merge that load, and the instruction into
which it was merged was an SSE instruction, then the SSE instruction
imposes stricter (128-bit) alignment requirements than the load.f64 did.
This PR simply forces any instruction lowerings that could use SSE
instructions to implement non-SIMD operations to take inputs in
registers only, and avoid load-op merging.
Fixes#2507.
- two bugfixes exposed by cg_clif: urem/srem.i8, select.b1.
- urem/srem.i8: the 8-bit form of the DIV instruction on x86-64 places
the remainder in AH, not RDX, different from all the other width-forms
of this instruction.
- select.b1: we were not recognizing selects of boolean values as
integer-typed operations, so we were generating XMM moves instead (!).
Previously, `select` and `brz`/`brnz` instructions, when given a `b1`
boolean argument, would test whether that boolean argument was nonzero,
rather than whether its LSB was nonzero. Since our invariant for mapping
CLIF state to machine state is that bits beyond the width of a value are
undefined, the proper lowering is to test only the LSB.
(aarch64 does not have the same issue because its `Extend` pseudoinst
already properly handles masking of b1 values when a zero-extend is
requested, as it is for select/brz/brnz.)
Found by Nathan Ringo on Zulip [1] (thanks!).
[1]
https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/bnot.20on.20b1s
This end result was previously enacted by carrying a `SourceLoc` on
every load/store, which was somewhat cumbersome, and only indirectly
encoded metadata about a memory reference (can it trap) by its presence
or absence. We have a type for this -- `MemFlags` -- that tells us
everything we might want to know about a load or store, and we should
plumb it through to code emission instead.
This PR attaches a `MemFlags` to an `Amode` on x64, and puts it on load
and store `Inst` variants on aarch64. These two choices seem to factor
things out in the nicest way: there are relatively few load/store insts
on aarch64 but many addressing modes, while the opposite is true on x64.