Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.
Fixes#3205
Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
* Cranelift: use regalloc2 constraints on caller side of ABI code.
This PR updates the shared ABI code and backends to use register-operand
constraints rather than explicit pinned-vreg moves for register
arguments and return values.
The s390x backend was not updated, because it has its own implementation
of ABI code. Ideally we could converge back to the code shared by x64
and aarch64 (which didn't exist when s390x ported calls to ISLE, so the
current situation is underestandable, to be clear!). I'll leave this for
future work.
This PR exposed several places where regalloc2 needed to be a bit more
flexible with constraints; it requires regalloc2#74 to be merged and
pulled in.
* Update to regalloc2 0.3.3.
In addition to version bump, this required removing two asserts as
`SpillSlot`s no longer carry their class (so we can't assert that they
have the correct class).
* Review comments.
* Filetest updates.
* Add cargo-vet audit for regalloc2 0.3.2 -> 0.3.3 upgrade.
* Update to regalloc2 0.4.0.
* ABI: implement register arguments with constraints.
Currently, Cranelift's ABI code emits a sequence of moves from physical
registers into vregs at the top of the function body, one for every
register-carried argument.
For a number of reasons, we want to move to operand constraints instead,
and remove the use of explicitly-named "pinned vregs"; this allows for
better regalloc in theory, as it removes the need to "reverse-engineer"
the sequence of moves.
This PR alters the ABI code so that it generates a single "args"
pseudo-instruction as the first instruction in the function body. This
pseudo-inst defs all register arguments, and constrains them to the
appropriate registers at the def-point. Subsequently the regalloc can
move them wherever it needs to.
Some care was taken not to have this pseudo-inst show up in
post-regalloc disassemblies, but the change did cause a general regalloc
"shift" in many tests, so the precise-output updates are a bit noisy.
Sorry about that!
A subsequent PR will handle the other half of the ABI code, namely, the
callsite case, with a similar preg-to-constraint conversion.
* Update based on review feedback.
* Review feedback.
This is a cherry-pick of a long-ago commit, 2d46637. The original
message reads:
> Now that `SyntheticAmode` can refer to constants, there is no longer a
> need for a separate instruction format--standard load instructions will
> work.
Since then, the transition to ISLE and the use of `XmmLoadConst` in many
more places makes this change a larger diff than the original. The basic
idea is the same, though: the extra indirection of `Inst::XMmLoadConst`
is removed and replaced by a direct use of `VCodeConstant` as a
`SyntheticAmode`. This has no effect on codegen, but the CLIF output is
now clearer in that the actual instruction is displayed (e.g., `movdqu`)
instead of a made-up instruction (`load_const`).
This slipped through the regalloc2 operand code update in #4811: the
CvtFloatToUintSeq pseudo-instruction actually clobbers its source. It
was marked as a "mod" operand in the original and I mistakenly
converted it to a "use" as I had not seen the actual clobber. The
instruction now takes an extra temp and makes a copy of `src` in the
appropriate place.
Fixes#4840.
Add a new pseudo-instruction, XmmUnaryRmRImm, to handle instructions like roundss that only use their first register argument for the instruction's result. This has the added benefit of allowing the isle wrappers for those instructions to take an XmmMem argument, allowing for more cases where loads may be merged.
* x64: clean up regalloc-related semantics on several instructions.
This PR removes all uses of "modify" operands on instructions in the x64
backend, and also removes all uses of "pinned vregs", or vregs that are
explicitly tied to particular physical registers. In place of both of
these mechanisms, which are legacies of the old regalloc design and
supported via compatibility code, the backend now uses operand
constraints. This is more flexible as it allows the regalloc to see the
liveranges and constraints without "reverse-engineering" move instructions.
Eventually, after removing all such uses (including in other backends
and by the ABI code), we can remove the compatibility code in regalloc2,
significantly simplifying its liverange-construction frontend and
thus allowing for higher confidence in correctness as well as possibly a
bit more compilation speed.
Curiously, there are a few extra move instructions now; they are likely
poor splitting decisions and I can try to chase these down later.
* Fix cranelift-codegen tests.
* Review feedback.
Lower extractlane, scalar_to_vector and splat in ISLE.
This PR also makes some changes to the SinkableLoad api
* change the return type of sink_load to RegMem as there are more functions available for dealing with RegMem
* add reg_mem_to_reg_mem_imm and register it as an automatic conversion
Lower `shuffle` and `swizzle` in ISLE.
This PR surfaced a bug with the lowering of `shuffle` when avx512vl and avx512vbmi are enabled: we use `vpermi2b` as the implementation, but panic if the immediate shuffle mask contains any out-of-bounds values. The behavior when the avx512 extensions are not present is that out-of-bounds values are turned into `0` in the result.
I've resolved this by detecting when the shuffle immediate has out-of-bounds indices in the avx512-enabled lowering, and generating an additional mask to zero out the lanes where those indices occur. This brings the avx512 case into line with the semantics of the `shuffle` op: 94bcbe8446/cranelift/codegen/meta/src/shared/instructions.rs (L1495-L1498)
Lower stack_addr, udiv, sdiv, urem, srem, umulhi, and smulhi in ISLE.
For udiv, sdiv, urem, and srem I opted to move the original lowering into an extern constructor, as the interactions with rax and rdx for the div instruction didn't seem meaningful to implement in ISLE. However, I'm happy to revisit this choice and move more of the embedding into ISLE.
* Cranelift: Remove `ABICallee` trait
It has only one implementation: the `ABICalleeImpl` struct. By using that
directly we can avoid unnecessary layers of generics and abstractions as well as
a couple `Box`es that were previously putting the single implementation into a
`Box<dyn>`.
* Cranelift: Rename `ABICalleeImpl` to `AbiCallee`
* Fix comments as per review
* Rename `AbiCallee` to `Callee`
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
* Add a test for the existing behavior of fcvt_from_unit
* Migrate the I8, I16, I32 cases of fcvt_from_uint
* Implement the I64 case of fcvt_from_uint
* Add a test for the existing behavior of fcvt_from_uint.f64x2
* Migrate fcvt_from_uint.f64x2 to ISLE
* Lower the last case of `fcvt_from_uint`
* Add a test for `fcvt_from_uint`
* Finish lowering fcmp_from_uint
* Format
* Cranelift: Add instructions for getting the current stack/frame pointers and return address
This is the initial part of https://github.com/bytecodealliance/wasmtime/issues/4535
* x64: Remove `Amode::RbpOffset` and use `Amode::ImmReg` instead
We just special case getting operands from `Amode`s now.
* Fix s390x `get_return_address`; require `preserve_frame_pointers=true`
* Assert that `Amode::ImmRegRegShift` doesn't use rbp/rsp
* Handle non-allocatable registers in Amode::with_allocs
* Use "stack" instead of "r15" on s390x
* r14 is an allocatable register on s390x, so it shouldn't be used with `MovPReg`
* Cranellift: remove Baldrdash support and related features.
As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.
In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.
This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425
* Review feedback.
* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.
The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.
* Review feedback.
* x64: Add VEX Instruction Encoder
This uses a similar builder pattern to the EVEX Encoder.
Does not yet support memory accesses.
* x64: Add FMA Flag
* x64: Implement SIMD `fma`
* x64: Use 4 register Vex Inst
* x64: Reorder VEX pretty print args
* x64: port `atomic_rmw` to ISLE
This change ports `atomic_rmw` to ISLE for the x64 backend. It does not
change the lowering in any way, though it seems possible that the fixed
regs need not be as fixed and that there are opportunities for single
instruction lowerings. It does rename `inst_common::AtomicRmwOp` to
`MachAtomicRmwOp` to disambiguate with the IR enum with the same name.
* x64: remove remaining hardcoded register constraints for `atomic_rmw`
* x64: use `SyntheticAmode` in `AtomicRmwSeq`
* review: add missing reg collector for amode
* review: collect memory registers in the 'late' phase
- Handle call instructions' clobbers with the clobbers API, using RA2's
clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
list;
- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
edge-case behavior w.r.t. liverange splitting.
`idiv` on x86-64 only reads `rdx`/`edx`/`dx`/`dl` for divides with width
greater than 8 bits; for an 8-bit divide, it reads the whole 16-bit
divisor from `ax`, as our CISC ancestors intended. This PR fixes the
metadata to avoid a regalloc panic (due to undefined `rdx`) in this
case. Does not affect Wasmtime or other Wasm-frontend embedders.
This commit fixes a bug in the previous codegen for the `select`
instruction when the operations of the `select` were of the `v128` type.
Previously teh `XmmCmove` instruction only stored an `OperandSize` of 32
or 64 for a 64 or 32-bit move, but this was also used for these 128-bit
types which meant that when used the wrong move instruction was
generated. The fix applied here is to store the whole `Type` being moved
so the 128-bit variant can be selected as well.
* Remove unused srcloc in MachReloc
* Remove unused srcloc in MachTrap
* Use `into_iter` on array in bench code to suppress a warning
* Remove unused srcloc in MachCallSite
The pretty-printing had swapped dst and src2; this was introduced when
we moved to RA2 (sorry about that! IMHO we should do something to
automate the mapping between regalloc arg collection and pretty
printing/emission).
`src2` comes at the end because it has a variable number of register
mentions; this is in line with how many of the other inst formats work.
Actual emitted code was never incorrect, just the pretty-printing.
Updated test golden outputs look correct to me now, including the one
that we saw was incorrect in #3945.
Currently, we have partial Spectre mitigation: we protect heap accesses
with dynamic bounds checks. Specifically, we guard against errant
accesses on the misspeculated path beyond the bounds-check conditional
branch by adding a conditional move that is also dependent on the
bounds-check condition. This data dependency on the condition is not
speculated and thus will always pick the "safe" value (in the heap case,
a NULL address) on the misspeculated path, until the pipeline flushes
and recovers onto the correct path.
This PR uses the same technique both for table accesses -- used to
implement Wasm tables -- and for jumptables, used to implement Wasm
`br_table` instructions.
In the case of Wasm tables, the cmove picks the table base address on
the misspeculated path. This is equivalent to reading the first table
entry. This prevents loads of arbitrary data addresses on the
misspeculated path.
In the case of `br_table`, the cmove picks index 0 on the misspeculated
path. This is safer than allowing a branch to an address loaded from an
index under misspeculation (i.e., it preserves control-flow integrity
even under misspeculation).
The table mitigation is controlled by a Cranelift setting, on by
default. The br_table mitigation is always on, because it is part of the
single lowering pseudoinstruction. In both cases, the impact should be
minimal: a single extra cmove in a (relatively) rarely-used operation.
The table mitigation is architecture-independent (happens during
legalization); the br_table mitigation has been implemented for both x64
and aarch64. (I don't know enough about s390x to implement this
confidently there, but would happily review a PR to do the same on that
platform.)
x64 backend: add lowerings with load-op-store fusion.
These lowerings use the `OP [mem], reg` forms (or in AT&T syntax, `OP
%reg, (mem)`) -- i.e., x86 instructions that load from memory, perform
an ALU operation, and store the result, all in one instruction. Using
these instruction forms, we can merge three CLIF ops together: a load,
an arithmetic operation, and a store.
Previously, the block successor accumulation and the blockparam branch
arg setup were decoupled. The lowering backend implicitly specified
the order of successor edges via its `MachTerminator` enum on the last
instruction in the block, while the `Lower` toplevel
machine-independent driver set up blockparam branch args in the edge
order seen in CLIF.
In some cases, these orders did not match -- for example, when the
conditional branch depended on an FP condition that was implemented by
swapping taken/not-taken edges and inverting the condition code.
This PR refactors the successor handling to be centralized in `Lower`
rather than flow through the terminator `MachInst`, and adds a
successor block and its blockparam args at the same time, ensuring the
orders match.
This PR switches Cranelift over to the new register allocator, regalloc2.
See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.
Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:
```
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/A
```
* x64: port `select` using an FP comparison to ISLE
This change includes quite a few interlocking parts, required mainly by
the current x64 conventions in ISLE:
- it adds a way to emit a `cmove` with multiple OR-ing conditions;
because x64 ISLE cannot currently safely emit a comparison followed
by several jumps, this adds `MachInst::CmoveOr` and
`MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro
instructions hide the multi-instruction sequence in `lower.isle`
- to properly keep track of what instructions consume and produce
flags, @cfallin added a way to pass around variants of
`ConsumesFlags` and `ProducesFlags`--these changes affect all
backends
- then, to lower the `fcmp + select` CLIF, this change adds several
`cmove*_from_values` helpers that perform all of the awkward
conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one
upside is that now these lowerings have much-improved documentation
explaining why the various `FloatCC` and `CC` choices are made the
the way they are.
Co-authored-by: Chris Fallin <chris@cfallin.org>
We already defined the `Gpr` newtype and used it in a few places, and we already
defined the `Xmm` newtype and used it extensively. This finishes the transition
to using the newtypes extensively in lowering by making use of `Gpr` in more
places.
Fixes#3685
This primary motivation of this large commit (apologies for its size!) is to
introduce `Gpr` and `Xmm` newtypes over `Reg`. This should help catch
difficult-to-diagnose register class mixup bugs in x64 lowerings.
But having a newtype for `Gpr` and `Xmm` themselves isn't enough to catch all of
our operand-with-wrong-register-class bugs, because about 50% of operands on x64
aren't just a register, but a register or memory address or even an
immediate! So we have `{Gpr,Xmm}Mem[Imm]` newtypes as well.
Unfortunately, `GprMem` et al can't be `enum`s and are therefore a little bit
noisier to work with from ISLE. They need to maintain the invariant that their
registers really are of the claimed register class, so they need to encapsulate
the inner data. If they exposed the underlying `enum` variants, then anyone
could just change register classes or construct a `GprMem` that holds an XMM
register, defeating the whole point of these newtypes. So when working with
these newtypes from ISLE, we rely on external constructors like `(gpr_to_gpr_mem
my_gpr)` instead of `(GprMem.Gpr my_gpr)`.
A bit of extra lines of code are included to add support for register mapping
for all of these newtypes as well. Ultimately this is all a bit wordier than I'd
hoped it would be when I first started authoring this commit, but I think it is
all worth it nonetheless!
In the process of adding these newtypes, I didn't want to have to update both
the ISLE `extern` type definition of `MInst` and the Rust definition, so I move
the definition fully into ISLE, similar as aarch64.
Finally, this process isn't complete. I've introduced the newtypes here, and
I've made most XMM-using instructions switch from `Reg` to `Xmm`, as well as
register class-converting instructions, but I haven't moved all of the GPR-using
instructions over to the newtypes yet. I figured this commit was big enough as
it was, and I can continue the adoption of these newtypes in follow up commits.
Part of #3685.