* aarch64: constant generation cleanup
Add support for MOVZ and MOVN generation via ISLE.
Handle f32const, f64const, and nop instructions via ISLE.
No longer call Inst::gen_constant from lower.rs.
* riscv64: constant generation cleanup
Handle f32const, f64const, and nop instructions via ISLE.
* s390x: constant generation cleanup
Fix rule priorities for "imm" term.
Only handle 32-bit stack offsets; no longer use load_constant64.
* x64: constant generation cleanup
No longer call Inst::gen_constant from lower.rs or abi.rs.
* Refactor LowerBackend::lower to return InstOutput
No longer write to the per-insn output registers; instead, return
an InstOutput vector of temp registers holding the outputs.
This will allow calling LowerBackend::lower multiple times for
the same instruction, e.g. to rematerialize constants.
When emitting the primary copy of the instruction during lowering,
writing to the per-insn registers is now done in lower_clif_block.
As a result, the ISLE lower_common routine is no longer needed.
In addition, the InsnOutput type and all code related to it
can be removed as well.
* Refactor IsleContext to hold a LowerBackend reference
Remove the "triple", "flags", and "isa_flags" fields that are
copied from LowerBackend to each IsleContext, and instead just
hold a reference to LowerBackend in IsleContext.
This will allow calling LowerBackend::lower from within callbacks
in src/machinst/isle.rs, e.g. to rematerialize constants.
To avoid having to pass LowerBackend references through multiple
functions, eliminate the lower_insn_to_regs subroutines in those
targets that still have them, and just inline into the main
lower routine. This also eliminates lower_inst.rs on aarch64
and riscv64.
Replace all accesses to the removed IsleContext fields by going
through the LowerBackend reference.
* Remove MachInst::gen_constant
This addresses the problem described in issue
https://github.com/bytecodealliance/wasmtime/issues/4426
that targets currently have to duplicate code to emit
constants between the ISLE logic and the gen_constant
callback.
After the various cleanups in earlier patches in this series,
the only remaining user of get_constant is put_value_in_regs
in Lower. This can now be removed, and instead constant
rematerialization can be performed in the put_in_regs ISLE
callback by simply directly calling LowerBackend::lower
on the instruction defining the constant (using a different
output register).
Since the check for egraph mode is now no longer performed in
put_value_in_regs, the Lower::flags member becomes obsolete.
Care needs to be taken that other calls directly to the
Lower::put_value_in_regs routine now handle the fact that
no more rematerialization is performed. All such calls in
target code already historically handle constants themselves.
The remaining call site in the ISLE gen_call_common helper
can be redirected to the ISLE put_in_regs callback.
The existing target implementations of gen_constant are then
unused and can be removed. (In some target there may still
be further opportunities to remove duplication between ISLE
and some local Rust code - this can be left to future patches.)
This commit prepares the x64 pieces from cranelift codegen to be consumed by
Winch for binary emission. This change doesn't introduce or modifies
functionality it makes the necessary pieces for binary emission public.
This change also improves documentation where applicable.
All instructions using the CPU flags types (IFLAGS/FFLAGS) were already
removed. This patch completes the cleanup by removing all remaining
instructions that define values of CPU flags types, as well as the
types themselves.
Specifically, the following features are removed:
- The IFLAGS and FFLAGS types and the SpecialType category.
- Special handling of IFLAGS and FFLAGS in machinst/isle.rs and
machinst/lower.rs.
- The ifcmp, ifcmp_imm, ffcmp, iadd_ifcin, iadd_ifcout, iadd_ifcarry,
isub_ifbin, isub_ifbout, and isub_ifborrow instructions.
- The writes_cpu_flags instruction property.
- The flags verifier pass.
- Flags handling in the interpreter.
All of these features are currently unused; no functional change
intended by this patch.
This addresses https://github.com/bytecodealliance/wasmtime/issues/3249.
Enable regalloc2's SSA verifier in debug builds to check for any outstanding reuse of virtual registers in def constraints. As fuzzing enables debug_assertions, this will enable the SSA verifier when fuzzing as well.
Introduce a temporary for an intermediate value in the lowering of div in the x64 backend. Additionally, add a src argument to the shift_r smart constructor, which is why the diff got larger than just the div lowering.
Avoid naming %rcx as written by the CoffTlsGetAddr pseudo-instruction in the x64 backend, and instead emit a fixed-def constraint for a fresh VReg and %rcx.
Modify return pseudo-instructions to have pairs of registers: virtual and real. This allows us to constrain the virtual registers to the real ones specified by the abi, instead of directly emitting moves to those real registers.
Adds Bswap to the Cranelift IR. Implements the Bswap instruction
in the x64 and aarch64 codegen backends. Cranelift users can now:
```
builder.ins().bswap(value)
```
to get a native byteswap instruction.
* x64: implements the 32- and 64-bit bswap instruction, following
the pattern set by similar unary instrutions (Neg and Not) - it
only operates on a dst register, but is parameterized with both
a src and dst which are expected to be the same register.
As x64 bswap instruction is only for 32- or 64-bit registers,
the 16-bit swap is implemented as a rotate left by 8.
Updated x64 RexFlags type to support emitting for single-operand
instructions like bswap
* aarch64: Bswap gets emitted as aarch64 rev16, rev32,
or rev64 instruction as appropriate.
* s390x: Bswap was already supported in backend, just had to add
a bit of plumbing
* For completeness, added bswap to the interpreter as well.
* added filetests and runtests for each ISA
* added bswap to fuzzgen, thanks to afonso360 for the code there
* 128-bit swaps are not yet implemented, that can be done later
Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.
Fixes#3205
Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
* Cranelift: use regalloc2 constraints on caller side of ABI code.
This PR updates the shared ABI code and backends to use register-operand
constraints rather than explicit pinned-vreg moves for register
arguments and return values.
The s390x backend was not updated, because it has its own implementation
of ABI code. Ideally we could converge back to the code shared by x64
and aarch64 (which didn't exist when s390x ported calls to ISLE, so the
current situation is underestandable, to be clear!). I'll leave this for
future work.
This PR exposed several places where regalloc2 needed to be a bit more
flexible with constraints; it requires regalloc2#74 to be merged and
pulled in.
* Update to regalloc2 0.3.3.
In addition to version bump, this required removing two asserts as
`SpillSlot`s no longer carry their class (so we can't assert that they
have the correct class).
* Review comments.
* Filetest updates.
* Add cargo-vet audit for regalloc2 0.3.2 -> 0.3.3 upgrade.
* Update to regalloc2 0.4.0.
* ABI: implement register arguments with constraints.
Currently, Cranelift's ABI code emits a sequence of moves from physical
registers into vregs at the top of the function body, one for every
register-carried argument.
For a number of reasons, we want to move to operand constraints instead,
and remove the use of explicitly-named "pinned vregs"; this allows for
better regalloc in theory, as it removes the need to "reverse-engineer"
the sequence of moves.
This PR alters the ABI code so that it generates a single "args"
pseudo-instruction as the first instruction in the function body. This
pseudo-inst defs all register arguments, and constrains them to the
appropriate registers at the def-point. Subsequently the regalloc can
move them wherever it needs to.
Some care was taken not to have this pseudo-inst show up in
post-regalloc disassemblies, but the change did cause a general regalloc
"shift" in many tests, so the precise-output updates are a bit noisy.
Sorry about that!
A subsequent PR will handle the other half of the ABI code, namely, the
callsite case, with a similar preg-to-constraint conversion.
* Update based on review feedback.
* Review feedback.
This is a cherry-pick of a long-ago commit, 2d46637. The original
message reads:
> Now that `SyntheticAmode` can refer to constants, there is no longer a
> need for a separate instruction format--standard load instructions will
> work.
Since then, the transition to ISLE and the use of `XmmLoadConst` in many
more places makes this change a larger diff than the original. The basic
idea is the same, though: the extra indirection of `Inst::XMmLoadConst`
is removed and replaced by a direct use of `VCodeConstant` as a
`SyntheticAmode`. This has no effect on codegen, but the CLIF output is
now clearer in that the actual instruction is displayed (e.g., `movdqu`)
instead of a made-up instruction (`load_const`).
This slipped through the regalloc2 operand code update in #4811: the
CvtFloatToUintSeq pseudo-instruction actually clobbers its source. It
was marked as a "mod" operand in the original and I mistakenly
converted it to a "use" as I had not seen the actual clobber. The
instruction now takes an extra temp and makes a copy of `src` in the
appropriate place.
Fixes#4840.
Add a new pseudo-instruction, XmmUnaryRmRImm, to handle instructions like roundss that only use their first register argument for the instruction's result. This has the added benefit of allowing the isle wrappers for those instructions to take an XmmMem argument, allowing for more cases where loads may be merged.
* x64: clean up regalloc-related semantics on several instructions.
This PR removes all uses of "modify" operands on instructions in the x64
backend, and also removes all uses of "pinned vregs", or vregs that are
explicitly tied to particular physical registers. In place of both of
these mechanisms, which are legacies of the old regalloc design and
supported via compatibility code, the backend now uses operand
constraints. This is more flexible as it allows the regalloc to see the
liveranges and constraints without "reverse-engineering" move instructions.
Eventually, after removing all such uses (including in other backends
and by the ABI code), we can remove the compatibility code in regalloc2,
significantly simplifying its liverange-construction frontend and
thus allowing for higher confidence in correctness as well as possibly a
bit more compilation speed.
Curiously, there are a few extra move instructions now; they are likely
poor splitting decisions and I can try to chase these down later.
* Fix cranelift-codegen tests.
* Review feedback.
Lower extractlane, scalar_to_vector and splat in ISLE.
This PR also makes some changes to the SinkableLoad api
* change the return type of sink_load to RegMem as there are more functions available for dealing with RegMem
* add reg_mem_to_reg_mem_imm and register it as an automatic conversion
Lower `shuffle` and `swizzle` in ISLE.
This PR surfaced a bug with the lowering of `shuffle` when avx512vl and avx512vbmi are enabled: we use `vpermi2b` as the implementation, but panic if the immediate shuffle mask contains any out-of-bounds values. The behavior when the avx512 extensions are not present is that out-of-bounds values are turned into `0` in the result.
I've resolved this by detecting when the shuffle immediate has out-of-bounds indices in the avx512-enabled lowering, and generating an additional mask to zero out the lanes where those indices occur. This brings the avx512 case into line with the semantics of the `shuffle` op: 94bcbe8446/cranelift/codegen/meta/src/shared/instructions.rs (L1495-L1498)
Lower stack_addr, udiv, sdiv, urem, srem, umulhi, and smulhi in ISLE.
For udiv, sdiv, urem, and srem I opted to move the original lowering into an extern constructor, as the interactions with rax and rdx for the div instruction didn't seem meaningful to implement in ISLE. However, I'm happy to revisit this choice and move more of the embedding into ISLE.
* Cranelift: Remove `ABICallee` trait
It has only one implementation: the `ABICalleeImpl` struct. By using that
directly we can avoid unnecessary layers of generics and abstractions as well as
a couple `Box`es that were previously putting the single implementation into a
`Box<dyn>`.
* Cranelift: Rename `ABICalleeImpl` to `AbiCallee`
* Fix comments as per review
* Rename `AbiCallee` to `Callee`
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
* Add a test for the existing behavior of fcvt_from_unit
* Migrate the I8, I16, I32 cases of fcvt_from_uint
* Implement the I64 case of fcvt_from_uint
* Add a test for the existing behavior of fcvt_from_uint.f64x2
* Migrate fcvt_from_uint.f64x2 to ISLE
* Lower the last case of `fcvt_from_uint`
* Add a test for `fcvt_from_uint`
* Finish lowering fcmp_from_uint
* Format
* Cranelift: Add instructions for getting the current stack/frame pointers and return address
This is the initial part of https://github.com/bytecodealliance/wasmtime/issues/4535
* x64: Remove `Amode::RbpOffset` and use `Amode::ImmReg` instead
We just special case getting operands from `Amode`s now.
* Fix s390x `get_return_address`; require `preserve_frame_pointers=true`
* Assert that `Amode::ImmRegRegShift` doesn't use rbp/rsp
* Handle non-allocatable registers in Amode::with_allocs
* Use "stack" instead of "r15" on s390x
* r14 is an allocatable register on s390x, so it shouldn't be used with `MovPReg`
* Cranellift: remove Baldrdash support and related features.
As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.
In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.
This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425
* Review feedback.
* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.
The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.
* Review feedback.
* x64: Add VEX Instruction Encoder
This uses a similar builder pattern to the EVEX Encoder.
Does not yet support memory accesses.
* x64: Add FMA Flag
* x64: Implement SIMD `fma`
* x64: Use 4 register Vex Inst
* x64: Reorder VEX pretty print args
* x64: port `atomic_rmw` to ISLE
This change ports `atomic_rmw` to ISLE for the x64 backend. It does not
change the lowering in any way, though it seems possible that the fixed
regs need not be as fixed and that there are opportunities for single
instruction lowerings. It does rename `inst_common::AtomicRmwOp` to
`MachAtomicRmwOp` to disambiguate with the IR enum with the same name.
* x64: remove remaining hardcoded register constraints for `atomic_rmw`
* x64: use `SyntheticAmode` in `AtomicRmwSeq`
* review: add missing reg collector for amode
* review: collect memory registers in the 'late' phase
- Handle call instructions' clobbers with the clobbers API, using RA2's
clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
list;
- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
edge-case behavior w.r.t. liverange splitting.
`idiv` on x86-64 only reads `rdx`/`edx`/`dx`/`dl` for divides with width
greater than 8 bits; for an 8-bit divide, it reads the whole 16-bit
divisor from `ax`, as our CISC ancestors intended. This PR fixes the
metadata to avoid a regalloc panic (due to undefined `rdx`) in this
case. Does not affect Wasmtime or other Wasm-frontend embedders.
This commit fixes a bug in the previous codegen for the `select`
instruction when the operations of the `select` were of the `v128` type.
Previously teh `XmmCmove` instruction only stored an `OperandSize` of 32
or 64 for a 64 or 32-bit move, but this was also used for these 128-bit
types which meant that when used the wrong move instruction was
generated. The fix applied here is to store the whole `Type` being moved
so the 128-bit variant can be selected as well.