* x64: Add VEX Instruction Encoder
This uses a similar builder pattern to the EVEX Encoder.
Does not yet support memory accesses.
* x64: Add FMA Flag
* x64: Implement SIMD `fma`
* x64: Use 4 register Vex Inst
* x64: Reorder VEX pretty print args
* Allow 64-bit vectors and implement for interpreter
The AArch64 backend already supports 64-bit vectors; this simply allows
instructions to make use of that.
Implemented support for 64-bit vectors within the interpreter to allow
interpret runtests to use them.
Copyright (c) 2022 Arm Limited
* Disable 64-bit SIMD `iaddpairwise` tests on s390x
Copyright (c) 2022 Arm Limited
* [AArch64] Port SIMD narrowing to ISLE
Fvdemote, snarrow, unarrow and uunarrow.
Also refactor the aarch64 instructions descriptions to parameterize
on ScalarSize instead of using different opcodes.
The zero_value pure constructor has been introduced and used by the
integer narrow operations and it replaces, and extends, the compare
zero patterns.
Copright (c) 2022, Arm Limited.
* use short 'if' patterns
This enables more runtests to be executed on s390x. Doing so
uncovered a two back-end bugs, which are fixed as well:
- The result of cls was always off by one.
- The result of popcnt.i16 has uninitialized high bits.
In addition, I found a bug in the load-op-store.clif test case:
v3 = heap_addr.i64 heap0, v1, 4
v4 = iconst.i64 42
store.i32 v4, v3
This was clearly intended to perform a 32-bit store, but
actually performs a 64-bit store (it seems the type annotation
of the store opcode is ignored, and the type of the operand
is used instead). That bug did not show any noticable symptoms
on little-endian architectures, but broke on big-endian.
* cranelift: Restrict `br_table` to `i32` indices
In #4498 it was proposed that we should only accept `i32` indices
to `br_table`. The rationale for this is that larger types lead the
users to a false sense of flexibility (since we don't support jump
tables larger than u32's), and narrower types are not well tested
paths that would be safer if we removed them.
* cranelift: Reduce directly from i128 to i32 in Switch
Converted the existing implementations for the following opcodes to ISLE
on AArch64:
- `sqrt`
- `fneg`
- `fabs`
- `fpromote`
- `fdemote`
- `ceil`
- `floor`
- `trunc`
- `nearest`
Copyright (c) 2022 Arm Limited
On s390x, we do not have a frame pointer that can be used to chain
stack frames for easy unwinding. Instead, our ABI defines a stack
"backchain" mechanism that can be used to the same effect.
This PR uses that backchain mechanism to implement the new
preserve_frame_pointers flags introduced here:
https://github.com/bytecodealliance/wasmtime/pull/4469
Converted the existing implementations for the following Opcodes to ISLE on AArch64:
- `fadd`
- `fsub`
- `fmul`
- `fdiv`
- `fmin`
- `fmax`
- `fmin_pseudo`
- `fmax_pseudo`
Copyright (c) 2022 Arm Limited
This adds full support for all Cranelift SIMD instructions
to the s390x target. Everything is matched fully via ISLE.
In addition to adding support for many new instructions,
and the lower.isle code to match all SIMD IR patterns,
this patch also adds ABI support for vector types.
In particular, we now need to handle the fact that
vector registers 8 .. 15 are partially callee-saved,
i.e. the high parts of those registers (which correspond
to the old floating-poing registers) are callee-saved,
but the low parts are not. This is the exact same situation
that we already have on AArch64, and so this patch uses the
same solution (the is_included_in_clobbers callback).
The bulk of the changes are platform-specific, but there are
a few exceptions:
- Added ISLE extractors for the Immediate and Constant types,
to enable matching the vconst and swizzle instructions.
- Added a missing accessor for call_conv to ABISig.
- Fixed endian conversion for vector types in data_value.rs
to enable their use in runtests on the big-endian platforms.
- Enabled (nearly) all SIMD runtests on s390x. [ Two test cases
remain disabled due to vector shift count semantics, see below. ]
- Enabled all Wasmtime SIMD tests on s390x.
There are three minor issues, called out via FIXMEs below,
which should be addressed in the future, but should not be
blockers to getting this patch merged. I've opened the
following issues to track them:
- Vector shift count semantics
https://github.com/bytecodealliance/wasmtime/issues/4424
- is_included_in_clobbers vs. link register
https://github.com/bytecodealliance/wasmtime/issues/4425
- gen_constant callback
https://github.com/bytecodealliance/wasmtime/issues/4426
All tests, including all newly enabled SIMD tests, pass
on both z14 and z15 architectures.
* Implement `iabs` in ISLE (AArch64)
Converts the existing implementation of `iabs` for AArch64 into ISLE,
and fixes support for `iabs` on scalar values.
Copyright (c) 2022 Arm Limited.
* Improve scalar `iabs` implementation.
Also introduces `CSNeg` instruction.
Copyright (c) 2022 Arm Limited
* Convert `scalar_to_vector` to ISLE (AArch64)
Converted the exisiting implementation of `scalar_to_vector` for AArch64 to
ISLE.
Copyright (c) 2022 Arm Limited
* Add support for floats and fix FpuExtend
- Added rules to cover `f32 -> f32x4` and `f64 -> f64x2` for
`scalar_to_vector`
- Added tests for `scalar_to_vector` on floats.
- Corrected an invalid instruction emitted by `FpuExtend` on 64-bit
values.
Copyright (c) 2022 Arm Limited
Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.
Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.
The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.
Copyright (c) 2022, Arm Limited.
@yuyang-ok reported via zulip that i128 overflow tests were:
1. different from the interpreter implementation
2. wrong on some of the test cases
This fixes both the tests and the aarch64 implementation and adds the
interpreter to the testsuite.
* x64: port `atomic_rmw` to ISLE
This change ports `atomic_rmw` to ISLE for the x64 backend. It does not
change the lowering in any way, though it seems possible that the fixed
regs need not be as fixed and that there are opportunities for single
instruction lowerings. It does rename `inst_common::AtomicRmwOp` to
`MachAtomicRmwOp` to disambiguate with the IR enum with the same name.
* x64: remove remaining hardcoded register constraints for `atomic_rmw`
* x64: use `SyntheticAmode` in `AtomicRmwSeq`
* review: add missing reg collector for amode
* review: collect memory registers in the 'late' phase
This defines the full set of 32 128-bit vector registers on s390x.
(Note that the VRs overlap the existing FPRs.) In addition, this
adds support to use all 32 vector registers to implement floating-
point operations, by using vector floating-point instructions with
the 'W' bit set to operate only on the first element.
This part of the vector instruction set mostly matches the old FP
instruction set, with two exceptions:
- There is no vector version of the COPY SIGN instruction. Instead,
now use a VECTOR SELECT with an appropriate bit mask to implement
the fcopysign operation.
- There are no vector version of the float <-> int conversion
instructions where source and target differ in bit size. Use
appropriate multiple conversion steps instead. This also requires
use of explicit checking to implement correct overflow handling.
As a side effect, this version now also implements the i8 / i16
variants of all conversions, which had been missing so far.
For all operations except those two above, we continue to use the
old FP instruction if applicable (i.e. if all operands happen to
have been allocated to the original FP register set), and use the
vector instruction otherwise.
Move from passing and returning u8 and u16 values to u32 in many of
the functions. This removes a number of type conversions and gives
a small compilation time speedup, around ~0.7% on my aarch64 machine.
Copyright (c) 2022, Arm Limited.
Now that lowering is fully done in ISLE, clean up some code remnants
in lower.rs. In particular, move code to lower/isle.rs where
possible, and inline lower_insn_to_regs into its caller and simplify.
This adds infrastructure to allow implementing call and return
instructions in ISLE, and migrates the s390x back-end.
To implement ABI details, this patch creates public accessors
for `ABISig` and makes them accessible in ISLE. All actual
code generation is then done in ISLE rules, following the
information provided by that signature.
[ Note that the s390x back end never requires multiple slots for
a single argument - the infrastructure to handle this should
already be present, however. ]
To implement loops in ISLE rules, this patch uses regular tail
recursion, employing a `Range` data structure holding a range
of integers to be looped over.
- Handle call instructions' clobbers with the clobbers API, using RA2's
clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
list;
- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
edge-case behavior w.r.t. liverange splitting.
The previous `cls` code was producing wrong results when fed with a -1 i8.
The fix here is to sign extend instead of zero extending since we want
to keep the sign bit as one in order for it to be counted correctly
in the cls instruction
This also merges the interpreter only tests now that aarch64
correctly supports this instruction
`idiv` on x86-64 only reads `rdx`/`edx`/`dx`/`dl` for divides with width
greater than 8 bits; for an 8-bit divide, it reads the whole 16-bit
divisor from `ax`, as our CISC ancestors intended. This PR fixes the
metadata to avoid a regalloc panic (due to undefined `rdx`) in this
case. Does not affect Wasmtime or other Wasm-frontend embedders.
This commit fixes a mistake in the `Swizzle` opcode implementation in
the x64 backend of Cranelift. Previously an input register was casted to
a writable register and then modified, which I believe instructions are
not supposed to do. This was discovered as part of my investigation
into #4315.
This commit fixes a bug in the previous codegen for the `select`
instruction when the operations of the `select` were of the `v128` type.
Previously teh `XmmCmove` instruction only stored an `OperandSize` of 32
or 64 for a 64 or 32-bit move, but this was also used for these 128-bit
types which meant that when used the wrong move instruction was
generated. The fix applied here is to store the whole `Type` being moved
so the 128-bit variant can be selected as well.
Now the fiber implementation on AArch64 authenticates function
return addresses and includes the relevant BTI instructions, except
on macOS.
Also, change the locations of the saved FP and LR registers on the
fiber stack to make them compliant with the Procedure Call Standard
for the Arm 64-bit Architecture.
Copyright (c) 2022, Arm Limited.
The current lowering helper for `cmpxchg` returns the literal RealReg
`rax` as its result. However, this breaks a number of invariants, and
eventually causes a regalloc panic if used as a blockparam arg (pinned
vregs cannot be used in this way).
In general we have to return regular vregs, not a RealReg, as results of
instructions during lowering. However #4223 added a helper for
`x64_cmpxchg` that returns a literal `rax`.
Fortunately we can do the right thing here by just giving a fresh vreg
to the instruction; the regalloc constraints mean that this vreg is
constrained to `rax` at the instruction (at its def/late point), so the
generator of the instruction need not worry about `rax` here.
If an address expression is given to `to_amode` that is completely
constant (no registers at all), then it will produce an `Amode` that has
the resulting constant as an offset, and `(invalid_reg)` as the base.
This is a side-effect of the way we build up the amode step-by-step --
we're waiting to see a register and plug it into the base field. If we
never get a reg though, we need to generate a constant zero into a
register and use that as the base. This PR adds a `finalize_amode`
helper to do just that.
Fixes#4234.