This changes the following:
mov x0, #4
ldr x0, [x1, #4]
Into:
ldr x0, [x1]
I noticed this pattern (but with #0), in a benchmark.
Copyright (c) 2020, Arm Limited.
* this requires upgrading to wasmparser 0.67.0.
* There are no CLIF side changes because the CLIF `select` instruction is
polymorphic enough.
* on aarch64, there is unfortunately no conditional-move (csel) instruction on
vectors. This patch adds a synthetic instruction `VecCSel` which *does*
behave like that. At emit time, this is emitted as an if-then-else diamond
(4 insns).
* aarch64 implementation is otherwise straightforwards.
In existing MachInst backends, many instructions -- any that can trap or
result in a relocation -- carry `SourceLoc` values in order to propagate
the location-in-original-source to use to describe resulting traps or
relocation errors.
This is quite tedious, and also error-prone: it is likely that the
necessary plumbing will be missed in some cases, and in any case, it's
unnecessarily verbose.
This PR factors out the `SourceLoc` handling so that it is tracked
during emission as part of the `EmitState`, and plumbed through
automatically by the machine-independent framework. Instruction emission
code that directly emits trap or relocation records can query the
current location as necessary. Then we only need to ensure that memory
references and trap instructions, at their (one) emission point rather
than their (many) lowering/generation points, are wired up correctly.
This does have the side-effect that some loads and stores that do not
correspond directly to user code's heap accesses will have unnecessary
but harmless trap metadata. For example, the load that fetches a code
offset from a jump table will have a 'heap out of bounds' trap record
attached to it; but because it is bounds-checked, and will never
actually trap if the lowering is correct, this should be harmless. The
simplicity improvement here seemed more worthwhile to me than plumbing
through a "corresponds to user-level load/store" bit, because the latter
is a bit complex when we allow for op merging.
Closes#2290: though it does not implement a full "metadata" scheme as
described in that issue, this seems simpler overall.
* Make cranelift_codegen::isa::unwind::input public
* Move UnwindCode's common offset field out of the structure
* Make MachCompileResult::unwind_info more generic
* Record initial stack pointer offset
This patch implements, for aarch64, the following wasm SIMD extensions.
v128.load32_zero and v128.load64_zero instructions
https://github.com/WebAssembly/simd/pull/237
The changes are straightforward:
* no new CLIF instructions. They are translated into an existing CLIF scalar
load followed by a CLIF `scalar_to_vector`.
* the comment/specification for CLIF `scalar_to_vector` has been changed to
match the actual intended semantics, per consulation with Andrew Brown.
* translation from `scalar_to_vector` to aarch64 `fmov` instruction. This
has been generalised slightly so as to allow both 32- and 64-bit transfers.
* special-case zero in `lower_constant_f128` in order to avoid a
potentially slow call to `Inst::load_fp_constant128`.
* Once "Allow loads to merge into other operations during instruction
selection in MachInst backends"
(https://github.com/bytecodealliance/wasmtime/issues/2340) lands,
we can use that functionality to pattern match the two-CLIF pair and
emit a single AArch64 instruction.
* A simple filetest has been added.
There is no comprehensive testcase in this commit, because that is a separate
repo. The implementation has been tested, nevertheless.
When performing a function call, the platform ABI may require space
on the stack to hold outgoing arguments and/or return values.
Currently, this is supported via decrementing the stack pointer
before the call and incrementing it afterwards, using the
emit_stack_pre_adjust and emit_stack_post_adjust methods of
ABICaller. However, on some platforms it would be preferable
to just allocate enough space for any call done in the function
in the caller's prologue instead.
This patch adds support to allow back-ends to choose that method.
Instead of calling emit_stack_pre/post_adjust around a call, they
simply call a new accumulate_outgoing_args_size method of
ABICaller instead. This will pass on the required size to the
ABICallee structure of the calling function, which will accumulate
the maximum size required for all function calls.
That accumulated size is then passed to the gen_clobber_save
and gen_clobber_restore functions so they can include the size
in the stack allocation / deallocation that already happens in
the prologue / epilogue code.
This patch implements, for aarch64, the following wasm SIMD extensions
i32x4.dot_i16x8_s instruction
https://github.com/WebAssembly/simd/pull/127
It also updates dependencies as follows, in order that the new instruction can
be parsed, decoded, etc:
wat to 1.0.27
wast to 26.0.1
wasmparser to 0.65.0
wasmprinter to 0.2.12
The changes are straightforward:
* new CLIF instruction `widening_pairwise_dot_product_s`
* translation from wasm into `widening_pairwise_dot_product_s`
* new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group)
* translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv`
There is no testcase in this commit, because that is a separate repo. The
implementation has been tested, nevertheless.
The ABI common code currently passes the fixed frame size to
the gen_clobber_save back-end routine, which is required to
emit code to allocate the required stack space in the prologue.
Similarly, the back-end needs to emit code to de-allocate the
stack in the epilogue. However, at this point the back-end
does not have access to that fixed frame size value any more.
With targets that use a frame pointer, this does not matter,
since de-allocation can be done simply by assigning the frame
pointer back to the stack pointer. However, on targets that
do not use a frame pointer, the frame size is required.
To allow back-ends that option, this patch changes ABI common
code to pass the fixed frame size to get_clobber_restore as
well (the same value as is passed to get_clobber_save).
The common gen_prologue code currently assumes that the stack
pointer has to be aligned to twice the word size. While this
is true for many ABIs, it does not hold universally.
This patch adds a new callback stack_align that back-ends can
provide to define the specific stack alignment required by the
ABI on that platform.
In particular, introduce initial support for the MOVI and MVNI
instructions, with 8-bit elements. Also, treat vector constants
as 32- or 64-bit floating-point numbers, if their value allows
it, by relying on the architectural zero extension. Finally,
stop generating literal loads for 32-bit constants.
Copyright (c) 2020, Arm Limited.
This patch implements, for aarch64, the following wasm SIMD extensions
Floating-point rounding instructions
https://github.com/WebAssembly/simd/pull/232
Pseudo-Minimum and Pseudo-Maximum instructions
https://github.com/WebAssembly/simd/pull/122
The changes are straightforward:
* `build.rs`: the relevant tests have been enabled
* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
`fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need
any new CLIF instructions.
* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
pretty much the same as any other unary or binary vector instruction (for
the rounding and the pmin/max respectively)
* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
- `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
sequence, `fcmpgt` followed by `bsl`
- the CLIF rounding instructions are converted to a suitable vector
`frint{n,z,p,m}` instruction.
* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
enum VecMisc2` to handle the rounding operations. And corresponding `emit`
cases.
The `bitmask.{8x16,16x8,32x4}` instructions do not map neatly to any single
AArch64 SIMD instruction, and instead need a sequence of around ten
instructions. Because of this, this patch is somewhat longer and more complex
than it would be for (eg) x64.
Main changes are:
* the relevant testsuite test (`simd_boolean.wast`) has been enabled on aarch64.
* at the CLIF level, add a new instruction `vhigh_bits`, into which these wasm
instructions are to be translated.
* in the wasm->CLIF translation (code_translator.rs), translate into
`vhigh_bits`. This is straightforward.
* in the CLIF->AArch64 translation (lower_inst.rs), translate `vhigh_bits`
into equivalent sequences of AArch64 instructions. There is a different
sequence for each of the `{8x16, 16x8, 32x4}` variants.
All other changes are AArch64-specific, and add instruction definitions needed
by the previous step:
* Add two new families of AArch64 instructions: `VecShiftImm` (vector shift by
immediate) and `VecExtract` (effectively a double-length vector shift)
* To the existing AArch64 family `VecRRR`, add a `zip1` variant. To the
`VecLanesOp` family add an `addv` variant.
* Add supporting code for the above changes to AArch64 instructions:
- getting the register uses (`aarch64_get_regs`)
- mapping the registers (`aarch64_map_regs`)
- printing instructions
- emitting instructions (`impl MachInstEmit for Inst`). The handling of
`VecShiftImm` is a bit complex.
- emission tests for new instructions and variants.
It corresponds to WebAssembly's `load*_splat` operations, which
were previously represented as a combination of `Load` and `Splat`
instructions. However, there are architectures such as Armv8-A
that have a single machine instruction equivalent to the Wasm
operations. In order to generate it, it is necessary to merge the
`Load` and the `Splat` in the backend, which is not possible
because the load may have side effects. The new IR instruction
works around this limitation.
The AArch64 backend leverages the new instruction to improve code
generation.
Copyright (c) 2020, Arm Limited.
A new associated type Info is added to MachInstEmit, which is the
immutable counterpart to State. It can't easily be constructed from an
ABICallee, since it would require adding an associated type to the
latter, and making so leaks the associated type in a lot of places in
the code base and makes the code harder to read. Instead, the EmitInfo
state can simply be passed to the `Vcode::emit` function directly.
This change abstracts away (from the perspective of the new backend) how immediate values are stored in InstructionData. It gathers large immediates from necessary places (e.g. constant pool) and delegates to `InstructionData::imm_value` for the rest. This refactor only touches original users of `LowerCtx::get_immediate` but a future change could do the same for any place the new backend is accessing InstructionData directly to retrieve immediates.
This PR updates the AArch64 ABI implementation so that it (i) properly
respects that v8-v15 inclusive have callee-save lower halves, and
caller-save upper halves, by conservatively approximating (to full
registers) in the appropriate directions when generating prologue
caller-saves and when informing the regalloc of clobbered regs across
callsites.
In order to prevent saving all of these vector registers in the prologue
of every non-leaf function due to the above approximation, this also
makes use of a new regalloc.rs feature to exclude call instructions'
writes from the clobber set returned by register allocation. This is
safe whenever the caller and callee have the same ABI (because anything
the callee could clobber, the caller is allowed to clobber as well
without saving it in the prologue).
Fixes#2254.
This also passes `fixed_frame_storage_size` (previously `total_sp_adjust`)
into `gen_clobber_save` so that it can be combined with other stack
adjustments.
Copyright (c) 2020, Arm Limited.
As part of a Wasm JIT update, SpiderMonkey is changing its internal
WebAssembly function ABI. The new ABI's frame format includes "caller
TLS" and "callee TLS" slots. The details of where these come from are
not important; from Cranelift's point of view, the only relevant
requirement is that we have two on-stack args that are always present
(offsetting other on-stack args), and that we define special argument
purposes so that we can supply values for these slots.
Note that this adds a *new* ABI (a variant of the Baldrdash ABI) because
we do not want to tightly couple the landing of this PR to the landing
of the changes in SpiderMonkey; it's better if both the old and new
behavior remain available in Cranelift, so SpiderMonkey can continue to
vendor Cranelift even if it does not land (or backs out) the ABI change.
Furthermore, note that this needs to be a Cranelift-level change (i.e.
cannot be done purely from the translator environment implementation)
because the special TLS arguments must always go on the stack, which
would not otherwise happen with the usual argument-placement logic; and
there is no primitive to push a value directly in CLIF code (the notion
of a stack frame is a lower-level concept).
This commit adds arm32 code generation for some IR insts.
Floating-point instructions are not supported, because regalloc
does not allow to represent overlapping register classes,
which are needed by VFP/Neon.
There is also no support for big-endianness, I64 and I128 types.
Previously, in #2128, we factored out a common "vanilla 64-bit ABI"
implementation from the AArch64 ABI code, with the idea that this should
be largely compatible with x64. This PR alters the new x64 backend to
make use of the shared infrastructure, removing the duplication that
existed previously. The generated code is nearly (not exactly) the same;
the only difference relates to how the clobber-save region is padded in
the prologue.
This also changes some register allocations in the aarch64 code because
call support in the shared ABI infra now passes a temp vreg in, rather
than requiring use of a fixed, non-allocable temp; tests have been
updated, and the runtime behavior is unchanged.
This commit performs a small cleanup in the AArch64 backend - after
the MAdd and MSub variants have been extracted, the ALUOp enum is
used purely for binary integer operations.
Also, Inst::Mov has been renamed to Inst::Mov64 for consistency.
Copyright (c) 2020, Arm Limited.
Baldrdash's API requires that there is at most one result in a register,
across all the possible register classes: in particular, it's not
possible to return an i64 value in a register while returning an v128
value in another register.
This patch adds a notion of "remaining register values", so this is
properly taking into account when choosing whether a return value may be
put into a register or not.
It does this by providing an implementation of the CLIF instructions `AtomicRmw`, `AtomicCas`,
`AtomicLoad`, `AtomicStore` and `Fence`.
The translation is straightforward. `AtomicCas` is translated into x64 `cmpxchg`, `AtomicLoad`
becomes a normal load because x64-TSO provides adequate sequencing, `AtomicStore` becomes a
normal store followed by `mfence`, and `Fence` becomes `mfence`. `AtomicRmw` is the only
complex case: it becomes a normal load, followed by a loop which computes an updated value,
tries to `cmpxchg` it back to memory, and repeats if necessary.
This is a minimum-effort initial implementation. `AtomicRmw` could be implemented more
efficiently using LOCK-prefixed integer read-modify-write instructions in the case where the old
value in memory is not required. Subsequent work could add that, if required.
The x64 emitter has been updated to emit the new instructions, obviously. The `LegacyPrefix`
mechanism has been revised to handle multiple prefix bytes, not just one, since it is now
sometimes necessary to emit both 0x66 (Operand Size Override) and F0 (Lock).
In the aarch64 implementation of atomics, there has been some minor renaming for the sake of
clarity, and for consistency with this x64 implementation.