When dealing with params that need to be split, we follow the
arch64 ABI and split the value in two, and make sure that start that
argument in an even numbered xN register.
The apple ABI does not require this, so on those platforms, we start
params anywhere.
This commit enables Cranelift's AArch64 backend to generate code
for instruction set extensions (previously only the base Armv8-A
architecture was supported); also, it makes it possible to detect
the extensions supported by the host when JIT compiling. The new
functionality is applied to the IR instruction `AtomicCas`.
Copyright (c) 2021, Arm Limited.
The StructReturn ABI is fairly simple at the codegen/isel level: we only
need to take care to return the sret pointer as one of the return values
if that wasn't specified in the initial function signature.
Struct arguments are a little more complex. A struct argument is stored
as a chunk of memory in the stack-args space. However, the CLIF
semantics are slightly special: on the caller side, the parameter passed
in is a pointer to an arbitrary memory block, and we must memcpy this
data to the on-stack struct-argument; and on the callee side, we provide
a pointer to the passed-in struct-argument as the CLIF block param
value.
This is necessary to support various ABIs other than Wasm, such as that
of Rust (with the cg_clif codegen backend).
This will allow for support for `I128` values everywhere, and `I64`
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to *reason* about a `Value` residing in more than
one register.
The current code doesn't correctly handle the case where `ExtendOp::UXTW` has
as source, a constant-producing insn that produces a negative (32-bit) value.
Then the value is incorrectly sign-extended to 64 bits (in fact, this has
already been done by `ctx.get_constant(insn)`), whereas it needs to be zero
extended. The obvious fix, done here, is just to force bits 63:32 of the
extension to zero, hence zero-extending it.
This PR updates the "coloring" scheme that accounts for side-effects in
the MachInst lowering logic. As a result, the new backends will now be
able to merge effectful operations (such as memory loads) *into* other
operations; previously, only the other way (pure ops merged into
effectful ops) was possible. This will allow, for example, a load+ALU-op
combination, as is common on x86. It should even allow a load + ALU-op +
store sequence to merge into one lowered instruction.
The scheme arose from many fruitful discussions with @julian-seward1
(thanks!); significant credit is due to him for the insights here.
The first insight is that given the right basic conditions, i.e. that
the root instruction is the only use of an effectful instruction's
result, all we need is that the "color" of the effectful instruction is
*one less* than the color of the current instruction. It's easier to
think about colors on the program points between instructions: if the
color coming *out* of the first (effectful def) instruction and *in* to
the second (effectful or effect-free use) instruction are the same, then
they can merge. Basically the color denotes a version of global state;
if the same, then no other effectful ops happened in the meantime.
The second insight is that we can keep state as we scan, tracking the
"current color", and *update* this when we sink (merge) an op. Hence
when we sink a load into another op, we effectively *re-color* every
instruction it moved over; this may allow further sinks.
Consider the example (and assume that we consider loads effectful in
order to conservatively ensure a strong memory model; otherwise, replace
with other effectful value-producing insts):
```
v0 = load x
v1 = load y
v2 = add v0, 1
v3 = add v1, 1
```
Scanning from bottom to top, we first see the add producing `v3` and we
can sink the load producing `v1` into it, producing a load + ALU-op
machine instruction. This is legal because `v1` moves over only `v2`,
which is a pure instruction. Consider, though, `v2`: under a simple
scheme that has no other context, `v0` could not sink to `v2` because it
would move over `v1`, another load. But because we already sunk `v1`
down to `v3`, we are free to sink `v0` to `v2`; the update of the
"current color" during the scan allows this.
This PR also cleans up the `LowerCtx` interface a bit at the same time:
whereas previously it always gave some subset of (constant, mergeable
inst, register) directly from `LowerCtx::get_input()`, it now returns
zero or more of (constant, mergable inst) from
`LowerCtx::maybe_get_input_as_source_or_const()`, and returns the
register only from `LowerCtx::put_input_in_reg()`. This removes the need
to explicitly denote uses of the register, so it's a little safer.
Note that this PR does not actually make use of the new ability to merge
loads into other ops; that will come in future PRs, especially to
optimize the `x64` backend by using direct-memory operands.
This refactors the handling of Inst::Extend and simplifies the lowering
of Bextend and Bmask, which allows the use of SBFX instructions for
extensions from 1-bit booleans. Other extensions use aliases of BFM,
and the code was changed to reflect that, rather than hard coding bit
patterns. Also ImmLogic is now implemented, so another hard coded
instruction can be removed.
As part of looking at boolean handling, `normalize_boolean_result` was
changed to `materialize_boolean_result`, such that it can use either
CSET or CSETM. Using CSETM saves an instruction (previously CSET + SUB)
for booleans bigger than 1-bit.
Copyright (c) 2020, Arm Limited.
`lucetc` currently *almost*, but not quite, works with the new x64
backend; the only missing piece is support for the particular
instructions emitted as part of its prologue stack-check.
We do not normally see `brff`, `brif`, or `ifcmp_sp` in CLIF generated by
`cranelift-wasm` without the old-backend legalization rules, so these
were not supported in the new x64 backend as they were not necessary for
Wasm MVP support. Using them resulted in an `unimplemented!()` panic.
This PR adds support for `brff` and `brif` analogously to how AArch64
implements them, by pattern-matching the `ifcmp` / `ffcmp` directly.
Then `ifcmp_sp` is a straightforward variant of `ifcmp`.
Along the way, this also removes the notion of "fallthrough block" from
the branch-group lowering method; instead, `fallthrough` instructions
are handled as normal branches to their explicitly-provided targets,
which (in the original CLIF) match the fallthrough block. The reason for
this is that the block reordering done as part of lowering can change
the fallthrough block. We were not using `fallthrough` instructions in
the output produced by `cranelift-wasm`, so this, too, was not
previously caught.
With these changes, the `lucetc` crate in Lucet passes all tests with
the `x64` feature-flag added to its `cranelift-codegen` dependency.
This changes the following:
mov x0, #4
ldr x0, [x1, #4]
Into:
ldr x0, [x1]
I noticed this pattern (but with #0), in a benchmark.
Copyright (c) 2020, Arm Limited.
This patch implements, for aarch64, the following wasm SIMD extensions.
v128.load32_zero and v128.load64_zero instructions
https://github.com/WebAssembly/simd/pull/237
The changes are straightforward:
* no new CLIF instructions. They are translated into an existing CLIF scalar
load followed by a CLIF `scalar_to_vector`.
* the comment/specification for CLIF `scalar_to_vector` has been changed to
match the actual intended semantics, per consulation with Andrew Brown.
* translation from `scalar_to_vector` to aarch64 `fmov` instruction. This
has been generalised slightly so as to allow both 32- and 64-bit transfers.
* special-case zero in `lower_constant_f128` in order to avoid a
potentially slow call to `Inst::load_fp_constant128`.
* Once "Allow loads to merge into other operations during instruction
selection in MachInst backends"
(https://github.com/bytecodealliance/wasmtime/issues/2340) lands,
we can use that functionality to pattern match the two-CLIF pair and
emit a single AArch64 instruction.
* A simple filetest has been added.
There is no comprehensive testcase in this commit, because that is a separate
repo. The implementation has been tested, nevertheless.
In particular, introduce initial support for the MOVI and MVNI
instructions, with 8-bit elements. Also, treat vector constants
as 32- or 64-bit floating-point numbers, if their value allows
it, by relying on the architectural zero extension. Finally,
stop generating literal loads for 32-bit constants.
Copyright (c) 2020, Arm Limited.
This change abstracts away (from the perspective of the new backend) how immediate values are stored in InstructionData. It gathers large immediates from necessary places (e.g. constant pool) and delegates to `InstructionData::imm_value` for the rest. This refactor only touches original users of `LowerCtx::get_immediate` but a future change could do the same for any place the new backend is accessing InstructionData directly to retrieve immediates.
It does this by providing an implementation of the CLIF instructions `AtomicRmw`, `AtomicCas`,
`AtomicLoad`, `AtomicStore` and `Fence`.
The translation is straightforward. `AtomicCas` is translated into x64 `cmpxchg`, `AtomicLoad`
becomes a normal load because x64-TSO provides adequate sequencing, `AtomicStore` becomes a
normal store followed by `mfence`, and `Fence` becomes `mfence`. `AtomicRmw` is the only
complex case: it becomes a normal load, followed by a loop which computes an updated value,
tries to `cmpxchg` it back to memory, and repeats if necessary.
This is a minimum-effort initial implementation. `AtomicRmw` could be implemented more
efficiently using LOCK-prefixed integer read-modify-write instructions in the case where the old
value in memory is not required. Subsequent work could add that, if required.
The x64 emitter has been updated to emit the new instructions, obviously. The `LegacyPrefix`
mechanism has been revised to handle multiple prefix bytes, not just one, since it is now
sometimes necessary to emit both 0x66 (Operand Size Override) and F0 (Lock).
In the aarch64 implementation of atomics, there has been some minor renaming for the sake of
clarity, and for consistency with this x64 implementation.
We have observed that the ABI implementations for AArch64 and x64 are
very similar; in fact, x64's implementation started as a modified copy
of AArch64's implementation. This is an artifact of both a similar ABI
(both machines pass args and return values in registers first, then the
stack, and both machines give considerable freedom with stack-frame
layout) and a too-low-level ABI abstraction in the existing design. For
machines that fit the mainstream or most common ABI-design idioms, we
should be able to do much better.
This commit factors AArch64 into machine-specific and
machine-independent parts, but does not yet modify x64; that will come
next.
This should be completely neutral with respect to compile time and
generated code performance.
The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups
* atomic read-modify-write
* atomic compare-and-swap
* atomic loads
* atomic stores
* fences
and the implementation mirrors that structure, at both the CLIF and AArch64
levels.
At the CLIF level, there are five new instructions, one for each group. Some
comments about these:
* for those that take addresses (all except fences), the address is contained
entirely in a single `Value`; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.
* atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
- per above comment, makes alignment checking simpler
- reuse of existing loads and stores would require extension of `MemFlags`
to indicate atomicity, which sounds semantically unclean. For example,
then *any* instruction carrying `MemFlags` could be marked as atomic, even
in cases where it is meaningless or ambiguous.
* I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.
The translation from Wasm to CLIF, in `code_translator.rs` is unremarkable.
At the AArch64 level, there are also five new instructions, one for each
group. All of them except `::Fence` contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.
One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow *any* extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.
The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in `get_regs` that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (`::LLSC` and `::CAS`) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.
One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method `<LowerCtx for Lower>::ensure_in_vreg`.
Again, we rely on coalescing to turn them into no-ops in the common case.
There is also a ridealong fix for the AArch64 lowering case for
`Opcode::Trapif | Opcode::Trapff`, which removes a bug in which two trap insns
in a row were generated.
In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.
Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:
```
v2760 = uextend.i64 v985
v2761 = load.i64 notrap aligned readonly v1
v1018 = iadd v2761, v2760
store v1017, v1018
```
This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit `iadd`s, optionally with `uextend`/`sextend` from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.
Using `perf stat` to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts `bz2` as follows:
```
pre:
1006.410425 task-clock (msec) # 1.000 CPUs utilized
113 context-switches # 0.112 K/sec
1 cpu-migrations # 0.001 K/sec
5,036 page-faults # 0.005 M/sec
3,221,547,476 cycles # 3.201 GHz
4,000,670,104 instructions # 1.24 insn per cycle
<not supported> branches
27,958,613 branch-misses
1.006071348 seconds time elapsed
post:
963.499525 task-clock (msec) # 0.997 CPUs utilized
117 context-switches # 0.121 K/sec
0 cpu-migrations # 0.000 K/sec
5,081 page-faults # 0.005 M/sec
3,039,687,673 cycles # 3.155 GHz
3,837,761,690 instructions # 1.26 insn per cycle
<not supported> branches
28,254,585 branch-misses
0.966072682 seconds time elapsed
```
In other words, this reduces instruction count by 4.1% on `bz2`.
We often see patterns like:
```
mov w2, #0xffff_ffff // uses ORR with logical immediate form
add w0, w1, w2
```
which is just `w0 := w1 - 1`. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:
```
sub w0, w1, #1
```
We see this pattern in e.g. `bz2`, where this commit makes the following
difference (counting instructions with `perf stat`, filling in the
wasmtime cache first then running again to get just runtime):
pre:
```
992.762250 task-clock (msec) # 0.998 CPUs utilized
109 context-switches # 0.110 K/sec
0 cpu-migrations # 0.000 K/sec
5,035 page-faults # 0.005 M/sec
3,224,119,134 cycles # 3.248 GHz
4,000,521,171 instructions # 1.24 insn per cycle
<not supported> branches
27,573,755 branch-misses
0.995072322 seconds time elapsed
```
post:
```
993.853850 task-clock (msec) # 0.998 CPUs utilized
123 context-switches # 0.124 K/sec
1 cpu-migrations # 0.001 K/sec
5,072 page-faults # 0.005 M/sec
3,201,278,337 cycles # 3.221 GHz
3,917,061,340 instructions # 1.22 insn per cycle
<not supported> branches
28,410,633 branch-misses
0.996008047 seconds time elapsed
```
In other words, a 2.1% reduction in instruction count on `bz2`.
It seems that this is actually the correct behavior for bool types wider
than `b1`; some of the vector instruction optimizations depend on bool
lanes representing false and true as all-zeroes and all-ones
respectively. For `b8`..`b64`, this results in an extra negation after a
`cset` when a bool is produced by an `icmp`/`fcmp`, but the most common
case (`b1`) is unaffected, because an all-ones one-bit value is just
`1`.
An example of this assumption can be seen here:
399ee0a54c/cranelift/codegen/src/simple_preopt.rs (L956)
Thanks to Joey Gouly of ARM for noting this issue while implementing
SIMD support, and digging into the source (finding the above example) to
determine the correct behavior.
As per Carlo Kok on Zulip #cranelift, this breaks builds with stable
Rust pre-1.43, as `core::u8::MAX` was only stabilized then. We'd like to
support older versions if we can easily do so.
This PR also adds `cranelift-tools` to the crates checked on CI with
Rust 1.41.0, which pulls in all backends (including `aarch64`).
We had previously fixed a bug in which constant shift amounts should be
masked to modulo the number of bits in the operand; however, we did not
fix the analogous case for shifts incorporated into the second register
argument of ALU instructions that support integrated shifts. This
failure to mask resulted in illegal instructions being generated, e.g.
in https://bugzilla.mozilla.org/show_bug.cgi?id=1653502. This PR fixes
the issue by masking the amount, as the shift semantics require.
This commit adds the inital support to allow reftypes to flow through
the program when targetting aarch64. It also adds a fix to the
`ModuleTranslationState` needed to send R32/R64 types over from the
SpiderMonkey embedding.
This commit does not include any support for safepoints in aarch64
or the `MachInst` infrastructure; that is in the next commit.
This commit also makes a drive-by improvement to `Bint`, avoiding an
unneeded zero-extension op when the extended value comes directly from a
conditional-set (which produces a full-width 0 or 1).
The failure to mask the amount triggered a panic due to a subtraction
overflow check; see
https://bugzilla.mozilla.org/show_bug.cgi?id=1649432. Attempting to
shift by an out-of-range amount should be defined to shift by an amount
mod the operand size (i.e., masked to 5 bits for 32-bit shifts, or 6
bits for 64-bit shifts).
From discussion with Julian and Ben, this PR makes a few documentation-
and naming-level changes (no functionality change):
- Document that the `LowerCtx`-provided output register can be used as a
scratch register during the lowered instruction sequence before
placing the final result in it.
- Rename `input_to_*` helpers in the AArch64 backend to
`put_input_in_*`, emphasizing that these are side-effecting helpers
that potentially generate code (e.g., sign/zero-extensions) to ensure
an input value is in a register.
When a load/store instruction needs an address of the form `v0 +
uextend(v1)` or `v0 + sextend(v1)` (or the commuted forms thereof), we
currently generate a separate zero/sign-extend operation and then use a
plain `[rA, rB]` addressing mode. This patch extends `lower_address()`
to look at both addends of an address if it has two addends and a zero
offset, recognize extension operations, and incorporate them directly
into a `[rA, rB, UXTW]` or `[rA, rB, SXTW]` form. This should improve
our performence on WebAssembly workloads, at least, because we often see
a 64-bit linear memory base indexed by a 32-bit (Wasm) pointer value.
- Properly mask constant values down to appropriate width when
generating a constant value directly in aarch64 backend. This was a
miscompilation introduced in the new-isel refactor. In combination
with failure to respect NarrowValueMode, this resulted in a very
subtle bug when an `i32` constant was used in bit-twiddling logic.
- Add support for `iadd_ifcout` in aarch64 backend as used in explicit
heap-check mode. With this change, we no longer fail heap-related
tests with the huge-heap-region mode disabled.
- Remove a panic that was occurring in some tests that are currently
ignored on aarch64, by simply returning empty/default information in
`value_label` functionality rather than touching unimplemented APIs.
This is not a bugfix per-se, but removes confusing panic messages from
`cargo test` output that might otherwise mislead.
This patch includes:
- A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.
- A new `MachBuffer` that replaces the `MachSection`. This is a special
version of a code-sink that is far more than a humble `Vec<u8>`. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable `LabelUse` trait that defines various types
of fixups (basically internal relocations).
Importantly, it implements some simple peephole-style branch rewrites
*inline in the emission pass*, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.
The `MachBuffer` also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.
- A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.
Overall, on `bz2.wasm`, the results are:
wasmtime full run (compile + runtime) of bz2:
baseline: 9774M insns, 9742M cycles, 3.918s
w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns)
clif-util wasm compile bz2:
baseline: 2633M insns, 3278M cycles, 1.034s
w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns)
All numbers are averages of two runs on an Ampere eMAG.