It does this by providing an implementation of the CLIF instructions `AtomicRmw`, `AtomicCas`,
`AtomicLoad`, `AtomicStore` and `Fence`.
The translation is straightforward. `AtomicCas` is translated into x64 `cmpxchg`, `AtomicLoad`
becomes a normal load because x64-TSO provides adequate sequencing, `AtomicStore` becomes a
normal store followed by `mfence`, and `Fence` becomes `mfence`. `AtomicRmw` is the only
complex case: it becomes a normal load, followed by a loop which computes an updated value,
tries to `cmpxchg` it back to memory, and repeats if necessary.
This is a minimum-effort initial implementation. `AtomicRmw` could be implemented more
efficiently using LOCK-prefixed integer read-modify-write instructions in the case where the old
value in memory is not required. Subsequent work could add that, if required.
The x64 emitter has been updated to emit the new instructions, obviously. The `LegacyPrefix`
mechanism has been revised to handle multiple prefix bytes, not just one, since it is now
sometimes necessary to emit both 0x66 (Operand Size Override) and F0 (Lock).
In the aarch64 implementation of atomics, there has been some minor renaming for the sake of
clarity, and for consistency with this x64 implementation.
We have observed that the ABI implementations for AArch64 and x64 are
very similar; in fact, x64's implementation started as a modified copy
of AArch64's implementation. This is an artifact of both a similar ABI
(both machines pass args and return values in registers first, then the
stack, and both machines give considerable freedom with stack-frame
layout) and a too-low-level ABI abstraction in the existing design. For
machines that fit the mainstream or most common ABI-design idioms, we
should be able to do much better.
This commit factors AArch64 into machine-specific and
machine-independent parts, but does not yet modify x64; that will come
next.
This should be completely neutral with respect to compile time and
generated code performance.
The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups
* atomic read-modify-write
* atomic compare-and-swap
* atomic loads
* atomic stores
* fences
and the implementation mirrors that structure, at both the CLIF and AArch64
levels.
At the CLIF level, there are five new instructions, one for each group. Some
comments about these:
* for those that take addresses (all except fences), the address is contained
entirely in a single `Value`; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.
* atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
- per above comment, makes alignment checking simpler
- reuse of existing loads and stores would require extension of `MemFlags`
to indicate atomicity, which sounds semantically unclean. For example,
then *any* instruction carrying `MemFlags` could be marked as atomic, even
in cases where it is meaningless or ambiguous.
* I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.
The translation from Wasm to CLIF, in `code_translator.rs` is unremarkable.
At the AArch64 level, there are also five new instructions, one for each
group. All of them except `::Fence` contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.
One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow *any* extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.
The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in `get_regs` that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (`::LLSC` and `::CAS`) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.
One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method `<LowerCtx for Lower>::ensure_in_vreg`.
Again, we rely on coalescing to turn them into no-ops in the common case.
There is also a ridealong fix for the AArch64 lowering case for
`Opcode::Trapif | Opcode::Trapff`, which removes a bug in which two trap insns
in a row were generated.
In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.
We often see patterns like:
```
mov w2, #0xffff_ffff // uses ORR with logical immediate form
add w0, w1, w2
```
which is just `w0 := w1 - 1`. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:
```
sub w0, w1, #1
```
We see this pattern in e.g. `bz2`, where this commit makes the following
difference (counting instructions with `perf stat`, filling in the
wasmtime cache first then running again to get just runtime):
pre:
```
992.762250 task-clock (msec) # 0.998 CPUs utilized
109 context-switches # 0.110 K/sec
0 cpu-migrations # 0.000 K/sec
5,035 page-faults # 0.005 M/sec
3,224,119,134 cycles # 3.248 GHz
4,000,521,171 instructions # 1.24 insn per cycle
<not supported> branches
27,573,755 branch-misses
0.995072322 seconds time elapsed
```
post:
```
993.853850 task-clock (msec) # 0.998 CPUs utilized
123 context-switches # 0.124 K/sec
1 cpu-migrations # 0.001 K/sec
5,072 page-faults # 0.005 M/sec
3,201,278,337 cycles # 3.221 GHz
3,917,061,340 instructions # 1.22 insn per cycle
<not supported> branches
28,410,633 branch-misses
0.996008047 seconds time elapsed
```
In other words, a 2.1% reduction in instruction count on `bz2`.
It seems that this is actually the correct behavior for bool types wider
than `b1`; some of the vector instruction optimizations depend on bool
lanes representing false and true as all-zeroes and all-ones
respectively. For `b8`..`b64`, this results in an extra negation after a
`cset` when a bool is produced by an `icmp`/`fcmp`, but the most common
case (`b1`) is unaffected, because an all-ones one-bit value is just
`1`.
An example of this assumption can be seen here:
399ee0a54c/cranelift/codegen/src/simple_preopt.rs (L956)
Thanks to Joey Gouly of ARM for noting this issue while implementing
SIMD support, and digging into the source (finding the above example) to
determine the correct behavior.
This is implemented the same as Bitselect, as the controlling vector
is a boolean vector. A boolean vector in cranelift has elements
that are either 0 or all 1s, so it can be used to select elements
lane wise.
Copyright (c) 2020, Arm Limited.
Previously, we simply compared the input bool to 0, which forced the
value into a register (usually via a cmp and cset), zero-extended it,
etc. This patch performs the same pattern-matching that branches do to
directly perform the cmp and use its flag results with the csel.
On the `bz2` benchmark, the runtime is affected as follows (measuring
with `perf stat`, using wasmtime with its cache enabled, and taking the
second run after the first compiles and populates the cache):
pre:
1117.232000 task-clock (msec) # 1.000 CPUs utilized
133 context-switches # 0.119 K/sec
1 cpu-migrations # 0.001 K/sec
5,041 page-faults # 0.005 M/sec
3,511,615,100 cycles # 3.143 GHz
4,272,427,772 instructions # 1.22 insn per cycle
<not supported> branches
27,980,906 branch-misses
1.117299838 seconds time elapsed
post:
1003.738075 task-clock (msec) # 1.000 CPUs utilized
121 context-switches # 0.121 K/sec
0 cpu-migrations # 0.000 K/sec
5,052 page-faults # 0.005 M/sec
3,224,875,393 cycles # 3.213 GHz
4,000,838,686 instructions # 1.24 insn per cycle
<not supported> branches
27,928,232 branch-misses
1.003440004 seconds time elapsed
In other words, with this change, on `bz2`, we see a 6.3% reduction in
executed instructions.
This commit adds support for generating stackmaps at safepoints to the
new backend framework and to the AArch64 backend in particular. It has
been tested to work with SpiderMonkey.
This commit adds the inital support to allow reftypes to flow through
the program when targetting aarch64. It also adds a fix to the
`ModuleTranslationState` needed to send R32/R64 types over from the
SpiderMonkey embedding.
This commit does not include any support for safepoints in aarch64
or the `MachInst` infrastructure; that is in the next commit.
This commit also makes a drive-by improvement to `Bint`, avoiding an
unneeded zero-extension op when the extended value comes directly from a
conditional-set (which produces a full-width 0 or 1).
The main issue with the InstSize enum was that it was used both for
GPR and SIMD & FP operands, even though machine instructions do not
mix them in general (as in a destination register is either a GPR
or not). As a result it had methods such as sf_bit() that made
sense only for one type of operand.
Another issue was that the enum name was not reflecting its purpose
accurately - it was meant to represent an instruction operand size,
not an instruction size, which is fixed in A64 (always 4 bytes).
Now the enum is split into one for GPR operands and another for
scalar SIMD & FP operands.
Copyright (c) 2020, Arm Limited.
In discussions with @bnjbvr, it came up that generating `OneWayCondBr`s
with explicit, hardcoded PC-offsets as part of lowered instruction
sequences is actually unsafe, because the register allocator *might*
insert a spill or reload into the middle of our sequence. We were
careful about this in some cases but somehow missed that it was a
general restriction. Conceptually, all inter-instruction references
should be via labels at the VCode level; explicit offsets are only ever
known at emission time, and resolved by the `MachBuffer`.
To allow for conditional trap checks without modifying the CFG (as seen
by regalloc) during lowering, this PR instead adds a `TrapIf`
pseudo-instruction that conditionally skips a single embedded trap
instruction. It lowers to the same `condbr label ; trap ; label: ...`
sequence, but without the hardcoded branch-target offset in the lowering
code.
The failure to mask the amount triggered a panic due to a subtraction
overflow check; see
https://bugzilla.mozilla.org/show_bug.cgi?id=1649432. Attempting to
shift by an out-of-range amount should be defined to shift by an amount
mod the operand size (i.e., masked to 5 bits for 32-bit shifts, or 6
bits for 64-bit shifts).
This PR adds a conditional move following a heap bounds check through
which the address to be accessed flows. This conditional move ensures
that even if the branch is mispredicted (access is actually out of
bounds, but speculation goes down in-bounds path), the acually accessed
address is zero (a NULL pointer) rather than the out-of-bounds address.
The mitigation is controlled by a flag that is off by default, but can
be set by the embedding. Note that in order to turn it on by default,
we would need to add conditional-move support to the current x86
backend; this does not appear to be present. Once the deprecated
backend is removed in favor of the new backend, IMHO we should turn
this flag on by default.
Note that the mitigation is unneccessary when we use the "huge heap"
technique on 64-bit systems, in which we allocate a range of virtual
address space such that no 32-bit offset can reach other data. Hence,
this only affects small-heap configurations.
From discussion with Julian and Ben, this PR makes a few documentation-
and naming-level changes (no functionality change):
- Document that the `LowerCtx`-provided output register can be used as a
scratch register during the lowered instruction sequence before
placing the final result in it.
- Rename `input_to_*` helpers in the AArch64 backend to
`put_input_in_*`, emphasizing that these are side-effecting helpers
that potentially generate code (e.g., sign/zero-extensions) to ensure
an input value is in a register.
This is useful to have to allow resumable_trap to happen in loop
headers, for instance. This is the correct way to implement interrupt
checks in Spidermonkey, which are effectively resumable traps. Previous
implementation was using traps, which is wrong, since traps semantically
can't be resumed after.