In the Baldrdash (SpiderMonkey) embedding, we must take care to
zero-extend all function arguments to callees in integer registers when
the types are narrower than 64 bits. This is because, unlike the native
SysV ABI, the Baldrdash ABI expects high bits to be cleared. Not doing
so leads to difficult-to-trace errors where high bits falsely tag an
int32 as e.g. an object pointer, leading to potential security issues.
Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:
```
v2760 = uextend.i64 v985
v2761 = load.i64 notrap aligned readonly v1
v1018 = iadd v2761, v2760
store v1017, v1018
```
This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit `iadd`s, optionally with `uextend`/`sextend` from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.
Using `perf stat` to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts `bz2` as follows:
```
pre:
1006.410425 task-clock (msec) # 1.000 CPUs utilized
113 context-switches # 0.112 K/sec
1 cpu-migrations # 0.001 K/sec
5,036 page-faults # 0.005 M/sec
3,221,547,476 cycles # 3.201 GHz
4,000,670,104 instructions # 1.24 insn per cycle
<not supported> branches
27,958,613 branch-misses
1.006071348 seconds time elapsed
post:
963.499525 task-clock (msec) # 0.997 CPUs utilized
117 context-switches # 0.121 K/sec
0 cpu-migrations # 0.000 K/sec
5,081 page-faults # 0.005 M/sec
3,039,687,673 cycles # 3.155 GHz
3,837,761,690 instructions # 1.26 insn per cycle
<not supported> branches
28,254,585 branch-misses
0.966072682 seconds time elapsed
```
In other words, this reduces instruction count by 4.1% on `bz2`.
We often see patterns like:
```
mov w2, #0xffff_ffff // uses ORR with logical immediate form
add w0, w1, w2
```
which is just `w0 := w1 - 1`. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:
```
sub w0, w1, #1
```
We see this pattern in e.g. `bz2`, where this commit makes the following
difference (counting instructions with `perf stat`, filling in the
wasmtime cache first then running again to get just runtime):
pre:
```
992.762250 task-clock (msec) # 0.998 CPUs utilized
109 context-switches # 0.110 K/sec
0 cpu-migrations # 0.000 K/sec
5,035 page-faults # 0.005 M/sec
3,224,119,134 cycles # 3.248 GHz
4,000,521,171 instructions # 1.24 insn per cycle
<not supported> branches
27,573,755 branch-misses
0.995072322 seconds time elapsed
```
post:
```
993.853850 task-clock (msec) # 0.998 CPUs utilized
123 context-switches # 0.124 K/sec
1 cpu-migrations # 0.001 K/sec
5,072 page-faults # 0.005 M/sec
3,201,278,337 cycles # 3.221 GHz
3,917,061,340 instructions # 1.22 insn per cycle
<not supported> branches
28,410,633 branch-misses
0.996008047 seconds time elapsed
```
In other words, a 2.1% reduction in instruction count on `bz2`.
It seems that this is actually the correct behavior for bool types wider
than `b1`; some of the vector instruction optimizations depend on bool
lanes representing false and true as all-zeroes and all-ones
respectively. For `b8`..`b64`, this results in an extra negation after a
`cset` when a bool is produced by an `icmp`/`fcmp`, but the most common
case (`b1`) is unaffected, because an all-ones one-bit value is just
`1`.
An example of this assumption can be seen here:
399ee0a54c/cranelift/codegen/src/simple_preopt.rs (L956)
Thanks to Joey Gouly of ARM for noting this issue while implementing
SIMD support, and digging into the source (finding the above example) to
determine the correct behavior.
This is implemented the same as Bitselect, as the controlling vector
is a boolean vector. A boolean vector in cranelift has elements
that are either 0 or all 1s, so it can be used to select elements
lane wise.
Copyright (c) 2020, Arm Limited.
As per Carlo Kok on Zulip #cranelift, this breaks builds with stable
Rust pre-1.43, as `core::u8::MAX` was only stabilized then. We'd like to
support older versions if we can easily do so.
This PR also adds `cranelift-tools` to the crates checked on CI with
Rust 1.41.0, which pulls in all backends (including `aarch64`).
Previously, we simply compared the input bool to 0, which forced the
value into a register (usually via a cmp and cset), zero-extended it,
etc. This patch performs the same pattern-matching that branches do to
directly perform the cmp and use its flag results with the csel.
On the `bz2` benchmark, the runtime is affected as follows (measuring
with `perf stat`, using wasmtime with its cache enabled, and taking the
second run after the first compiles and populates the cache):
pre:
1117.232000 task-clock (msec) # 1.000 CPUs utilized
133 context-switches # 0.119 K/sec
1 cpu-migrations # 0.001 K/sec
5,041 page-faults # 0.005 M/sec
3,511,615,100 cycles # 3.143 GHz
4,272,427,772 instructions # 1.22 insn per cycle
<not supported> branches
27,980,906 branch-misses
1.117299838 seconds time elapsed
post:
1003.738075 task-clock (msec) # 1.000 CPUs utilized
121 context-switches # 0.121 K/sec
0 cpu-migrations # 0.000 K/sec
5,052 page-faults # 0.005 M/sec
3,224,875,393 cycles # 3.213 GHz
4,000,838,686 instructions # 1.24 insn per cycle
<not supported> branches
27,928,232 branch-misses
1.003440004 seconds time elapsed
In other words, with this change, on `bz2`, we see a 6.3% reduction in
executed instructions.
We had previously fixed a bug in which constant shift amounts should be
masked to modulo the number of bits in the operand; however, we did not
fix the analogous case for shifts incorporated into the second register
argument of ALU instructions that support integrated shifts. This
failure to mask resulted in illegal instructions being generated, e.g.
in https://bugzilla.mozilla.org/show_bug.cgi?id=1653502. This PR fixes
the issue by masking the amount, as the shift semantics require.
This commit adds support for generating stackmaps at safepoints to the
new backend framework and to the AArch64 backend in particular. It has
been tested to work with SpiderMonkey.
This commit adds the inital support to allow reftypes to flow through
the program when targetting aarch64. It also adds a fix to the
`ModuleTranslationState` needed to send R32/R64 types over from the
SpiderMonkey embedding.
This commit does not include any support for safepoints in aarch64
or the `MachInst` infrastructure; that is in the next commit.
This commit also makes a drive-by improvement to `Bint`, avoiding an
unneeded zero-extension op when the extended value comes directly from a
conditional-set (which produces a full-width 0 or 1).
The main issue with the InstSize enum was that it was used both for
GPR and SIMD & FP operands, even though machine instructions do not
mix them in general (as in a destination register is either a GPR
or not). As a result it had methods such as sf_bit() that made
sense only for one type of operand.
Another issue was that the enum name was not reflecting its purpose
accurately - it was meant to represent an instruction operand size,
not an instruction size, which is fixed in A64 (always 4 bytes).
Now the enum is split into one for GPR operands and another for
scalar SIMD & FP operands.
Copyright (c) 2020, Arm Limited.
* Switch CI back to nightly channel
I think all upstream issues are now fixed so we should be good to switch
back to nightly from our previously pinned version.
* Fix doc warnings
The ARM book says that the immr field should contain (-count % 64); the
existing code was approximating this with (64 - count), which is not
correct for a zero count.
- put the division in the synthetic instruction as well,
- put the branch table check in the inst's emission code,
- replace OneWayCondJmp by TrapIf vcode instruction,
- add comments describing code generated by the synthetic instructions
In discussions with @bnjbvr, it came up that generating `OneWayCondBr`s
with explicit, hardcoded PC-offsets as part of lowered instruction
sequences is actually unsafe, because the register allocator *might*
insert a spill or reload into the middle of our sequence. We were
careful about this in some cases but somehow missed that it was a
general restriction. Conceptually, all inter-instruction references
should be via labels at the VCode level; explicit offsets are only ever
known at emission time, and resolved by the `MachBuffer`.
To allow for conditional trap checks without modifying the CFG (as seen
by regalloc) during lowering, this PR instead adds a `TrapIf`
pseudo-instruction that conditionally skips a single embedded trap
instruction. It lowers to the same `condbr label ; trap ; label: ...`
sequence, but without the hardcoded branch-target offset in the lowering
code.
The failure to mask the amount triggered a panic due to a subtraction
overflow check; see
https://bugzilla.mozilla.org/show_bug.cgi?id=1649432. Attempting to
shift by an out-of-range amount should be defined to shift by an amount
mod the operand size (i.e., masked to 5 bits for 32-bit shifts, or 6
bits for 64-bit shifts).