Previously, we simply compared the input bool to 0, which forced the
value into a register (usually via a cmp and cset), zero-extended it,
etc. This patch performs the same pattern-matching that branches do to
directly perform the cmp and use its flag results with the csel.
On the `bz2` benchmark, the runtime is affected as follows (measuring
with `perf stat`, using wasmtime with its cache enabled, and taking the
second run after the first compiles and populates the cache):
pre:
1117.232000 task-clock (msec) # 1.000 CPUs utilized
133 context-switches # 0.119 K/sec
1 cpu-migrations # 0.001 K/sec
5,041 page-faults # 0.005 M/sec
3,511,615,100 cycles # 3.143 GHz
4,272,427,772 instructions # 1.22 insn per cycle
<not supported> branches
27,980,906 branch-misses
1.117299838 seconds time elapsed
post:
1003.738075 task-clock (msec) # 1.000 CPUs utilized
121 context-switches # 0.121 K/sec
0 cpu-migrations # 0.000 K/sec
5,052 page-faults # 0.005 M/sec
3,224,875,393 cycles # 3.213 GHz
4,000,838,686 instructions # 1.24 insn per cycle
<not supported> branches
27,928,232 branch-misses
1.003440004 seconds time elapsed
In other words, with this change, on `bz2`, we see a 6.3% reduction in
executed instructions.
This is tricky: the control flow implicitly implied by the operand makes
it so that the output register may be undefined, if we mark it only as a
"def". Make it a "mod" instead, which matches our usage in the codebase,
and will make it crash if the output operand isn't unconditionally
defined before the instruction.
Certain operations (e.g. widening) will have operands with types like `NxM` but will return results with types like `(N*2)x(M/2)` (double the lane width, halve the number of lanes; maintain the same number of vector bits). This is equivalent to applying two `DerivedFunction`s to the type: `DerivedFunction::DoubleWidth` then `DerivedFunction::HalfVector`. Since there is no easy way to apply multiple `DerivedFunction`s (e.g. most of the logic is one-level deep, 1d5a678124/cranelift/codegen/meta/src/gen_inst.rs (L618-L621)), I added `DerivedFunction::MergeLanes` to do the necessary type conversion.
Before this patch, running the x64 new backend would require both
compiling with --features experimental_x64 and running with
`use_new_backend`.
This patches changes this behavior so that the runtime flag is not
needed anymore: using the feature flag will enforce usage of the new
backend everywhere, making using and testing it much simpler:
cargo run --features experimental_x64 ;; other CLI options/flags
This also gives a hint at what the meta language generation would look
like after switching to the new backend.
Compiling only with the x64 codegen flag gives a nice compile time speedup.
This commit adds support for generating stackmaps at safepoints to the
new backend framework and to the AArch64 backend in particular. It has
been tested to work with SpiderMonkey.
This commit adds the inital support to allow reftypes to flow through
the program when targetting aarch64. It also adds a fix to the
`ModuleTranslationState` needed to send R32/R64 types over from the
SpiderMonkey embedding.
This commit does not include any support for safepoints in aarch64
or the `MachInst` infrastructure; that is in the next commit.
This commit also makes a drive-by improvement to `Bint`, avoiding an
unneeded zero-extension op when the extended value comes directly from a
conditional-set (which produces a full-width 0 or 1).
The main issue with the InstSize enum was that it was used both for
GPR and SIMD & FP operands, even though machine instructions do not
mix them in general (as in a destination register is either a GPR
or not). As a result it had methods such as sf_bit() that made
sense only for one type of operand.
Another issue was that the enum name was not reflecting its purpose
accurately - it was meant to represent an instruction operand size,
not an instruction size, which is fixed in A64 (always 4 bytes).
Now the enum is split into one for GPR operands and another for
scalar SIMD & FP operands.
Copyright (c) 2020, Arm Limited.