* ensure that all const assignments are placed at the end of the sequence.
This minimises live ranges.
* for the non-const assignments, ignore self-assignments. This can
dramatically reduce the total number of moves generated, because any
self-assignments trigger the overlap-case handling, hence invoking the
double-copy behaviour in cases where it's not necessary.
It's worth pointing out that self-assignments are common, and are not due to
deficiencies in CLIR optimisation. Rather, they occur whenever a loop back
edge doesn't modify *all* loop-carried values. This can easily happen if
the loop has multiple "early" back-edges -- "continues" in C parlance. Eg:
loop_header(a, b, c, d, e, f):
...
a_new = ...
b_new = ...
if (..) goto loop_header(a_new, b_new, c, d, e, f)
...
c_new = ...
d_new = ...
if (..) goto loop_header(a_new, b_new, c_new, d_new, e, f)
etc
For functions with many live values, this can dramatically reduce the number
of spill moves we throw into the register allocator.
In terms of compilation costs, this ranges from neutral for functions which
spill not at all, or minimally (joey_small, joey_med) to a 7.1% reduction in
insn count.
In terms of run costs, for one spill-heavy test (bz2 w/ custom timing harness),
instruction counts are reduced by 4.3%, data reads by 12.3% and data writes
by 18.5%. Note those last two figures include all reads and writes made by the
generated code, not just spills/reloads, so the proportional reduction in
spill/reload traffic must be greater.
- Properly mask constant values down to appropriate width when
generating a constant value directly in aarch64 backend. This was a
miscompilation introduced in the new-isel refactor. In combination
with failure to respect NarrowValueMode, this resulted in a very
subtle bug when an `i32` constant was used in bit-twiddling logic.
- Add support for `iadd_ifcout` in aarch64 backend as used in explicit
heap-check mode. With this change, we no longer fail heap-related
tests with the huge-heap-region mode disabled.
- Remove a panic that was occurring in some tests that are currently
ignored on aarch64, by simply returning empty/default information in
`value_label` functionality rather than touching unimplemented APIs.
This is not a bugfix per-se, but removes confusing panic messages from
`cargo test` output that might otherwise mislead.
This avoids the set uniqueness (hashing) test, reduces memory
churn when re-mapping virtual register onto real registers, and is
generally more memory-efficient.
This patch fixes a subtle bug that occurred in the MachBuffer branch
optimization: in tracking labels at the current buffer tail using a
sorted-by-offset array, the code did not update this array properly when
redirecting labels. As a result, the dead-branch removal was unsafe,
because not every label pointing to a branch is guaranteed to be
redirected properly first.
Discovered while doing performance testing: bz2 silently took a wrong
branch and exited compression early. (Eek!)
To address this problem, this patch adopts a slightly simpler data
structure: we only track the labels *at the current buffer tail*, and
*at the start of each branch*, and we're careful to update these
appropriately to maintain the invariants. I'm pretty confident that this
is correct now, but we should (still) fuzz it a bunch, because wrong
control flow scares me a nonzero amount. I should probably also actually
write out a formal proof that these data-structure updates are correct.
The optimizations are important for performance (removing useless empty
blocks, and taking advantage of any fallthrough opportunities at all),
so I don't think we would want to drop them entirely.
This patch includes:
- A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.
- A new `MachBuffer` that replaces the `MachSection`. This is a special
version of a code-sink that is far more than a humble `Vec<u8>`. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable `LabelUse` trait that defines various types
of fixups (basically internal relocations).
Importantly, it implements some simple peephole-style branch rewrites
*inline in the emission pass*, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.
The `MachBuffer` also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.
- A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.
Overall, on `bz2.wasm`, the results are:
wasmtime full run (compile + runtime) of bz2:
baseline: 9774M insns, 9742M cycles, 3.918s
w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns)
clif-util wasm compile bz2:
baseline: 2633M insns, 3278M cycles, 1.034s
w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns)
All numbers are averages of two runs on an Ampere eMAG.
compilation.
This saves ~0.14% instruction count, ~0.18% allocated bytes, and ~1.5%
allocated blocks on a `clif-util wasm` compilation of `bz2.wasm` for
aarch64.
This PR changes the aarch64 ABI implementation to use positive offsets
from SP, rather than negative offsets from FP, to refer to spill slots
and stack-local storage. This allows for better addressing-mode options,
and hence slightly better code: e.g., the unsigned scaled 12-bit offset
mode can be used to reach anywhere in a 32KB frame without extra
address-construction instructions, whereas negative offsets are limited
to a signed 9-bit unscaled mode (-256 bytes).
To enable this, the PR introduces a notion of "nominal SP offsets" as a
virtual addressing mode, lowered during the emission pass. The offsets
are relative to "SP after adjusting downward to allocate stack/spill
slots", but before pushing clobbers. This allows the addressing-mode
expressions to be generated before register allocation (or during it,
for spill/reload sequences).
To convert these offsets into *true* offsets from SP, we need to track
how much further SP is moved downward, and compensate for this. We do so
with "virtual SP offset adjustment" pseudo-instructions: these are seen
by the emission pass, and result in no instruction (0 byte output), but
update state that is now threaded through each instruction emission in
turn. In this way, we can push e.g. stack args for a call and adjust
the virtual SP offset, allowing reloads from nominal-SP-relative
spillslots while we do the argument setup with "real SP offsets" at the
same time.
Previously, every call was lowered on AArch64 to a `call` instruction, which
takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this
gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible
relocation record to be emitted (or rather, a record that a more sophisticated
linker would fix up by inserting a shim/veneer).
This commit adds a notion of "relocation distance" in the MachInst backends,
and provides this information for every call target and symbol reference. The
intent is that backends on architectures like AArch64, where there are different
offset sizes / addressing strategies to choose from, can either emit a regular
call or a load-64-bit-constant / call-indirect sequence, as necessary. This
avoids the need to implement complex linking behavior.
The MachInst driver code provides this information based on the "colocated" bit
in the CLIF symbol references, which appears to have been designed for this
purpose, or at least a similar one. Combined with the `use_colocated_libcalls`
setting, this allows client code to ensure that library calls can link to
library code at any location in the address space.
Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing
so, it appears all that is necessary to get its tests to pass is to set the
`use_colocated_libcalls` flag to false, to make use of the above change. This
fixes the `libcall_function` unit-test in this crate.
Previously, the SourceLoc information transferred in `VCode` only
included PC-spans for non-default SourceLocs. I realized that the
invariant we're supposed to keep here is that every PC is covered; if no
source information, just use `SourceLoc::default()`.
This was spurred by @bjorn3's comment in #1575 (thanks!).
This PR updates Cranelift to use the new version of regalloc.rs
(bytecodealliance/regalloc.rs#55) that provides dense vreg->rreg maps to
the `map_reg()` function for each instruction, rather than the earlier
hashmap-based approach.
In one test (regex-rs.wasm), this PR results in a 15% reduction in
memory allocations (1245MB -> 1060MB) as measured by DHAT on `clif-util
wasm` runs.
This change adds SourceLoc information per instruction in a `VCode<Inst>`
container, and keeps this information up-to-date across register allocation
and branch reordering. The information is initially collected during
instruction lowering, eventually collected on the MachSection, and finally
provided to the environment that wraps the codegen crate for wasmtime.
- Undo temporary changes to default features (`all-arch`) and a
signal-handler test.
- Remove `SIGTRAP` handler: no longer needed now that we've found an
"undefined opcode" option on ARM64.
- Rename pp.rs to pretty_print.rs in machinst/.
- Only use empty stack-probe on non-x86. As per a comment in
rust-lang/compiler-builtins [1], LLVM only supports stack probes on
x86 and x86-64. Thus, on any other CPU architecture, we cannot refer
to `__rust_probestack`, because it does not exist.
- Rename arm64 to aarch64.
- Use `target` directive in vcode filetests.
- Run the flags verifier, but without encinfo, when using new backends.
- Clean up warning overrides.
- Fix up use of casts: use u32::from(x) and siblings when possible,
u32::try_from(x).unwrap() when not, to avoid silent truncation.
- Take immutable `Function` borrows as input; we don't actually
mutate the input IR.
- Lots of other miscellaneous cleanups.
[1] cae3e6ea23/src/probestack.rs (L39)
This patch adds the MachInst, or Machine Instruction, infrastructure.
This is the machine-independent portion of the new backend design. It
contains the implementation of the "vcode" (virtual-registerized code)
container, the top-level lowering algorithm and compilation pipeline,
and the trait definitions that the machine backends will fill in.
This backend infrastructure is included in the compilation of the
`codegen` crate, but it is not yet tied into the public APIs; that patch
will come last, after all the other pieces are filled in.
This patch contains code written by Julian Seward <jseward@acm.org> and
Benjamin Bouvier <public@benj.me>, originally developed on a side-branch
before rebasing and condensing into this patch series. See the `arm64`
branch at `https://github.com/cfallin/wasmtime` for original development
history.
Co-authored-by: Julian Seward <jseward@acm.org>
Co-authored-by: Benjamin Bouvier <public@benj.me>