A branch is considered side-effecting and so updates the instruction
color (which is our way of computing how far instructions can sink).
However, in the lowering loop, we did not update current instruction
color when scanning backward across branches, which are side-effecting.
As a result, the color was stale and fewer load-op merges were permitted
than are actually possible.
Note that this would not have resulted in any correctness issues, as the
stale color is too high (so no merges are permitted that should have
been disallowed).
Fixes#2562.
This will allow for support for `I128` values everywhere, and `I64`
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to *reason* about a `Value` residing in more than
one register.
This fixes a subtle corner case exposed during fuzzing. If we have a bit
of CLIF like:
```
v0 = load.i64 ...
v1 = iadd.i64 v0, ...
v2 = do_other_thing v1
v3 = load.i64 v1
```
and if this is lowered using a machine backend that can merge loads into
ALU ops, *and* that has an addressing mode that can look through add
ops, then the following can happen:
1. We lower the load at `v3`. This looks backward at the address
operand tree and finds that `v1` is `v0` plus other things; it has an
addressing mode that can add `v0`'s register and the other things
directly; so it calls `put_value_in_reg(v0)` and uses its register in
the amode. At this point, the add producing `v1` has no references,
so it will not (yet) be codegen'd.
2. We lower `do_other_thing`, which puts `v1` in a register and uses it.
the `iadd` now has a reference.
3. We reach the `iadd` and, because it has a reference, lower it. Our
machine has the ability to merge a load into an ALU operation.
Crucially, *we think the load at `v0` is mergeable* because it has
only one user, the add at `v1` (!). So we merge it.
4. We reach the `load` at `v0` and because it has been merged into the
`iadd`, we do not separately codegen it. The register that holds `v0`
is thus never written, and the use of this register by the final load
(Step 1) will see an undefined value.
The logic error here is that in the presence of pattern matching that
looks through pure ops, we can end up with multiple uses of a value that
originally had a single use (because we allow lookthrough of pure ops in
all cases). In other words, the multiple-use-ness of `v1` "passes
through" in some sense to `v0`. However, the load sinking logic is not
aware of this.
The fix, I think, is pretty simple: we disallow an effectful instruction
from sinking/merging if it already has some other use when we look back
at it.
If we disallowed lookthrough of *any* op that had multiple uses, even
pure ones, then we would avoid this scenario; but earlier experiments
showed that to have a non-negligible performance impact, so (given that
we've worked out the logic above) I think this complexity is worth it.
This PR updates the "coloring" scheme that accounts for side-effects in
the MachInst lowering logic. As a result, the new backends will now be
able to merge effectful operations (such as memory loads) *into* other
operations; previously, only the other way (pure ops merged into
effectful ops) was possible. This will allow, for example, a load+ALU-op
combination, as is common on x86. It should even allow a load + ALU-op +
store sequence to merge into one lowered instruction.
The scheme arose from many fruitful discussions with @julian-seward1
(thanks!); significant credit is due to him for the insights here.
The first insight is that given the right basic conditions, i.e. that
the root instruction is the only use of an effectful instruction's
result, all we need is that the "color" of the effectful instruction is
*one less* than the color of the current instruction. It's easier to
think about colors on the program points between instructions: if the
color coming *out* of the first (effectful def) instruction and *in* to
the second (effectful or effect-free use) instruction are the same, then
they can merge. Basically the color denotes a version of global state;
if the same, then no other effectful ops happened in the meantime.
The second insight is that we can keep state as we scan, tracking the
"current color", and *update* this when we sink (merge) an op. Hence
when we sink a load into another op, we effectively *re-color* every
instruction it moved over; this may allow further sinks.
Consider the example (and assume that we consider loads effectful in
order to conservatively ensure a strong memory model; otherwise, replace
with other effectful value-producing insts):
```
v0 = load x
v1 = load y
v2 = add v0, 1
v3 = add v1, 1
```
Scanning from bottom to top, we first see the add producing `v3` and we
can sink the load producing `v1` into it, producing a load + ALU-op
machine instruction. This is legal because `v1` moves over only `v2`,
which is a pure instruction. Consider, though, `v2`: under a simple
scheme that has no other context, `v0` could not sink to `v2` because it
would move over `v1`, another load. But because we already sunk `v1`
down to `v3`, we are free to sink `v0` to `v2`; the update of the
"current color" during the scan allows this.
This PR also cleans up the `LowerCtx` interface a bit at the same time:
whereas previously it always gave some subset of (constant, mergeable
inst, register) directly from `LowerCtx::get_input()`, it now returns
zero or more of (constant, mergable inst) from
`LowerCtx::maybe_get_input_as_source_or_const()`, and returns the
register only from `LowerCtx::put_input_in_reg()`. This removes the need
to explicitly denote uses of the register, so it's a little safer.
Note that this PR does not actually make use of the new ability to merge
loads into other ops; that will come in future PRs, especially to
optimize the `x64` backend by using direct-memory operands.
`lucetc` currently *almost*, but not quite, works with the new x64
backend; the only missing piece is support for the particular
instructions emitted as part of its prologue stack-check.
We do not normally see `brff`, `brif`, or `ifcmp_sp` in CLIF generated by
`cranelift-wasm` without the old-backend legalization rules, so these
were not supported in the new x64 backend as they were not necessary for
Wasm MVP support. Using them resulted in an `unimplemented!()` panic.
This PR adds support for `brff` and `brif` analogously to how AArch64
implements them, by pattern-matching the `ifcmp` / `ffcmp` directly.
Then `ifcmp_sp` is a straightforward variant of `ifcmp`.
Along the way, this also removes the notion of "fallthrough block" from
the branch-group lowering method; instead, `fallthrough` instructions
are handled as normal branches to their explicitly-provided targets,
which (in the original CLIF) match the fallthrough block. The reason for
this is that the block reordering done as part of lowering can change
the fallthrough block. We were not using `fallthrough` instructions in
the output produced by `cranelift-wasm`, so this, too, was not
previously caught.
With these changes, the `lucetc` crate in Lucet passes all tests with
the `x64` feature-flag added to its `cranelift-codegen` dependency.
This approach suffers from memory-size bloat during compile time due to the desire to de-duplicate the constants emitted and reduce runtime memory-size. As a first step, though, this provides an end-to-end mechanism for constants to be emitted in the MachBuffer islands.
When performing a function call, the platform ABI may require space
on the stack to hold outgoing arguments and/or return values.
Currently, this is supported via decrementing the stack pointer
before the call and incrementing it afterwards, using the
emit_stack_pre_adjust and emit_stack_post_adjust methods of
ABICaller. However, on some platforms it would be preferable
to just allocate enough space for any call done in the function
in the caller's prologue instead.
This patch adds support to allow back-ends to choose that method.
Instead of calling emit_stack_pre/post_adjust around a call, they
simply call a new accumulate_outgoing_args_size method of
ABICaller instead. This will pass on the required size to the
ABICallee structure of the calling function, which will accumulate
the maximum size required for all function calls.
That accumulated size is then passed to the gen_clobber_save
and gen_clobber_restore functions so they can include the size
in the stack allocation / deallocation that already happens in
the prologue / epilogue code.
A new associated type Info is added to MachInstEmit, which is the
immutable counterpart to State. It can't easily be constructed from an
ABICallee, since it would require adding an associated type to the
latter, and making so leaks the associated type in a lot of places in
the code base and makes the code harder to read. Instead, the EmitInfo
state can simply be passed to the `Vcode::emit` function directly.
This change abstracts away (from the perspective of the new backend) how immediate values are stored in InstructionData. It gathers large immediates from necessary places (e.g. constant pool) and delegates to `InstructionData::imm_value` for the rest. This refactor only touches original users of `LowerCtx::get_immediate` but a future change could do the same for any place the new backend is accessing InstructionData directly to retrieve immediates.
As part of a Wasm JIT update, SpiderMonkey is changing its internal
WebAssembly function ABI. The new ABI's frame format includes "caller
TLS" and "callee TLS" slots. The details of where these come from are
not important; from Cranelift's point of view, the only relevant
requirement is that we have two on-stack args that are always present
(offsetting other on-stack args), and that we define special argument
purposes so that we can supply values for these slots.
Note that this adds a *new* ABI (a variant of the Baldrdash ABI) because
we do not want to tightly couple the landing of this PR to the landing
of the changes in SpiderMonkey; it's better if both the old and new
behavior remain available in Cranelift, so SpiderMonkey can continue to
vendor Cranelift even if it does not land (or backs out) the ABI change.
Furthermore, note that this needs to be a Cranelift-level change (i.e.
cannot be done purely from the translator environment implementation)
because the special TLS arguments must always go on the stack, which
would not otherwise happen with the usual argument-placement logic; and
there is no primitive to push a value directly in CLIF code (the notion
of a stack frame is a lower-level concept).
Previously, in #2128, we factored out a common "vanilla 64-bit ABI"
implementation from the AArch64 ABI code, with the idea that this should
be largely compatible with x64. This PR alters the new x64 backend to
make use of the shared infrastructure, removing the duplication that
existed previously. The generated code is nearly (not exactly) the same;
the only difference relates to how the clobber-save region is padded in
the prologue.
This also changes some register allocations in the aarch64 code because
call support in the shared ABI infra now passes a temp vreg in, rather
than requiring use of a fixed, non-allocable temp; tests have been
updated, and the runtime behavior is unchanged.
The pinned register hack didn't work because the GetPinnedReg is marked
as having side-effects, so that GVN wouldn't try to common it out.
This commit tweaks the function used during lowering to vcode, so that
the GetPinnedReg opcode is specially handled. It's a bit lame, but it
makes the hack work again.
Also, use_input needs to be a no-op for real registers.
The implementation is pretty straightforward. Wasm atomic instructions fall
into 5 groups
* atomic read-modify-write
* atomic compare-and-swap
* atomic loads
* atomic stores
* fences
and the implementation mirrors that structure, at both the CLIF and AArch64
levels.
At the CLIF level, there are five new instructions, one for each group. Some
comments about these:
* for those that take addresses (all except fences), the address is contained
entirely in a single `Value`; there is no offset field as there is with
normal loads and stores. Wasm atomics require alignment checks, and
removing the offset makes implementation of those checks a bit simpler.
* atomic loads and stores get their own instructions, rather than reusing the
existing load and store instructions, for two reasons:
- per above comment, makes alignment checking simpler
- reuse of existing loads and stores would require extension of `MemFlags`
to indicate atomicity, which sounds semantically unclean. For example,
then *any* instruction carrying `MemFlags` could be marked as atomic, even
in cases where it is meaningless or ambiguous.
* I tried to specify, in comments, the behaviour of these instructions as
tightly as I could. Unfortunately there is no way (per my limited CLIF
knowledge) to enforce the constraint that they may only be used on I8, I16,
I32 and I64 types, and in particular not on floating point or vector types.
The translation from Wasm to CLIF, in `code_translator.rs` is unremarkable.
At the AArch64 level, there are also five new instructions, one for each
group. All of them except `::Fence` contain multiple real machine
instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively. The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.
One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional. In particular, we
cannot allow *any* extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code. That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere. Hence we must present it as a single indivisible
unit, as we do here. It also has the benefit of reducing the total amount of
work the RA has to do.
The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally. Rather
than faking one up by claiming, in `get_regs` that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (`::LLSC` and `::CAS`) simply use fixed registers in the range
x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.
One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done. In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg. For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method `<LowerCtx for Lower>::ensure_in_vreg`.
Again, we rely on coalescing to turn them into no-ops in the common case.
There is also a ridealong fix for the AArch64 lowering case for
`Opcode::Trapif | Opcode::Trapff`, which removes a bug in which two trap insns
in a row were generated.
In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion. Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.
In the Baldrdash (SpiderMonkey) embedding, we must take care to
zero-extend all function arguments to callees in integer registers when
the types are narrower than 64 bits. This is because, unlike the native
SysV ABI, the Baldrdash ABI expects high bits to be cleared. Not doing
so leads to difficult-to-trace errors where high bits falsely tag an
int32 as e.g. an object pointer, leading to potential security issues.
This commit adds support for generating stackmaps at safepoints to the
new backend framework and to the AArch64 backend in particular. It has
been tested to work with SpiderMonkey.
From discussion with Julian and Ben, this PR makes a few documentation-
and naming-level changes (no functionality change):
- Document that the `LowerCtx`-provided output register can be used as a
scratch register during the lowered instruction sequence before
placing the final result in it.
- Rename `input_to_*` helpers in the AArch64 backend to
`put_input_in_*`, emphasizing that these are side-effecting helpers
that potentially generate code (e.g., sign/zero-extensions) to ensure
an input value is in a register.
* ensure that all const assignments are placed at the end of the sequence.
This minimises live ranges.
* for the non-const assignments, ignore self-assignments. This can
dramatically reduce the total number of moves generated, because any
self-assignments trigger the overlap-case handling, hence invoking the
double-copy behaviour in cases where it's not necessary.
It's worth pointing out that self-assignments are common, and are not due to
deficiencies in CLIR optimisation. Rather, they occur whenever a loop back
edge doesn't modify *all* loop-carried values. This can easily happen if
the loop has multiple "early" back-edges -- "continues" in C parlance. Eg:
loop_header(a, b, c, d, e, f):
...
a_new = ...
b_new = ...
if (..) goto loop_header(a_new, b_new, c, d, e, f)
...
c_new = ...
d_new = ...
if (..) goto loop_header(a_new, b_new, c_new, d_new, e, f)
etc
For functions with many live values, this can dramatically reduce the number
of spill moves we throw into the register allocator.
In terms of compilation costs, this ranges from neutral for functions which
spill not at all, or minimally (joey_small, joey_med) to a 7.1% reduction in
insn count.
In terms of run costs, for one spill-heavy test (bz2 w/ custom timing harness),
instruction counts are reduced by 4.3%, data reads by 12.3% and data writes
by 18.5%. Note those last two figures include all reads and writes made by the
generated code, not just spills/reloads, so the proportional reduction in
spill/reload traffic must be greater.
- Properly mask constant values down to appropriate width when
generating a constant value directly in aarch64 backend. This was a
miscompilation introduced in the new-isel refactor. In combination
with failure to respect NarrowValueMode, this resulted in a very
subtle bug when an `i32` constant was used in bit-twiddling logic.
- Add support for `iadd_ifcout` in aarch64 backend as used in explicit
heap-check mode. With this change, we no longer fail heap-related
tests with the huge-heap-region mode disabled.
- Remove a panic that was occurring in some tests that are currently
ignored on aarch64, by simply returning empty/default information in
`value_label` functionality rather than touching unimplemented APIs.
This is not a bugfix per-se, but removes confusing panic messages from
`cargo test` output that might otherwise mislead.
This patch includes:
- A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.
- A new `MachBuffer` that replaces the `MachSection`. This is a special
version of a code-sink that is far more than a humble `Vec<u8>`. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable `LabelUse` trait that defines various types
of fixups (basically internal relocations).
Importantly, it implements some simple peephole-style branch rewrites
*inline in the emission pass*, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.
The `MachBuffer` also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.
- A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.
Overall, on `bz2.wasm`, the results are:
wasmtime full run (compile + runtime) of bz2:
baseline: 9774M insns, 9742M cycles, 3.918s
w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns)
clif-util wasm compile bz2:
baseline: 2633M insns, 3278M cycles, 1.034s
w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns)
All numbers are averages of two runs on an Ampere eMAG.
Previously, every call was lowered on AArch64 to a `call` instruction, which
takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this
gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible
relocation record to be emitted (or rather, a record that a more sophisticated
linker would fix up by inserting a shim/veneer).
This commit adds a notion of "relocation distance" in the MachInst backends,
and provides this information for every call target and symbol reference. The
intent is that backends on architectures like AArch64, where there are different
offset sizes / addressing strategies to choose from, can either emit a regular
call or a load-64-bit-constant / call-indirect sequence, as necessary. This
avoids the need to implement complex linking behavior.
The MachInst driver code provides this information based on the "colocated" bit
in the CLIF symbol references, which appears to have been designed for this
purpose, or at least a similar one. Combined with the `use_colocated_libcalls`
setting, this allows client code to ensure that library calls can link to
library code at any location in the address space.
Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing
so, it appears all that is necessary to get its tests to pass is to set the
`use_colocated_libcalls` flag to false, to make use of the above change. This
fixes the `libcall_function` unit-test in this crate.
This change adds SourceLoc information per instruction in a `VCode<Inst>`
container, and keeps this information up-to-date across register allocation
and branch reordering. The information is initially collected during
instruction lowering, eventually collected on the MachSection, and finally
provided to the environment that wraps the codegen crate for wasmtime.
- Undo temporary changes to default features (`all-arch`) and a
signal-handler test.
- Remove `SIGTRAP` handler: no longer needed now that we've found an
"undefined opcode" option on ARM64.
- Rename pp.rs to pretty_print.rs in machinst/.
- Only use empty stack-probe on non-x86. As per a comment in
rust-lang/compiler-builtins [1], LLVM only supports stack probes on
x86 and x86-64. Thus, on any other CPU architecture, we cannot refer
to `__rust_probestack`, because it does not exist.
- Rename arm64 to aarch64.
- Use `target` directive in vcode filetests.
- Run the flags verifier, but without encinfo, when using new backends.
- Clean up warning overrides.
- Fix up use of casts: use u32::from(x) and siblings when possible,
u32::try_from(x).unwrap() when not, to avoid silent truncation.
- Take immutable `Function` borrows as input; we don't actually
mutate the input IR.
- Lots of other miscellaneous cleanups.
[1] cae3e6ea23/src/probestack.rs (L39)
This patch adds the MachInst, or Machine Instruction, infrastructure.
This is the machine-independent portion of the new backend design. It
contains the implementation of the "vcode" (virtual-registerized code)
container, the top-level lowering algorithm and compilation pipeline,
and the trait definitions that the machine backends will fill in.
This backend infrastructure is included in the compilation of the
`codegen` crate, but it is not yet tied into the public APIs; that patch
will come last, after all the other pieces are filled in.
This patch contains code written by Julian Seward <jseward@acm.org> and
Benjamin Bouvier <public@benj.me>, originally developed on a side-branch
before rebasing and condensing into this patch series. See the `arm64`
branch at `https://github.com/cfallin/wasmtime` for original development
history.
Co-authored-by: Julian Seward <jseward@acm.org>
Co-authored-by: Benjamin Bouvier <public@benj.me>