This commit disables the `MachBuffer::check_label_branch_invariants`
debug check on the fuzzers due to it causing timeouts with the test case
from #3441. Fuzzing leads to a 20-30x slowdown of executed code and
locally the fuzz time it takes to instantiate #3441 drops from 3 minutes
to 6 seconds disabling this function. Note that this should still be
executed during our testing on CI since it's still enabled for debug
assertions.
Fixes#3468.
If a program has many instances of the pattern "goto next; next:" in a
row (i.e., no-op branches to the fallthrough address), the branch
simplification in `MachBuffer` would remove them all, as expected.
However, in order to work correctly, the algorithm needs to track all
labels that alias the current buffer tail, so that they can be adjusted
later if another branch chomp occurs.
When many thousands of this branch-to-next pattern occur, many thousands
of labels will reference the current buffer tail, and this list of
thousands of labels will be shuffled between the branch metadata struct
and the "labels at tail" struct as branches are appended and then
chomped immediately.
It's possible that with smarter data structure design, we could somehow
share the list of labels -- e.g., a single array of all labels, in order
they are bound, with ranges of indices in this array used to represent
lists of labels (actually, that seems like a better design in general);
but let's leave that to future optimization work.
For now, we can avoid the quadratic behavior by just "giving up" if the
list is too long; it's always valid to not optimize a branch. It is very
unlikely that the "normal" case will have more than 100 "goto next"
branches in a row, so this should not have any perf impact; if it does,
we will leave 1 out of every 100 such branches un-optimized in a long
sequence of thousands.
This takes total compilation time down on my machine from ~300ms to
~72ms for the `foo.wasm` case in #3441. For reference, the old backend
(now removed), built from arbitrarily-chosen-1-year-old commit
`c7fcc344`, takes 158ms, so we're ~twice as fast, which is what I would
expect.
This also paves the way for unifying TargetIsa and MachBackend, since now they map one to one. In theory the two traits could be merged, which would be nice to limit the number of total concepts. Also they have quite different responsibilities, so it might be fine to keep them separate.
Interestingly, this PR started as removing RegInfo from the TargetIsa trait since the adapter returned a dummy value there. From the fallout, noticed that all Display implementations didn't needed an ISA anymore (since these were only used to render ISA specific registers). Also the whole family of RegInfo / ValueLoc / RegUnit was exclusively used for the old backend, and these could be removed. Notably, some IR instructions needed to be removed, because they were using RegUnit too: this was the oddball of regfill / regmove / regspill / copy_special, which were IR instructions inserted by the old regalloc. Fare thee well!
- Add relocation handling needed after PR #3275
- Fix incorrect handling of signed constants detected by PR #3056 test
- Fix LabelUse max pos/neg ranges; fix overflow in buffers.rs
- Disable fuzzing tests that require pre-built v8 binaries
- Disable cranelift test that depends on i128
- Temporarily disable memory64 tests
* Use relative `call` instructions between wasm functions
This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:
* Function calls are always indirect which means they are more difficult
to predict, resulting in worse performance.
* Generating a relocation-per-function call requires expensive
relocation resolution at module-load time, which can be a large
contributing factor to how long it takes to load a precompiled module.
To fix these issues, while also somewhat compromising on the previously
simple implementation technique, this commit switches wasm calls within
a module to using the `colocated` flag enabled in Cranelift-speak, which
basically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.
When switching the `colocated` flag to `true` this commit is also then
able to move much of the relocation resolution from `wasmtime_jit::link`
into `wasmtime_cranelift::obj` during object-construction time. This
frontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.
The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from the `bl` instruction to the target, which means that if the
function we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.
This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by the `MachBuffer` type. When
generating a function relative jumps between basic blocks have similar
restrictions. This commit adds new methods for the `MachBackend` trait
and updates the implementation of `MachBuffer` to account for all these
new branches. Specifically the changes to `MachBuffer` are:
* For AAarch64 the `LabelUse::Branch26` value now supports veneers, and
AArch64 calls use this to resolve relocations.
* The `emit_island` function has been rewritten internally to handle
some cases which previously didn't come up before, such as:
* When emitting an island the deadline is now recalculated, where
previously it was always set to infinitely in the future. This was ok
prior since only a `Branch19` supported veneers and once it was
promoted no veneers were supported, so without multiple layers of
promotion the lack of a new deadline was ok.
* When emitting an island all pending fixups had veneers forced if
their branch target wasn't known yet. This was generally ok for
19-bit fixups since the only kind getting a veneer was a 19-bit
fixup, but with mixed kinds it's a bit odd to force veneers for a
26-bit fixup just because a nearby 19-bit fixup needed a veneer.
Instead fixups are now re-enqueued unless they're known to be
out-of-bounds. This may run the risk of generating more islands for
19-bit branches but it should also reduce the number of islands for
between-function calls.
* Otherwise the internal logic was tweaked to ideally be a bit more
simple, but that's a pretty subjective criteria in compilers...
I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.
The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.
* Fix arm32
* Review comments
* Replace some cfg(debug) with cfg(debug_assertions)
Cargo nor rustc ever sets `cfg(debug)` automatically, so it's expected
that these usages were intended to be `cfg(debug_assertions)`.
* Fix MachBuffer debug-assertion invariant checks.
We should only check invariants when we expect them to be true --
specifically, before the branch-simplification algorithm runs. At other
times, they may be temporarily violated: e.g., after
`add_{cond,uncond}_branch()` but before emitting the branch bytes. This
is the expected sequence, and the rest of the code is consistent with
that.
Some of the checks also were not quite right (w.r.t. the written
invariants); specifically, we should not check validity of a label's
offset when the label has been aliased to another label.
It seems that this is an unfortunate consequence of leftover
debug-assertions that weren't actually being run, so weren't kept
up-to-date. Should no longer happen now that we actually check these!
Co-authored-by: Chris Fallin <chris@cfallin.org>
The truncate_last_branch removes an instruction that had already
been added to the buffer, and must update various bookkeeping.
However, updating the "srclocs" field is incorrect: if there is
a srclocs entry that spans both the removed branch *and some
previous instruction*, that whole srclocs entry is removed,
which makes those previous instructions now uncovered by any
srclocs record. This can cause subsequent problems e.g. if
one of those instructions traps.
Fixed by just truncating instead of fully removing the srclocs
record in this case.
This is sometimes useful when performing analyses on the generated
machine code: for example, some kinds of code verifiers will want to do
a control-flow analysis, and it is much easier to do this if one does
not have to recover the CFG from the machine code (doing so requires
heavyweight analysis when indirect branches are involved). If one trusts
the control-flow lowering and only needs to verify other properties of
the code, this can be very useful.
Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.
A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.
The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.
It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.
There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for `alloca` (dynamic frame growth),
then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
`abi_impl.rs`.
The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from `sp` once to allocate space, then used stores to `[sp,
offset]` to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 384 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like
a "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.
This PR adds the new backend on Windows to CI as well.
If an instruction has more than one trap record associated with it (for
example: a divide instruction that has participated in load-op fusion,
so we have both a heap-out-of-bounds trap record due to its load and a
divide-by-zero trap record due to its divide op), the current MachBuffer
code would emit only one of the trap records to the sink.
Separately, divide instructions probably shouldn't merge loads, because
the two separate possible traps at one location might be confusing for
some embedders (certainly in Lucet). Divide seems to be the only case in
our current codegen where such merging might occur. This PR changes the
lowering to always force the divisor into a register.
Finally, while working out why trap records were not appearing, I had
noticed that `isa::x64::emit_std_enc_mem()` was only emitting heap-OOB
trap metadata for loads/stores when it had a srcloc. This PR ensures
that the metadata is emitted even when the srcloc is empty.
Note that none of the above presents a security or correctness problem;
trap metadata only affects the status that we return to the embedder
when a Wasm program terminates with a trap.
- Sort by generated-code offset to maintain invariant and avoid gimli
panic.
- Fix srcloc interaction with branch peephole optimization in
MachBuffer: if a srcloc range overlaps with a branch that is
truncated, remove that srcloc range.
These issues were found while fuzzing the new backend (#2453); I suspect
that they arise with the new backend because we can sink instructions
(e.g. loads or extends) in more interesting ways than before, but I'm
not entirely sure.
Test coverage will be via the fuzz corpus once #2453 lands.
In existing MachInst backends, many instructions -- any that can trap or
result in a relocation -- carry `SourceLoc` values in order to propagate
the location-in-original-source to use to describe resulting traps or
relocation errors.
This is quite tedious, and also error-prone: it is likely that the
necessary plumbing will be missed in some cases, and in any case, it's
unnecessarily verbose.
This PR factors out the `SourceLoc` handling so that it is tracked
during emission as part of the `EmitState`, and plumbed through
automatically by the machine-independent framework. Instruction emission
code that directly emits trap or relocation records can query the
current location as necessary. Then we only need to ensure that memory
references and trap instructions, at their (one) emission point rather
than their (many) lowering/generation points, are wired up correctly.
This does have the side-effect that some loads and stores that do not
correspond directly to user code's heap accesses will have unnecessary
but harmless trap metadata. For example, the load that fetches a code
offset from a jump table will have a 'heap out of bounds' trap record
attached to it; but because it is bounds-checked, and will never
actually trap if the lowering is correct, this should be harmless. The
simplicity improvement here seemed more worthwhile to me than plumbing
through a "corresponds to user-level load/store" bit, because the latter
is a bit complex when we allow for op merging.
Closes#2290: though it does not implement a full "metadata" scheme as
described in that issue, this seems simpler overall.
This approach suffers from memory-size bloat during compile time due to the desire to de-duplicate the constants emitted and reduce runtime memory-size. As a first step, though, this provides an end-to-end mechanism for constants to be emitted in the MachBuffer islands.
A new associated type Info is added to MachInstEmit, which is the
immutable counterpart to State. It can't easily be constructed from an
ABICallee, since it would require adding an associated type to the
latter, and making so leaks the associated type in a lot of places in
the code base and makes the code harder to read. Instead, the EmitInfo
state can simply be passed to the `Vcode::emit` function directly.
When one branch target label in a MachBuffer is redirected to another,
we eventually fix up branches targetting the first to refer to the
redirected target instead. Separately, we have a branch-folding
optimization that, when an unconditional branch occurs as the only
instruction in a block (right at a label) and the previous instruction
is also an unconditional branch (hence no fallthrough), we can elide
that block entirely and redirect the label. Finally, we prevented
infinite loops when resolving label aliases by chasing only one alias
deep.
Unfortunately, these three facts interacted poorly, and this is a result
of our correctness arguments assuming a fully-general "redirect" that
was not limited to one indirection level. In particular, we could have
some label A that redirected to B, then remove the block at B because it
is just a single branch to C, redirecting B to C. A would still redirect
to B, though, without chasing to C, and hence a branch to B would fall
through to the unrelated block that came after block B.
Thanks to @bnjbvr for finding this bug while debugging the x64 backend
and reducing a failure to the function in issue #2082. (This is a very
subtle bug and it seems to have been quite difficult to chase; my
apologies!)
The fix is to (i) chase redirects arbitrarily deep, but also (ii) ensure
that we do not form a cycle of redirects. The latter is done by very
carefully checking the existing fully-resolved target of the label we
are about to redirect *to*; if it resolves back to the branch that
is causing this redirect, then we avoid making the alias. The comments
in this patch make a slightly more detailed argument why this should be
correct.
Unfortunately we cannot directly test the CLIF that @bnjbvr reduced
because we don't have a way to assert anything about the machine-code
that comes after the branch folding and emission. However, the dedicated
unit tests in this patch replicate an equivalent folding case, and also
test that we handle branch cycles properly (as argued above).
Fixes#2082.
This PR adds a bit more granularity to the output of e.g. `clif-util
compile -T`, indicating how much time is spent in VCode lowering and
various other new-backend-specific tasks.
This commit adds support for generating stackmaps at safepoints to the
new backend framework and to the AArch64 backend in particular. It has
been tested to work with SpiderMonkey.
In discussions with @bnjbvr, it came up that generating `OneWayCondBr`s
with explicit, hardcoded PC-offsets as part of lowered instruction
sequences is actually unsafe, because the register allocator *might*
insert a spill or reload into the middle of our sequence. We were
careful about this in some cases but somehow missed that it was a
general restriction. Conceptually, all inter-instruction references
should be via labels at the VCode level; explicit offsets are only ever
known at emission time, and resolved by the `MachBuffer`.
To allow for conditional trap checks without modifying the CFG (as seen
by regalloc) during lowering, this PR instead adds a `TrapIf`
pseudo-instruction that conditionally skips a single embedded trap
instruction. It lowers to the same `condbr label ; trap ; label: ...`
sequence, but without the hardcoded branch-target offset in the lowering
code.
Also add configuration to CI to fail doc generation if any links are
broken. Unfortunately we can't blanket deny all warnings in rustdoc
since some are unconditional warnings, but for now this is hopefully
good enough.
Closes#1947
This patch fixes a subtle bug that occurred in the MachBuffer branch
optimization: in tracking labels at the current buffer tail using a
sorted-by-offset array, the code did not update this array properly when
redirecting labels. As a result, the dead-branch removal was unsafe,
because not every label pointing to a branch is guaranteed to be
redirected properly first.
Discovered while doing performance testing: bz2 silently took a wrong
branch and exited compression early. (Eek!)
To address this problem, this patch adopts a slightly simpler data
structure: we only track the labels *at the current buffer tail*, and
*at the start of each branch*, and we're careful to update these
appropriately to maintain the invariants. I'm pretty confident that this
is correct now, but we should (still) fuzz it a bunch, because wrong
control flow scares me a nonzero amount. I should probably also actually
write out a formal proof that these data-structure updates are correct.
The optimizations are important for performance (removing useless empty
blocks, and taking advantage of any fallthrough opportunities at all),
so I don't think we would want to drop them entirely.
This patch includes:
- A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.
- A new `MachBuffer` that replaces the `MachSection`. This is a special
version of a code-sink that is far more than a humble `Vec<u8>`. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable `LabelUse` trait that defines various types
of fixups (basically internal relocations).
Importantly, it implements some simple peephole-style branch rewrites
*inline in the emission pass*, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.
The `MachBuffer` also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.
- A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.
Overall, on `bz2.wasm`, the results are:
wasmtime full run (compile + runtime) of bz2:
baseline: 9774M insns, 9742M cycles, 3.918s
w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns)
clif-util wasm compile bz2:
baseline: 2633M insns, 3278M cycles, 1.034s
w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns)
All numbers are averages of two runs on an Ampere eMAG.