Commit Graph

61 Commits

Author SHA1 Message Date
Julian Seward
25e31739a6 Implement Wasm Atomics for Cranelift/newBE/aarch64.
The implementation is pretty straightforward.  Wasm atomic instructions fall
into 5 groups

* atomic read-modify-write
* atomic compare-and-swap
* atomic loads
* atomic stores
* fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group.  Some
comments about these:

* for those that take addresses (all except fences), the address is contained
  entirely in a single `Value`; there is no offset field as there is with
  normal loads and stores.  Wasm atomics require alignment checks, and
  removing the offset makes implementation of those checks a bit simpler.

* atomic loads and stores get their own instructions, rather than reusing the
  existing load and store instructions, for two reasons:

  - per above comment, makes alignment checking simpler

  - reuse of existing loads and stores would require extension of `MemFlags`
    to indicate atomicity, which sounds semantically unclean.  For example,
    then *any* instruction carrying `MemFlags` could be marked as atomic, even
    in cases where it is meaningless or ambiguous.

* I tried to specify, in comments, the behaviour of these instructions as
  tightly as I could.  Unfortunately there is no way (per my limited CLIF
  knowledge) to enforce the constraint that they may only be used on I8, I16,
  I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in `code_translator.rs` is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group.  All of them except `::Fence` contain multiple real machine
instructions.  Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively.  The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional.  In particular, we
cannot allow *any* extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code.  That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere.  Hence we must present it as a single indivisible
unit, as we do here.  It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally.  Rather
than faking one up by claiming, in `get_regs` that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (`::LLSC` and `::CAS`) simply use fixed registers in the range
x24-x28.  We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero.  x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done.  In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg.  For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method `<LowerCtx for Lower>::ensure_in_vreg`.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
`Opcode::Trapif | Opcode::Trapff`, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion.  Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.
2020-08-04 09:35:50 +02:00
Chris Fallin
9a9b5015d0 Merge pull request #2081 from cfallin/aarch64-baldrdash-fix
Aarch64: fix narrow integer-register extension with Baldrdash ABI.
2020-07-31 12:13:38 -07:00
Chris Fallin
dd09865611 Fix MachBuffer branch handling with redirect chains.
When one branch target label in a MachBuffer is redirected to another,
we eventually fix up branches targetting the first to refer to the
redirected target instead. Separately, we have a branch-folding
optimization that, when an unconditional branch occurs as the only
instruction in a block (right at a label) and the previous instruction
is also an unconditional branch (hence no fallthrough), we can elide
that block entirely and redirect the label. Finally, we prevented
infinite loops when resolving label aliases by chasing only one alias
deep.

Unfortunately, these three facts interacted poorly, and this is a result
of our correctness arguments assuming a fully-general "redirect" that
was not limited to one indirection level. In particular, we could have
some label A that redirected to B, then remove the block at B because it
is just a single branch to C, redirecting B to C. A would still redirect
to B, though, without chasing to C, and hence a branch to B would fall
through to the unrelated block that came after block B.

Thanks to @bnjbvr for finding this bug while debugging the x64 backend
and reducing a failure to the function in issue #2082. (This is a very
subtle bug and it seems to have been quite difficult to chase; my
apologies!)

The fix is to (i) chase redirects arbitrarily deep, but also (ii) ensure
that we do not form a cycle of redirects. The latter is done by very
carefully checking the existing fully-resolved target of the label we
are about to redirect *to*; if it resolves back to the branch that
is causing this redirect, then we avoid making the alias. The comments
in this patch make a slightly more detailed argument why this should be
correct.

Unfortunately we cannot directly test the CLIF that @bnjbvr reduced
because we don't have a way to assert anything about the machine-code
that comes after the branch folding and emission. However, the dedicated
unit tests in this patch replicate an equivalent folding case, and also
test that we handle branch cycles properly (as argued above).

Fixes #2082.
2020-07-31 19:52:30 +02:00
Chris Fallin
1fbdf169b5 Aarch64: fix narrow integer-register extension with Baldrdash ABI.
In the Baldrdash (SpiderMonkey) embedding, we must take care to
zero-extend all function arguments to callees in integer registers when
the types are narrower than 64 bits. This is because, unlike the native
SysV ABI, the Baldrdash ABI expects high bits to be cleared. Not doing
so leads to difficult-to-trace errors where high bits falsely tag an
int32 as e.g. an object pointer, leading to potential security issues.
2020-07-31 10:19:13 -07:00
Benjamin Bouvier
35d9ab19b7 Review fixes; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
ad4a2f919f machinst x64: implement support for reference types; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
694af3aec2 machinst x64: implement float Floor/Ceil/Trunc/Nearest as VM calls; 2020-07-24 19:29:12 +02:00
Chris Fallin
2b9fefe89a Add timing for several new-backend stages.
This PR adds a bit more granularity to the output of e.g. `clif-util
compile -T`, indicating how much time is spent in VCode lowering and
various other new-backend-specific tasks.
2020-07-23 09:54:39 -07:00
Benjamin Bouvier
ead8a835c4 machinst x64: add more FP support 2020-07-17 15:56:44 +02:00
Chris Fallin
26529006e0 Address review comments. 2020-07-14 10:17:29 -07:00
Chris Fallin
08353fcc14 Reftypes part two: add support for stackmaps.
This commit adds support for generating stackmaps at safepoints to the
new backend framework and to the AArch64 backend in particular. It has
been tested to work with SpiderMonkey.
2020-07-14 10:17:27 -07:00
Chris Fallin
b7ecad1d74 AArch64: avoid branches with explicit offsets at lowering stage.
In discussions with @bnjbvr, it came up that generating `OneWayCondBr`s
with explicit, hardcoded PC-offsets as part of lowered instruction
sequences is actually unsafe, because the register allocator *might*
insert a spill or reload into the middle of our sequence. We were
careful about this in some cases but somehow missed that it was a
general restriction. Conceptually, all inter-instruction references
should be via labels at the VCode level; explicit offsets are only ever
known at emission time, and resolved by the `MachBuffer`.

To allow for conditional trap checks without modifying the CFG (as seen
by regalloc) during lowering, this PR instead adds a `TrapIf`
pseudo-instruction that conditionally skips a single embedded trap
instruction. It lowers to the same `condbr label ; trap ; label: ...`
sequence, but without the hardcoded branch-target offset in the lowering
code.
2020-07-02 11:02:27 -07:00
Alex Crichton
0acd2072c2 Fix doc warnings and link failures (#1948)
Also add configuration to CI to fail doc generation if any links are
broken. Unfortunately we can't blanket deny all warnings in rustdoc
since some are unconditional warnings, but for now this is hopefully
good enough.

Closes #1947
2020-06-30 13:01:49 -05:00
Benjamin Bouvier
c9a3f05afd machinst x64: implement calls and int cmp/store/loads;
This makes it possible to run a simple recursive fibonacci function in
wasmtime.
2020-06-25 16:20:33 +02:00
Chris Fallin
492000e945 MachInst isel and aarch64 backend: docs / clarity improvements.
From discussion with Julian and Ben, this PR makes a few documentation-
and naming-level changes (no functionality change):

- Document that the `LowerCtx`-provided output register can be used as a
  scratch register during the lowered instruction sequence before
  placing the final result in it.

- Rename `input_to_*` helpers in the AArch64 backend to
  `put_input_in_*`, emphasizing that these are side-effecting helpers
  that potentially generate code (e.g., sign/zero-extensions) to ensure
  an input value is in a register.
2020-06-18 12:18:50 -07:00
Benjamin Bouvier
ef5de04d32 machinst/x64: teach regalloc what FP instructions are moves;
and cosmetic changes after #1665 landed.
2020-06-15 16:39:08 +02:00
Johnnie Birch
48f0b10c7a Add initial scalar FP operations (addss, subss, etc) to x64 backend.
Adds support for addss and subss. This is the first lowering for
sse floating point alu and some move operations. The changes here do
some renaming of data structures and adds a couple of new ones
to support sse specific operations. The work done here will likely
evolve as needed to support an efficient, inituative, and consistent
framework.
2020-06-10 18:36:57 +02:00
Benjamin Bouvier
5d01603390 mach backend: allow snapshotting IR graphs with the SNAPSHOT_REGALLOC env variable;
This also requires the serde feature, which isn't enabled by default,
thus it must be passed as a command-line argument to cargo.
2020-06-10 18:23:04 +02:00
Benjamin Bouvier
46093f6119 Bump regalloc.rs to 0.0.26;
And adapt to regalloc.rs API change to provide the exact number of vregs.
2020-06-10 18:23:04 +02:00
Chris Fallin
02ae1b4464 Merge pull request #1846 from julian-seward1/better-phis
Rewrite `lower_edge` to produce better phi-translations:
2020-06-09 09:56:52 -07:00
Julian Seward
6d25759c8e Rewrite lower_edge to produce better phi-translations:
* ensure that all const assignments are placed at the end of the sequence.
  This minimises live ranges.

* for the non-const assignments, ignore self-assignments.  This can
  dramatically reduce the total number of moves generated, because any
  self-assignments trigger the overlap-case handling, hence invoking the
  double-copy behaviour in cases where it's not necessary.

It's worth pointing out that self-assignments are common, and are not due to
deficiencies in CLIR optimisation.  Rather, they occur whenever a loop back
edge doesn't modify *all* loop-carried values.  This can easily happen if
the loop has multiple "early" back-edges -- "continues" in C parlance.  Eg:

   loop_header(a, b, c, d, e, f):
      ...
      a_new = ...
      b_new = ...
      if (..) goto loop_header(a_new, b_new, c, d, e, f)

      ...
      c_new = ...
      d_new = ...
      if (..) goto loop_header(a_new, b_new, c_new, d_new, e, f)

      etc

For functions with many live values, this can dramatically reduce the number
of spill moves we throw into the register allocator.

In terms of compilation costs, this ranges from neutral for functions which
spill not at all, or minimally (joey_small, joey_med) to a 7.1% reduction in
insn count.

In terms of run costs, for one spill-heavy test (bz2 w/ custom timing harness),
instruction counts are reduced by 4.3%, data reads by 12.3% and data writes
by 18.5%.  Note those last two figures include all reads and writes made by the
generated code, not just spills/reloads, so the proportional reduction in
spill/reload traffic must be greater.
2020-06-09 10:36:32 +02:00
Chris Fallin
fc2a6f273b Three fixes to various SpiderMonkey-related issues:
- Properly mask constant values down to appropriate width when
  generating a constant value directly in aarch64 backend. This was a
  miscompilation introduced in the new-isel refactor. In combination
  with failure to respect NarrowValueMode, this resulted in a very
  subtle bug when an `i32` constant was used in bit-twiddling logic.

- Add support for `iadd_ifcout` in aarch64 backend as used in explicit
  heap-check mode. With this change, we no longer fail heap-related
  tests with the huge-heap-region mode disabled.

- Remove a panic that was occurring in some tests that are currently
  ignored on aarch64, by simply returning empty/default information in
  `value_label` functionality rather than touching unimplemented APIs.
  This is not a bugfix per-se, but removes confusing panic messages from
  `cargo test` output that might otherwise mislead.
2020-06-08 13:02:00 -07:00
Andrew Brown
40f31375a5 Add TargetIsa::as_any for downcasting to specific ISA implementations
This is necessary when we would like to check specific ISA flags, e.g.
2020-06-03 16:27:57 -07:00
Chris Fallin
fe97659813 Address review comments. 2020-06-03 13:31:34 -07:00
Chris Fallin
615362068f Multi-value return support. 2020-06-03 13:31:34 -07:00
Benjamin Bouvier
e227608510 mach backend: use vectors instead of sets to remember set of uses/defs for calls;
This avoids the set uniqueness (hashing) test, reduces memory
churn when re-mapping virtual register onto real registers, and is
generally more memory-efficient.
2020-06-02 16:29:05 +02:00
Anton Kirilov
8a928830ac Enable the wast::Cranelift::spec::simd::simd_store test for AArch64
Copyright (c) 2020, Arm Limited.
2020-05-24 22:53:07 +01:00
Chris Fallin
c9e3b71c39 Merge pull request #1729 from cfallin/machinst-branch-opt
Fix MachBuffer branch optimization.
2020-05-20 14:43:57 -07:00
Chris Fallin
13e12908a6 MachBuffer branch opts: comments approximating a semi-formal correctness proof. 2020-05-20 14:12:19 -07:00
Chris Fallin
80ab154d04 Update from review comments. 2020-05-20 12:35:36 -07:00
Benjamin Bouvier
1f620e1b46 cranelift: bump regalloc.rs to 0.0.24 and adapt to latest API changes; 2020-05-20 15:37:15 +02:00
Chris Fallin
e11094b28b Fix MachBuffer branch optimization.
This patch fixes a subtle bug that occurred in the MachBuffer branch
optimization: in tracking labels at the current buffer tail using a
sorted-by-offset array, the code did not update this array properly when
redirecting labels. As a result, the dead-branch removal was unsafe,
because not every label pointing to a branch is guaranteed to be
redirected properly first.

Discovered while doing performance testing: bz2 silently took a wrong
branch and exited compression early. (Eek!)

To address this problem, this patch adopts a slightly simpler data
structure: we only track the labels *at the current buffer tail*, and
*at the start of each branch*, and we're careful to update these
appropriately to maintain the invariants. I'm pretty confident that this
is correct now, but we should (still) fuzz it a bunch, because wrong
control flow scares me a nonzero amount. I should probably also actually
write out a formal proof that these data-structure updates are correct.
The optimizations are important for performance (removing useless empty
blocks, and taking advantage of any fallthrough opportunities at all),
so I don't think we would want to drop them entirely.
2020-05-19 18:09:18 -07:00
Chris Fallin
bdd2873c8c Address review comments. 2020-05-18 16:25:26 -07:00
Chris Fallin
72e6be9342 Rework of MachInst isel, branch fixups and lowering, and block ordering.
This patch includes:

- A complete rework of the way that CLIF blocks and edge blocks are
  lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
  computes RPO over the CFG, but with a twist: it merges edge blocks intto
  heads or tails of original CLIF blocks wherever possible, and it does
  this without ever actually materializing the full nodes-plus-edges
  graph first. The backend driver lowers blocks in final order so
  there's no need to reshuffle later.

- A new `MachBuffer` that replaces the `MachSection`. This is a special
  version of a code-sink that is far more than a humble `Vec<u8>`. In
  particular, it keeps a record of label definitions and label uses,
  with a machine-pluggable `LabelUse` trait that defines various types
  of fixups (basically internal relocations).

  Importantly, it implements some simple peephole-style branch rewrites
  *inline in the emission pass*, without any separate traversals over
  the code to use fallthroughs, swap taken/not-taken arms, etc. It
  tracks branches at the tail of the buffer and can (i) remove blocks
  that are just unconditional branches (by redirecting the label), (ii)
  understand a conditional/unconditional pair and swap the conditional
  polarity when it's helpful; and (iii) remove branches that branch to
  the fallthrough PC.

  The `MachBuffer` also implements branch-island support. On
  architectures like AArch64, this is needed to allow conditional
  branches within plausibly-attainable ranges (+/- 1MB on AArch64
  specifically). It also does this inline while streaming through the
  emission, without any sort of fixpoint algorithm or later moving of
  code, by simply tracking outstanding references and "deadlines" and
  emitting an island just-in-time when we're in danger of going out of
  range.

- A rework of the instruction selector driver. This is largely following
  the same algorithm as before, but is cleaned up significantly, in
  particular in the API: the machine backend can ask for an input arg
  and get any of three forms (constant, register, producing
  instruction), indicating it needs the register or can merge the
  constant or producing instruction as appropriate. This new driver
  takes special care to emit constants right at use-sites (and at phi
  inputs), minimizing their live-ranges, and also special-cases the
  "pinned register" to avoid superfluous moves.

Overall, on `bz2.wasm`, the results are:

    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
2020-05-16 23:08:22 -07:00
Benjamin Bouvier
5987cf5cda machinst: add a linear-scan checked variant too; 2020-05-13 10:56:32 +02:00
Chris Fallin
17cef9140c MachInst backend: don't reallocate RealRegUniverses for each function
compilation.

This saves ~0.14% instruction count, ~0.18% allocated bytes, and ~1.5%
allocated blocks on a `clif-util wasm` compilation of `bz2.wasm` for
aarch64.
2020-05-08 15:35:16 -07:00
Benjamin Bouvier
528d3c1355 machinst: Steal the used/defs Sets when emitting a call in ABICall; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
19d8a7f1fb machinst: Reuse memory accross loop iterations in lowering; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
b24b711c16 machinst: Reduce the number of vec allocations for edge blocks; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
9215b610ef machinst: Avoid a lot of short-lived allocations in ABICall; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
4f919c6460 machinst: bump regalloc to 0.0.23 and return a slice on the successor indexes, in block_succs; 2020-05-07 12:24:02 +02:00
Julian Seward
48521393ae Update to regalloc.rs version 0.22. 2020-05-06 20:16:31 +02:00
Chris Fallin
6d73fdb70a Merge pull request #1607 from cfallin/aarch64-stack-frame
Rework aarch64 stack frame implementation to use positive offsets.
2020-05-06 10:29:30 -07:00
Chris Fallin
a66724aafd Rework aarch64 stack frame implementation.
This PR changes the aarch64 ABI implementation to use positive offsets
from SP, rather than negative offsets from FP, to refer to spill slots
and stack-local storage. This allows for better addressing-mode options,
and hence slightly better code: e.g., the unsigned scaled 12-bit offset
mode can be used to reach anywhere in a 32KB frame without extra
address-construction instructions, whereas negative offsets are limited
to a signed 9-bit unscaled mode (-256 bytes).

To enable this, the PR introduces a notion of "nominal SP offsets" as a
virtual addressing mode, lowered during the emission pass. The offsets
are relative to "SP after adjusting downward to allocate stack/spill
slots", but before pushing clobbers. This allows the addressing-mode
expressions to be generated before register allocation (or during it,
for spill/reload sequences).

To convert these offsets into *true* offsets from SP, we need to track
how much further SP is moved downward, and compensate for this. We do so
with "virtual SP offset adjustment" pseudo-instructions: these are seen
by the emission pass, and result in no instruction (0 byte output), but
update state that is now threaded through each instruction emission in
turn. In this way, we can push e.g. stack args for a call and adjust
the virtual SP offset, allowing reloads from nominal-SP-relative
spillslots while we do the argument setup with "real SP offsets" at the
same time.
2020-05-06 09:23:55 -07:00
Benjamin Bouvier
1d90751ba9 machinst: Avoid a full instructions traversal of all the blocks when computing the final block ordering; 2020-05-06 15:13:25 +02:00
Chris Fallin
e39b4aba1c Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use it.
Previously, every call was lowered on AArch64 to a `call` instruction, which
takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this
gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible
relocation record to be emitted (or rather, a record that a more sophisticated
linker would fix up by inserting a shim/veneer).

This commit adds a notion of "relocation distance" in the MachInst backends,
and provides this information for every call target and symbol reference. The
intent is that backends on architectures like AArch64, where there are different
offset sizes / addressing strategies to choose from, can either emit a regular
call or a load-64-bit-constant / call-indirect sequence, as necessary. This
avoids the need to implement complex linking behavior.

The MachInst driver code provides this information based on the "colocated" bit
in the CLIF symbol references, which appears to have been designed for this
purpose, or at least a similar one. Combined with the `use_colocated_libcalls`
setting, this allows client code to ensure that library calls can link to
library code at any location in the address space.

Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing
so, it appears all that is necessary to get its tests to pass is to set the
`use_colocated_libcalls` flag to false, to make use of the above change. This
fixes the `libcall_function` unit-test in this crate.
2020-05-05 09:55:12 -07:00
Chris Fallin
8393412c40 Merge pull request #1632 from cfallin/aarch64-fix-srclocs
MachInst backend: attach SourceLoc span information to all ranges.
2020-04-30 16:13:55 -07:00
Chris Fallin
964c6087bd MachInst backend: attach SourceLoc span information to all ranges.
Previously, the SourceLoc information transferred in `VCode` only
included PC-spans for non-default SourceLocs. I realized that the
invariant we're supposed to keep here is that every PC is covered; if no
source information, just use `SourceLoc::default()`.

This was spurred by @bjorn3's comment in #1575 (thanks!).
2020-04-30 15:40:55 -07:00
Chris Fallin
be6f060abf Use new regalloc.rs version with dense vreg->rreg maps.
This PR updates Cranelift to use the new version of regalloc.rs
(bytecodealliance/regalloc.rs#55) that provides dense vreg->rreg maps to
the `map_reg()` function for each instruction, rather than the earlier
hashmap-based approach.

In one test (regex-rs.wasm), this PR results in a 15% reduction in
memory allocations (1245MB -> 1060MB) as measured by DHAT on `clif-util
wasm` runs.
2020-04-29 10:42:25 -07:00
Benjamin Bouvier
698dc9c401 Fixes #1619: Properly bubble up errors when seeing an unexpected type during lowering. 2020-04-29 10:26:22 +02:00