Commit Graph

37 Commits

Author SHA1 Message Date
Benjamin Bouvier
9d5be00de4 Address review comments
- put the division in the synthetic instruction as well,
- put the branch table check in the inst's emission code,
- replace OneWayCondJmp by TrapIf vcode instruction,
- add comments describing code generated by the synthetic instructions
2020-07-03 14:33:52 +02:00
Chris Fallin
b700646c93 Merge pull request #1962 from cfallin/aarch64-lowering-condbr
AArch64: avoid branches with explicit offsets at lowering stage.
2020-07-02 14:05:40 -07:00
Chris Fallin
b7ecad1d74 AArch64: avoid branches with explicit offsets at lowering stage.
In discussions with @bnjbvr, it came up that generating `OneWayCondBr`s
with explicit, hardcoded PC-offsets as part of lowered instruction
sequences is actually unsafe, because the register allocator *might*
insert a spill or reload into the middle of our sequence. We were
careful about this in some cases but somehow missed that it was a
general restriction. Conceptually, all inter-instruction references
should be via labels at the VCode level; explicit offsets are only ever
known at emission time, and resolved by the `MachBuffer`.

To allow for conditional trap checks without modifying the CFG (as seen
by regalloc) during lowering, this PR instead adds a `TrapIf`
pseudo-instruction that conditionally skips a single embedded trap
instruction. It lowers to the same `condbr label ; trap ; label: ...`
sequence, but without the hardcoded branch-target offset in the lowering
code.
2020-07-02 11:02:27 -07:00
Joey Gouly
62e7b7f838 arm64: Implement basic SIMD arithmetic
Copyright (c) 2020, Arm Limited.
2020-07-02 13:17:33 +01:00
Anton Kirilov
90bafae1dc AArch64: Implement SIMD floating-point comparisons
Copyright (c) 2020, Arm Limited.
2020-06-18 11:07:52 +01:00
Joey Gouly
0f462330e0 arm64: Implement AllTrue and AnyTrue
This enables the simd_boolean WASM SIMD spec test.

Copyright (c) 2020, Arm Limited.
2020-06-17 15:40:51 +01:00
Chris Fallin
6286ca7310 AArch64: make use of reg-reg-extend amode.
When a load/store instruction needs an address of the form `v0 +
uextend(v1)` or `v0 + sextend(v1)` (or the commuted forms thereof), we
currently generate a separate zero/sign-extend operation and then use a
plain `[rA, rB]` addressing mode. This patch extends `lower_address()`
to look at both addends of an address if it has two addends and a zero
offset, recognize extension operations, and incorporate them directly
into a `[rA, rB, UXTW]` or `[rA, rB, SXTW]` form. This should improve
our performence on WebAssembly workloads, at least, because we often see
a 64-bit linear memory base indexed by a 32-bit (Wasm) pointer value.
2020-06-12 10:40:54 -07:00
Joey Gouly
544c5dece5 arm64: Implement SIMD bitwise operations
Copyright (c) 2020, Arm Limited.
2020-06-11 10:58:23 -07:00
Anton Kirilov
d941034c2e Enable the wast::Cranelift::spec::simd::simd_load_splat test for AArch64
Copyright (c) 2020, Arm Limited.
2020-06-10 15:01:37 +01:00
Chris Fallin
ac87ed12bd Merge pull request #1847 from akirilov-arm/simd_load_extend
Enable the wast::Cranelift::spec::simd::simd_load_extend test for AArch64
2020-06-09 12:29:06 -07:00
Joey Gouly
df2b031b6a arm64: Implement Icmp for I16X8 and I32X4
Copyright (c) 2020, Arm Limited.
2020-06-09 11:07:43 -07:00
Anton Kirilov
7ac19af498 Enable the wast::Cranelift::spec::simd::simd_load_extend test for AArch64
Copyright (c) 2020, Arm Limited.
2020-06-09 18:05:38 +01:00
Anton Kirilov
51a551fb39 Implement vector element extensions for AArch64
This commit also includes load and extend operations. Both are
prerequisites for enabling further SIMD spec tests.

Copyright (c) 2020, Arm Limited.
2020-06-09 12:28:49 +01:00
Chris Fallin
fe97659813 Address review comments. 2020-06-03 13:31:34 -07:00
Chris Fallin
9fec933056 Merge pull request #1801 from jgouly/cmp-rebase
arm64: add support for I8X16 ICmp
2020-06-02 09:35:41 -07:00
Joey Gouly
90a421193f arm64: add support for I8X16 ICmp
Copyright (c) 2020, Arm Limited.
2020-06-02 16:58:09 +01:00
Benjamin Bouvier
67c7a3ed19 mach backend: reduce the size of the Inst enum down to 32 bytes; 2020-06-02 16:29:05 +02:00
Benjamin Bouvier
cfa0527794 mach backend: have mem_finalize return a SmallVec;
This avoids a spurious reallocation of the SmallVec containing the
load_constants result to a Vec, which appeared in dhat profiles.
2020-06-02 16:29:05 +02:00
Anton Kirilov
8a928830ac Enable the wast::Cranelift::spec::simd::simd_store test for AArch64
Copyright (c) 2020, Arm Limited.
2020-05-24 22:53:07 +01:00
Chris Fallin
73537e72c0 Merge pull request #1732 from jgouly/copysign-fpu
arm64: Use FPU instrctions for Fcopysign
2020-05-22 17:25:33 -07:00
Joey Gouly
02c3f238f8 arm64: Use FPU instrctions for Fcopysign
Copyright (c) 2020, Arm Limited.
2020-05-21 18:14:12 +01:00
Chris Fallin
e11094b28b Fix MachBuffer branch optimization.
This patch fixes a subtle bug that occurred in the MachBuffer branch
optimization: in tracking labels at the current buffer tail using a
sorted-by-offset array, the code did not update this array properly when
redirecting labels. As a result, the dead-branch removal was unsafe,
because not every label pointing to a branch is guaranteed to be
redirected properly first.

Discovered while doing performance testing: bz2 silently took a wrong
branch and exited compression early. (Eek!)

To address this problem, this patch adopts a slightly simpler data
structure: we only track the labels *at the current buffer tail*, and
*at the start of each branch*, and we're careful to update these
appropriately to maintain the invariants. I'm pretty confident that this
is correct now, but we should (still) fuzz it a bunch, because wrong
control flow scares me a nonzero amount. I should probably also actually
write out a formal proof that these data-structure updates are correct.
The optimizations are important for performance (removing useless empty
blocks, and taking advantage of any fallthrough opportunities at all),
so I don't think we would want to drop them entirely.
2020-05-19 18:09:18 -07:00
Chris Fallin
bdd2873c8c Address review comments. 2020-05-18 16:25:26 -07:00
Chris Fallin
72e6be9342 Rework of MachInst isel, branch fixups and lowering, and block ordering.
This patch includes:

- A complete rework of the way that CLIF blocks and edge blocks are
  lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
  computes RPO over the CFG, but with a twist: it merges edge blocks intto
  heads or tails of original CLIF blocks wherever possible, and it does
  this without ever actually materializing the full nodes-plus-edges
  graph first. The backend driver lowers blocks in final order so
  there's no need to reshuffle later.

- A new `MachBuffer` that replaces the `MachSection`. This is a special
  version of a code-sink that is far more than a humble `Vec<u8>`. In
  particular, it keeps a record of label definitions and label uses,
  with a machine-pluggable `LabelUse` trait that defines various types
  of fixups (basically internal relocations).

  Importantly, it implements some simple peephole-style branch rewrites
  *inline in the emission pass*, without any separate traversals over
  the code to use fallthroughs, swap taken/not-taken arms, etc. It
  tracks branches at the tail of the buffer and can (i) remove blocks
  that are just unconditional branches (by redirecting the label), (ii)
  understand a conditional/unconditional pair and swap the conditional
  polarity when it's helpful; and (iii) remove branches that branch to
  the fallthrough PC.

  The `MachBuffer` also implements branch-island support. On
  architectures like AArch64, this is needed to allow conditional
  branches within plausibly-attainable ranges (+/- 1MB on AArch64
  specifically). It also does this inline while streaming through the
  emission, without any sort of fixpoint algorithm or later moving of
  code, by simply tracking outstanding references and "deadlines" and
  emitting an island just-in-time when we're in danger of going out of
  range.

- A rework of the instruction selector driver. This is largely following
  the same algorithm as before, but is cleaned up significantly, in
  particular in the API: the machine backend can ask for an input arg
  and get any of three forms (constant, register, producing
  instruction), indicating it needs the register or can merge the
  constant or producing instruction as appropriate. This new driver
  takes special care to emit constants right at use-sites (and at phi
  inputs), minimizing their live-ranges, and also special-cases the
  "pinned register" to avoid superfluous moves.

Overall, on `bz2.wasm`, the results are:

    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
2020-05-16 23:08:22 -07:00
Joey Gouly
f418b7a700 Reduce arm64 Inst enum size
This reduces the size of the Inst enum from 112 bytes to 48 bytes.

Using DHAT on a regex-rs.wasm benchmark, `valgrind --tool=dhat clif-util compile --target aarch64`

The total number of allocated bytes, drops by around 170 MB.
At t-gmax drops by 3 MB.

Using `perf stat clif-util compile --target aarch64`, the instructions count dropped by 0.6%. Cache misses dropped by 6%. Cycles dropped by 2.3%.
2020-05-14 15:45:55 +01:00
Chris Fallin
a66724aafd Rework aarch64 stack frame implementation.
This PR changes the aarch64 ABI implementation to use positive offsets
from SP, rather than negative offsets from FP, to refer to spill slots
and stack-local storage. This allows for better addressing-mode options,
and hence slightly better code: e.g., the unsigned scaled 12-bit offset
mode can be used to reach anywhere in a 32KB frame without extra
address-construction instructions, whereas negative offsets are limited
to a signed 9-bit unscaled mode (-256 bytes).

To enable this, the PR introduces a notion of "nominal SP offsets" as a
virtual addressing mode, lowered during the emission pass. The offsets
are relative to "SP after adjusting downward to allocate stack/spill
slots", but before pushing clobbers. This allows the addressing-mode
expressions to be generated before register allocation (or during it,
for spill/reload sequences).

To convert these offsets into *true* offsets from SP, we need to track
how much further SP is moved downward, and compensate for this. We do so
with "virtual SP offset adjustment" pseudo-instructions: these are seen
by the emission pass, and result in no instruction (0 byte output), but
update state that is now threaded through each instruction emission in
turn. In this way, we can push e.g. stack args for a call and adjust
the virtual SP offset, allowing reloads from nominal-SP-relative
spillslots while we do the argument setup with "real SP offsets" at the
same time.
2020-05-06 09:23:55 -07:00
Benjamin Bouvier
b7cfd39b53 aarch64: split emit tests into its own file;
This is done to satisfy a check done on the maximal file's size when
vendoring Rust source code into Mozilla central's repository.
2020-04-30 13:50:45 +02:00
Alex Crichton
74eda8090c Implement stack limit checks for AArch64 (#1573)
This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets the `stack_limit` argument purpose as well as
a function's global `stack_limit` directive working for the AArch64
backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!
2020-04-24 15:01:57 -05:00
Benjamin Bouvier
0b13d8c848 aarch64: copy SP whenever it's involved in an address lowering with an explicit add; 2020-04-24 17:41:14 +02:00
Chris Fallin
dacadc8a34 Fix aarch64 load trap info: HeapOutOfBounds, not OutOfBounds.
This halfway solves a test failure: when temporarily disabling another
assert that is triggered by lack of debug info, this causes the
`custom_trap_handler` test to pass.
2020-04-21 15:30:58 -07:00
Benjamin Bouvier
a7ca37e493 Honour the emit_all_ones_funcaddrs() settings when creating unpatched locations; 2020-04-21 17:22:53 +02:00
Chris Fallin
297d64b2c0 Merge pull request #1530 from bnjbvr/bbouvier-arm64-fixes
Pending arm64 fixes for Spidermonkey integration
2020-04-21 08:08:09 -07:00
Joey Gouly
3638f8a764 arm64: Add support for CCmp
Also add a test for SUBS/ADDS with XZR, as CMP/CMN are aliases.

Copyright (c) 2020, Arm Limited.
2020-04-21 12:19:07 +02:00
Benjamin Bouvier
241c164e25 Implement pinned register usage through set_pinned_reg/get_pinned_reg; 2020-04-21 12:12:56 +02:00
bjorn3
259de864e4 Reuse rd as tmp reg in LoadAddr 2020-04-18 13:24:06 +02:00
bjorn3
1bee1af755 Implement stack_addr for AArch64 2020-04-18 13:24:06 +02:00
Chris Fallin
48cf2c2f50 Address review comments:
- Undo temporary changes to default features (`all-arch`) and a
  signal-handler test.
- Remove `SIGTRAP` handler: no longer needed now that we've found an
  "undefined opcode" option on ARM64.
- Rename pp.rs to pretty_print.rs in machinst/.
- Only use empty stack-probe on non-x86. As per a comment in
  rust-lang/compiler-builtins [1], LLVM only supports stack probes on
  x86 and x86-64. Thus, on any other CPU architecture, we cannot refer
  to `__rust_probestack`, because it does not exist.
- Rename arm64 to aarch64.
- Use `target` directive in vcode filetests.
- Run the flags verifier, but without encinfo, when using new backends.
- Clean up warning overrides.
- Fix up use of casts: use u32::from(x) and siblings when possible,
  u32::try_from(x).unwrap() when not, to avoid silent truncation.
- Take immutable `Function` borrows as input; we don't actually
  mutate the input IR.
- Lots of other miscellaneous cleanups.

[1] cae3e6ea23/src/probestack.rs (L39)
2020-04-15 17:21:28 -07:00