Commit Graph

16 Commits

Author SHA1 Message Date
Anton Kirilov
b895ac0e40 AArch64: Implement SIMD conversions
Copyright (c) 2020, Arm Limited.
2020-08-21 18:03:50 +01:00
Joey Gouly
a518c10141 arm64: Implement SIMD i64x2 multiply
Copyright (c) 2020, Arm Limited.
2020-08-20 13:26:03 +01:00
Chris Fallin
5cf3fba3da Refactor AArch64 ABI support to extract common bits for shared impl with x64.
We have observed that the ABI implementations for AArch64 and x64 are
very similar; in fact, x64's implementation started as a modified copy
of AArch64's implementation. This is an artifact of both a similar ABI
(both machines pass args and return values in registers first, then the
stack, and both machines give considerable freedom with stack-frame
layout) and a too-low-level ABI abstraction in the existing design. For
machines that fit the mainstream or most common ABI-design idioms, we
should be able to do much better.

This commit factors AArch64 into machine-specific and
machine-independent parts, but does not yet modify x64; that will come
next.

This should be completely neutral with respect to compile time and
generated code performance.
2020-08-14 16:27:39 -07:00
Julian Seward
25e31739a6 Implement Wasm Atomics for Cranelift/newBE/aarch64.
The implementation is pretty straightforward.  Wasm atomic instructions fall
into 5 groups

* atomic read-modify-write
* atomic compare-and-swap
* atomic loads
* atomic stores
* fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group.  Some
comments about these:

* for those that take addresses (all except fences), the address is contained
  entirely in a single `Value`; there is no offset field as there is with
  normal loads and stores.  Wasm atomics require alignment checks, and
  removing the offset makes implementation of those checks a bit simpler.

* atomic loads and stores get their own instructions, rather than reusing the
  existing load and store instructions, for two reasons:

  - per above comment, makes alignment checking simpler

  - reuse of existing loads and stores would require extension of `MemFlags`
    to indicate atomicity, which sounds semantically unclean.  For example,
    then *any* instruction carrying `MemFlags` could be marked as atomic, even
    in cases where it is meaningless or ambiguous.

* I tried to specify, in comments, the behaviour of these instructions as
  tightly as I could.  Unfortunately there is no way (per my limited CLIF
  knowledge) to enforce the constraint that they may only be used on I8, I16,
  I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in `code_translator.rs` is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group.  All of them except `::Fence` contain multiple real machine
instructions.  Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively.  The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional.  In particular, we
cannot allow *any* extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code.  That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere.  Hence we must present it as a single indivisible
unit, as we do here.  It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally.  Rather
than faking one up by claiming, in `get_regs` that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (`::LLSC` and `::CAS`) simply use fixed registers in the range
x24-x28.  We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero.  x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done.  In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg.  For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method `<LowerCtx for Lower>::ensure_in_vreg`.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
`Opcode::Trapif | Opcode::Trapff`, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion.  Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.
2020-08-04 09:35:50 +02:00
Chris Fallin
9bd9c628aa Aarch64: mask shift-amounts incorporated into reg-reg-shift ALU insts.
We had previously fixed a bug in which constant shift amounts should be
masked to modulo the number of bits in the operand; however, we did not
fix the analogous case for shifts incorporated into the second register
argument of ALU instructions that support integrated shifts.  This
failure to mask resulted in illegal instructions being generated, e.g.
in https://bugzilla.mozilla.org/show_bug.cgi?id=1653502. This PR fixes
the issue by masking the amount, as the shift semantics require.
2020-07-17 14:55:23 -07:00
Anton Kirilov
95b0b05af2 AArch64: Introduce an enum to specify vector instruction operand sizes
Copyright (c) 2020, Arm Limited.
2020-07-14 21:37:44 +01:00
Anton Kirilov
79dfac5514 Refactor the InstSize enum in the AArch64 backend
The main issue with the InstSize enum was that it was used both for
GPR and SIMD & FP operands, even though machine instructions do not
mix them in general (as in a destination register is either a GPR
or not). As a result it had methods such as sf_bit() that made
sense only for one type of operand.

Another issue was that the enum name was not reflecting its purpose
accurately - it was meant to represent an instruction operand size,
not an instruction size, which is fixed in A64 (always 4 bytes).

Now the enum is split into one for GPR operands and another for
scalar SIMD & FP operands.

Copyright (c) 2020, Arm Limited.
2020-07-14 15:04:35 +01:00
Chris Fallin
6286ca7310 AArch64: make use of reg-reg-extend amode.
When a load/store instruction needs an address of the form `v0 +
uextend(v1)` or `v0 + sextend(v1)` (or the commuted forms thereof), we
currently generate a separate zero/sign-extend operation and then use a
plain `[rA, rB]` addressing mode. This patch extends `lower_address()`
to look at both addends of an address if it has two addends and a zero
offset, recognize extension operations, and incorporate them directly
into a `[rA, rB, UXTW]` or `[rA, rB, SXTW]` form. This should improve
our performence on WebAssembly workloads, at least, because we often see
a 64-bit linear memory base indexed by a 32-bit (Wasm) pointer value.
2020-06-12 10:40:54 -07:00
Chris Fallin
fe97659813 Address review comments. 2020-06-03 13:31:34 -07:00
Benjamin Bouvier
67c7a3ed19 mach backend: reduce the size of the Inst enum down to 32 bytes; 2020-06-02 16:29:05 +02:00
Anton Kirilov
8a928830ac Enable the wast::Cranelift::spec::simd::simd_store test for AArch64
Copyright (c) 2020, Arm Limited.
2020-05-24 22:53:07 +01:00
Chris Fallin
72e6be9342 Rework of MachInst isel, branch fixups and lowering, and block ordering.
This patch includes:

- A complete rework of the way that CLIF blocks and edge blocks are
  lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
  computes RPO over the CFG, but with a twist: it merges edge blocks intto
  heads or tails of original CLIF blocks wherever possible, and it does
  this without ever actually materializing the full nodes-plus-edges
  graph first. The backend driver lowers blocks in final order so
  there's no need to reshuffle later.

- A new `MachBuffer` that replaces the `MachSection`. This is a special
  version of a code-sink that is far more than a humble `Vec<u8>`. In
  particular, it keeps a record of label definitions and label uses,
  with a machine-pluggable `LabelUse` trait that defines various types
  of fixups (basically internal relocations).

  Importantly, it implements some simple peephole-style branch rewrites
  *inline in the emission pass*, without any separate traversals over
  the code to use fallthroughs, swap taken/not-taken arms, etc. It
  tracks branches at the tail of the buffer and can (i) remove blocks
  that are just unconditional branches (by redirecting the label), (ii)
  understand a conditional/unconditional pair and swap the conditional
  polarity when it's helpful; and (iii) remove branches that branch to
  the fallthrough PC.

  The `MachBuffer` also implements branch-island support. On
  architectures like AArch64, this is needed to allow conditional
  branches within plausibly-attainable ranges (+/- 1MB on AArch64
  specifically). It also does this inline while streaming through the
  emission, without any sort of fixpoint algorithm or later moving of
  code, by simply tracking outstanding references and "deadlines" and
  emitting an island just-in-time when we're in danger of going out of
  range.

- A rework of the instruction selector driver. This is largely following
  the same algorithm as before, but is cleaned up significantly, in
  particular in the API: the machine backend can ask for an input arg
  and get any of three forms (constant, register, producing
  instruction), indicating it needs the register or can merge the
  constant or producing instruction as appropriate. This new driver
  takes special care to emit constants right at use-sites (and at phi
  inputs), minimizing their live-ranges, and also special-cases the
  "pinned register" to avoid superfluous moves.

Overall, on `bz2.wasm`, the results are:

    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
2020-05-16 23:08:22 -07:00
Chris Fallin
a66724aafd Rework aarch64 stack frame implementation.
This PR changes the aarch64 ABI implementation to use positive offsets
from SP, rather than negative offsets from FP, to refer to spill slots
and stack-local storage. This allows for better addressing-mode options,
and hence slightly better code: e.g., the unsigned scaled 12-bit offset
mode can be used to reach anywhere in a 32KB frame without extra
address-construction instructions, whereas negative offsets are limited
to a signed 9-bit unscaled mode (-256 bytes).

To enable this, the PR introduces a notion of "nominal SP offsets" as a
virtual addressing mode, lowered during the emission pass. The offsets
are relative to "SP after adjusting downward to allocate stack/spill
slots", but before pushing clobbers. This allows the addressing-mode
expressions to be generated before register allocation (or during it,
for spill/reload sequences).

To convert these offsets into *true* offsets from SP, we need to track
how much further SP is moved downward, and compensate for this. We do so
with "virtual SP offset adjustment" pseudo-instructions: these are seen
by the emission pass, and result in no instruction (0 byte output), but
update state that is now threaded through each instruction emission in
turn. In this way, we can push e.g. stack args for a call and adjust
the virtual SP offset, allowing reloads from nominal-SP-relative
spillslots while we do the argument setup with "real SP offsets" at the
same time.
2020-05-06 09:23:55 -07:00
Chris Fallin
8f462db645 Fix ImmLogic.invert(), and with it, fcopysign and float_misc test.
Previously, `fcopysign` was mysteriously failing to pass the
`float_misc` spec test. This was tracked down to bad logical-immediate
masks used to separate the sign and not-sign bits. In particular, the
masks for the and-not operations were wrong. The `invert()` function on
an `ImmLogic` immediate, it turns out, assumed every immediate would be
used by a 64-bit instruction; `ImmLogic` immediates are subtly different
for 32-bit instructions. This change tracks the instruction size (32 or
64 bits) intended for use with each such immediate, and passes it back
into `maybe_from_u64` when computing the inverted immediate.

Addresses several of the failures (`float_misc`, `f32_bitwise`) for
 #1521 (test failures) and presumably helps #1519 (SpiderMonkey
integration).
2020-04-23 13:06:13 -07:00
Joey Gouly
3638f8a764 arm64: Add support for CCmp
Also add a test for SUBS/ADDS with XZR, as CMP/CMN are aliases.

Copyright (c) 2020, Arm Limited.
2020-04-21 12:19:07 +02:00
Chris Fallin
48cf2c2f50 Address review comments:
- Undo temporary changes to default features (`all-arch`) and a
  signal-handler test.
- Remove `SIGTRAP` handler: no longer needed now that we've found an
  "undefined opcode" option on ARM64.
- Rename pp.rs to pretty_print.rs in machinst/.
- Only use empty stack-probe on non-x86. As per a comment in
  rust-lang/compiler-builtins [1], LLVM only supports stack probes on
  x86 and x86-64. Thus, on any other CPU architecture, we cannot refer
  to `__rust_probestack`, because it does not exist.
- Rename arm64 to aarch64.
- Use `target` directive in vcode filetests.
- Run the flags verifier, but without encinfo, when using new backends.
- Clean up warning overrides.
- Fix up use of casts: use u32::from(x) and siblings when possible,
  u32::try_from(x).unwrap() when not, to avoid silent truncation.
- Take immutable `Function` borrows as input; we don't actually
  mutate the input IR.
- Lots of other miscellaneous cleanups.

[1] cae3e6ea23/src/probestack.rs (L39)
2020-04-15 17:21:28 -07:00