Commit Graph

37 Commits

Author SHA1 Message Date
Andrew Brown
057c93b64e Add unarrow instruction with x86 implementation
Adds a shared `unarrow` instruction in order to lower the Wasm SIMD specification's unsigned narrowing (see https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#integer-to-integer-narrowing). Additionally, this commit implements the instruction for x86 using PACKUSWB and PACKUSDW for the applicable encodings.
2020-07-02 09:35:45 -07:00
Andrew Brown
65e6de2344 Replace x86_packss with snarrow
Since the Wasm specification contains narrowing instructions (see https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#integer-to-integer-narrowing) that lower to PACKSS*, the x86-specific instruction is not necessary in the CLIF IR.
2020-07-02 09:35:45 -07:00
Chris Fallin
0a59a321bd Merge pull request #1954 from cfallin/b1649432
AArch64: fix shift ops: mask shift amount.
2020-07-01 09:33:29 -07:00
Chris Fallin
a351fa52b5 Merge pull request #1930 from cfallin/spectre-heap
Spectre mitigation on heap access overflow checks.
2020-07-01 09:23:04 -07:00
Chris Fallin
533f1c8d8b Aarch64: fix shift ops: mask shift amount.
The failure to mask the amount triggered a panic due to a subtraction
overflow check; see
https://bugzilla.mozilla.org/show_bug.cgi?id=1649432. Attempting to
shift by an out-of-range amount should be defined to shift by an amount
mod the operand size (i.e., masked to 5 bits for 32-bit shifts, or 6
bits for 64-bit shifts).
2020-07-01 08:57:56 -07:00
Chris Fallin
e694fb1312 Spectre mitigation on heap access overflow checks.
This PR adds a conditional move following a heap bounds check through
which the address to be accessed flows. This conditional move ensures
that even if the branch is mispredicted (access is actually out of
bounds, but speculation goes down in-bounds path), the acually accessed
address is zero (a NULL pointer) rather than the out-of-bounds address.

The mitigation is controlled by a flag that is off by default, but can
be set by the embedding. Note that in order to turn it on by default,
we would need to add conditional-move support to the current x86
backend; this does not appear to be present. Once the deprecated
backend is removed in favor of the new backend, IMHO we should turn
this flag on by default.

Note that the mitigation is unneccessary when we use the "huge heap"
technique on 64-bit systems, in which we allocate a range of virtual
address space such that no 32-bit offset can reach other data. Hence,
this only affects small-heap configurations.
2020-07-01 08:36:09 -07:00
Andrew Brown
26bdf9c333 Add iabs instructions.rs
This is necessary for implementing SIMD [abs](https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#lane-wise-integer-absolute-value).
2020-06-30 14:00:17 -07:00
Chris Fallin
492000e945 MachInst isel and aarch64 backend: docs / clarity improvements.
From discussion with Julian and Ben, this PR makes a few documentation-
and naming-level changes (no functionality change):

- Document that the `LowerCtx`-provided output register can be used as a
  scratch register during the lowered instruction sequence before
  placing the final result in it.

- Rename `input_to_*` helpers in the AArch64 backend to
  `put_input_in_*`, emphasizing that these are side-effecting helpers
  that potentially generate code (e.g., sign/zero-extensions) to ensure
  an input value is in a register.
2020-06-18 12:18:50 -07:00
Anton Kirilov
90bafae1dc AArch64: Implement SIMD floating-point comparisons
Copyright (c) 2020, Arm Limited.
2020-06-18 11:07:52 +01:00
Joey Gouly
0f462330e0 arm64: Implement AllTrue and AnyTrue
This enables the simd_boolean WASM SIMD spec test.

Copyright (c) 2020, Arm Limited.
2020-06-17 15:40:51 +01:00
Benjamin Bouvier
dad56a2488 cranelift: add a new resumable_trapnz instruction;
This is useful to have to allow resumable_trap to happen in loop
headers, for instance. This is the correct way to implement interrupt
checks in Spidermonkey, which are effectively resumable traps. Previous
implementation was using traps, which is wrong, since traps semantically
can't be resumed after.
2020-06-15 12:04:28 +02:00
Andrew Brown
772ce73f7f Add x86_pblendw instruction
This instruction is necessary for lowering `fcvt_from_uint`.
2020-06-12 15:06:22 -07:00
Andrew Brown
546fc9ddf1 Add x86_vcvtudq2ps instruction
This instruction converts i32x4 to f32x4 in several AVX512 feature sets.
2020-06-12 15:06:22 -07:00
Joey Gouly
544c5dece5 arm64: Implement SIMD bitwise operations
Copyright (c) 2020, Arm Limited.
2020-06-11 10:58:23 -07:00
Anton Kirilov
d941034c2e Enable the wast::Cranelift::spec::simd::simd_load_splat test for AArch64
Copyright (c) 2020, Arm Limited.
2020-06-10 15:01:37 +01:00
Chris Fallin
ac87ed12bd Merge pull request #1847 from akirilov-arm/simd_load_extend
Enable the wast::Cranelift::spec::simd::simd_load_extend test for AArch64
2020-06-09 12:29:06 -07:00
Joey Gouly
df2b031b6a arm64: Implement Icmp for I16X8 and I32X4
Copyright (c) 2020, Arm Limited.
2020-06-09 11:07:43 -07:00
Anton Kirilov
7ac19af498 Enable the wast::Cranelift::spec::simd::simd_load_extend test for AArch64
Copyright (c) 2020, Arm Limited.
2020-06-09 18:05:38 +01:00
Anton Kirilov
51a551fb39 Implement vector element extensions for AArch64
This commit also includes load and extend operations. Both are
prerequisites for enabling further SIMD spec tests.

Copyright (c) 2020, Arm Limited.
2020-06-09 12:28:49 +01:00
Chris Fallin
fc2a6f273b Three fixes to various SpiderMonkey-related issues:
- Properly mask constant values down to appropriate width when
  generating a constant value directly in aarch64 backend. This was a
  miscompilation introduced in the new-isel refactor. In combination
  with failure to respect NarrowValueMode, this resulted in a very
  subtle bug when an `i32` constant was used in bit-twiddling logic.

- Add support for `iadd_ifcout` in aarch64 backend as used in explicit
  heap-check mode. With this change, we no longer fail heap-related
  tests with the huge-heap-region mode disabled.

- Remove a panic that was occurring in some tests that are currently
  ignored on aarch64, by simply returning empty/default information in
  `value_label` functionality rather than touching unimplemented APIs.
  This is not a bugfix per-se, but removes confusing panic messages from
  `cargo test` output that might otherwise mislead.
2020-06-08 13:02:00 -07:00
Andrew Brown
df171f01b5 Add x86_pmuludq
This instruction multiplies the lower 32 bits of two 64x2 unsigned integers into an i64x2; this is necessary for lowering Wasm's i64x2.mul.
2020-06-03 16:27:57 -07:00
Andrew Brown
9ba9fd0f64 Add x86-specific instruction for i64x2 multiplication
Without this special instruction, legalizing to the AVX512 instruction AND the SSE instruction sequence is impossible. This extra instruction would be rendered unnecessary by the x64 backend.
2020-06-03 16:27:57 -07:00
Chris Fallin
fe97659813 Address review comments. 2020-06-03 13:31:34 -07:00
Chris Fallin
9fec933056 Merge pull request #1801 from jgouly/cmp-rebase
arm64: add support for I8X16 ICmp
2020-06-02 09:35:41 -07:00
Joey Gouly
90a421193f arm64: add support for I8X16 ICmp
Copyright (c) 2020, Arm Limited.
2020-06-02 16:58:09 +01:00
Benjamin Bouvier
67c7a3ed19 mach backend: reduce the size of the Inst enum down to 32 bytes; 2020-06-02 16:29:05 +02:00
Anton Kirilov
8a928830ac Enable the wast::Cranelift::spec::simd::simd_store test for AArch64
Copyright (c) 2020, Arm Limited.
2020-05-24 22:53:07 +01:00
Joey Gouly
02c3f238f8 arm64: Use FPU instrctions for Fcopysign
Copyright (c) 2020, Arm Limited.
2020-05-21 18:14:12 +01:00
Chris Fallin
bdd2873c8c Address review comments. 2020-05-18 16:25:26 -07:00
Chris Fallin
72e6be9342 Rework of MachInst isel, branch fixups and lowering, and block ordering.
This patch includes:

- A complete rework of the way that CLIF blocks and edge blocks are
  lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
  computes RPO over the CFG, but with a twist: it merges edge blocks intto
  heads or tails of original CLIF blocks wherever possible, and it does
  this without ever actually materializing the full nodes-plus-edges
  graph first. The backend driver lowers blocks in final order so
  there's no need to reshuffle later.

- A new `MachBuffer` that replaces the `MachSection`. This is a special
  version of a code-sink that is far more than a humble `Vec<u8>`. In
  particular, it keeps a record of label definitions and label uses,
  with a machine-pluggable `LabelUse` trait that defines various types
  of fixups (basically internal relocations).

  Importantly, it implements some simple peephole-style branch rewrites
  *inline in the emission pass*, without any separate traversals over
  the code to use fallthroughs, swap taken/not-taken arms, etc. It
  tracks branches at the tail of the buffer and can (i) remove blocks
  that are just unconditional branches (by redirecting the label), (ii)
  understand a conditional/unconditional pair and swap the conditional
  polarity when it's helpful; and (iii) remove branches that branch to
  the fallthrough PC.

  The `MachBuffer` also implements branch-island support. On
  architectures like AArch64, this is needed to allow conditional
  branches within plausibly-attainable ranges (+/- 1MB on AArch64
  specifically). It also does this inline while streaming through the
  emission, without any sort of fixpoint algorithm or later moving of
  code, by simply tracking outstanding references and "deadlines" and
  emitting an island just-in-time when we're in danger of going out of
  range.

- A rework of the instruction selector driver. This is largely following
  the same algorithm as before, but is cleaned up significantly, in
  particular in the API: the machine backend can ask for an input arg
  and get any of three forms (constant, register, producing
  instruction), indicating it needs the register or can merge the
  constant or producing instruction as appropriate. This new driver
  takes special care to emit constants right at use-sites (and at phi
  inputs), minimizing their live-ranges, and also special-cases the
  "pinned register" to avoid superfluous moves.

Overall, on `bz2.wasm`, the results are:

    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
2020-05-16 23:08:22 -07:00
Joey Gouly
f418b7a700 Reduce arm64 Inst enum size
This reduces the size of the Inst enum from 112 bytes to 48 bytes.

Using DHAT on a regex-rs.wasm benchmark, `valgrind --tool=dhat clif-util compile --target aarch64`

The total number of allocated bytes, drops by around 170 MB.
At t-gmax drops by 3 MB.

Using `perf stat clif-util compile --target aarch64`, the instructions count dropped by 0.6%. Cache misses dropped by 6%. Cycles dropped by 2.3%.
2020-05-14 15:45:55 +01:00
Benjamin Bouvier
528d3c1355 machinst: Steal the used/defs Sets when emitting a call in ABICall; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
9215b610ef machinst: Avoid a lot of short-lived allocations in ABICall; 2020-05-07 12:24:02 +02:00
Chris Fallin
a66724aafd Rework aarch64 stack frame implementation.
This PR changes the aarch64 ABI implementation to use positive offsets
from SP, rather than negative offsets from FP, to refer to spill slots
and stack-local storage. This allows for better addressing-mode options,
and hence slightly better code: e.g., the unsigned scaled 12-bit offset
mode can be used to reach anywhere in a 32KB frame without extra
address-construction instructions, whereas negative offsets are limited
to a signed 9-bit unscaled mode (-256 bytes).

To enable this, the PR introduces a notion of "nominal SP offsets" as a
virtual addressing mode, lowered during the emission pass. The offsets
are relative to "SP after adjusting downward to allocate stack/spill
slots", but before pushing clobbers. This allows the addressing-mode
expressions to be generated before register allocation (or during it,
for spill/reload sequences).

To convert these offsets into *true* offsets from SP, we need to track
how much further SP is moved downward, and compensate for this. We do so
with "virtual SP offset adjustment" pseudo-instructions: these are seen
by the emission pass, and result in no instruction (0 byte output), but
update state that is now threaded through each instruction emission in
turn. In this way, we can push e.g. stack args for a call and adjust
the virtual SP offset, allowing reloads from nominal-SP-relative
spillslots while we do the argument setup with "real SP offsets" at the
same time.
2020-05-06 09:23:55 -07:00
Chris Fallin
e39b4aba1c Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use it.
Previously, every call was lowered on AArch64 to a `call` instruction, which
takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this
gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible
relocation record to be emitted (or rather, a record that a more sophisticated
linker would fix up by inserting a shim/veneer).

This commit adds a notion of "relocation distance" in the MachInst backends,
and provides this information for every call target and symbol reference. The
intent is that backends on architectures like AArch64, where there are different
offset sizes / addressing strategies to choose from, can either emit a regular
call or a load-64-bit-constant / call-indirect sequence, as necessary. This
avoids the need to implement complex linking behavior.

The MachInst driver code provides this information based on the "colocated" bit
in the CLIF symbol references, which appears to have been designed for this
purpose, or at least a similar one. Combined with the `use_colocated_libcalls`
setting, this allows client code to ensure that library calls can link to
library code at any location in the address space.

Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing
so, it appears all that is necessary to get its tests to pass is to set the
`use_colocated_libcalls` flag to false, to make use of the above change. This
fixes the `libcall_function` unit-test in this crate.
2020-05-05 09:55:12 -07:00
Andrew Brown
a312506262 Add x86 complex encodings for SIMD load-extend instructions 2020-04-30 11:38:01 -07:00
Benjamin Bouvier
4c066b1c73 codegen: split lower.rs into multiple files;
This splits off lower.rs into two files: lower.rs keeps all the utility
functions, while lower_inst.rs contains the (gigantic!) function
lowering a single Cranelift instruction into vcode.

This is done to satisfy a check done on the maximal file's size when
vendoring Rust source code into Mozilla central's repository.
2020-04-30 13:50:45 +02:00