Commit Graph

744 Commits

Author SHA1 Message Date
Nick Fitzgerald
05bf9ea3f3 Rename "Stackmap" to "StackMap"
And "stackmap" to "stack_map".

This commit is purely mechanical.
2020-08-07 10:08:44 -07:00
Anton Kirilov
1ec6930005 Enable the spec::simd::simd_lane test for AArch64
Copyright (c) 2020, Arm Limited.
2020-08-06 11:14:15 +01:00
Chris Fallin
1fbdf169b5 Aarch64: fix narrow integer-register extension with Baldrdash ABI.
In the Baldrdash (SpiderMonkey) embedding, we must take care to
zero-extend all function arguments to callees in integer registers when
the types are narrower than 64 bits. This is because, unlike the native
SysV ABI, the Baldrdash ABI expects high bits to be cleared. Not doing
so leads to difficult-to-trace errors where high bits falsely tag an
int32 as e.g. an object pointer, leading to potential security issues.
2020-07-31 10:19:13 -07:00
Chris Fallin
8fd92093a4 Merge pull request #2061 from cfallin/aarch64-amode
Aarch64 codegen quality: support more general add+extend address computations.
2020-07-27 13:48:55 -07:00
Chris Fallin
f9b98f0ddc Aarch64 codegen quality: support more general add+extend computations.
Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:

```
   v2760 = uextend.i64 v985
   v2761 = load.i64 notrap aligned readonly v1
   v1018 = iadd v2761, v2760
   store v1017, v1018
```

This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit `iadd`s, optionally with `uextend`/`sextend` from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.

Using `perf stat` to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts `bz2` as follows:

```
pre:

       1006.410425      task-clock (msec)         #    1.000 CPUs utilized
               113      context-switches          #    0.112 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,036      page-faults               #    0.005 M/sec
     3,221,547,476      cycles                    #    3.201 GHz
     4,000,670,104      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,958,613      branch-misses

       1.006071348 seconds time elapsed

post:

        963.499525      task-clock (msec)         #    0.997 CPUs utilized
               117      context-switches          #    0.121 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,081      page-faults               #    0.005 M/sec
     3,039,687,673      cycles                    #    3.155 GHz
     3,837,761,690      instructions              #    1.26  insn per cycle
   <not supported>      branches
        28,254,585      branch-misses

       0.966072682 seconds time elapsed
```

In other words, this reduces instruction count by 4.1% on `bz2`.
2020-07-27 13:10:50 -07:00
Chris Fallin
1b80860f1f Aarch64 codegen quality: handle add-negative-imm as subtract.
We often see patterns like:

```
    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2
```

which is just `w0 := w1 - 1`. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

```
    sub w0, w1, #1
```

We see this pattern in e.g. `bz2`, where this commit makes the following
difference (counting instructions with `perf stat`, filling in the
wasmtime cache first then running again to get just runtime):

pre:

```
        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed
```

post:

```
        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed
```

In other words, a 2.1% reduction in instruction count on `bz2`.
2020-07-24 11:41:33 -07:00
Chris Fallin
1b3b2dbfd0 Merge pull request #2043 from cfallin/csel-opt
Aarch64: handle csel with icmp/fcmp source without materializing the bool.
2020-07-18 19:33:47 -07:00
Chris Fallin
21dac670f0 Aarch64: handle csel with icmp/fcmp source without materializing the bool.
Previously, we simply compared the input bool to 0, which forced the
value into a register (usually via a cmp and cset), zero-extended it,
etc. This patch performs the same pattern-matching that branches do to
directly perform the cmp and use its flag results with the csel.

On the `bz2` benchmark, the runtime is affected as follows (measuring
with `perf stat`, using wasmtime with its cache enabled, and taking the
second run after the first compiles and populates the cache):

pre:

       1117.232000      task-clock (msec)         #    1.000 CPUs utilized
               133      context-switches          #    0.119 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,041      page-faults               #    0.005 M/sec
     3,511,615,100      cycles                    #    3.143 GHz
     4,272,427,772      instructions              #    1.22  insn per cycle
   <not supported>      branches
        27,980,906      branch-misses

       1.117299838 seconds time elapsed

post:

       1003.738075      task-clock (msec)         #    1.000 CPUs utilized
               121      context-switches          #    0.121 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,052      page-faults               #    0.005 M/sec
     3,224,875,393      cycles                    #    3.213 GHz
     4,000,838,686      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,928,232      branch-misses

       1.003440004 seconds time elapsed

In other words, with this change, on `bz2`, we see a 6.3% reduction in
executed instructions.
2020-07-17 21:10:21 -07:00
Chris Fallin
9bd9c628aa Aarch64: mask shift-amounts incorporated into reg-reg-shift ALU insts.
We had previously fixed a bug in which constant shift amounts should be
masked to modulo the number of bits in the operand; however, we did not
fix the analogous case for shifts incorporated into the second register
argument of ALU instructions that support integrated shifts.  This
failure to mask resulted in illegal instructions being generated, e.g.
in https://bugzilla.mozilla.org/show_bug.cgi?id=1653502. This PR fixes
the issue by masking the amount, as the shift semantics require.
2020-07-17 14:55:23 -07:00
bjorn3
5c5a30f76c Fix review comments 2020-07-17 12:03:17 +02:00
bjorn3
7b7b1f4997 Rename sarg__ to sarg_t 2020-07-17 12:03:17 +02:00
bjorn3
4431ac1108 Implement SystemV struct argument passing 2020-07-17 12:03:17 +02:00
Chris Fallin
5e0268a542 Merge pull request #2034 from cfallin/update-regalloc
Update to regalloc.rs 0.0.28.
2020-07-16 11:36:11 -07:00
Alex Crichton
63d5b91930 Wasmtime 0.19.0 and Cranelift 0.66.0 (#2027)
This commit updates Wasmtime's version to 0.19.0, Cranelift's version to
0.66.0, and updates the release notes as well.
2020-07-16 12:46:21 -05:00
Chris Fallin
756e8b8ea2 Update to regalloc.rs 0.0.28.
This version of regalloc.rs includes several bugfixes for
reference-types support used by the new backend framework and the
aarch64 backend (bytecodealliance/regalloc.rs#85 and
bytecodealliance/regalloc.rs#86).
2020-07-16 09:42:09 -07:00
Andrew Brown
f0b083c6ad Legalize [u|s]widen_high for x86
Use `x86_palignr` and `[u|s]widen_low` for legalizing this instruction.
2020-07-15 11:32:08 -07:00
Andrew Brown
c8ddf8a34c Encode [u|s]widen_low for x86 2020-07-15 11:32:08 -07:00
Andrew Brown
fafef7db77 Add x86_palignr instructions
This instruction is necessary for implementing `[s|u]widen_high`.
2020-07-15 11:32:08 -07:00
Chris Fallin
26529006e0 Address review comments. 2020-07-14 10:17:29 -07:00
Chris Fallin
08353fcc14 Reftypes part two: add support for stackmaps.
This commit adds support for generating stackmaps at safepoints to the
new backend framework and to the AArch64 backend in particular. It has
been tested to work with SpiderMonkey.
2020-07-14 10:17:27 -07:00
Chris Fallin
b93e8c296d Initial reftype support in aarch64, modulo safepoints.
This commit adds the inital support to allow reftypes to flow through
the program when targetting aarch64. It also adds a fix to the
`ModuleTranslationState` needed to send R32/R64 types over from the
SpiderMonkey embedding.

This commit does not include any support for safepoints in aarch64
or the `MachInst` infrastructure; that is in the next commit.

This commit also makes a drive-by improvement to `Bint`, avoiding an
unneeded zero-extension op when the extended value comes directly from a
conditional-set (which produces a full-width 0 or 1).
2020-07-14 10:14:18 -07:00
Alex Crichton
85ffc8f595 Switch CI back to nightly channel (#2014)
* Switch CI back to nightly channel

I think all upstream issues are now fixed so we should be good to switch
back to nightly from our previously pinned version.

* Fix doc warnings
2020-07-13 18:40:47 -05:00
Andrew Brown
c5a69cee9f Add x86 legalization for fcvt_to_uint_sat.i32x4
This converts an `f32x4` into an `i32x4` (unsigned) with rounding by using a long sequence of SSE4.1 compatible instructions.
2020-07-08 10:20:01 -07:00
Peter Huene
d6ae72abe6 Merge pull request #1983 from peterhuene/fix-unwind-info
Remove 'set frame pointer' unwind code from Windows x64 unwind.
2020-07-06 22:26:41 -07:00
Peter Huene
3a33749404 Remove 'set frame pointer' unwind code from Windows x64 unwind.
This commit removes the "set frame pointer" unwind code and frame
pointer information from Windows x64 unwind information.

In Windows x64 unwind information, a "frame pointer" is actually the
*base address* of the static part of the local frame and would be at some
negative offset to RSP upon establishing the frame pointer.

Currently Cranelift uses a "traditional" notion of a frame pointer, one
that is the highest address in the local frame (i.e. pointing at the
previous frame pointer on the stack).

Windows x64 unwind doesn't describe such frame pointers and only needs
one described if the frame contains a dynamic stack allocation.

Fixes #1967.
2020-07-06 14:22:57 -07:00
Chris Fallin
b700646c93 Merge pull request #1962 from cfallin/aarch64-lowering-condbr
AArch64: avoid branches with explicit offsets at lowering stage.
2020-07-02 14:05:40 -07:00
Chris Fallin
b7ecad1d74 AArch64: avoid branches with explicit offsets at lowering stage.
In discussions with @bnjbvr, it came up that generating `OneWayCondBr`s
with explicit, hardcoded PC-offsets as part of lowered instruction
sequences is actually unsafe, because the register allocator *might*
insert a spill or reload into the middle of our sequence. We were
careful about this in some cases but somehow missed that it was a
general restriction. Conceptually, all inter-instruction references
should be via labels at the VCode level; explicit offsets are only ever
known at emission time, and resolved by the `MachBuffer`.

To allow for conditional trap checks without modifying the CFG (as seen
by regalloc) during lowering, this PR instead adds a `TrapIf`
pseudo-instruction that conditionally skips a single embedded trap
instruction. It lowers to the same `condbr label ; trap ; label: ...`
sequence, but without the hardcoded branch-target offset in the lowering
code.
2020-07-02 11:02:27 -07:00
Andrew Brown
057c93b64e Add unarrow instruction with x86 implementation
Adds a shared `unarrow` instruction in order to lower the Wasm SIMD specification's unsigned narrowing (see https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#integer-to-integer-narrowing). Additionally, this commit implements the instruction for x86 using PACKUSWB and PACKUSDW for the applicable encodings.
2020-07-02 09:35:45 -07:00
Andrew Brown
65e6de2344 Replace x86_packss with snarrow
Since the Wasm specification contains narrowing instructions (see https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#integer-to-integer-narrowing) that lower to PACKSS*, the x86-specific instruction is not necessary in the CLIF IR.
2020-07-02 09:35:45 -07:00
Chris Fallin
0a59a321bd Merge pull request #1954 from cfallin/b1649432
AArch64: fix shift ops: mask shift amount.
2020-07-01 09:33:29 -07:00
Chris Fallin
a351fa52b5 Merge pull request #1930 from cfallin/spectre-heap
Spectre mitigation on heap access overflow checks.
2020-07-01 09:23:04 -07:00
Chris Fallin
533f1c8d8b Aarch64: fix shift ops: mask shift amount.
The failure to mask the amount triggered a panic due to a subtraction
overflow check; see
https://bugzilla.mozilla.org/show_bug.cgi?id=1649432. Attempting to
shift by an out-of-range amount should be defined to shift by an amount
mod the operand size (i.e., masked to 5 bits for 32-bit shifts, or 6
bits for 64-bit shifts).
2020-07-01 08:57:56 -07:00
Chris Fallin
e694fb1312 Spectre mitigation on heap access overflow checks.
This PR adds a conditional move following a heap bounds check through
which the address to be accessed flows. This conditional move ensures
that even if the branch is mispredicted (access is actually out of
bounds, but speculation goes down in-bounds path), the acually accessed
address is zero (a NULL pointer) rather than the out-of-bounds address.

The mitigation is controlled by a flag that is off by default, but can
be set by the embedding. Note that in order to turn it on by default,
we would need to add conditional-move support to the current x86
backend; this does not appear to be present. Once the deprecated
backend is removed in favor of the new backend, IMHO we should turn
this flag on by default.

Note that the mitigation is unneccessary when we use the "huge heap"
technique on 64-bit systems, in which we allocate a range of virtual
address space such that no 32-bit offset can reach other data. Hence,
this only affects small-heap configurations.
2020-07-01 08:36:09 -07:00
Andrew Brown
737cf1d605 Implement iabs for x86 SIMD
This only covers the types necessary for implementing the Wasm SIMD spec--`i8x16`, `i16x8`, `i32x4`.
2020-06-30 14:00:17 -07:00
Alex Crichton
0acd2072c2 Fix doc warnings and link failures (#1948)
Also add configuration to CI to fail doc generation if any links are
broken. Unfortunately we can't blanket deny all warnings in rustdoc
since some are unconditional warnings, but for now this is hopefully
good enough.

Closes #1947
2020-06-30 13:01:49 -05:00
Andrew Brown
c9d573d841 Provide spec-compliant legalization for SIMD floating point min/max 2020-06-25 14:48:16 -07:00
Benjamin Bouvier
60ac091afe Remove unused dependencies in Cranelift; 2020-06-22 08:45:09 -07:00
Benjamin Bouvier
4f6a002f70 cranelift-filetests: tell when a run test is skipped (fixes #1558); 2020-06-18 16:11:21 -07:00
Andrew Brown
3675f95bb2 Legalize fcvt_to_sint_sat.i32x4 on x86
Use a lengthy sequence involving CVTTPS2DQ to quiet NaNs and saturate overflow.
2020-06-18 11:39:38 -07:00
Andrew Brown
01d34e71b9 Add x86 legalization for fcvt_from_uint.f32x4
This converts an `i32x4` into an `f32x4` with some rounding either by using an AVX512VL/F instruction--VCVTUDQ2PS--or a long sequence of SSE4.1 compatible instructions.
2020-06-12 15:06:22 -07:00
Andrew Brown
772ce73f7f Add x86_pblendw instruction
This instruction is necessary for lowering `fcvt_from_uint`.
2020-06-12 15:06:22 -07:00
Andrew Brown
546fc9ddf1 Add x86_vcvtudq2ps instruction
This instruction converts i32x4 to f32x4 in several AVX512 feature sets.
2020-06-12 15:06:22 -07:00
Chris Fallin
6286ca7310 AArch64: make use of reg-reg-extend amode.
When a load/store instruction needs an address of the form `v0 +
uextend(v1)` or `v0 + sextend(v1)` (or the commuted forms thereof), we
currently generate a separate zero/sign-extend operation and then use a
plain `[rA, rB]` addressing mode. This patch extends `lower_address()`
to look at both addends of an address if it has two addends and a zero
offset, recognize extension operations, and incorporate them directly
into a `[rA, rB, UXTW]` or `[rA, rB, SXTW]` form. This should improve
our performence on WebAssembly workloads, at least, because we often see
a 64-bit linear memory base indexed by a 32-bit (Wasm) pointer value.
2020-06-12 10:40:54 -07:00
Dan Gohman
caa87048ab Wasmtime 0.18.0 and Cranelift 0.65.0. 2020-06-11 17:49:56 -07:00
Chris Fallin
6ba165be01 Merge pull request #1858 from cfallin/fix-scale-b1
Bugfix: scaled addressing mode: round B1 up to one byte.
2020-06-11 11:16:07 -07:00
Chris Fallin
47402316e0 Add test case: b1-typed spillslot access using UImm12 addressing mode. 2020-06-11 10:27:39 -07:00
Benjamin Bouvier
46093f6119 Bump regalloc.rs to 0.0.26;
And adapt to regalloc.rs API change to provide the exact number of vregs.
2020-06-10 18:23:04 +02:00
Nick Fitzgerald
fb9f39ce17 Merge pull request #1824 from fitzgen/test-stack-maps
cranelift: Better document and test stack maps
2020-06-08 15:58:20 -07:00
Nick Fitzgerald
6aac4c891e cranelift: Better document and test stack maps 2020-06-08 15:05:20 -07:00
Chris Fallin
e3d89c8a92 Merge pull request #1825 from cfallin/spidermonkey-fixes
Three fixes to various SpiderMonkey-related issues
2020-06-08 13:54:13 -07:00