Commit Graph

188 Commits

Author SHA1 Message Date
Joey Gouly
df2b031b6a arm64: Implement Icmp for I16X8 and I32X4
Copyright (c) 2020, Arm Limited.
2020-06-09 11:07:43 -07:00
Anton Kirilov
51a551fb39 Implement vector element extensions for AArch64
This commit also includes load and extend operations. Both are
prerequisites for enabling further SIMD spec tests.

Copyright (c) 2020, Arm Limited.
2020-06-09 12:28:49 +01:00
Chris Fallin
e3d89c8a92 Merge pull request #1825 from cfallin/spidermonkey-fixes
Three fixes to various SpiderMonkey-related issues
2020-06-08 13:54:13 -07:00
Chris Fallin
fc2a6f273b Three fixes to various SpiderMonkey-related issues:
- Properly mask constant values down to appropriate width when
  generating a constant value directly in aarch64 backend. This was a
  miscompilation introduced in the new-isel refactor. In combination
  with failure to respect NarrowValueMode, this resulted in a very
  subtle bug when an `i32` constant was used in bit-twiddling logic.

- Add support for `iadd_ifcout` in aarch64 backend as used in explicit
  heap-check mode. With this change, we no longer fail heap-related
  tests with the huge-heap-region mode disabled.

- Remove a panic that was occurring in some tests that are currently
  ignored on aarch64, by simply returning empty/default information in
  `value_label` functionality rather than touching unimplemented APIs.
  This is not a bugfix per-se, but removes confusing panic messages from
  `cargo test` output that might otherwise mislead.
2020-06-08 13:02:00 -07:00
whitequark
bc555468a7 cranelift: add i64.{ishl,ushr,ashr} libcalls.
These libcalls are useful for 32-bit platforms.

On x86_32 in particular, commit 4ec16fa0 added support for legalizing
64-bit shifts through SIMD operations. However, that legalization
requires SIMD to be enabled and SSE 4.1 to be supported, which is not
acceptable as a hard requirement.
2020-06-05 12:13:49 -07:00
Yury Delendik
6f37204f82 Upgrade gimli to 0.21 (#1819)
* Use gimli 0.21

* rm CFI w Expression

* Don't write .debug_frame twice
2020-06-04 14:34:05 -05:00
Andrew Brown
1ea09088be Add x86 legalization for imul.i64x2 for non-AVX CPUs
The `convert_i64x2_imul` custom legalization checks the ISA flags for AVX512DQ or AVX512VL support and legalizes `imul.i64x2` to an `x86_pmullq` in this case; if not, it uses a lengthy SSE2-compatible instruction sequence.
2020-06-03 16:27:57 -07:00
Andrew Brown
df171f01b5 Add x86_pmuludq
This instruction multiplies the lower 32 bits of two 64x2 unsigned integers into an i64x2; this is necessary for lowering Wasm's i64x2.mul.
2020-06-03 16:27:57 -07:00
Andrew Brown
40f31375a5 Add TargetIsa::as_any for downcasting to specific ISA implementations
This is necessary when we would like to check specific ISA flags, e.g.
2020-06-03 16:27:57 -07:00
Andrew Brown
9ba9fd0f64 Add x86-specific instruction for i64x2 multiplication
Without this special instruction, legalizing to the AVX512 instruction AND the SSE instruction sequence is impossible. This extra instruction would be rendered unnecessary by the x64 backend.
2020-06-03 16:27:57 -07:00
Chris Fallin
fe97659813 Address review comments. 2020-06-03 13:31:34 -07:00
Chris Fallin
615362068f Multi-value return support. 2020-06-03 13:31:34 -07:00
Chris Fallin
9fec933056 Merge pull request #1801 from jgouly/cmp-rebase
arm64: add support for I8X16 ICmp
2020-06-02 09:35:41 -07:00
Joey Gouly
90a421193f arm64: add support for I8X16 ICmp
Copyright (c) 2020, Arm Limited.
2020-06-02 16:58:09 +01:00
Benjamin Bouvier
67c7a3ed19 mach backend: reduce the size of the Inst enum down to 32 bytes; 2020-06-02 16:29:05 +02:00
Benjamin Bouvier
e227608510 mach backend: use vectors instead of sets to remember set of uses/defs for calls;
This avoids the set uniqueness (hashing) test, reduces memory
churn when re-mapping virtual register onto real registers, and is
generally more memory-efficient.
2020-06-02 16:29:05 +02:00
Benjamin Bouvier
cfa0527794 mach backend: have mem_finalize return a SmallVec;
This avoids a spurious reallocation of the SmallVec containing the
load_constants result to a Vec, which appeared in dhat profiles.
2020-06-02 16:29:05 +02:00
Nick Fitzgerald
7c68a10ed6 Merge pull request #1670 from teapotd/win64-pass-by-ref
Implement passing arguments by ref for win64 ABI
2020-06-01 11:13:30 -07:00
Andrew Brown
0dd77d36f8 Rename BinaryImm format to BinaryImm64 2020-05-29 19:56:27 -07:00
Andrew Brown
a27a079d65 Replace ExtractLane format with BinaryImm8
Like https://github.com/bytecodealliance/wasmtime/pull/1762, this change the name of the `ExtractLane` format to the more-general `BinaryImm8` and renames its immediate argument from `lane` to `imm`.
2020-05-29 19:56:27 -07:00
Andrew Brown
7d6e94b952 Replace InsertLane format with TernaryImm8
The InsertLane format has an ordering (`value().imm().value()`) and immediate name (`"lane"`) that make it awkward to use for other instructions. This changes the ordering (`value().value().imm()`) and uses the default name (`"imm"`) throughout the codebase.
2020-05-29 19:56:27 -07:00
teapotd
e430984ac4 Improve bitselect codegen with knowledge of operand origin (#1783)
* Encode vselect using BLEND instructions on x86

* Legalize vselect to bitselect

* Optimize bitselect to vselect for some operands

* Add run tests for bitselect-vselect optimization

* Address review feedback
2020-05-29 19:53:11 -07:00
teapotd
759cc3e751 Implement passing arguments by ref for win64 ABI 2020-05-29 20:12:41 +02:00
Nick Fitzgerald
94380bf2b7 Merge pull request #1510 from teapotd/abi-i128-fix
Always check if struct-return parameter is needed
2020-05-29 10:02:16 -07:00
teapotd
0f55bb4b8d Always check if struct-return parameter is needed 2020-05-25 20:03:24 +02:00
Anton Kirilov
8a928830ac Enable the wast::Cranelift::spec::simd::simd_store test for AArch64
Copyright (c) 2020, Arm Limited.
2020-05-24 22:53:07 +01:00
Chris Fallin
73537e72c0 Merge pull request #1732 from jgouly/copysign-fpu
arm64: Use FPU instrctions for Fcopysign
2020-05-22 17:25:33 -07:00
Peter Huene
ce5f3e153b Only update XMM save unwind operation offsets when using a FP.
This commit prevents updating the XMM save unwind operation offsets when a
frame pointer is not used, even though currently Cranelift always uses a
frame pointer.

This will prevent incorrect unwind information in the future when we start
omitting frame pointers.
2020-05-21 16:46:30 -07:00
Peter Huene
2cd5ed1880 Address code review feedback. 2020-05-21 15:57:11 -07:00
Joey Gouly
02c3f238f8 arm64: Use FPU instrctions for Fcopysign
Copyright (c) 2020, Arm Limited.
2020-05-21 18:14:12 +01:00
Peter Huene
78c3091e84 Fix FPR saving and shadow space allocation for Windows x64.
This commit fixes both how FPR callee-saved registers are saved and how the
shadow space allocation occurs when laying out the stack for Windows x64
calling convention.

Importantly, this commit removes the compiler limitation of stack size for
Windows x64 that was imposed because FPR saves previously couldn't always be
represented in the unwind information.

The FPR saves are now performed without using stack slots, much like how the
callee-saved GPRs are saved. The total CSR space is given to `layout_stack` so
that it is included in the frame size and to offset the layout of spills and
explicit slots.

The FPR saves are now done via an RSP offset (post adjustment) and they always
follow the GPR saves on the stack. A simpler calculation can now be made to
determine the proper offsets of the FPR saves for representing the unwind
information.

Additionally, the shadow space is no longer treated as an incoming argument,
but an explicit stack slot that gets laid out at the lowest address possible in
the local frame. This prevents `layout_stack` from putting a spill or explicit
slot in this reserved space. In the future, `layout_stack` should take
advantage of the *caller-provided* shadow space for spills, but this commit does
not attempt to address that.

The shadow space is now omitted from the local frame for leaf functions.

Fixes #1728.
Fixes #1587.
Fixes #1475.
2020-05-20 15:37:30 -07:00
Chris Fallin
c9e3b71c39 Merge pull request #1729 from cfallin/machinst-branch-opt
Fix MachBuffer branch optimization.
2020-05-20 14:43:57 -07:00
Benjamin Bouvier
1f620e1b46 cranelift: bump regalloc.rs to 0.0.24 and adapt to latest API changes; 2020-05-20 15:37:15 +02:00
Chris Fallin
e11094b28b Fix MachBuffer branch optimization.
This patch fixes a subtle bug that occurred in the MachBuffer branch
optimization: in tracking labels at the current buffer tail using a
sorted-by-offset array, the code did not update this array properly when
redirecting labels. As a result, the dead-branch removal was unsafe,
because not every label pointing to a branch is guaranteed to be
redirected properly first.

Discovered while doing performance testing: bz2 silently took a wrong
branch and exited compression early. (Eek!)

To address this problem, this patch adopts a slightly simpler data
structure: we only track the labels *at the current buffer tail*, and
*at the start of each branch*, and we're careful to update these
appropriately to maintain the invariants. I'm pretty confident that this
is correct now, but we should (still) fuzz it a bunch, because wrong
control flow scares me a nonzero amount. I should probably also actually
write out a formal proof that these data-structure updates are correct.
The optimizations are important for performance (removing useless empty
blocks, and taking advantage of any fallthrough opportunities at all),
so I don't think we would want to drop them entirely.
2020-05-19 18:09:18 -07:00
Chris Fallin
bdd2873c8c Address review comments. 2020-05-18 16:25:26 -07:00
Chris Fallin
687aca00fe Update x64 backend to use new lowering APIs. 2020-05-18 16:25:15 -07:00
Chris Fallin
72e6be9342 Rework of MachInst isel, branch fixups and lowering, and block ordering.
This patch includes:

- A complete rework of the way that CLIF blocks and edge blocks are
  lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
  computes RPO over the CFG, but with a twist: it merges edge blocks intto
  heads or tails of original CLIF blocks wherever possible, and it does
  this without ever actually materializing the full nodes-plus-edges
  graph first. The backend driver lowers blocks in final order so
  there's no need to reshuffle later.

- A new `MachBuffer` that replaces the `MachSection`. This is a special
  version of a code-sink that is far more than a humble `Vec<u8>`. In
  particular, it keeps a record of label definitions and label uses,
  with a machine-pluggable `LabelUse` trait that defines various types
  of fixups (basically internal relocations).

  Importantly, it implements some simple peephole-style branch rewrites
  *inline in the emission pass*, without any separate traversals over
  the code to use fallthroughs, swap taken/not-taken arms, etc. It
  tracks branches at the tail of the buffer and can (i) remove blocks
  that are just unconditional branches (by redirecting the label), (ii)
  understand a conditional/unconditional pair and swap the conditional
  polarity when it's helpful; and (iii) remove branches that branch to
  the fallthrough PC.

  The `MachBuffer` also implements branch-island support. On
  architectures like AArch64, this is needed to allow conditional
  branches within plausibly-attainable ranges (+/- 1MB on AArch64
  specifically). It also does this inline while streaming through the
  emission, without any sort of fixpoint algorithm or later moving of
  code, by simply tracking outstanding references and "deadlines" and
  emitting an island just-in-time when we're in danger of going out of
  range.

- A rework of the instruction selector driver. This is largely following
  the same algorithm as before, but is cleaned up significantly, in
  particular in the API: the machine backend can ask for an input arg
  and get any of three forms (constant, register, producing
  instruction), indicating it needs the register or can merge the
  constant or producing instruction as appropriate. This new driver
  takes special care to emit constants right at use-sites (and at phi
  inputs), minimizing their live-ranges, and also special-cases the
  "pinned register" to avoid superfluous moves.

Overall, on `bz2.wasm`, the results are:

    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
2020-05-16 23:08:22 -07:00
Joey Gouly
f418b7a700 Reduce arm64 Inst enum size
This reduces the size of the Inst enum from 112 bytes to 48 bytes.

Using DHAT on a regex-rs.wasm benchmark, `valgrind --tool=dhat clif-util compile --target aarch64`

The total number of allocated bytes, drops by around 170 MB.
At t-gmax drops by 3 MB.

Using `perf stat clif-util compile --target aarch64`, the instructions count dropped by 0.6%. Cache misses dropped by 6%. Cycles dropped by 2.3%.
2020-05-14 15:45:55 +01:00
Benjamin Bouvier
07c55fa50f aarch64: suggest a scratch register that's not caller-saved;
If the scratch register is caller-saved, then it might appear in fixed
ranges because of call clobbers. Instead, use a register that's not
caller-saved and has no predefined use in the ABI.
2020-05-13 10:56:32 +02:00
Chris Fallin
ee2f861fdd Merge pull request #1674 from cfallin/machinst-reg-universe-opt
MachInst backend: don't reallocate RealRegUniverses for each function compilation.
2020-05-09 14:10:26 -07:00
whitequark
4ec16fa057 Legalize 64 bit shifts on x86_32 using PSLLQ/PSRLQ.
Co-authored-by: iximeow <git@iximeow.net>
2020-05-09 03:28:19 -07:00
whitequark
2331403741 Extend X86 ABI to cover stack overflow checking on X86-32.
In stark contrast with every reasonable architecture, X86-32 does not
pass any parameters in registers. Because of that we have to resort
to reading arguments from stack without being able to use the stack
slot machinery.

(This wouldn't have been avoidable even by pinning a register because
there is a trampoline in wasmtime with the C ABI that Cranelift needs
to be able to call.)
2020-05-09 03:27:06 -07:00
Chris Fallin
17cef9140c MachInst backend: don't reallocate RealRegUniverses for each function
compilation.

This saves ~0.14% instruction count, ~0.18% allocated bytes, and ~1.5%
allocated blocks on a `clif-util wasm` compilation of `bz2.wasm` for
aarch64.
2020-05-08 15:35:16 -07:00
Benjamin Bouvier
528d3c1355 machinst: Steal the used/defs Sets when emitting a call in ABICall; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
9215b610ef machinst: Avoid a lot of short-lived allocations in ABICall; 2020-05-07 12:24:02 +02:00
Chris Fallin
a66724aafd Rework aarch64 stack frame implementation.
This PR changes the aarch64 ABI implementation to use positive offsets
from SP, rather than negative offsets from FP, to refer to spill slots
and stack-local storage. This allows for better addressing-mode options,
and hence slightly better code: e.g., the unsigned scaled 12-bit offset
mode can be used to reach anywhere in a 32KB frame without extra
address-construction instructions, whereas negative offsets are limited
to a signed 9-bit unscaled mode (-256 bytes).

To enable this, the PR introduces a notion of "nominal SP offsets" as a
virtual addressing mode, lowered during the emission pass. The offsets
are relative to "SP after adjusting downward to allocate stack/spill
slots", but before pushing clobbers. This allows the addressing-mode
expressions to be generated before register allocation (or during it,
for spill/reload sequences).

To convert these offsets into *true* offsets from SP, we need to track
how much further SP is moved downward, and compensate for this. We do so
with "virtual SP offset adjustment" pseudo-instructions: these are seen
by the emission pass, and result in no instruction (0 byte output), but
update state that is now threaded through each instruction emission in
turn. In this way, we can push e.g. stack args for a call and adjust
the virtual SP offset, allowing reloads from nominal-SP-relative
spillslots while we do the argument setup with "real SP offsets" at the
same time.
2020-05-06 09:23:55 -07:00
Chris Fallin
59039df001 Merge pull request #1570 from cfallin/fix-long-range-aarch64-call
Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use new long-distance call.
2020-05-05 10:45:55 -07:00
Chris Fallin
e39b4aba1c Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use it.
Previously, every call was lowered on AArch64 to a `call` instruction, which
takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this
gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible
relocation record to be emitted (or rather, a record that a more sophisticated
linker would fix up by inserting a shim/veneer).

This commit adds a notion of "relocation distance" in the MachInst backends,
and provides this information for every call target and symbol reference. The
intent is that backends on architectures like AArch64, where there are different
offset sizes / addressing strategies to choose from, can either emit a regular
call or a load-64-bit-constant / call-indirect sequence, as necessary. This
avoids the need to implement complex linking behavior.

The MachInst driver code provides this information based on the "colocated" bit
in the CLIF symbol references, which appears to have been designed for this
purpose, or at least a similar one. Combined with the `use_colocated_libcalls`
setting, this allows client code to ensure that library calls can link to
library code at any location in the address space.

Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing
so, it appears all that is necessary to get its tests to pass is to set the
`use_colocated_libcalls` flag to false, to make use of the above change. This
fixes the `libcall_function` unit-test in this crate.
2020-05-05 09:55:12 -07:00
Benjamin Bouvier
fa54422854 Add a work-in-progress backend for x86_64 using the new instruction selection;
Most of the work is credited to Julian Seward.

Co-authored-by: Julian Seward <jseward@acm.org>
Co-authored-by: Chris Fallin <cfallin@mozilla.com>
2020-05-05 16:35:41 +02:00
Andrew Brown
a312506262 Add x86 complex encodings for SIMD load-extend instructions 2020-04-30 11:38:01 -07:00