Commit Graph

1806 Commits

Author SHA1 Message Date
Chris Fallin
e11094b28b Fix MachBuffer branch optimization.
This patch fixes a subtle bug that occurred in the MachBuffer branch
optimization: in tracking labels at the current buffer tail using a
sorted-by-offset array, the code did not update this array properly when
redirecting labels. As a result, the dead-branch removal was unsafe,
because not every label pointing to a branch is guaranteed to be
redirected properly first.

Discovered while doing performance testing: bz2 silently took a wrong
branch and exited compression early. (Eek!)

To address this problem, this patch adopts a slightly simpler data
structure: we only track the labels *at the current buffer tail*, and
*at the start of each branch*, and we're careful to update these
appropriately to maintain the invariants. I'm pretty confident that this
is correct now, but we should (still) fuzz it a bunch, because wrong
control flow scares me a nonzero amount. I should probably also actually
write out a formal proof that these data-structure updates are correct.
The optimizations are important for performance (removing useless empty
blocks, and taking advantage of any fallthrough opportunities at all),
so I don't think we would want to drop them entirely.
2020-05-19 18:09:18 -07:00
Chris Fallin
bdd2873c8c Address review comments. 2020-05-18 16:25:26 -07:00
Chris Fallin
687aca00fe Update x64 backend to use new lowering APIs. 2020-05-18 16:25:15 -07:00
Chris Fallin
72e6be9342 Rework of MachInst isel, branch fixups and lowering, and block ordering.
This patch includes:

- A complete rework of the way that CLIF blocks and edge blocks are
  lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
  computes RPO over the CFG, but with a twist: it merges edge blocks intto
  heads or tails of original CLIF blocks wherever possible, and it does
  this without ever actually materializing the full nodes-plus-edges
  graph first. The backend driver lowers blocks in final order so
  there's no need to reshuffle later.

- A new `MachBuffer` that replaces the `MachSection`. This is a special
  version of a code-sink that is far more than a humble `Vec<u8>`. In
  particular, it keeps a record of label definitions and label uses,
  with a machine-pluggable `LabelUse` trait that defines various types
  of fixups (basically internal relocations).

  Importantly, it implements some simple peephole-style branch rewrites
  *inline in the emission pass*, without any separate traversals over
  the code to use fallthroughs, swap taken/not-taken arms, etc. It
  tracks branches at the tail of the buffer and can (i) remove blocks
  that are just unconditional branches (by redirecting the label), (ii)
  understand a conditional/unconditional pair and swap the conditional
  polarity when it's helpful; and (iii) remove branches that branch to
  the fallthrough PC.

  The `MachBuffer` also implements branch-island support. On
  architectures like AArch64, this is needed to allow conditional
  branches within plausibly-attainable ranges (+/- 1MB on AArch64
  specifically). It also does this inline while streaming through the
  emission, without any sort of fixpoint algorithm or later moving of
  code, by simply tracking outstanding references and "deadlines" and
  emitting an island just-in-time when we're in danger of going out of
  range.

- A rework of the instruction selector driver. This is largely following
  the same algorithm as before, but is cleaned up significantly, in
  particular in the API: the machine backend can ask for an input arg
  and get any of three forms (constant, register, producing
  instruction), indicating it needs the register or can merge the
  constant or producing instruction as appropriate. This new driver
  takes special care to emit constants right at use-sites (and at phi
  inputs), minimizing their live-ranges, and also special-cases the
  "pinned register" to avoid superfluous moves.

Overall, on `bz2.wasm`, the results are:

    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
2020-05-16 23:08:22 -07:00
Chris Fallin
df4028749e Merge pull request #1699 from jgouly/inst-size
Reduce arm64 Inst enum size
2020-05-14 13:44:46 -07:00
Nick Fitzgerald
fb7a690efc Merge pull request #1687 from fitzgen/sign-extend-immediates
cranelift: Sign extend `Imm64` immediates
2020-05-14 10:09:53 -07:00
Nick Fitzgerald
c093dee79e cranelift: Let lifetime elision elide lifetimes 2020-05-14 07:52:23 -07:00
Nick Fitzgerald
22a070ed4f peepmatic: Apply some review suggestions from @bjorn3 2020-05-14 07:52:23 -07:00
Nick Fitzgerald
52c6ece5f3 peepmatic: Make peepmatic optional to enable
Rather than outright replacing parts of our existing peephole optimizations
passes, this makes peepmatic an optional cargo feature that can be enabled. This
allows us to take a conservative approach with enabling peepmatic everywhere,
while also allowing us to get it in-tree and make it easier to collaborate on
improving it quickly.
2020-05-14 07:52:23 -07:00
Nick Fitzgerald
6e135b3aea peepmatic: Fix a failed assertion due to extra iterations after fixed point
After replacing an instruction with an alias to an earlier value, trying to
further optimize that value is unnecessary, since we've already processed it,
and also was triggering an assertion.
2020-05-14 07:52:23 -07:00
Nick Fitzgerald
210b036320 peepmatic: Represent various id types with u16
These ids end up in the automaton, so making them smaller should give us better
data cache locality and also smaller serialized sizes.
2020-05-14 07:52:23 -07:00
Nick Fitzgerald
469104c4d3 peepmatic: Make the results of match operations a smaller and more cache friendly 2020-05-14 07:52:23 -07:00
Nick Fitzgerald
9a1f8038b7 peepmatic: Do not transplant instructions whose results are potentially used elsewhere 2020-05-14 07:52:23 -07:00
Nick Fitzgerald
090d1c2d32 cranelift: Port most of simple_preopt.rs over to the peepmatic DSL
This ports all of the identity, no-op, simplification, and canonicalization
related optimizations over from being hand-coded to the `peepmatic` DSL. This
does not handle the branch-to-branch optimizations or most of the
divide-by-constant optimizations.
2020-05-14 07:52:23 -07:00
Joey Gouly
f418b7a700 Reduce arm64 Inst enum size
This reduces the size of the Inst enum from 112 bytes to 48 bytes.

Using DHAT on a regex-rs.wasm benchmark, `valgrind --tool=dhat clif-util compile --target aarch64`

The total number of allocated bytes, drops by around 170 MB.
At t-gmax drops by 3 MB.

Using `perf stat clif-util compile --target aarch64`, the instructions count dropped by 0.6%. Cache misses dropped by 6%. Cycles dropped by 2.3%.
2020-05-14 15:45:55 +01:00
Dan Gohman
fb0b9e3ae6 Change proc_exit to unwind the stack rather than exiting the host process. (#1646)
* Remove Cranelift's OutOfBounds trap, which is no longer used.

* Change proc_exit to unwind instead of exit the host process.

This implements the semantics in https://github.com/WebAssembly/WASI/pull/235.

Fixes #783.
Fixes #993.

* Fix exit-status tests on Windows.

* Revert the wiggle changes and re-introduce the wasi-common implementations.

* Move `wasi_proc_exit` into the wasmtime-wasi crate.

* Revert the spec_testsuite change.

* Remove the old proc_exit implementations.

* Make `TrapReason` an implementation detail.

* Allow exit status 2 on Windows too.

* Fix a documentation link.

* Really fix a documentation link.
2020-05-13 15:59:43 -07:00
Benjamin Bouvier
5987cf5cda machinst: add a linear-scan checked variant too; 2020-05-13 10:56:32 +02:00
Benjamin Bouvier
07c55fa50f aarch64: suggest a scratch register that's not caller-saved;
If the scratch register is caller-saved, then it might appear in fixed
ranges because of call clobbers. Instead, use a register that's not
caller-saved and has no predefined use in the ABI.
2020-05-13 10:56:32 +02:00
Nick Fitzgerald
9b867b09c7 cranelift: Sign extend Imm64 immediates
When an instruction has an `Imm64` immediate, but operates on values of a
narrower width, we need to sign extend the value.

Fixes #1095
2020-05-12 15:44:48 -07:00
Chris Fallin
ee2f861fdd Merge pull request #1674 from cfallin/machinst-reg-universe-opt
MachInst backend: don't reallocate RealRegUniverses for each function compilation.
2020-05-09 14:10:26 -07:00
whitequark
4ec16fa057 Legalize 64 bit shifts on x86_32 using PSLLQ/PSRLQ.
Co-authored-by: iximeow <git@iximeow.net>
2020-05-09 03:28:19 -07:00
whitequark
2331403741 Extend X86 ABI to cover stack overflow checking on X86-32.
In stark contrast with every reasonable architecture, X86-32 does not
pass any parameters in registers. Because of that we have to resort
to reading arguments from stack without being able to use the stack
slot machinery.

(This wouldn't have been avoidable even by pinning a register because
there is a trampoline in wasmtime with the C ABI that Cranelift needs
to be able to call.)
2020-05-09 03:27:06 -07:00
Chris Fallin
17cef9140c MachInst backend: don't reallocate RealRegUniverses for each function
compilation.

This saves ~0.14% instruction count, ~0.18% allocated bytes, and ~1.5%
allocated blocks on a `clif-util wasm` compilation of `bz2.wasm` for
aarch64.
2020-05-08 15:35:16 -07:00
Julian Seward
0bc0503f3f Add a transformation pass which removes phi nodes to which it can demonstrate
that only one value ever flows.  Has been observed to improve generated code
run times by up to 8%.  Compilation cost increases by about 0.6%, but up to 7%
total cost has been observed to be saved; iow it can be a significant win in
terms of compilation time, overall.
2020-05-08 09:41:16 +02:00
Benjamin Bouvier
528d3c1355 machinst: Steal the used/defs Sets when emitting a call in ABICall; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
19d8a7f1fb machinst: Reuse memory accross loop iterations in lowering; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
b24b711c16 machinst: Reduce the number of vec allocations for edge blocks; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
9215b610ef machinst: Avoid a lot of short-lived allocations in ABICall; 2020-05-07 12:24:02 +02:00
Benjamin Bouvier
4f919c6460 machinst: bump regalloc to 0.0.23 and return a slice on the successor indexes, in block_succs; 2020-05-07 12:24:02 +02:00
Julian Seward
48521393ae Update to regalloc.rs version 0.22. 2020-05-06 20:16:31 +02:00
Chris Fallin
6d73fdb70a Merge pull request #1607 from cfallin/aarch64-stack-frame
Rework aarch64 stack frame implementation to use positive offsets.
2020-05-06 10:29:30 -07:00
Chris Fallin
a66724aafd Rework aarch64 stack frame implementation.
This PR changes the aarch64 ABI implementation to use positive offsets
from SP, rather than negative offsets from FP, to refer to spill slots
and stack-local storage. This allows for better addressing-mode options,
and hence slightly better code: e.g., the unsigned scaled 12-bit offset
mode can be used to reach anywhere in a 32KB frame without extra
address-construction instructions, whereas negative offsets are limited
to a signed 9-bit unscaled mode (-256 bytes).

To enable this, the PR introduces a notion of "nominal SP offsets" as a
virtual addressing mode, lowered during the emission pass. The offsets
are relative to "SP after adjusting downward to allocate stack/spill
slots", but before pushing clobbers. This allows the addressing-mode
expressions to be generated before register allocation (or during it,
for spill/reload sequences).

To convert these offsets into *true* offsets from SP, we need to track
how much further SP is moved downward, and compensate for this. We do so
with "virtual SP offset adjustment" pseudo-instructions: these are seen
by the emission pass, and result in no instruction (0 byte output), but
update state that is now threaded through each instruction emission in
turn. In this way, we can push e.g. stack args for a call and adjust
the virtual SP offset, allowing reloads from nominal-SP-relative
spillslots while we do the argument setup with "real SP offsets" at the
same time.
2020-05-06 09:23:55 -07:00
Benjamin Bouvier
1d90751ba9 machinst: Avoid a full instructions traversal of all the blocks when computing the final block ordering; 2020-05-06 15:13:25 +02:00
Chris Fallin
59039df001 Merge pull request #1570 from cfallin/fix-long-range-aarch64-call
Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use new long-distance call.
2020-05-05 10:45:55 -07:00
Chris Fallin
e39b4aba1c Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use it.
Previously, every call was lowered on AArch64 to a `call` instruction, which
takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this
gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible
relocation record to be emitted (or rather, a record that a more sophisticated
linker would fix up by inserting a shim/veneer).

This commit adds a notion of "relocation distance" in the MachInst backends,
and provides this information for every call target and symbol reference. The
intent is that backends on architectures like AArch64, where there are different
offset sizes / addressing strategies to choose from, can either emit a regular
call or a load-64-bit-constant / call-indirect sequence, as necessary. This
avoids the need to implement complex linking behavior.

The MachInst driver code provides this information based on the "colocated" bit
in the CLIF symbol references, which appears to have been designed for this
purpose, or at least a similar one. Combined with the `use_colocated_libcalls`
setting, this allows client code to ensure that library calls can link to
library code at any location in the address space.

Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing
so, it appears all that is necessary to get its tests to pass is to set the
`use_colocated_libcalls` flag to false, to make use of the above change. This
fixes the `libcall_function` unit-test in this crate.
2020-05-05 09:55:12 -07:00
Benjamin Bouvier
fa54422854 Add a work-in-progress backend for x86_64 using the new instruction selection;
Most of the work is credited to Julian Seward.

Co-authored-by: Julian Seward <jseward@acm.org>
Co-authored-by: Chris Fallin <cfallin@mozilla.com>
2020-05-05 16:35:41 +02:00
Chris Fallin
8393412c40 Merge pull request #1632 from cfallin/aarch64-fix-srclocs
MachInst backend: attach SourceLoc span information to all ranges.
2020-04-30 16:13:55 -07:00
Chris Fallin
964c6087bd MachInst backend: attach SourceLoc span information to all ranges.
Previously, the SourceLoc information transferred in `VCode` only
included PC-spans for non-default SourceLocs. I realized that the
invariant we're supposed to keep here is that every PC is covered; if no
source information, just use `SourceLoc::default()`.

This was spurred by @bjorn3's comment in #1575 (thanks!).
2020-04-30 15:40:55 -07:00
Andrew Brown
49622bde58 Use complex load-extend instructions in optimize_complex_addresses; fixes #1186 2020-04-30 11:38:01 -07:00
Andrew Brown
a312506262 Add x86 complex encodings for SIMD load-extend instructions 2020-04-30 11:38:01 -07:00
Yury Delendik
1873c0ae46 Fix value label ranges resolution (#1572)
There was a bug how value labels were resolved, which caused some DWARF expressions not be transformed, e.g. those are in the registers.

*    Implements FIXME in expression.rs
*    Move TargetIsa from CompiledExpression structure
*    Fix expression format for GDB
*    Add tests for parsing
*    Proper logic in ValueLabelRangesBuilder::process_label
*    Tests for ValueLabelRangesBuilder
*    Refactor build_with_locals to return Iterator instead of Vec<_>
*    Misc comments and magical numbers
2020-04-30 08:07:55 -05:00
Benjamin Bouvier
b7cfd39b53 aarch64: split emit tests into its own file;
This is done to satisfy a check done on the maximal file's size when
vendoring Rust source code into Mozilla central's repository.
2020-04-30 13:50:45 +02:00
Benjamin Bouvier
4c066b1c73 codegen: split lower.rs into multiple files;
This splits off lower.rs into two files: lower.rs keeps all the utility
functions, while lower_inst.rs contains the (gigantic!) function
lowering a single Cranelift instruction into vcode.

This is done to satisfy a check done on the maximal file's size when
vendoring Rust source code into Mozilla central's repository.
2020-04-30 13:50:45 +02:00
Alex Crichton
363cd2d20f Expose memory-related options in Config (#1513)
* Expose memory-related options in `Config`

This commit was initially motivated by looking more into #1501, but it
ended up balooning a bit after finding a few issues. The high-level
items in this commit are:

* New configuration options via `wasmtime::Config` are exposed to
  configure the tunable limits of how memories are allocated and such.
* The `MemoryCreator` trait has been updated to accurately reflect the
  required allocation characteristics that JIT code expects.
* A bug has been fixed in the cranelift wasm code generation where if no
  guard page was present bounds checks weren't accurately performed.

The new `Config` methods allow tuning the memory allocation
characteristics of wasmtime. Currently 64-bit platforms will reserve 6GB
chunks of memory for each linear memory, but by tweaking various config
options you can change how this is allocate, perhaps at the cost of
slower JIT code since it needs more bounds checks. The methods are
intended to be pretty thoroughly documented as to the effect they have
on the JIT code and what values you may wish to select. These new
methods have been added to the spectest fuzzer to ensure that various
configuration values for these methods don't affect correctness.

The `MemoryCreator` trait previously only allocated memories with a
`MemoryType`, but this didn't actually reflect the guarantees that JIT
code expected. JIT code is generated with an assumption about the
minimum size of the guard region, as well as whether memory is static or
dynamic (whether the base pointer can be relocated). These properties
must be upheld by custom allocation engines for JIT code to perform
correctly, so extra parameters have been added to
`MemoryCreator::new_memory` to reflect this.

Finally the fuzzing with `Config` turned up an issue where if no guard
pages present the wasm code wouldn't correctly bounds-check memory
accesses. The issue here was that with a guard page we only need to
bounds-check the first byte of access, but without a guard page we need
to bounds-check the last byte of access. This meant that the code
generation needed to account for the size of the memory operation
(load/store) and use this as the offset-to-check in the no-guard-page
scenario. I've attempted to make the various comments in cranelift a bit
more exhaustive too to hopefully make it a bit clearer for future
readers!

Closes #1501

* Review comments

* Update a comment
2020-04-29 17:10:00 -07:00
Chris Fallin
be6f060abf Use new regalloc.rs version with dense vreg->rreg maps.
This PR updates Cranelift to use the new version of regalloc.rs
(bytecodealliance/regalloc.rs#55) that provides dense vreg->rreg maps to
the `map_reg()` function for each instruction, rather than the earlier
hashmap-based approach.

In one test (regex-rs.wasm), this PR results in a 15% reduction in
memory allocations (1245MB -> 1060MB) as measured by DHAT on `clif-util
wasm` runs.
2020-04-29 10:42:25 -07:00
Benjamin Bouvier
767bcaab29 aarch64: redefine is_move now that regalloc.rs bug has been fixed; 2020-04-29 13:38:30 +02:00
Benjamin Bouvier
698dc9c401 Fixes #1619: Properly bubble up errors when seeing an unexpected type during lowering. 2020-04-29 10:26:22 +02:00
Gabor Greif
d9d69299bb A few typofixes (#1623)
* a few typofixes

* more tyops
2020-04-28 19:18:05 -05:00
Chris Fallin
b691770faa MachInst backend: pass through SourceLoc information.
This change adds SourceLoc information per instruction in a `VCode<Inst>`
container, and keeps this information up-to-date across register allocation
and branch reordering. The information is initially collected during
instruction lowering, eventually collected on the MachSection, and finally
provided to the environment that wraps the codegen crate for wasmtime.
2020-04-24 13:18:01 -07:00
Alex Crichton
74eda8090c Implement stack limit checks for AArch64 (#1573)
This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets the `stack_limit` argument purpose as well as
a function's global `stack_limit` directive working for the AArch64
backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!
2020-04-24 15:01:57 -05:00