Commit Graph

34 Commits

Author SHA1 Message Date
bjorn3
a48a60f958 Remove reloc_external from CodeSink
And introduce MachBufferFinalized::relocs() in the place.
2022-01-11 16:54:27 +01:00
bjorn3
63e2360346 Remove trap from CodeSink
And introduce MachBufferFinalized::traps() in the place.
2022-01-11 16:42:52 +01:00
bjorn3
38aaa6e1da Remove add_call_site from CodeSink and RelocSink
And introduce MachBufferFinalized::call_sites() in the place.
2022-01-11 16:32:57 +01:00
Alex Crichton
1141169ff8 aarch64: Initial work to transition backend to ISLE (#3541)
* aarch64: Initial work to transition backend to ISLE

This commit is what is hoped to be the initial commit towards migrating
the aarch64 backend to ISLE. There's seemingly a lot of changes here but
it's intended to largely be code motion. The current thinking is to
closely follow the x64 backend for how all this is handled and
organized.

Major changes in this PR are:

* The `Inst` enum is now defined in ISLE. This avoids having to define
  it in two places (once in Rust and once in ISLE). I've preserved all
  the comments in the ISLE and otherwise this isn't actually a
  functional change from the Rust perspective, it's still the same enum
  according to Rust.

* Lots of little enums and things were moved to ISLE as well. As with
  `Inst` their definitions didn't change, only where they're defined.
  This will give future ISLE PRs access to all these operations.

* Initial code for lowering `iconst`, `null`, and `bconst` are
  implemented. Ironically none of this is actually used right now
  because constant lowering is handled in `put_input_in_regs` which
  specially handles constants. Nonetheless I wanted to get at least
  something simple working which shows off how to special case various
  things that are specific to AArch64. In a future PR I plan to hook up
  const-lowering in ISLE to this path so even though
  `iconst`-the-clif-instruction is never lowered this should use the
  const lowering defined in ISLE rather than elsewhere in the backend
  (eventually leading to the deletion of the non-ISLE lowering).

* The `IsleContext` skeleton is created and set up for future additions.

* Some code for ISLE that's shared across all backends now lives in
  `isle_prelude_methods!()` and is deduplicated between the AArch64
  backend and the x64 backend.

* Register mapping is tweaked to do the same thing for AArch64 that it
  does for x64. Namely mapping virtual registers is supported instead of
  just virtual to machine registers.

My main goal with this PR was to get AArch64 into a place where new
instructions can be added with relative ease. Additionally I'm hoping to
figure out as part of this change how much to share for ISLE between
AArch64 and x64 (and other backends).

* Don't use priorities with rules

* Update .gitattributes with concise syntax

* Deduplicate some type definitions

* Rebuild ISLE

* Move isa::isle to machinst::isle
2021-11-18 10:38:16 -06:00
Nick Fitzgerald
d2d0a0f36b Remove Peepmatic!!!
Peepmatic was an early attempt at a DSL for peephole optimizations, with the
idea that maybe sometime in the future we could user it for instruction
selection as well. It didn't really pan out, however:

* Peepmatic wasn't quite flexible enough, and adding new operators or snippets
  of code implemented externally in Rust was a bit of a pain.

* The performance was never competitive with the hand-written peephole
  optimizers. It was *very* size efficient, but that came at the cost of
  run-time efficiency. Everything was table-based and interpreted, rather than
  generating any Rust code.

Ultimately, because of these reasons, we never turned Peepmatic on by default.

These days, we just landed the ISLE domain-specific language, and it is better
suited than Peepmatic for all the things that Peepmatic was originally designed
to do. It is more flexible and easy to integrate with external Rust code. It is
has better time efficiency, meeting or even beating hand-written code. I think a
small part of the reason why ISLE excels in these things is because its design
was informed by Peepmatic's failures. I still plan on continuing Peepmatic's
mission to make Cranelift's peephole optimizer passes generated from DSL rewrite
rules, but using ISLE instead of Peepmatic.

Thank you Peepmatic, rest in peace!
2021-11-17 13:04:17 -08:00
Chris Fallin
5c2a629871 Merge pull request #2455 from Hywan/feat-cranelift-codegen-re-export-gimli
feat(cranelift-codegen) Re-export `gimli` when `unwind` feature is enabled
2021-10-11 13:09:16 -07:00
bjorn3
54293a5929 Remove predicates module
It is dead code now
2021-10-10 15:25:29 +02:00
Benjamin Bouvier
bae4ec6427 Remove ancient register allocation (#3401) 2021-09-30 21:27:23 +02:00
Alex Crichton
1532516a36 Use relative call instructions between wasm functions (#3275)
* Use relative `call` instructions between wasm functions

This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:

* Function calls are always indirect which means they are more difficult
  to predict, resulting in worse performance.

* Generating a relocation-per-function call requires expensive
  relocation resolution at module-load time, which can be a large
  contributing factor to how long it takes to load a precompiled module.

To fix these issues, while also somewhat compromising on the previously
simple implementation technique, this commit switches wasm calls within
a module to using the `colocated` flag enabled in Cranelift-speak, which
basically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.

When switching the `colocated` flag to `true` this commit is also then
able to move much of the relocation resolution from `wasmtime_jit::link`
into `wasmtime_cranelift::obj` during object-construction time. This
frontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.

The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from the `bl` instruction to the target, which means that if the
function we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.

This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by the `MachBuffer` type. When
generating a function relative jumps between basic blocks have similar
restrictions. This commit adds new methods for the `MachBackend` trait
and updates the implementation of `MachBuffer` to account for all these
new branches. Specifically the changes to `MachBuffer` are:

* For AAarch64 the `LabelUse::Branch26` value now supports veneers, and
  AArch64 calls use this to resolve relocations.

* The `emit_island` function has been rewritten internally to handle
  some cases which previously didn't come up before, such as:

  * When emitting an island the deadline is now recalculated, where
    previously it was always set to infinitely in the future. This was ok
    prior since only a `Branch19` supported veneers and once it was
    promoted no veneers were supported, so without multiple layers of
    promotion the lack of a new deadline was ok.

  * When emitting an island all pending fixups had veneers forced if
    their branch target wasn't known yet. This was generally ok for
    19-bit fixups since the only kind getting a veneer was a 19-bit
    fixup, but with mixed kinds it's a bit odd to force veneers for a
    26-bit fixup just because a nearby 19-bit fixup needed a veneer.
    Instead fixups are now re-enqueued unless they're known to be
    out-of-bounds. This may run the risk of generating more islands for
    19-bit branches but it should also reduce the number of islands for
    between-function calls.

  * Otherwise the internal logic was tweaked to ideally be a bit more
    simple, but that's a pretty subjective criteria in compilers...

I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.

The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.

* Fix arm32

* Review comments
2021-09-01 13:27:38 -05:00
Benjamin Bouvier
91c65d739f Remove unused code in machinst 2021-07-02 18:09:33 +02:00
Benjamin Bouvier
50aa645769 cranelift: use a deferred display wrapper for logging the vcode's IR 2021-04-16 10:27:19 +02:00
bjorn3
0693b7dade Include git rev in the version number 2021-02-18 13:01:01 +01:00
Ivan Enderlin
2ab83eec7a feat(cranelift-codegen) Re-export gimli when unwind feature is enabled.
When one wants to manipulate the unwind information, the exact version
of `gimli` must be used both by the user and `cranelift-codegen`. It
makes the update procedure less obvious.

This patch proposes to re-export `gimli` when the `unwind` feature is
turned on.
2020-11-26 17:26:55 +01:00
Andrew Brown
c9e8889d47 Update clippy annotation to use latest version (#2375) 2020-11-09 09:24:59 -06:00
Andrew Brown
6f6f79ef2b refactor: move DataValue from cranelift-reader to cranelift-codegen
This is no change to functionality; the move is necessary in order to return InstructionData immediates in a structure way (see next commit).
2020-10-07 12:17:17 -07:00
Nick Fitzgerald
3a6dd832c0 Harvest left-hand side superoptimization candidates.
Given a clif function, harvest all its integer subexpressions, so that they can
be fed into [Souper](https://github.com/google/souper) as candidates for
superoptimization. For some of these candidates, Souper will successfully
synthesize a right-hand side that is equivalent but has lower cost than the
left-hand side. Then, we can combine these left- and right-hand sides into a
complete optimization, and add it to our peephole passes.

To harvest the expression that produced a given value `x`, we do a post-order
traversal of the dataflow graph starting from `x`. As we do this traversal, we
maintain a map from clif values to their translated Souper values. We stop
traversing when we reach anything that can't be translated into Souper IR: a
memory load, a float-to-int conversion, a block parameter, etc. For values
produced by these instructions, we create a Souper `var`, which is an input
variable to the optimization. For instructions that have a direct mapping into
Souper IR, we get the Souper version of each of its operands and then create the
Souper version of the instruction itself. It should now be clear why we do a
post-order traversal: we need an instruction's translated operands in order to
translate the instruction itself. Once this instruction is translated, we update
the clif-to-souper map with this new translation so that any other instruction
that uses this result as an operand has access to the translated value. When the
traversal is complete we return the translation of `x` as the root of left-hand
side candidate.
2020-09-14 16:27:47 -07:00
Johnnie Birch
48f0b10c7a Add initial scalar FP operations (addss, subss, etc) to x64 backend.
Adds support for addss and subss. This is the first lowering for
sse floating point alu and some move operations. The changes here do
some renaming of data structures and adds a couple of new ones
to support sse specific operations. The work done here will likely
evolve as needed to support an efficient, inituative, and consistent
framework.
2020-06-10 18:36:57 +02:00
Chris Fallin
72e6be9342 Rework of MachInst isel, branch fixups and lowering, and block ordering.
This patch includes:

- A complete rework of the way that CLIF blocks and edge blocks are
  lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
  computes RPO over the CFG, but with a twist: it merges edge blocks intto
  heads or tails of original CLIF blocks wherever possible, and it does
  this without ever actually materializing the full nodes-plus-edges
  graph first. The backend driver lowers blocks in final order so
  there's no need to reshuffle later.

- A new `MachBuffer` that replaces the `MachSection`. This is a special
  version of a code-sink that is far more than a humble `Vec<u8>`. In
  particular, it keeps a record of label definitions and label uses,
  with a machine-pluggable `LabelUse` trait that defines various types
  of fixups (basically internal relocations).

  Importantly, it implements some simple peephole-style branch rewrites
  *inline in the emission pass*, without any separate traversals over
  the code to use fallthroughs, swap taken/not-taken arms, etc. It
  tracks branches at the tail of the buffer and can (i) remove blocks
  that are just unconditional branches (by redirecting the label), (ii)
  understand a conditional/unconditional pair and swap the conditional
  polarity when it's helpful; and (iii) remove branches that branch to
  the fallthrough PC.

  The `MachBuffer` also implements branch-island support. On
  architectures like AArch64, this is needed to allow conditional
  branches within plausibly-attainable ranges (+/- 1MB on AArch64
  specifically). It also does this inline while streaming through the
  emission, without any sort of fixpoint algorithm or later moving of
  code, by simply tracking outstanding references and "deadlines" and
  emitting an island just-in-time when we're in danger of going out of
  range.

- A rework of the instruction selector driver. This is largely following
  the same algorithm as before, but is cleaned up significantly, in
  particular in the API: the machine backend can ask for an input arg
  and get any of three forms (constant, register, producing
  instruction), indicating it needs the register or can merge the
  constant or producing instruction as appropriate. This new driver
  takes special care to emit constants right at use-sites (and at phi
  inputs), minimizing their live-ranges, and also special-cases the
  "pinned register" to avoid superfluous moves.

Overall, on `bz2.wasm`, the results are:

    wasmtime full run (compile + runtime) of bz2:

    baseline:   9774M insns, 9742M cycles, 3.918s
    w/ changes: 7012M insns, 6888M cycles, 2.958s  (24.5% faster, 28.3% fewer insns)

    clif-util wasm compile bz2:

    baseline:   2633M insns, 3278M cycles, 1.034s
    w/ changes: 2366M insns, 2920M cycles, 0.923s  (10.7% faster, 10.1% fewer insns)

    All numbers are averages of two runs on an Ampere eMAG.
2020-05-16 23:08:22 -07:00
Nick Fitzgerald
52c6ece5f3 peepmatic: Make peepmatic optional to enable
Rather than outright replacing parts of our existing peephole optimizations
passes, this makes peepmatic an optional cargo feature that can be enabled. This
allows us to take a conservative approach with enabling peepmatic everywhere,
while also allowing us to get it in-tree and make it easier to collaborate on
improving it quickly.
2020-05-14 07:52:23 -07:00
Nick Fitzgerald
090d1c2d32 cranelift: Port most of simple_preopt.rs over to the peepmatic DSL
This ports all of the identity, no-op, simplification, and canonicalization
related optimizations over from being hand-coded to the `peepmatic` DSL. This
does not handle the branch-to-branch optimizations or most of the
divide-by-constant optimizations.
2020-05-14 07:52:23 -07:00
Julian Seward
0bc0503f3f Add a transformation pass which removes phi nodes to which it can demonstrate
that only one value ever flows.  Has been observed to improve generated code
run times by up to 8%.  Compilation cost increases by about 0.6%, but up to 7%
total cost has been observed to be saved; iow it can be a significant win in
terms of compilation time, overall.
2020-05-08 09:41:16 +02:00
Alex Crichton
4c82da440a Move most wasmtime tests into one test suite (#1544)
* Move most wasmtime tests into one test suite

This commit moves most wasmtime tests into a single test suite which
gets compiled into one executable instead of having lots of test
executables. The goal here is to reduce disk space on CI, and this
should be achieved by having fewer executables which means fewer copies
of `libwasmtime.rlib` linked across binaries on the system. More
importantly though this means that DWARF debug information should only
be in one executable rather than duplicated across many.

* Share more build caches

Globally set `RUSTFLAGS` to `-Dwarnings` instead of individually so all
build steps share the same value.

* Allow some dead code in cranelift-codegen

Prevents having to fix all warnings for all possible feature
combinations, only the main ones which come up.

* Update some debug file paths
2020-04-17 17:22:12 -05:00
Chris Fallin
48cf2c2f50 Address review comments:
- Undo temporary changes to default features (`all-arch`) and a
  signal-handler test.
- Remove `SIGTRAP` handler: no longer needed now that we've found an
  "undefined opcode" option on ARM64.
- Rename pp.rs to pretty_print.rs in machinst/.
- Only use empty stack-probe on non-x86. As per a comment in
  rust-lang/compiler-builtins [1], LLVM only supports stack probes on
  x86 and x86-64. Thus, on any other CPU architecture, we cannot refer
  to `__rust_probestack`, because it does not exist.
- Rename arm64 to aarch64.
- Use `target` directive in vcode filetests.
- Run the flags verifier, but without encinfo, when using new backends.
- Clean up warning overrides.
- Fix up use of casts: use u32::from(x) and siblings when possible,
  u32::try_from(x).unwrap() when not, to avoid silent truncation.
- Take immutable `Function` borrows as input; we don't actually
  mutate the input IR.
- Lots of other miscellaneous cleanups.

[1] cae3e6ea23/src/probestack.rs (L39)
2020-04-15 17:21:28 -07:00
Chris Fallin
d83574261c ARM64 backend, part 3 / 11: MachInst infrastructure.
This patch adds the MachInst, or Machine Instruction, infrastructure.
This is the machine-independent portion of the new backend design. It
contains the implementation of the "vcode" (virtual-registerized code)
container, the top-level lowering algorithm and compilation pipeline,
and the trait definitions that the machine backends will fill in.

This backend infrastructure is included in the compilation of the
`codegen` crate, but it is not yet tied into the public APIs; that patch
will come last, after all the other pieces are filled in.

This patch contains code written by Julian Seward <jseward@acm.org> and
Benjamin Bouvier <public@benj.me>, originally developed on a side-branch
before rebasing and condensing into this patch series. See the `arm64`
branch at `https://github.com/cfallin/wasmtime` for original development
history.

Co-authored-by: Julian Seward <jseward@acm.org>
Co-authored-by: Benjamin Bouvier <public@benj.me>
2020-04-11 17:51:11 -07:00
Peter Huene
9f506692c2 Fix clippy warnings.
This commit fixes the current set of (stable) clippy warnings in the repo.
2019-10-24 17:20:12 -07:00
bjorn3
c274d81b5b Fix it 2019-10-02 11:50:44 -07:00
bjorn3
10e226f9ff Always use extern crate std in cranelift-codegen 2019-10-02 11:50:44 -07:00
Joshua Nelson
a1f6457e8a Allow building without std (#1069)
Closes https://github.com/CraneStation/cranelift/issues/1067
2019-09-26 18:00:03 +02:00
Anthony Ramine
178241625c Use slice::from_ref and slice::from_mut 2019-09-23 10:36:03 +02:00
Pat Hickey
89d741f8ae upgrade to target-lexicon 0.8.0
* the target-lexicon crate no longer has or needs the std feature
  in cargo, so we can delete all default-features=false, any mentions
  of its std feature, and the nostd configs in many lib.rs files
* the representation of arm architectures has changed, so some case
  statements needed refactoring
2019-09-04 15:12:17 -07:00
julian-seward1
b8fb52446c Cranelift: implement redundant fill removal on tree-shaped CFG regions. Mozilla bug 1570584. (#906) 2019-08-25 19:37:34 +02:00
bjorn3
edd2bf12fd Export ValueLocRange and DisplayFunctionAnnotations::default() 2019-05-15 09:18:45 +02:00
Yury Delendik
8f95c51730 Reconstruct locations of the original source variable 2019-05-09 00:35:44 -07:00
lazypassion
747ad3c4c5 moved crates in lib/ to src/, renamed crates, modified some files' text (#660)
moved crates in lib/ to src/, renamed crates, modified some files' text (#660)
2019-01-28 15:56:54 -08:00