Commit Graph

151 Commits

Author SHA1 Message Date
Nick Fitzgerald
f0c60f46a8 Cranelift: Remove ABICallee trait (#4701)
* Cranelift: Remove `ABICallee` trait

It has only one implementation: the `ABICalleeImpl` struct. By using that
directly we can avoid unnecessary layers of generics and abstractions as well as
a couple `Box`es that were previously putting the single implementation into a
`Box<dyn>`.

* Cranelift: Rename `ABICalleeImpl` to `AbiCallee`

* Fix comments as per review

* Rename `AbiCallee` to `Callee`
2022-08-15 18:27:05 +00:00
Benjamin Bouvier
8a9b1a9025 Implement an incremental compilation cache for Cranelift (#4551)
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime. 

After the suggestion of Chris, `Function` has been split into mostly two parts:

- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.

Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:

- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
  - `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
  - The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.

The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.

A basic fuzz target has been introduced that tries to do the bare minimum:

- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
  - This last check is less efficient and less likely to happen, so probably should be rethought a bit.

Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.

Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement. 

Fixes #4155.
2022-08-12 16:47:43 +00:00
Afonso Bordado
3ea1813173 x64: Add native lowering for scalar fma (#4539)
Use `vfmadd213{ss,sd}` for these lowerings.
2022-08-11 22:48:16 +00:00
Trevor Elliott
0c2e0494bd x64: Lower fcvt_from_uint in ISLE (#4684)
* Add a test for the existing behavior of fcvt_from_unit

* Migrate the I8, I16, I32 cases of fcvt_from_uint

* Implement the I64 case of fcvt_from_uint

* Add a test for the existing behavior of fcvt_from_uint.f64x2

* Migrate fcvt_from_uint.f64x2 to ISLE

* Lower the last case of `fcvt_from_uint`

* Add a test for `fcvt_from_uint`

* Finish lowering fcmp_from_uint

* Format
2022-08-11 12:28:41 -07:00
Afonso Bordado
c5bc368cfe cranelift: Add COFF TLS Support (#4546)
* cranelift: Implement COFF TLS Relocations

* cranelift: Emit SecRel relocations

* cranelift: Handle _tls_index symbol in backend
2022-08-11 09:33:40 -07:00
Trevor Elliott
0c2a48f682 x64: Migrate selectif and selectif_spectre_guard to ISLE (#4619)
https://github.com/bytecodealliance/wasmtime/pull/4619
2022-08-05 09:36:11 -07:00
Trevor Elliott
dc8362ceec x64: Finish migrating brz and brnz to ISLE (#4614)
https://github.com/bytecodealliance/wasmtime/pull/4614
2022-08-04 12:58:43 -07:00
Trevor Elliott
1fc11bbe51 x64: Migrate brff and I128 branching instructions to ISLE (#4599)
https://github.com/bytecodealliance/wasmtime/pull/4599
2022-08-04 08:58:50 -07:00
Nick Fitzgerald
42bba452a6 Cranelift: Add instructions for getting the current stack/frame/return pointers (#4573)
* Cranelift: Add instructions for getting the current stack/frame pointers and return address

This is the initial part of https://github.com/bytecodealliance/wasmtime/issues/4535

* x64: Remove `Amode::RbpOffset` and use `Amode::ImmReg` instead

We just special case getting operands from `Amode`s now.

* Fix s390x `get_return_address`; require `preserve_frame_pointers=true`

* Assert that `Amode::ImmRegRegShift` doesn't use rbp/rsp

* Handle non-allocatable registers in Amode::with_allocs

* Use "stack" instead of "r15" on s390x

* r14 is an allocatable register on s390x, so it shouldn't be used with `MovPReg`
2022-08-02 14:37:17 -07:00
Chris Fallin
43f1765272 Cranellift: remove Baldrdash support and related features. (#4571)
* Cranellift: remove Baldrdash support and related features.

As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.

In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.

This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the  understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425

* Review feedback.

* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.

The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.

* Review feedback.
2022-08-02 19:37:56 +00:00
Trevor Elliott
25782b527e x64: Migrate trapif and trapff to ISLE (#4545)
https://github.com/bytecodealliance/wasmtime/pull/4545
2022-08-01 11:24:11 -07:00
Benjamin Bouvier
8d0224341c cranelift: Introduce a feature to enable trace logs (#4484)
* Don't use `log::trace` directly but a feature-enabled `trace` macro
* Don't emit disassembly based on the log level
2022-08-01 11:19:15 +02:00
Trevor Elliott
7ac6134894 x64: Shrink Inst from 72 to 48 bytes (#4514)
https://github.com/bytecodealliance/wasmtime/pull/4514
2022-07-27 10:39:22 -07:00
Afonso Bordado
02c3b47db2 x64: Implement SIMD fma (#4474)
* x64: Add VEX Instruction Encoder

This uses a similar builder pattern to the EVEX Encoder.
Does not yet support memory accesses.

* x64: Add FMA Flag

* x64: Implement SIMD `fma`

* x64: Use 4 register Vex Inst

* x64: Reorder VEX pretty print args
2022-07-25 22:01:02 +00:00
Trevor Elliott
06407dd337 Add a test to prevent x64 Inst size slipping further (#4489)
* Add a test to prevent x64 Inst size slipping further

* Enable the test for 64-bit pointers only
2022-07-21 00:01:33 +00:00
Andrew Brown
8629cbc6a4 x64: port atomic_rmw to ISLE (#4389)
* x64: port `atomic_rmw` to ISLE

This change ports `atomic_rmw` to ISLE for the x64 backend. It does not
change the lowering in any way, though it seems possible that the fixed
regs need not be as fixed and that there are opportunities for single
instruction lowerings. It does rename `inst_common::AtomicRmwOp` to
`MachAtomicRmwOp` to disambiguate with the IR enum with the same name.

* x64: remove remaining hardcoded register constraints for `atomic_rmw`

* x64: use `SyntheticAmode` in `AtomicRmwSeq`

* review: add missing reg collector for amode

* review: collect memory registers in the 'late' phase
2022-07-06 23:58:59 +00:00
bjorn3
d1446f767d Mark return value as define instead of clobber for TLS pseudoinstructions (#4357) 2022-06-30 10:44:51 -07:00
Chris Fallin
b2e28b917a Cranelift: update to latest regalloc2: (#4324)
- Handle call instructions' clobbers with the clobbers API, using RA2's
  clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
  list;

- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
  edge-case behavior w.r.t. liverange splitting.
2022-06-28 09:01:59 -07:00
Chris Fallin
5c2c285dd7 Cranelift/x64: fix register allocator metadata for 8-bit divides. (#4332)
`idiv` on x86-64 only reads `rdx`/`edx`/`dx`/`dl` for divides with width
greater than 8 bits; for an 8-bit divide, it reads the whole 16-bit
divisor from `ax`, as our CISC ancestors intended. This PR fixes the
metadata to avoid a regalloc panic (due to undefined `rdx`) in this
case. Does not affect Wasmtime or other Wasm-frontend embedders.
2022-06-27 12:31:06 -07:00
Alex Crichton
8bb07523e2 x64: Fix codegen for the select instruction with v128 (#4317)
This commit fixes a bug in the previous codegen for the `select`
instruction when the operations of the `select` were of the `v128` type.
Previously teh `XmmCmove` instruction only stored an `OperandSize` of 32
or 64 for a 64 or 32-bit move, but this was also used for these 128-bit
types which meant that when used the wrong move instruction was
generated. The fix applied here is to store the whole `Type` being moved
so the 128-bit variant can be selected as well.
2022-06-27 11:02:40 -07:00
Andrew Brown
816aae6aca x64: port some atomics to ISLE (#4212)
* x64: port `fence` to ISLE
* x64: port `atomic_load` to ISLE
* x64: port `atomic_store` to ISLE
2022-06-02 14:13:10 -07:00
Benjamin Bouvier
6e828df632 Remove unused SourceLoc in many Mach data structures (#4180)
* Remove unused srcloc in MachReloc

* Remove unused srcloc in MachTrap

* Use `into_iter` on array in bench code to suppress a warning

* Remove unused srcloc in MachCallSite
2022-05-23 09:27:28 -07:00
Chris Fallin
019ebf47b1 x64: fix pretty-printing argument order for XmmRmR instructions. (#4094)
The pretty-printing had swapped dst and src2; this was introduced when
we moved to RA2 (sorry about that! IMHO we should do something to
automate the mapping between regalloc arg collection and pretty
printing/emission).

`src2` comes at the end because it has a variable number of register
mentions; this is in line with how many of the other inst formats work.

Actual emitted code was never incorrect, just the pretty-printing.

Updated test golden outputs look correct to me now, including the one
that we saw was incorrect in #3945.
2022-05-03 10:12:58 -07:00
Chris Fallin
61dc38c065 Implement Spectre mitigations for table accesses and br_tables. (#4092)
Currently, we have partial Spectre mitigation: we protect heap accesses
with dynamic bounds checks. Specifically, we guard against errant
accesses on the misspeculated path beyond the bounds-check conditional
branch by adding a conditional move that is also dependent on the
bounds-check condition. This data dependency on the condition is not
speculated and thus will always pick the "safe" value (in the heap case,
a NULL address) on the misspeculated path, until the pipeline flushes
and recovers onto the correct path.

This PR uses the same technique both for table accesses -- used to
implement Wasm tables -- and for jumptables, used to implement Wasm
`br_table` instructions.

In the case of Wasm tables, the cmove picks the table base address on
the misspeculated path. This is equivalent to reading the first table
entry. This prevents loads of arbitrary data addresses on the
misspeculated path.

In the case of `br_table`, the cmove picks index 0 on the misspeculated
path. This is safer than allowing a branch to an address loaded from an
index under misspeculation (i.e., it preserves control-flow integrity
even under misspeculation).

The table mitigation is controlled by a Cranelift setting, on by
default. The br_table mitigation is always on, because it is part of the
single lowering pseudoinstruction. In both cases, the impact should be
minimal: a single extra cmove in a (relatively) rarely-used operation.

The table mitigation is architecture-independent (happens during
legalization); the br_table mitigation has been implemented for both x64
and aarch64. (I don't know enough about s390x to implement this
confidently there, but would happily review a PR to do the same on that
platform.)
2022-05-02 11:19:16 -07:00
Chris Fallin
dd45f44511 x64 backend: add lowerings with load-op-store fusion. (#4071)
x64 backend: add lowerings with load-op-store fusion.

These lowerings use the `OP [mem], reg` forms (or in AT&T syntax, `OP
%reg, (mem)`) -- i.e., x86 instructions that load from memory, perform
an ALU operation, and store the result, all in one instruction. Using
these instruction forms, we can merge three CLIF ops together: a load,
an arithmetic operation, and a store.
2022-04-26 18:58:26 -07:00
Chris Fallin
5774e068b7 Cranelift: fix regalloc2 integration bug wrt blockparam branch args. (#4042)
Previously, the block successor accumulation and the blockparam branch
arg setup were decoupled. The lowering backend implicitly specified
the order of successor edges via its `MachTerminator` enum on the last
instruction in the block, while the `Lower` toplevel
machine-independent driver set up blockparam branch args in the edge
order seen in CLIF.

In some cases, these orders did not match -- for example, when the
conditional branch depended on an FP condition that was implemented by
swapping taken/not-taken edges and inverting the condition code.

This PR refactors the successor handling to be centralized in `Lower`
rather than flow through the terminator `MachInst`, and adds a
successor block and its blockparam args at the same time, ensuring the
orders match.
2022-04-18 09:53:57 -07:00
Chris Fallin
a0318f36f0 Switch Cranelift over to regalloc2. (#3989)
This PR switches Cranelift over to the new register allocator, regalloc2.

See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.

Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:

```
Benchmark       Compilation (wallclock)     Execution (wallclock)
blake3-scalar   25% faster                  28% faster
blake3-simd     no diff                     no diff
meshoptimizer   19% faster                  17% faster
pulldown-cmark  17% faster                  no diff
bz2             15% faster                  no diff
SpiderMonkey,   21% faster                  2% faster
  fib(30)
clang.wasm      42% faster                  N/A
```
2022-04-14 10:28:21 -07:00
Chris Fallin
cd173cfe8e ISLE: port fmin, fmax, fmin_pseudo, fmax_pseudo on x64. (#3856) 2022-02-28 14:40:26 -08:00
Chris Fallin
24f145cd1e Migrate clz, ctz, popcnt, bitrev, is_null, is_invalid on x64 to ISLE. (#3848) 2022-02-28 09:45:13 -08:00
Andrew Brown
f87c61176a x64: port select to ISLE (#3682)
* x64: port `select` using an FP comparison to ISLE

This change includes quite a few interlocking parts, required mainly by
the current x64 conventions in ISLE:
 - it adds a way to emit a `cmove` with multiple OR-ing conditions;
   because x64 ISLE cannot currently safely emit a comparison followed
   by several jumps, this adds `MachInst::CmoveOr` and
   `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro
   instructions hide the multi-instruction sequence in `lower.isle`
 - to properly keep track of what instructions consume and produce
   flags, @cfallin added a way to pass around variants of
   `ConsumesFlags` and `ProducesFlags`--these changes affect all
   backends
 - then, to lower the `fcmp + select` CLIF, this change adds several
   `cmove*_from_values` helpers that perform all of the awkward
   conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one
   upside is that now these lowerings have much-improved documentation
   explaining why the various `FloatCC` and `CC` choices are made the
   the way they are.

Co-authored-by: Chris Fallin <chris@cfallin.org>
2022-02-23 10:03:16 -08:00
Nick Fitzgerald
dc86e7a6dc cranelift: Use GPR newtypes extensively in x64 lowering (#3798)
We already defined the `Gpr` newtype and used it in a few places, and we already
defined the `Xmm` newtype and used it extensively. This finishes the transition
to using the newtypes extensively in lowering by making use of `Gpr` in more
places.

Fixes #3685
2022-02-14 12:54:41 -08:00
Nick Fitzgerald
795b0aaf9a cranelift: Add newtype wrappers for x64 register classes
This primary motivation of this large commit (apologies for its size!) is to
introduce `Gpr` and `Xmm` newtypes over `Reg`. This should help catch
difficult-to-diagnose register class mixup bugs in x64 lowerings.

But having a newtype for `Gpr` and `Xmm` themselves isn't enough to catch all of
our operand-with-wrong-register-class bugs, because about 50% of operands on x64
aren't just a register, but a register or memory address or even an
immediate! So we have `{Gpr,Xmm}Mem[Imm]` newtypes as well.

Unfortunately, `GprMem` et al can't be `enum`s and are therefore a little bit
noisier to work with from ISLE. They need to maintain the invariant that their
registers really are of the claimed register class, so they need to encapsulate
the inner data. If they exposed the underlying `enum` variants, then anyone
could just change register classes or construct a `GprMem` that holds an XMM
register, defeating the whole point of these newtypes. So when working with
these newtypes from ISLE, we rely on external constructors like `(gpr_to_gpr_mem
my_gpr)` instead of `(GprMem.Gpr my_gpr)`.

A bit of extra lines of code are included to add support for register mapping
for all of these newtypes as well. Ultimately this is all a bit wordier than I'd
hoped it would be when I first started authoring this commit, but I think it is
all worth it nonetheless!

In the process of adding these newtypes, I didn't want to have to update both
the ISLE `extern` type definition of `MInst` and the Rust definition, so I move
the definition fully into ISLE, similar as aarch64.

Finally, this process isn't complete. I've introduced the newtypes here, and
I've made most XMM-using instructions switch from `Reg` to `Xmm`, as well as
register class-converting instructions, but I haven't moved all of the GPR-using
instructions over to the newtypes yet. I figured this commit was big enough as
it was, and I can continue the adoption of these newtypes in follow up commits.

Part of #3685.
2022-02-03 14:08:08 -08:00
Nick Fitzgerald
5bb3645bd4 cranelift: Port ineg SIMD lowering to ISLE on x64 2022-01-13 15:57:17 -08:00
Nick Fitzgerald
a7dba81c1d cranelift: Port ishl SIMD lowerings to ISLE (#3686) 2022-01-13 09:34:37 -06:00
Chris Fallin
a98f9982fd Merge pull request #3655 from bjorn3/machinst_cleanups2
Remove MachBackend
2022-01-06 13:32:36 -08:00
Nick Fitzgerald
23efaf2196 cranelift: Remove unused x64 instruction helpers 2022-01-06 11:22:54 -08:00
bjorn3
d50f27e8f9 Remove reg_universe method from MachBackend and MachInst 2022-01-06 14:39:50 +01:00
bjorn3
17c3c1813f Remove MachInstEmitInfo 2022-01-04 18:06:01 +01:00
Alex Crichton
1141169ff8 aarch64: Initial work to transition backend to ISLE (#3541)
* aarch64: Initial work to transition backend to ISLE

This commit is what is hoped to be the initial commit towards migrating
the aarch64 backend to ISLE. There's seemingly a lot of changes here but
it's intended to largely be code motion. The current thinking is to
closely follow the x64 backend for how all this is handled and
organized.

Major changes in this PR are:

* The `Inst` enum is now defined in ISLE. This avoids having to define
  it in two places (once in Rust and once in ISLE). I've preserved all
  the comments in the ISLE and otherwise this isn't actually a
  functional change from the Rust perspective, it's still the same enum
  according to Rust.

* Lots of little enums and things were moved to ISLE as well. As with
  `Inst` their definitions didn't change, only where they're defined.
  This will give future ISLE PRs access to all these operations.

* Initial code for lowering `iconst`, `null`, and `bconst` are
  implemented. Ironically none of this is actually used right now
  because constant lowering is handled in `put_input_in_regs` which
  specially handles constants. Nonetheless I wanted to get at least
  something simple working which shows off how to special case various
  things that are specific to AArch64. In a future PR I plan to hook up
  const-lowering in ISLE to this path so even though
  `iconst`-the-clif-instruction is never lowered this should use the
  const lowering defined in ISLE rather than elsewhere in the backend
  (eventually leading to the deletion of the non-ISLE lowering).

* The `IsleContext` skeleton is created and set up for future additions.

* Some code for ISLE that's shared across all backends now lives in
  `isle_prelude_methods!()` and is deduplicated between the AArch64
  backend and the x64 backend.

* Register mapping is tweaked to do the same thing for AArch64 that it
  does for x64. Namely mapping virtual registers is supported instead of
  just virtual to machine registers.

My main goal with this PR was to get AArch64 into a place where new
instructions can be added with relative ease. Additionally I'm hoping to
figure out as part of this change how much to share for ISLE between
AArch64 and x64 (and other backends).

* Don't use priorities with rules

* Update .gitattributes with concise syntax

* Deduplicate some type definitions

* Rebuild ISLE

* Move isa::isle to machinst::isle
2021-11-18 10:38:16 -06:00
Alex Crichton
92394566fc x64: Migrate fabs and bnot vector operations to ISLE
This was my first attempt at transitioning code to ISLE to originally
fix #3327 but that fix has since landed on `main`, so this is instead
now just porting a few operations to ISLE.

Closes #3336
2021-11-16 07:36:49 -08:00
Nick Fitzgerald
33fcd6b4a5 x64: special case 0 to use xor in Inst::gen_constant for i128s 2021-11-10 15:57:58 -08:00
Nick Fitzgerald
d377b665c6 Initial ISLE integration with the x64 backend
On the build side, this commit introduces two things:

1. The automatic generation of various ISLE definitions for working with
CLIF. Specifically, it generates extern type definitions for clif opcodes and
the clif instruction data `enum`, as well as extractors for matching each clif
instructions. This happens inside the `cranelift-codegen-meta` crate.

2. The compilation of ISLE DSL sources to Rust code, that can be included in the
main `cranelift-codegen` compilation.

Next, this commit introduces the integration glue code required to get
ISLE-generated Rust code hooked up in clif-to-x64 lowering. When lowering a clif
instruction, we first try to use the ISLE code path. If it succeeds, then we are
done lowering this instruction. If it fails, then we proceed along the existing
hand-written code path for lowering.

Finally, this commit ports many lowering rules over from hand-written,
open-coded Rust to ISLE.

In the process of supporting ISLE, this commit also makes the x64 `Inst` capable
of expressing SSA by supporting 3-operand forms for all of the existing
instructions that only have a 2-operand form encoding:

    dst = src1 op src2

Rather than only the typical x86-64 2-operand form:

    dst = dst op src

This allows `MachInst` to be in SSA form, since `dst` and `src1` are
disentangled.

("3-operand" and "2-operand" are a little bit of a misnomer since not all
operations are binary operations, but we do the same thing for, e.g., unary
operations by disentangling the sole operand from the result.)

There are two motivations for this change:

1. To allow ISLE lowering code to have value-equivalence semantics. We want ISLE
   lowering to translate a CLIF expression that evaluates to some value into a
   `MachInst` expression that evaluates to the same value. We want both the
   lowering itself and the resulting `MachInst` to be pure and referentially
   transparent. This is both a nice paradigm for compiler writers that are
   authoring and maintaining lowering rules and is a prerequisite to any sort of
   formal verification of our lowering rules in the future.

2. Better align `MachInst` with `regalloc2`'s API, which requires that the input
   be in SSA form.
2021-10-12 17:11:58 -07:00
Alex Crichton
1532516a36 Use relative call instructions between wasm functions (#3275)
* Use relative `call` instructions between wasm functions

This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:

* Function calls are always indirect which means they are more difficult
  to predict, resulting in worse performance.

* Generating a relocation-per-function call requires expensive
  relocation resolution at module-load time, which can be a large
  contributing factor to how long it takes to load a precompiled module.

To fix these issues, while also somewhat compromising on the previously
simple implementation technique, this commit switches wasm calls within
a module to using the `colocated` flag enabled in Cranelift-speak, which
basically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.

When switching the `colocated` flag to `true` this commit is also then
able to move much of the relocation resolution from `wasmtime_jit::link`
into `wasmtime_cranelift::obj` during object-construction time. This
frontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.

The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from the `bl` instruction to the target, which means that if the
function we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.

This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by the `MachBuffer` type. When
generating a function relative jumps between basic blocks have similar
restrictions. This commit adds new methods for the `MachBackend` trait
and updates the implementation of `MachBuffer` to account for all these
new branches. Specifically the changes to `MachBuffer` are:

* For AAarch64 the `LabelUse::Branch26` value now supports veneers, and
  AArch64 calls use this to resolve relocations.

* The `emit_island` function has been rewritten internally to handle
  some cases which previously didn't come up before, such as:

  * When emitting an island the deadline is now recalculated, where
    previously it was always set to infinitely in the future. This was ok
    prior since only a `Branch19` supported veneers and once it was
    promoted no veneers were supported, so without multiple layers of
    promotion the lack of a new deadline was ok.

  * When emitting an island all pending fixups had veneers forced if
    their branch target wasn't known yet. This was generally ok for
    19-bit fixups since the only kind getting a veneer was a 19-bit
    fixup, but with mixed kinds it's a bit odd to force veneers for a
    26-bit fixup just because a nearby 19-bit fixup needed a veneer.
    Instead fixups are now re-enqueued unless they're known to be
    out-of-bounds. This may run the risk of generating more islands for
    19-bit branches but it should also reduce the number of islands for
    between-function calls.

  * Otherwise the internal logic was tweaked to ideally be a bit more
    simple, but that's a pretty subjective criteria in compilers...

I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.

The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.

* Fix arm32

* Review comments
2021-09-01 13:27:38 -05:00
Andrew Brown
2a9f458ea3 x64: lower i8x16.shuffle to VPERMI2B when possible
When shuffling values from two different registers, the x64 lowering for
`i8x16.shuffle` must first shuffle each register separately and then OR
the results with SSE instructions. With `VPERMI2B`, available in
AVX512VL + AVX512VBMI, this can be done in a single instruction after
the shuffle mask has been moved into the destination register. This
change uses `VPERMI2B` for that case when the CPU supports it.
2021-06-01 11:40:53 -07:00
Andrew Brown
7ef3ae2903 x64: implement vselect with variable blend instructions
This change implements `vselect` using SSE4.1's `BLENDVPS`, `BLENDVPD`,
and `PBLENDVB`. `vselect` is a lane-selecting instruction that is used
by
[simple_preopt.rs](fa1faf5d22/cranelift/codegen/src/simple_preopt.rs (L947-L999))
to lower `bitselect` to a single x86 instruction when the condition mask
is known to be boolean (all 1s or 0s, e.g., from a conversion). This is
better than `bitselect` in general, which lowers to 4-5 instructions.
The old backend had the `vselect` lowering; this simply introduces it to
the new backend.
2021-05-17 11:23:33 -07:00
Andrew Brown
e676589b0c x64: lower i64x2.imul to VPMULLQ when possible
This adds the machinery to encode the VPMULLQ instruction which is
available in AVX512VL and AVX512DQ. When these feature sets are
available, we use this instruction instead of a lengthy 12-instruction
sequence.
2021-05-13 20:14:05 -07:00
Andrew Brown
02796fc670 x64: move encodings to a separate module
In order to benchmark the encoding code with criterion, the functions
and structures must be public. Moving this code to its own module
(instead of keeping as a submodule to `inst`), allows `inst` to remain
private. This avoids having to expose and document (or ignore
documenting) the numerous instruction variants in `inst` while allowing
access to the encoding code. This commit changes no functionality.
2021-05-13 10:46:08 -07:00
Andrew Brown
0acc1451ea x64: lower iabs.i64x2 using a single AVX512 instruction when possible (#2819)
* x64: add EVEX encoding mechanism

Also, includes an empty stub module for the VEX encoding.

* x64: lower abs.i64x2 to VPABSQ when available

* x64: refactor EVEX encodings to use `EvexInstruction`

This change replaces the `encode_evex` function with a builder-style struct, `EvexInstruction`. This approach clarifies the code, adds documentation, and results in slight speedups when benchmarked.

* x64: rename encoding CodeSink to ByteSink
2021-04-15 11:53:58 -07:00
Andrew Brown
9b25b06d86 x64: store to all scalar sizes
Previously, `Inst::store` only understood a subset of the scalar types, which resulted in failures seen in #2826. This change allows `Inst::store` to generate instructions for all scalar widths (`8 | 16 | 32 | 64`) since all of these are supported in the emission code of `Inst::MovRM`.
2021-04-13 12:38:35 -07:00
Andrew Brown
8e495ac79d x64: match multiple ISA requirements before emitting
Because there are instructions that are present in more than one ISA feature set, we need to see if any of the ISA requirements match before emitting. This change includes the `VPABSQ` instruction as an example, which is present in both `AVX512F` and `AVX512VL`.
2021-04-08 10:30:39 -07:00