Commit Graph

891 Commits

Author SHA1 Message Date
Julian Seward
25e31739a6 Implement Wasm Atomics for Cranelift/newBE/aarch64.
The implementation is pretty straightforward.  Wasm atomic instructions fall
into 5 groups

* atomic read-modify-write
* atomic compare-and-swap
* atomic loads
* atomic stores
* fences

and the implementation mirrors that structure, at both the CLIF and AArch64
levels.

At the CLIF level, there are five new instructions, one for each group.  Some
comments about these:

* for those that take addresses (all except fences), the address is contained
  entirely in a single `Value`; there is no offset field as there is with
  normal loads and stores.  Wasm atomics require alignment checks, and
  removing the offset makes implementation of those checks a bit simpler.

* atomic loads and stores get their own instructions, rather than reusing the
  existing load and store instructions, for two reasons:

  - per above comment, makes alignment checking simpler

  - reuse of existing loads and stores would require extension of `MemFlags`
    to indicate atomicity, which sounds semantically unclean.  For example,
    then *any* instruction carrying `MemFlags` could be marked as atomic, even
    in cases where it is meaningless or ambiguous.

* I tried to specify, in comments, the behaviour of these instructions as
  tightly as I could.  Unfortunately there is no way (per my limited CLIF
  knowledge) to enforce the constraint that they may only be used on I8, I16,
  I32 and I64 types, and in particular not on floating point or vector types.

The translation from Wasm to CLIF, in `code_translator.rs` is unremarkable.

At the AArch64 level, there are also five new instructions, one for each
group.  All of them except `::Fence` contain multiple real machine
instructions.  Atomic r-m-w and atomic c-a-s are emitted as the usual
load-linked store-conditional loops, guarded at both ends by memory fences.
Atomic loads and stores are emitted as a load preceded by a fence, and a store
followed by a fence, respectively.  The amount of fencing may be overkill, but
it reflects exactly what the SM Wasm baseline compiler for AArch64 does.

One reason to implement r-m-w and c-a-s as a single insn which is expanded
only at emission time is that we must be very careful what instructions we
allow in between the load-linked and store-conditional.  In particular, we
cannot allow *any* extra memory transactions in there, since -- particularly
on low-end hardware -- that might cause the transaction to fail, hence
deadlocking the generated code.  That implies that we can't present the LL/SC
loop to the register allocator as its constituent instructions, since it might
insert spills anywhere.  Hence we must present it as a single indivisible
unit, as we do here.  It also has the benefit of reducing the total amount of
work the RA has to do.

The only other notable feature of the r-m-w and c-a-s translations into
AArch64 code, is that they both need a scratch register internally.  Rather
than faking one up by claiming, in `get_regs` that it modifies an extra
scratch register, and having to have a dummy initialisation of it, these new
instructions (`::LLSC` and `::CAS`) simply use fixed registers in the range
x24-x28.  We rely on the RA's ability to coalesce V<-->R copies to make the
cost of the resulting extra copies zero or almost zero.  x24-x28 are chosen so
as to be call-clobbered, hence their use is less likely to interfere with long
live ranges that span calls.

One subtlety regarding the use of completely fixed input and output registers
is that we must be careful how the surrounding copy from/to of the arg/result
registers is done.  In particular, it is not safe to simply emit copies in
some arbitrary order if one of the arg registers is a real reg.  For that
reason, the arguments are first moved into virtual regs if they are not
already there, using a new method `<LowerCtx for Lower>::ensure_in_vreg`.
Again, we rely on coalescing to turn them into no-ops in the common case.

There is also a ridealong fix for the AArch64 lowering case for
`Opcode::Trapif | Opcode::Trapff`, which removes a bug in which two trap insns
in a row were generated.

In the patch as submitted there are 6 "FIXME JRS" comments, which mark things
which I believe to be correct, but for which I would appreciate a second
opinion.  Unless otherwise directed, I will remove them for the final commit
but leave the associated code/comments unchanged.
2020-08-04 09:35:50 +02:00
Alex Crichton
65eaca35dd Refactor where results of compilation are stored (#2086)
* Refactor where results of compilation are stored

This commit refactors the internals of compilation in Wasmtime to change
where results of individual function compilation are stored. Previously
compilation resulted in many maps being returned, and compilation
results generally held all these maps together. This commit instead
switches this to have all metadata stored in a `CompiledFunction`
instead of having a separate map for each item that can be stored.

The motivation for this is primarily to help out with future
module-linking-related PRs. What exactly "module level" is depends on
how we interpret modules and how many modules are in play, so it's a bit
easier for operations in wasmtime to work at the function level where
possible. This means that we don't have to pass around multiple
different maps and a function index, but instead just one map or just
one entry representing a compiled function.

Additionally this change updates where the parallelism of compilation
happens, pushing it into `wasmtime-jit` instead of `wasmtime-environ`.
This is another goal where `wasmtime-jit` will have more knowledge about
module-level pieces with module linking in play. User-facing-wise this
should be the same in terms of parallel compilation, though.

The ultimate goal of this refactoring is to make it easier for the
results of compilation to actually be a set of wasm modules. This means
we won't be able to have a map-per-metadata where the primary key is the
function index, because there will be many modules within one "object
file".

* Don't clear out fields, just don't store them

Persist a smaller set of fields in `CompilationArtifacts` instead of
trying to clear fields out and dynamically not accessing them.
2020-08-03 12:20:51 -05:00
Benjamin Bouvier
e108f14620 machinst x64: use xor/xorpss/xorpd to generate zero constants; 2020-07-31 13:17:52 -07:00
Chris Fallin
9a9b5015d0 Merge pull request #2081 from cfallin/aarch64-baldrdash-fix
Aarch64: fix narrow integer-register extension with Baldrdash ABI.
2020-07-31 12:13:38 -07:00
Chris Fallin
dd09865611 Fix MachBuffer branch handling with redirect chains.
When one branch target label in a MachBuffer is redirected to another,
we eventually fix up branches targetting the first to refer to the
redirected target instead. Separately, we have a branch-folding
optimization that, when an unconditional branch occurs as the only
instruction in a block (right at a label) and the previous instruction
is also an unconditional branch (hence no fallthrough), we can elide
that block entirely and redirect the label. Finally, we prevented
infinite loops when resolving label aliases by chasing only one alias
deep.

Unfortunately, these three facts interacted poorly, and this is a result
of our correctness arguments assuming a fully-general "redirect" that
was not limited to one indirection level. In particular, we could have
some label A that redirected to B, then remove the block at B because it
is just a single branch to C, redirecting B to C. A would still redirect
to B, though, without chasing to C, and hence a branch to B would fall
through to the unrelated block that came after block B.

Thanks to @bnjbvr for finding this bug while debugging the x64 backend
and reducing a failure to the function in issue #2082. (This is a very
subtle bug and it seems to have been quite difficult to chase; my
apologies!)

The fix is to (i) chase redirects arbitrarily deep, but also (ii) ensure
that we do not form a cycle of redirects. The latter is done by very
carefully checking the existing fully-resolved target of the label we
are about to redirect *to*; if it resolves back to the branch that
is causing this redirect, then we avoid making the alias. The comments
in this patch make a slightly more detailed argument why this should be
correct.

Unfortunately we cannot directly test the CLIF that @bnjbvr reduced
because we don't have a way to assert anything about the machine-code
that comes after the branch folding and emission. However, the dedicated
unit tests in this patch replicate an equivalent folding case, and also
test that we handle branch cycles properly (as argued above).

Fixes #2082.
2020-07-31 19:52:30 +02:00
Chris Fallin
1fbdf169b5 Aarch64: fix narrow integer-register extension with Baldrdash ABI.
In the Baldrdash (SpiderMonkey) embedding, we must take care to
zero-extend all function arguments to callees in integer registers when
the types are narrower than 64 bits. This is because, unlike the native
SysV ABI, the Baldrdash ABI expects high bits to be cleared. Not doing
so leads to difficult-to-trace errors where high bits falsely tag an
int32 as e.g. an object pointer, leading to potential security issues.
2020-07-31 10:19:13 -07:00
Andrew Brown
999fa00d6a machinst x64: add loading of inline 128-bit constants
Eventually the `load + jmp + constant` pattern should be replaced with just `load` once constant pools are more tightly integrated.
2020-07-30 14:16:12 -07:00
Andrew Brown
eda5c6d370 machinst x64: add packed FP negation 2020-07-30 14:16:12 -07:00
Andrew Brown
c74a9d1225 machinst x64: add packed shifts 2020-07-30 14:16:12 -07:00
Andrew Brown
0398033447 machinst x64: add packed FP comparisons
Re-orders the SseOpcode variants alphabetically.
2020-07-30 14:16:12 -07:00
Andrew Brown
e3bd8d696b machinst x64: add basic packed FP arithmetic
Includes instruction definition of packed min/max.
2020-07-30 14:16:12 -07:00
Andrew Brown
77cc2f69c1 machinst x64: allow use of vector-length types 2020-07-30 14:16:12 -07:00
Andrew Brown
dc6220b87c machinst x64: add uses for crate dependencies 2020-07-30 14:16:12 -07:00
Benjamin Bouvier
79abcdb035 machinst x64: add testing to the CI; 2020-07-30 10:32:00 +02:00
Anton Kirilov
adf25d27c2 AArch64: Implement SIMD floating-point arithmetic
Copyright (c) 2020, Arm Limited.
2020-07-28 15:19:47 +01:00
Benjamin Bouvier
7f109a5198 machinst x64: use a sign-extension when loading jump table offsets;
The jump table offset that's loaded out of the jump table could be
signed (if it's an offset to before the jump table itself), so we should
use a signed extension there, not an unsigned extension.
2020-07-28 12:29:49 +02:00
Chris Fallin
8fd92093a4 Merge pull request #2061 from cfallin/aarch64-amode
Aarch64 codegen quality: support more general add+extend address computations.
2020-07-27 13:48:55 -07:00
Chris Fallin
f9b98f0ddc Aarch64 codegen quality: support more general add+extend computations.
Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:

```
   v2760 = uextend.i64 v985
   v2761 = load.i64 notrap aligned readonly v1
   v1018 = iadd v2761, v2760
   store v1017, v1018
```

This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit `iadd`s, optionally with `uextend`/`sextend` from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.

Using `perf stat` to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts `bz2` as follows:

```
pre:

       1006.410425      task-clock (msec)         #    1.000 CPUs utilized
               113      context-switches          #    0.112 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,036      page-faults               #    0.005 M/sec
     3,221,547,476      cycles                    #    3.201 GHz
     4,000,670,104      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,958,613      branch-misses

       1.006071348 seconds time elapsed

post:

        963.499525      task-clock (msec)         #    0.997 CPUs utilized
               117      context-switches          #    0.121 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,081      page-faults               #    0.005 M/sec
     3,039,687,673      cycles                    #    3.155 GHz
     3,837,761,690      instructions              #    1.26  insn per cycle
   <not supported>      branches
        28,254,585      branch-misses

       0.966072682 seconds time elapsed
```

In other words, this reduces instruction count by 4.1% on `bz2`.
2020-07-27 13:10:50 -07:00
Chris Fallin
bad99c93b1 Merge pull request #2051 from cfallin/aarch64-add-negative-imm
Aarch64 codegen quality: handle add-negative-imm as subtract.
2020-07-24 12:26:54 -07:00
Chris Fallin
1b80860f1f Aarch64 codegen quality: handle add-negative-imm as subtract.
We often see patterns like:

```
    mov w2, #0xffff_ffff   // uses ORR with logical immediate form
    add w0, w1, w2
```

which is just `w0 := w1 - 1`. It would be much better to recognize when
the inverse of an immediate will fit in a 12-bit immediate field if the
immediate itself does not, and flip add to subtract (and vice versa), so
we can instead generate:

```
    sub w0, w1, #1
```

We see this pattern in e.g. `bz2`, where this commit makes the following
difference (counting instructions with `perf stat`, filling in the
wasmtime cache first then running again to get just runtime):

pre:

```
        992.762250      task-clock (msec)         #    0.998 CPUs utilized
               109      context-switches          #    0.110 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,035      page-faults               #    0.005 M/sec
     3,224,119,134      cycles                    #    3.248 GHz
     4,000,521,171      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,573,755      branch-misses

       0.995072322 seconds time elapsed
```

post:

```
        993.853850      task-clock (msec)         #    0.998 CPUs utilized
               123      context-switches          #    0.124 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,072      page-faults               #    0.005 M/sec
     3,201,278,337      cycles                    #    3.221 GHz
     3,917,061,340      instructions              #    1.22  insn per cycle
   <not supported>      branches
        28,410,633      branch-misses

       0.996008047 seconds time elapsed
```

In other words, a 2.1% reduction in instruction count on `bz2`.
2020-07-24 11:41:33 -07:00
Benjamin Bouvier
35d9ab19b7 Review fixes; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
ad4a2f919f machinst x64: implement support for reference types; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
987c616bf5 machinst x64: implement support for dynamic heaps and explicit bound checks; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
2e3ad3227d machinst x64: fix encoding of movzx/movsx with non-ABCD input registers;
Using an input register that doesn't belong to the ABCD family (al,
etc.) as the source of movsx/movzx requires a redundant REX prefix, that
was not emitted.
2020-07-24 19:29:12 +02:00
Benjamin Bouvier
de4923356a machinst x64: fix fcmp comparison for NotEqual;
We have to emit both checks against the parity bit (for unordered) and
non-equality bit (for equality), otherwise this returns false when
comparing NaN against itself.
2020-07-24 19:29:12 +02:00
Benjamin Bouvier
4b26f5b120 machinst x64: baldrdash: fix multi-value when both gpr and xmm are returned;
In baldrdash, only the first return value may live in a register, be it
an integer or a floating point value.
2020-07-24 19:29:12 +02:00
Benjamin Bouvier
aa103698d4 machinst x64: extend Copysign to work for f64 inputs too; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
694af3aec2 machinst x64: implement float Floor/Ceil/Trunc/Nearest as VM calls; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
48ec806a9d machinst x64: implement Fabs/Fneg in terms of other instructions; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
03b9e1e86a machinst x64: implement float min/max with the right semantics; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
e43310a088 machinst x64: adapt conversions for saturation behaviors; 2020-07-24 19:29:12 +02:00
Benjamin Bouvier
cd54f05efd machinst x64: implement float-to-int and int-to-float conversions; 2020-07-24 19:29:12 +02:00
Chris Fallin
2b9fefe89a Add timing for several new-backend stages.
This PR adds a bit more granularity to the output of e.g. `clif-util
compile -T`, indicating how much time is spent in VCode lowering and
various other new-backend-specific tasks.
2020-07-23 09:54:39 -07:00
Chris Fallin
87eb4392c4 Merge pull request #2063 from jgouly/vselect
arm64: Implement Vselect opcode
2020-07-22 13:35:46 -07:00
Chris Fallin
44ef8247a9 Merge pull request #2062 from akirilov-arm/extract_lane
AArch64: Improve code generation for Extractlane + Sextend / Uextend
2020-07-22 13:35:00 -07:00
Chris Fallin
d22cefd220 Merge pull request #2058 from cfallin/aarch64-fix-bool
Aarch64 codegen: represent bool `true` as -1, not 1.
2020-07-22 13:16:12 -07:00
Chris Fallin
b8f6d53a6b Aarch64 codegen: represent bool true as -1, not 1.
It seems that this is actually the correct behavior for bool types wider
than `b1`; some of the vector instruction optimizations depend on bool
lanes representing false and true as all-zeroes and all-ones
respectively. For `b8`..`b64`, this results in an extra negation after a
`cset` when a bool is produced by an `icmp`/`fcmp`, but the most common
case (`b1`) is unaffected, because an all-ones one-bit value is just
`1`.

An example of this assumption can be seen here:

399ee0a54c/cranelift/codegen/src/simple_preopt.rs (L956)

Thanks to Joey Gouly of ARM for noting this issue while implementing
SIMD support, and digging into the source (finding the above example) to
determine the correct behavior.
2020-07-22 12:30:55 -07:00
Joey Gouly
5355c3e3d5 arm64: Implement Vselect opcode
This is implemented the same as Bitselect, as the controlling vector
is a boolean vector. A boolean vector in cranelift has elements
that are either 0 or all 1s, so it can be used to select elements
lane wise.

Copyright (c) 2020, Arm Limited.
2020-07-22 12:50:29 +01:00
Anton Kirilov
420c4f06b8 AArch64: Improve code generation for Extractlane + Sextend / Uextend
Copyright (c) 2020, Arm Limited.
2020-07-22 11:47:51 +01:00
Yury Delendik
399ee0a54c Serialize and deserialize compilation artifacts. (#2020)
* Serialize and deserialize Module
* Use bincode to serialize
* Add wasm_module_serialize; docs
* Simple tests
2020-07-21 15:05:50 -05:00
Chris Fallin
96ef2f1a1b Fix u8::MAX -> std::u8::MAX. (#2047)
As per Carlo Kok on Zulip #cranelift, this breaks builds with stable
Rust pre-1.43, as `core::u8::MAX` was only stabilized then. We'd like to
support older versions if we can easily do so.

This PR also adds `cranelift-tools` to the crates checked on CI with
Rust 1.41.0, which pulls in all backends (including `aarch64`).
2020-07-20 14:59:15 -05:00
Chris Fallin
784e2f1480 Merge pull request #2038 from jgouly/arith2
arm64: Enable arith2 tests
2020-07-20 09:00:10 -07:00
Chris Fallin
1b3b2dbfd0 Merge pull request #2043 from cfallin/csel-opt
Aarch64: handle csel with icmp/fcmp source without materializing the bool.
2020-07-18 19:33:47 -07:00
Chris Fallin
ea894c0eeb Merge pull request #2042 from cfallin/aarch64-fix-regshift-mask
Aarch64: mask shift-amounts incorporated into reg-reg-shift ALU insts.
2020-07-18 19:33:35 -07:00
Chris Fallin
21dac670f0 Aarch64: handle csel with icmp/fcmp source without materializing the bool.
Previously, we simply compared the input bool to 0, which forced the
value into a register (usually via a cmp and cset), zero-extended it,
etc. This patch performs the same pattern-matching that branches do to
directly perform the cmp and use its flag results with the csel.

On the `bz2` benchmark, the runtime is affected as follows (measuring
with `perf stat`, using wasmtime with its cache enabled, and taking the
second run after the first compiles and populates the cache):

pre:

       1117.232000      task-clock (msec)         #    1.000 CPUs utilized
               133      context-switches          #    0.119 K/sec
                 1      cpu-migrations            #    0.001 K/sec
             5,041      page-faults               #    0.005 M/sec
     3,511,615,100      cycles                    #    3.143 GHz
     4,272,427,772      instructions              #    1.22  insn per cycle
   <not supported>      branches
        27,980,906      branch-misses

       1.117299838 seconds time elapsed

post:

       1003.738075      task-clock (msec)         #    1.000 CPUs utilized
               121      context-switches          #    0.121 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             5,052      page-faults               #    0.005 M/sec
     3,224,875,393      cycles                    #    3.213 GHz
     4,000,838,686      instructions              #    1.24  insn per cycle
   <not supported>      branches
        27,928,232      branch-misses

       1.003440004 seconds time elapsed

In other words, with this change, on `bz2`, we see a 6.3% reduction in
executed instructions.
2020-07-17 21:10:21 -07:00
Nick Fitzgerald
ee5982fd16 peepmatic: Be generic over the operator type
This lets us avoid the cost of `cranelift_codegen::ir::Opcode` to
`peepmatic_runtime::Operator` conversion overhead, and paves the way for
allowing Peepmatic to support non-clif optimizations (e.g. vcode optimizations).

Rather than defining our own `peepmatic::Operator` type like we used to, now the
whole `peepmatic` crate is effectively generic over a `TOperator` type
parameter. For the Cranelift integration, we use `cranelift_codegen::ir::Opcode`
as the concrete type for our `TOperator` type parameter. For testing, we also
define a `TestOperator` type, so that we can test Peepmatic code without
building all of Cranelift, and we can keep them somewhat isolated from each
other.

The methods that `peepmatic::Operator` had are now translated into trait bounds
on the `TOperator` type. These traits need to be shared between all of
`peepmatic`, `peepmatic-runtime`, and `cranelift-codegen`'s Peepmatic
integration. Therefore, these new traits live in a new crate:
`peepmatic-traits`. This crate acts as a header file of sorts for shared
trait/type/macro definitions.

Additionally, the `peepmatic-runtime` crate no longer depends on the
`peepmatic-macro` procedural macro crate, which should lead to faster build
times for Cranelift when it is using pre-built peephole optimizers.
2020-07-17 16:16:49 -07:00
Chris Fallin
9bd9c628aa Aarch64: mask shift-amounts incorporated into reg-reg-shift ALU insts.
We had previously fixed a bug in which constant shift amounts should be
masked to modulo the number of bits in the operand; however, we did not
fix the analogous case for shifts incorporated into the second register
argument of ALU instructions that support integrated shifts.  This
failure to mask resulted in illegal instructions being generated, e.g.
in https://bugzilla.mozilla.org/show_bug.cgi?id=1653502. This PR fixes
the issue by masking the amount, as the shift semantics require.
2020-07-17 14:55:23 -07:00
Nick Fitzgerald
ae95ad8733 cranelift: Don't build peepmatic-based optimizations in build.rs
Instead, when the `rebuild-peephole-optimizers` feature is enabled, rebuild them
the first time they are used. This allows peepmatic to run when Cranelift's
`Opcode` is defined and available, which paves the way forward for:

* merging `peepmatic_runtime::operator::Operator` and Cranelift's `Opcode` (we
  are wasting a bunch of cycles converting between the two of them), and

* supporting vcode optimizations in `peepmatic`.
2020-07-17 14:35:16 -07:00
Johnnie Birch
a7cedf3100 Add support for 32 bit and 64 bit fcmp for the new backend
Implements commiss and commisd.
2020-07-17 13:46:54 -07:00
Nick Fitzgerald
8dd4ab2f1e Merge pull request #2022 from MaxGraey/peepmatic-bnot
peepmatic: Add bnot operation
2020-07-17 09:39:38 -07:00