Commit Graph

112 Commits

Author SHA1 Message Date
Alex Crichton
1532516a36 Use relative call instructions between wasm functions (#3275)
* Use relative `call` instructions between wasm functions

This commit is a relatively major change to the way that Wasmtime
generates code for Wasm modules and how functions call each other.
Prior to this commit all function calls between functions, even if they
were defined in the same module, were done indirectly through a
register. To implement this the backend would emit an absolute 8-byte
relocation near all function calls, load that address into a register,
and then call it. While this technique is simple to implement and easy
to get right, it has two primary downsides associated with it:

* Function calls are always indirect which means they are more difficult
  to predict, resulting in worse performance.

* Generating a relocation-per-function call requires expensive
  relocation resolution at module-load time, which can be a large
  contributing factor to how long it takes to load a precompiled module.

To fix these issues, while also somewhat compromising on the previously
simple implementation technique, this commit switches wasm calls within
a module to using the `colocated` flag enabled in Cranelift-speak, which
basically means that a relative call instruction is used with a
relocation that's resolved relative to the pc of the call instruction
itself.

When switching the `colocated` flag to `true` this commit is also then
able to move much of the relocation resolution from `wasmtime_jit::link`
into `wasmtime_cranelift::obj` during object-construction time. This
frontloads all relocation work which means that there's actually no
relocations related to function calls in the final image, solving both
of our points above.

The main gotcha in implementing this technique is that there are
hardware limitations to relative function calls which mean we can't
simply blindly use them. AArch64, for example, can only go +/- 64 MB
from the `bl` instruction to the target, which means that if the
function we're calling is a greater distance away then we would fail to
resolve that relocation. On x86_64 the limits are +/- 2GB which are much
larger, but theoretically still feasible to hit. Consequently the main
increase in implementation complexity is fixing this issue.

This issue is actually already present in Cranelift itself, and is
internally one of the invariants handled by the `MachBuffer` type. When
generating a function relative jumps between basic blocks have similar
restrictions. This commit adds new methods for the `MachBackend` trait
and updates the implementation of `MachBuffer` to account for all these
new branches. Specifically the changes to `MachBuffer` are:

* For AAarch64 the `LabelUse::Branch26` value now supports veneers, and
  AArch64 calls use this to resolve relocations.

* The `emit_island` function has been rewritten internally to handle
  some cases which previously didn't come up before, such as:

  * When emitting an island the deadline is now recalculated, where
    previously it was always set to infinitely in the future. This was ok
    prior since only a `Branch19` supported veneers and once it was
    promoted no veneers were supported, so without multiple layers of
    promotion the lack of a new deadline was ok.

  * When emitting an island all pending fixups had veneers forced if
    their branch target wasn't known yet. This was generally ok for
    19-bit fixups since the only kind getting a veneer was a 19-bit
    fixup, but with mixed kinds it's a bit odd to force veneers for a
    26-bit fixup just because a nearby 19-bit fixup needed a veneer.
    Instead fixups are now re-enqueued unless they're known to be
    out-of-bounds. This may run the risk of generating more islands for
    19-bit branches but it should also reduce the number of islands for
    between-function calls.

  * Otherwise the internal logic was tweaked to ideally be a bit more
    simple, but that's a pretty subjective criteria in compilers...

I've added some simple testing of this for now. A synthetic compiler
option was create to simply add padded 0s between functions and test
cases implement various forms of calls that at least need veneers. A
test is also included for x86_64, but it is unfortunately pretty slow
because it requires generating 2GB of output. I'm hoping for now it's
not too bad, but we can disable the test if it's prohibitive and
otherwise just comment the necessary portions to be sure to run the
ignored test if these parts of the code have changed.

The final end-result of this commit is that for a large module I'm
working with the number of relocations dropped to zero, meaning that
nothing actually needs to be done to the text section when it's loaded
into memory (yay!). I haven't run final benchmarks yet but this is the
last remaining source of significant slowdown when loading modules,
after I land a number of other PRs both active and ones that I only have
locally for now.

* Fix arm32

* Review comments
2021-09-01 13:27:38 -05:00
Sam Parker
cbb7229457 Re-implement atomic load and stores
The AArch64 support was a bit broken and was using Armv7 style
barriers, which aren't required with Armv8 acquire-release
load/stores.

The fallback CAS loops and RMW, for AArch64, have also been updated
to use acquire-release, exclusive, instructions which, again, remove
the need for barriers. The CAS loop has also been further optimised
by using the extending form of the cmp instruction.

Copyright (c) 2021, Arm Limited.
2021-08-05 09:08:08 +01:00
Sam Parker
3bc2f0c701 Enable simd_X_extadd_pairwise_X for AArch64
Lower to [u|s]addlp for AArch64.

Copyright (c) 2021, Arm Limited.
2021-08-03 10:25:09 +01:00
Sam Parker
f2806a9192 rebase and ran cargo fmt
Copyright (c) 2021, Arm Limited.
2021-07-28 13:14:20 +01:00
Sam Parker
541a4ee428 Enable simd_extmul_* for AArch64
Lower simd_extmul_[low/high][signed/unsigned] to [s|u]widen inputs to
an imul node.

Copyright (c) 2021, Arm Limited.
2021-07-28 13:14:20 +01:00
Anton Kirilov
6c3d7092b9 Enable the simd_conversions test for AArch64
Copyright (c) 2021, Arm Limited.
2021-07-16 22:04:45 +01:00
Anton Kirilov
330f02aa09 Enable the simd_i32x4_trunc_sat_f64x2 test for AArch64
Also, reorganize the AArch64-specific VCode instructions for unary
narrowing and widening vector operations, so that they are more
straightforward to use.

Copyright (c) 2021, Arm Limited.
2021-06-30 12:17:53 +01:00
Anton Kirilov
98f1ac789e Enable the simd_i16x8_q15mulr_sat_s test on AArch64
Copyright (c) 2021, Arm Limited.
2021-06-28 12:24:31 +01:00
Afonso Bordado
b8ad99e435 aarch64: Implement TLS ELF GD Relocations
Implement the `TlsValue` opcode in the aarch64 backend for ELF_GD.

This is a little bit unusual as the default TLS mechanism for aarch64 is TLS Descriptors in other compilers.
However currently we only recognize elf_gd so lets start with that as a TLS implementation.
2021-06-24 12:21:44 +01:00
Chris Fallin
3d56728b86 Merge pull request #2975 from afonso360/aarch64-icmp
aarch64: Implement lowering i128 icmp instructions
2021-06-09 15:38:41 -07:00
Afonso Bordado
4d085d8fbf aarch64: Add sbcs instruction encodings 2021-06-09 22:56:39 +01:00
Afonso Bordado
61f07d79a7 aarch64: Add adcs instruction encodings 2021-06-09 22:56:39 +01:00
Afonso Bordado
2c4d1c0003 aarch64: Add ands instruction encoding 2021-06-09 22:38:01 +01:00
Afonso Bordado
a2e74b2c45 aarch64: Implement isub for i128 operands 2021-05-22 21:51:41 +01:00
Afonso Bordado
d3b525fa29 aarch64: Implement iadd for i128 operands 2021-05-22 21:21:44 +01:00
Anton Kirilov
480670e17f Enable the simd_boolean test for AArch64
Also, enable the simd_i64x2_arith2 test because it doesn't need
any code changes.

Copyright (c) 2021, Arm Limited.
2021-04-27 20:19:51 +01:00
Anton Kirilov
7248abd591 Cranelift AArch64: Improve the handling of callee-saved registers
SIMD & FP registers are now saved and restored in pairs, similarly
to general-purpose registers. Also, only the bottom 64 bits of the
registers are saved and restored (in case of non-Baldrdash ABIs),
which is the requirement from the Procedure Call Standard for the
Arm 64-bit Architecture.

As for the callee-saved general-purpose registers, if a procedure
needs to save and restore an odd number of them, it no longer uses
store and load pair instructions for the last register.

Copyright (c) 2021, Arm Limited.
2021-04-13 20:23:08 +01:00
Anton Kirilov
07c27039b1 Cranelift AArch64: Add initial support for the Armv8.1 atomics
This commit enables Cranelift's AArch64 backend to generate code
for instruction set extensions (previously only the base Armv8-A
architecture was supported); also, it makes it possible to detect
the extensions supported by the host when JIT compiling. The new
functionality is applied to the IR instruction `AtomicCas`.

Copyright (c) 2021, Arm Limited.
2021-03-13 02:31:51 +00:00
Chris Fallin
2d5db92a9e Rework/simplify unwind infrastructure and implement Windows unwind.
Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.

A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.

The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.

It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.

There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for `alloca` (dynamic frame growth),
then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
`abi_impl.rs`.

The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from `sp` once to allocate space, then used stores to `[sp,
offset]` to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 384 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like
a "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.

This PR adds the new backend on Windows to CI as well.
2021-03-11 20:03:52 -08:00
Kasey Carrothers
99be82c866 Replace MachInst::gen_zero_len_nop with gen_nop(0) 2021-01-29 01:15:08 -08:00
Chris Fallin
c84d6be6f4 Detailed debug-info (DWARF) support in new backends (initially x64).
This PR propagates "value labels" all the way from CLIF to DWARF
metadata on the emitted machine code. The key idea is as follows:

- Translate value-label metadata on the input into "value_label"
  pseudo-instructions when lowering into VCode. These
  pseudo-instructions take a register as input, denote a value label,
  and semantically are like a "move into value label" -- i.e., they
  update the current value (as seen by debugging tools) of the given
  local. These pseudo-instructions emit no machine code.

- Perform a dataflow analysis *at the machine-code level*, tracking
  value-labels that propagate into registers and into [SP+constant]
  stack storage. This is a forward dataflow fixpoint analysis where each
  storage location can contain a *set* of value labels, and each value
  label can reside in a *set* of storage locations. (Meet function is
  pairwise intersection by storage location.)

  This analysis traces value labels symbolically through loads and
  stores and reg-to-reg moves, so it will naturally handle spills and
  reloads without knowing anything special about them.

- When this analysis converges, we have, at each machine-code offset, a
  mapping from value labels to some number of storage locations; for
  each offset for each label, we choose the best location (prefer
  registers). Note that we can choose any location, as the symbolic
  dataflow analysis is sound and guarantees that the value at the
  value_label instruction propagates to all of the named locations.

- Then we can convert this mapping into a format that the DWARF
  generation code (wasmtime's debug crate) can use.

This PR also adds the new-backend variant to the gdb tests on CI.
2021-01-21 15:59:49 -08:00
Anton Kirilov
043a8434d2 Cranelift AArch64: Improve the Popcnt implementation
Now the backend uses the CNT instruction, which results into a major
simplification.

Copyright (c) 2021, Arm Limited.
2021-01-19 16:49:47 +00:00
Chris Fallin
6eea015d6c Multi-register value support: framework for Values wider than machine regs.
This will allow for support for `I128` values everywhere, and `I64`
values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the
machine backends to build such support; it just adds the framework for
the MachInst backends to *reason* about a `Value` residing in more than
one register.
2021-01-05 17:45:02 -08:00
Anton Kirilov
f59b274d22 Cranelift AArch64: Further vector constant improvements
Introduce support for MOVI/MVNI with 16-, 32-, and 64-bit elements,
and the vector variant of FMOV.

Copyright (c) 2020, Arm Limited.
2020-12-03 15:30:24 +00:00
Chris Fallin
d413b907b4 Merge pull request #2414 from jgouly/extend-refactor
arm64: Refactor Inst::Extend handling
2020-11-25 17:22:07 -08:00
Alex Crichton
4d64c68b05 Run rustfmt 1.48
Run rustfmt over wasmtime with the new stable release which looks like
it wants to reformat a few lines.
2020-11-19 11:12:30 -08:00
Chris Fallin
073c727a74 x64 and aarch64: carry MemFlags on loads/stores; don't emit trap info unless an op can trap.
This end result was previously enacted by carrying a `SourceLoc` on
every load/store, which was somewhat cumbersome, and only indirectly
encoded metadata about a memory reference (can it trap) by its presence
or absence. We have a type for this -- `MemFlags` -- that tells us
everything we might want to know about a load or store, and we should
plumb it through to code emission instead.

This PR attaches a `MemFlags` to an `Amode` on x64, and puts it on load
and store `Inst` variants on aarch64. These two choices seem to factor
things out in the nicest way: there are relatively few load/store insts
on aarch64 but many addressing modes, while the opposite is true on x64.
2020-11-17 11:43:06 -08:00
Chris Fallin
712ff22492 AArch64 SIMD: pattern-match load+splat into LD1R instruction. 2020-11-16 15:59:28 -08:00
Joey Gouly
70cbc4ca7c arm64: Refactor Inst::Extend handling
This refactors the handling of Inst::Extend and simplifies the lowering
of Bextend and Bmask, which allows the use of SBFX instructions for
extensions from 1-bit booleans. Other extensions use aliases of BFM,
and the code was changed to reflect that, rather than hard coding bit
patterns. Also ImmLogic is now implemented, so another hard coded
instruction can be removed.

As part of looking at boolean handling, `normalize_boolean_result` was
changed to `materialize_boolean_result`, such that it can use either
CSET or CSETM. Using CSETM saves an instruction (previously CSET + SUB)
for booleans bigger than 1-bit.

Copyright (c) 2020, Arm Limited.
2020-11-13 16:17:25 +00:00
Anton Kirilov
edaada3f57 Cranelift AArch64: Various small fixes
* Use FMOV to move 64-bit FP registers and SIMD vectors.
* Add support for additional vector load types.
* Fix the printing of Inst::LoadAddr.

Copyright (c) 2020, Arm Limited.
2020-11-12 13:54:05 +00:00
Julian Seward
41e87a2f99 Support wasm select instruction with V128-typed operands on AArch64.
* this requires upgrading to wasmparser 0.67.0.

* There are no CLIF side changes because the CLIF `select` instruction is
  polymorphic enough.

* on aarch64, there is unfortunately no conditional-move (csel) instruction on
  vectors.  This patch adds a synthetic instruction `VecCSel` which *does*
  behave like that.  At emit time, this is emitted as an if-then-else diamond
  (4 insns).

* aarch64 implementation is otherwise straightforwards.
2020-11-11 18:45:24 +01:00
Chris Fallin
4dce51096d MachInst backends: handle SourceLocs out-of-band, not in Insts.
In existing MachInst backends, many instructions -- any that can trap or
result in a relocation -- carry `SourceLoc` values in order to propagate
the location-in-original-source to use to describe resulting traps or
relocation errors.

This is quite tedious, and also error-prone: it is likely that the
necessary plumbing will be missed in some cases, and in any case, it's
unnecessarily verbose.

This PR factors out the `SourceLoc` handling so that it is tracked
during emission as part of the `EmitState`, and plumbed through
automatically by the machine-independent framework. Instruction emission
code that directly emits trap or relocation records can query the
current location as necessary. Then we only need to ensure that memory
references and trap instructions, at their (one) emission point rather
than their (many) lowering/generation points, are wired up correctly.

This does have the side-effect that some loads and stores that do not
correspond directly to user code's heap accesses will have unnecessary
but harmless trap metadata. For example, the load that fetches a code
offset from a jump table will have a 'heap out of bounds' trap record
attached to it; but because it is bounds-checked, and will never
actually trap if the lowering is correct, this should be harmless.  The
simplicity improvement here seemed more worthwhile to me than plumbing
through a "corresponds to user-level load/store" bit, because the latter
is a bit complex when we allow for op merging.

Closes #2290: though it does not implement a full "metadata" scheme as
described in that issue, this seems simpler overall.
2020-11-10 15:46:53 -08:00
Julian Seward
dd9bfcefaa CL/aarch64: implement the wasm SIMD v128.load{32,64}_zero instructions.
This patch implements, for aarch64, the following wasm SIMD extensions.

  v128.load32_zero and v128.load64_zero instructions
  https://github.com/WebAssembly/simd/pull/237

The changes are straightforward:

* no new CLIF instructions.  They are translated into an existing CLIF scalar
  load followed by a CLIF `scalar_to_vector`.

* the comment/specification for CLIF `scalar_to_vector` has been changed to
  match the actual intended semantics, per consulation with Andrew Brown.

* translation from `scalar_to_vector` to aarch64 `fmov` instruction.  This
  has been generalised slightly so as to allow both 32- and 64-bit transfers.

* special-case zero in `lower_constant_f128` in order to avoid a
  potentially slow call to `Inst::load_fp_constant128`.

* Once "Allow loads to merge into other operations during instruction
  selection in MachInst backends"
  (https://github.com/bytecodealliance/wasmtime/issues/2340) lands,
  we can use that functionality to pattern match the two-CLIF pair and
  emit a single AArch64 instruction.

* A simple filetest has been added.

There is no comprehensive testcase in this commit, because that is a separate
repo.  The implementation has been tested, nevertheless.
2020-11-04 20:00:04 +01:00
Chris Fallin
75e02276be Merge pull request #2360 from jgouly/reg_map
aarch64: Fix aarch64_map_regs for FpuRRI
2020-11-04 10:37:50 -08:00
Joey Gouly
0223cb2f8c aarch64: Fix aarch64_map_regs for FpuRRI
This was wrong since I added it in 02c3f238f.

Copyright (c) 2020, Arm Limited.
2020-11-04 16:53:25 +00:00
Julian Seward
5a5fb11979 CL/aarch64: implement the wasm SIMD i32x4.dot_i16x8_s instruction
This patch implements, for aarch64, the following wasm SIMD extensions

  i32x4.dot_i16x8_s instruction
  https://github.com/WebAssembly/simd/pull/127

It also updates dependencies as follows, in order that the new instruction can
be parsed, decoded, etc:

  wat          to  1.0.27
  wast         to  26.0.1
  wasmparser   to  0.65.0
  wasmprinter  to  0.2.12

The changes are straightforward:

* new CLIF instruction `widening_pairwise_dot_product_s`

* translation from wasm into `widening_pairwise_dot_product_s`

* new AArch64 instructions `smull`, `smull2` (part of the `VecRRR` group)

* translation from `widening_pairwise_dot_product_s` to `smull ; smull2 ; addv`

There is no testcase in this commit, because that is a separate repo.  The
implementation has been tested, nevertheless.
2020-11-03 14:25:04 +01:00
Anton Kirilov
207779fe1d Cranelift AArch64: Improve code generation for vector constants
In particular, introduce initial support for the MOVI and MVNI
instructions, with 8-bit elements. Also, treat vector constants
as 32- or 64-bit floating-point numbers, if their value allows
it, by relying on the architectural zero extension. Finally,
stop generating literal loads for 32-bit constants.

Copyright (c) 2020, Arm Limited.
2020-10-30 13:16:12 +00:00
Chris Fallin
c35904a8bf Merge pull request #2278 from akirilov-arm/load_splat
Introduce the Cranelift IR instruction `LoadSplat`
2020-10-28 12:54:03 -07:00
Julian Seward
c15d9bd61b CL/aarch64: implement the wasm SIMD pseudo-max/min and FP-rounding instructions
This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  https://github.com/WebAssembly/simd/pull/232

  Pseudo-Minimum and Pseudo-Maximum instructions
  https://github.com/WebAssembly/simd/pull/122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
2020-10-26 10:37:07 +01:00
Yury Delendik
de4af90af6 machinst x64: New backend unwind (#2266)
Addresses unwind for experimental x64 backend. The preliminary code enables backtrace on SystemV call convension.
2020-10-23 15:19:41 -05:00
Julian Seward
2702942050 CL/aarch64 back end: implement the wasm SIMD bitmask instructions
The `bitmask.{8x16,16x8,32x4}` instructions do not map neatly to any single
AArch64 SIMD instruction, and instead need a sequence of around ten
instructions.  Because of this, this patch is somewhat longer and more complex
than it would be for (eg) x64.

Main changes are:

* the relevant testsuite test (`simd_boolean.wast`) has been enabled on aarch64.

* at the CLIF level, add a new instruction `vhigh_bits`, into which these wasm
  instructions are to be translated.

* in the wasm->CLIF translation (code_translator.rs), translate into
  `vhigh_bits`.  This is straightforward.

* in the CLIF->AArch64 translation (lower_inst.rs), translate `vhigh_bits`
  into equivalent sequences of AArch64 instructions.  There is a different
  sequence for each of the `{8x16, 16x8, 32x4}` variants.

All other changes are AArch64-specific, and add instruction definitions needed
by the previous step:

* Add two new families of AArch64 instructions: `VecShiftImm` (vector shift by
  immediate) and `VecExtract` (effectively a double-length vector shift)

* To the existing AArch64 family `VecRRR`, add a `zip1` variant.  To the
  `VecLanesOp` family add an `addv` variant.

* Add supporting code for the above changes to AArch64 instructions:
  - getting the register uses (`aarch64_get_regs`)
  - mapping the registers (`aarch64_map_regs`)
  - printing instructions
  - emitting instructions (`impl MachInstEmit for Inst`).  The handling of
    `VecShiftImm` is a bit complex.
  - emission tests for new instructions and variants.
2020-10-23 05:26:25 +02:00
Anton Kirilov
e0b911a4df Introduce the Cranelift IR instruction LoadSplat
It corresponds to WebAssembly's `load*_splat` operations, which
were previously represented as a combination of `Load` and `Splat`
instructions. However, there are architectures such as Armv8-A
that have a single machine instruction equivalent to the Wasm
operations. In order to generate it, it is necessary to merge the
`Load` and the `Splat` in the backend, which is not possible
because the load may have side effects. The new IR instruction
works around this limitation.

The AArch64 backend leverages the new instruction to improve code
generation.

Copyright (c) 2020, Arm Limited.
2020-10-14 13:07:13 +01:00
Chris Fallin
71768bb6cf Fix AArch64 ABI to respect half-caller-save, half-callee-save vec regs.
This PR updates the AArch64 ABI implementation so that it (i) properly
respects that v8-v15 inclusive have callee-save lower halves, and
caller-save upper halves, by conservatively approximating (to full
registers) in the appropriate directions when generating prologue
caller-saves and when informing the regalloc of clobbered regs across
callsites.

In order to prevent saving all of these vector registers in the prologue
of every non-leaf function due to the above approximation, this also
makes use of a new regalloc.rs feature to exclude call instructions'
writes from the clobber set returned by register allocation. This is
safe whenever the caller and callee have the same ABI (because anything
the callee could clobber, the caller is allowed to clobber as well
without saving it in the prologue).

Fixes #2254.
2020-10-06 14:44:02 -07:00
Anton Kirilov
f612e8e7b2 AArch64: Add various missing SIMD bits
In addition, improve the code for stack pointer manipulation.

Copyright (c) 2020, Arm Limited.
2020-09-09 13:37:50 +01:00
Chris Fallin
3d6c4d312f Merge pull request #2187 from akirilov-arm/ALUOp3
AArch64: Introduce an enum for ternary integer operations
2020-09-08 12:57:59 -07:00
Anton Kirilov
e92f949663 AArch64: Introduce an enum for ternary integer operations
This commit performs a small cleanup in the AArch64 backend - after
the MAdd and MSub variants have been extracted, the ALUOp enum is
used purely for binary integer operations.

Also, Inst::Mov has been renamed to Inst::Mov64 for consistency.

Copyright (c) 2020, Arm Limited.
2020-09-08 13:22:22 +01:00
Joey Gouly
650d48cd84 arm64: Don't always materialise a 64-bit constant
This improves the mov/movk/movn sequnce when the high half of the
64-bit value is all zero.

Copyright (c) 2020, Arm Limited.
2020-09-01 13:29:01 +01:00
Julian Seward
620e4b4e82 This patch fills in the missing pieces needed to support wasm atomics on newBE/x64.
It does this by providing an implementation of the CLIF instructions `AtomicRmw`, `AtomicCas`,
`AtomicLoad`, `AtomicStore` and `Fence`.

The translation is straightforward.  `AtomicCas` is translated into x64 `cmpxchg`, `AtomicLoad`
becomes a normal load because x64-TSO provides adequate sequencing, `AtomicStore` becomes a
normal store followed by `mfence`, and `Fence` becomes `mfence`.  `AtomicRmw` is the only
complex case: it becomes a normal load, followed by a loop which computes an updated value,
tries to `cmpxchg` it back to memory, and repeats if necessary.

This is a minimum-effort initial implementation.  `AtomicRmw` could be implemented more
efficiently using LOCK-prefixed integer read-modify-write instructions in the case where the old
value in memory is not required.  Subsequent work could add that, if required.

The x64 emitter has been updated to emit the new instructions, obviously.  The `LegacyPrefix`
mechanism has been revised to handle multiple prefix bytes, not just one, since it is now
sometimes necessary to emit both 0x66 (Operand Size Override) and F0 (Lock).

In the aarch64 implementation of atomics, there has been some minor renaming for the sake of
clarity, and for consistency with this x64 implementation.
2020-08-24 11:50:06 +02:00
Anton Kirilov
b895ac0e40 AArch64: Implement SIMD conversions
Copyright (c) 2020, Arm Limited.
2020-08-21 18:03:50 +01:00
Joey Gouly
a518c10141 arm64: Implement SIMD i64x2 multiply
Copyright (c) 2020, Arm Limited.
2020-08-20 13:26:03 +01:00