wasmtime

Author	SHA1	Message	Date
Ulrich Weigand	d02ae3940c	machinst ABI: Allow back-end to define stack alignment The common gen_prologue code currently assumes that the stack pointer has to be aligned to twice the word size. While this is true for many ABIs, it does not hold universally. This patch adds a new callback stack_align that back-ends can provide to define the specific stack alignment required by the ABI on that platform.	2020-11-03 09:43:55 +01:00
Anton Kirilov	207779fe1d	Cranelift AArch64: Improve code generation for vector constants In particular, introduce initial support for the MOVI and MVNI instructions, with 8-bit elements. Also, treat vector constants as 32- or 64-bit floating-point numbers, if their value allows it, by relying on the architectural zero extension. Finally, stop generating literal loads for 32-bit constants. Copyright (c) 2020, Arm Limited.	2020-10-30 13:16:12 +00:00
Chris Fallin	c35904a8bf	Merge pull request #2278 from akirilov-arm/load_splat Introduce the Cranelift IR instruction `LoadSplat`	2020-10-28 12:54:03 -07:00
Julian Seward	c15d9bd61b	CL/aarch64: implement the wasm SIMD pseudo-max/min and FP-rounding instructions This patch implements, for aarch64, the following wasm SIMD extensions Floating-point rounding instructions https://github.com/WebAssembly/simd/pull/232 Pseudo-Minimum and Pseudo-Maximum instructions https://github.com/WebAssembly/simd/pull/122 The changes are straightforward: * `build.rs`: the relevant tests have been enabled * `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions `fmin_pseudo` and `fmax_pseudo`. The wasm rounding instructions do not need any new CLIF instructions. * `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is pretty much the same as any other unary or binary vector instruction (for the rounding and the pmin/max respectively) * `cranelift/codegen/src/isa/aarch64/lower_inst.rs`: - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction sequence, `fcmpgt` followed by `bsl` - the CLIF rounding instructions are converted to a suitable vector `frint{n,z,p,m}` instruction. * `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub enum VecMisc2` to handle the rounding operations. And corresponding `emit` cases.	2020-10-26 10:37:07 +01:00
Yury Delendik	de4af90af6	machinst x64: New backend unwind (#2266 ) Addresses unwind for experimental x64 backend. The preliminary code enables backtrace on SystemV call convension.	2020-10-23 15:19:41 -05:00
Julian Seward	2702942050	CL/aarch64 back end: implement the wasm SIMD `bitmask` instructions The `bitmask.{8x16,16x8,32x4}` instructions do not map neatly to any single AArch64 SIMD instruction, and instead need a sequence of around ten instructions. Because of this, this patch is somewhat longer and more complex than it would be for (eg) x64. Main changes are: * the relevant testsuite test (`simd_boolean.wast`) has been enabled on aarch64. * at the CLIF level, add a new instruction `vhigh_bits`, into which these wasm instructions are to be translated. * in the wasm->CLIF translation (code_translator.rs), translate into `vhigh_bits`. This is straightforward. * in the CLIF->AArch64 translation (lower_inst.rs), translate `vhigh_bits` into equivalent sequences of AArch64 instructions. There is a different sequence for each of the `{8x16, 16x8, 32x4}` variants. All other changes are AArch64-specific, and add instruction definitions needed by the previous step: * Add two new families of AArch64 instructions: `VecShiftImm` (vector shift by immediate) and `VecExtract` (effectively a double-length vector shift) * To the existing AArch64 family `VecRRR`, add a `zip1` variant. To the `VecLanesOp` family add an `addv` variant. * Add supporting code for the above changes to AArch64 instructions: - getting the register uses (`aarch64_get_regs`) - mapping the registers (`aarch64_map_regs`) - printing instructions - emitting instructions (`impl MachInstEmit for Inst`). The handling of `VecShiftImm` is a bit complex. - emission tests for new instructions and variants.	2020-10-23 05:26:25 +02:00
Anton Kirilov	e0b911a4df	Introduce the Cranelift IR instruction `LoadSplat` It corresponds to WebAssembly's `load*_splat` operations, which were previously represented as a combination of `Load` and `Splat` instructions. However, there are architectures such as Armv8-A that have a single machine instruction equivalent to the Wasm operations. In order to generate it, it is necessary to merge the `Load` and the `Splat` in the backend, which is not possible because the load may have side effects. The new IR instruction works around this limitation. The AArch64 backend leverages the new instruction to improve code generation. Copyright (c) 2020, Arm Limited.	2020-10-14 13:07:13 +01:00
Benjamin Bouvier	c5bbc87498	machinst: allow passing constant information to the instruction emitter; A new associated type Info is added to MachInstEmit, which is the immutable counterpart to State. It can't easily be constructed from an ABICallee, since it would require adding an associated type to the latter, and making so leaks the associated type in a lot of places in the code base and makes the code harder to read. Instead, the EmitInfo state can simply be passed to the `Vcode::emit` function directly.	2020-10-08 09:21:51 +02:00
Andrew Brown	ce44719e1f	refactor: change LowerCtx::get_immediate to return a DataValue This change abstracts away (from the perspective of the new backend) how immediate values are stored in InstructionData. It gathers large immediates from necessary places (e.g. constant pool) and delegates to `InstructionData::imm_value` for the rest. This refactor only touches original users of `LowerCtx::get_immediate` but a future change could do the same for any place the new backend is accessing InstructionData directly to retrieve immediates.	2020-10-07 12:17:17 -07:00
Chris Fallin	71768bb6cf	Fix AArch64 ABI to respect half-caller-save, half-callee-save vec regs. This PR updates the AArch64 ABI implementation so that it (i) properly respects that v8-v15 inclusive have callee-save lower halves, and caller-save upper halves, by conservatively approximating (to full registers) in the appropriate directions when generating prologue caller-saves and when informing the regalloc of clobbered regs across callsites. In order to prevent saving all of these vector registers in the prologue of every non-leaf function due to the above approximation, this also makes use of a new regalloc.rs feature to exclude call instructions' writes from the clobber set returned by register allocation. This is safe whenever the caller and callee have the same ABI (because anything the callee could clobber, the caller is allowed to clobber as well without saving it in the prologue). Fixes #2254.	2020-10-06 14:44:02 -07:00
Joey Gouly	eec60c9b06	arm64: Use SignedOffset rather than PreIndexed addressing mode for callee-saved registers This also passes `fixed_frame_storage_size` (previously `total_sp_adjust`) into `gen_clobber_save` so that it can be combined with other stack adjustments. Copyright (c) 2020, Arm Limited.	2020-10-02 16:22:55 +01:00
Chris Fallin	835db11bea	Support for SpiderMonkey's "Wasm ABI 2020". As part of a Wasm JIT update, SpiderMonkey is changing its internal WebAssembly function ABI. The new ABI's frame format includes "caller TLS" and "callee TLS" slots. The details of where these come from are not important; from Cranelift's point of view, the only relevant requirement is that we have two on-stack args that are always present (offsetting other on-stack args), and that we define special argument purposes so that we can supply values for these slots. Note that this adds a new ABI (a variant of the Baldrdash ABI) because we do not want to tightly couple the landing of this PR to the landing of the changes in SpiderMonkey; it's better if both the old and new behavior remain available in Cranelift, so SpiderMonkey can continue to vendor Cranelift even if it does not land (or backs out) the ABI change. Furthermore, note that this needs to be a Cranelift-level change (i.e. cannot be done purely from the translator environment implementation) because the special TLS arguments must always go on the stack, which would not otherwise happen with the usual argument-placement logic; and there is no primitive to push a value directly in CLIF code (the notion of a stack frame is a lower-level concept).	2020-09-30 14:55:56 -07:00
Jakub Krauz	f6a140a662	arm32 codegen This commit adds arm32 code generation for some IR insts. Floating-point instructions are not supported, because regalloc does not allow to represent overlapping register classes, which are needed by VFP/Neon. There is also no support for big-endianness, I64 and I128 types.	2020-09-22 12:49:42 +02:00
Chris Fallin	1c7fa7f785	Merge pull request #2181 from jgouly/madd-opt arm64: Combine mul + add into madd	2020-09-15 11:52:33 -07:00
Joey Gouly	22369cfa0d	arm64: Combine mul + add into madd Copyright (c) 2020, Arm Limited.	2020-09-11 18:06:19 +01:00
Benjamin Bouvier	d9052d0a9c	machinst x64: generate copies of constants during lowering;	2020-09-11 17:41:44 +02:00
Chris Fallin	bd3ba0a774	Merge pull request #2189 from bnjbvr/x64-refactor-sub machinst x64: a few small refactorings/renamings	2020-09-09 12:40:59 -07:00
Benjamin Bouvier	7a833f442a	machinst: common up some instruction data helpers;	2020-09-09 18:03:59 +02:00
Benjamin Bouvier	a835c247c0	machinst: make get_output_reg target independent;	2020-09-09 18:03:59 +02:00
Anton Kirilov	f612e8e7b2	AArch64: Add various missing SIMD bits In addition, improve the code for stack pointer manipulation. Copyright (c) 2020, Arm Limited.	2020-09-09 13:37:50 +01:00
Chris Fallin	e8f772c1ac	x64 new backend: port ABI implementation to shared infrastructure with AArch64. Previously, in #2128, we factored out a common "vanilla 64-bit ABI" implementation from the AArch64 ABI code, with the idea that this should be largely compatible with x64. This PR alters the new x64 backend to make use of the shared infrastructure, removing the duplication that existed previously. The generated code is nearly (not exactly) the same; the only difference relates to how the clobber-save region is padded in the prologue. This also changes some register allocations in the aarch64 code because call support in the shared ABI infra now passes a temp vreg in, rather than requiring use of a fixed, non-allocable temp; tests have been updated, and the runtime behavior is unchanged.	2020-09-08 17:59:01 -07:00
Chris Fallin	3d6c4d312f	Merge pull request #2187 from akirilov-arm/ALUOp3 AArch64: Introduce an enum for ternary integer operations	2020-09-08 12:57:59 -07:00
Chris Fallin	e913bcb26a	Merge pull request #2179 from jgouly/mvn arm64: Don't always materialise a 64-bit constant	2020-09-08 09:17:08 -07:00
bjorn3	9999913a31	Fix sign extension Co-authored-by: Max Graey <maxgraey@gmail.com>	2020-09-08 15:00:24 +02:00
bjorn3	cc35f1e9bb	x64: Misc small integer fixes	2020-09-08 15:00:24 +02:00
bjorn3	74642b166f	x64: Implement ineg and bnot	2020-09-08 15:00:24 +02:00
Anton Kirilov	e92f949663	AArch64: Introduce an enum for ternary integer operations This commit performs a small cleanup in the AArch64 backend - after the MAdd and MSub variants have been extracted, the ALUOp enum is used purely for binary integer operations. Also, Inst::Mov has been renamed to Inst::Mov64 for consistency. Copyright (c) 2020, Arm Limited.	2020-09-08 13:22:22 +01:00
Joey Gouly	650d48cd84	arm64: Don't always materialise a 64-bit constant This improves the mov/movk/movn sequnce when the high half of the 64-bit value is all zero. Copyright (c) 2020, Arm Limited.	2020-09-01 13:29:01 +01:00
Benjamin Bouvier	a7f7c23bf9	machinst aarch64: in baldrdash, allow returning only one value across register classes; Baldrdash's API requires that there is at most one result in a register, across all the possible register classes: in particular, it's not possible to return an i64 value in a register while returning an v128 value in another register. This patch adds a notion of "remaining register values", so this is properly taking into account when choosing whether a return value may be put into a register or not.	2020-08-31 12:36:26 +02:00
Julian Seward	620e4b4e82	This patch fills in the missing pieces needed to support wasm atomics on newBE/x64. It does this by providing an implementation of the CLIF instructions `AtomicRmw`, `AtomicCas`, `AtomicLoad`, `AtomicStore` and `Fence`. The translation is straightforward. `AtomicCas` is translated into x64 `cmpxchg`, `AtomicLoad` becomes a normal load because x64-TSO provides adequate sequencing, `AtomicStore` becomes a normal store followed by `mfence`, and `Fence` becomes `mfence`. `AtomicRmw` is the only complex case: it becomes a normal load, followed by a loop which computes an updated value, tries to `cmpxchg` it back to memory, and repeats if necessary. This is a minimum-effort initial implementation. `AtomicRmw` could be implemented more efficiently using LOCK-prefixed integer read-modify-write instructions in the case where the old value in memory is not required. Subsequent work could add that, if required. The x64 emitter has been updated to emit the new instructions, obviously. The `LegacyPrefix` mechanism has been revised to handle multiple prefix bytes, not just one, since it is now sometimes necessary to emit both 0x66 (Operand Size Override) and F0 (Lock). In the aarch64 implementation of atomics, there has been some minor renaming for the sake of clarity, and for consistency with this x64 implementation.	2020-08-24 11:50:06 +02:00
Anton Kirilov	b895ac0e40	AArch64: Implement SIMD conversions Copyright (c) 2020, Arm Limited.	2020-08-21 18:03:50 +01:00
Chris Fallin	debacec1c5	Merge pull request #2150 from jgouly/mul64s arm64: Implement SIMD i64x2 multiply	2020-08-20 11:57:56 -07:00
Chris Fallin	051feaad75	Merge pull request #2148 from bjorn3/aarch64_fix_put_input_in_rsa Fix put_input_in_reg	2020-08-20 11:41:35 -07:00
Joey Gouly	a518c10141	arm64: Implement SIMD i64x2 multiply Copyright (c) 2020, Arm Limited.	2020-08-20 13:26:03 +01:00
bjorn3	957eb9eeba	Less unnecessary zero and sign extensions	2020-08-20 10:17:04 +02:00
bjorn3	ba48b9aef1	Fix put_input_in_reg	2020-08-19 19:38:47 +02:00
bjorn3	4a84f3f073	Lower fcvt_from_{u,s}int for 8 and 16 bit ints	2020-08-19 18:07:12 +02:00
Chris Fallin	5cf3fba3da	Refactor AArch64 ABI support to extract common bits for shared impl with x64. We have observed that the ABI implementations for AArch64 and x64 are very similar; in fact, x64's implementation started as a modified copy of AArch64's implementation. This is an artifact of both a similar ABI (both machines pass args and return values in registers first, then the stack, and both machines give considerable freedom with stack-frame layout) and a too-low-level ABI abstraction in the existing design. For machines that fit the mainstream or most common ABI-design idioms, we should be able to do much better. This commit factors AArch64 into machine-specific and machine-independent parts, but does not yet modify x64; that will come next. This should be completely neutral with respect to compile time and generated code performance.	2020-08-14 16:27:39 -07:00
Nick Fitzgerald	05bf9ea3f3	Rename "Stackmap" to "StackMap" And "stackmap" to "stack_map". This commit is purely mechanical.	2020-08-07 10:08:44 -07:00
Anton Kirilov	1ec6930005	Enable the spec::simd::simd_lane test for AArch64 Copyright (c) 2020, Arm Limited.	2020-08-06 11:14:15 +01:00
Julian Seward	25e31739a6	Implement Wasm Atomics for Cranelift/newBE/aarch64. The implementation is pretty straightforward. Wasm atomic instructions fall into 5 groups * atomic read-modify-write * atomic compare-and-swap * atomic loads * atomic stores * fences and the implementation mirrors that structure, at both the CLIF and AArch64 levels. At the CLIF level, there are five new instructions, one for each group. Some comments about these: * for those that take addresses (all except fences), the address is contained entirely in a single `Value`; there is no offset field as there is with normal loads and stores. Wasm atomics require alignment checks, and removing the offset makes implementation of those checks a bit simpler. * atomic loads and stores get their own instructions, rather than reusing the existing load and store instructions, for two reasons: - per above comment, makes alignment checking simpler - reuse of existing loads and stores would require extension of `MemFlags` to indicate atomicity, which sounds semantically unclean. For example, then any instruction carrying `MemFlags` could be marked as atomic, even in cases where it is meaningless or ambiguous. * I tried to specify, in comments, the behaviour of these instructions as tightly as I could. Unfortunately there is no way (per my limited CLIF knowledge) to enforce the constraint that they may only be used on I8, I16, I32 and I64 types, and in particular not on floating point or vector types. The translation from Wasm to CLIF, in `code_translator.rs` is unremarkable. At the AArch64 level, there are also five new instructions, one for each group. All of them except `::Fence` contain multiple real machine instructions. Atomic r-m-w and atomic c-a-s are emitted as the usual load-linked store-conditional loops, guarded at both ends by memory fences. Atomic loads and stores are emitted as a load preceded by a fence, and a store followed by a fence, respectively. The amount of fencing may be overkill, but it reflects exactly what the SM Wasm baseline compiler for AArch64 does. One reason to implement r-m-w and c-a-s as a single insn which is expanded only at emission time is that we must be very careful what instructions we allow in between the load-linked and store-conditional. In particular, we cannot allow any extra memory transactions in there, since -- particularly on low-end hardware -- that might cause the transaction to fail, hence deadlocking the generated code. That implies that we can't present the LL/SC loop to the register allocator as its constituent instructions, since it might insert spills anywhere. Hence we must present it as a single indivisible unit, as we do here. It also has the benefit of reducing the total amount of work the RA has to do. The only other notable feature of the r-m-w and c-a-s translations into AArch64 code, is that they both need a scratch register internally. Rather than faking one up by claiming, in `get_regs` that it modifies an extra scratch register, and having to have a dummy initialisation of it, these new instructions (`::LLSC` and `::CAS`) simply use fixed registers in the range x24-x28. We rely on the RA's ability to coalesce V<-->R copies to make the cost of the resulting extra copies zero or almost zero. x24-x28 are chosen so as to be call-clobbered, hence their use is less likely to interfere with long live ranges that span calls. One subtlety regarding the use of completely fixed input and output registers is that we must be careful how the surrounding copy from/to of the arg/result registers is done. In particular, it is not safe to simply emit copies in some arbitrary order if one of the arg registers is a real reg. For that reason, the arguments are first moved into virtual regs if they are not already there, using a new method `<LowerCtx for Lower>::ensure_in_vreg`. Again, we rely on coalescing to turn them into no-ops in the common case. There is also a ridealong fix for the AArch64 lowering case for `Opcode::Trapif \| Opcode::Trapff`, which removes a bug in which two trap insns in a row were generated. In the patch as submitted there are 6 "FIXME JRS" comments, which mark things which I believe to be correct, but for which I would appreciate a second opinion. Unless otherwise directed, I will remove them for the final commit but leave the associated code/comments unchanged.	2020-08-04 09:35:50 +02:00
Chris Fallin	9a9b5015d0	Merge pull request #2081 from cfallin/aarch64-baldrdash-fix Aarch64: fix narrow integer-register extension with Baldrdash ABI.	2020-07-31 12:13:38 -07:00
Chris Fallin	1fbdf169b5	Aarch64: fix narrow integer-register extension with Baldrdash ABI. In the Baldrdash (SpiderMonkey) embedding, we must take care to zero-extend all function arguments to callees in integer registers when the types are narrower than 64 bits. This is because, unlike the native SysV ABI, the Baldrdash ABI expects high bits to be cleared. Not doing so leads to difficult-to-trace errors where high bits falsely tag an int32 as e.g. an object pointer, leading to potential security issues.	2020-07-31 10:19:13 -07:00
Anton Kirilov	adf25d27c2	AArch64: Implement SIMD floating-point arithmetic Copyright (c) 2020, Arm Limited.	2020-07-28 15:19:47 +01:00
Chris Fallin	8fd92093a4	Merge pull request #2061 from cfallin/aarch64-amode Aarch64 codegen quality: support more general add+extend address computations.	2020-07-27 13:48:55 -07:00
Chris Fallin	f9b98f0ddc	Aarch64 codegen quality: support more general add+extend computations. Previously, our pattern-matching for generating load/store addresses was somewhat limited. For example, it could not use a register-extend address mode to handle the following CLIF: ``` v2760 = uextend.i64 v985 v2761 = load.i64 notrap aligned readonly v1 v1018 = iadd v2761, v2760 store v1017, v1018 ``` This PR adds more general support for address expressions made up of additions and extensions. In particular, it pattern-matches a tree of 64-bit `iadd`s, optionally with `uextend`/`sextend` from 32-bit values at the leaves, to collect the list of all addends that form the address. It also collects all offsets at leaves, combining them. It applies a series of heuristics to make the best use of the available addressing modes, filling the load/store itself with as many 64-bit registers, zero/sign-extended 32-bit registers, and/or an offset, then computing the rest with add instructions as necessary. It attempts to make use of immediate forms (add-immediate or subtract-immediate) whenever possible, and also uses the built-in extend operators on add instructions when possible. There are certainly cases where this is not optimal (i.e., does not generate the strictly shortest sequence of instructions), but it should be good enough for most code. Using `perf stat` to measure instruction count (runtime only, on wasmtime, after populating the cache to avoid measuring compilation), this impacts `bz2` as follows: ``` pre: 1006.410425 task-clock (msec) # 1.000 CPUs utilized 113 context-switches # 0.112 K/sec 1 cpu-migrations # 0.001 K/sec 5,036 page-faults # 0.005 M/sec 3,221,547,476 cycles # 3.201 GHz 4,000,670,104 instructions # 1.24 insn per cycle <not supported> branches 27,958,613 branch-misses 1.006071348 seconds time elapsed post: 963.499525 task-clock (msec) # 0.997 CPUs utilized 117 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,081 page-faults # 0.005 M/sec 3,039,687,673 cycles # 3.155 GHz 3,837,761,690 instructions # 1.26 insn per cycle <not supported> branches 28,254,585 branch-misses 0.966072682 seconds time elapsed ``` In other words, this reduces instruction count by 4.1% on `bz2`.	2020-07-27 13:10:50 -07:00
Chris Fallin	bad99c93b1	Merge pull request #2051 from cfallin/aarch64-add-negative-imm Aarch64 codegen quality: handle add-negative-imm as subtract.	2020-07-24 12:26:54 -07:00
Chris Fallin	1b80860f1f	Aarch64 codegen quality: handle add-negative-imm as subtract. We often see patterns like: ``` mov w2, #0xffff_ffff // uses ORR with logical immediate form add w0, w1, w2 ``` which is just `w0 := w1 - 1`. It would be much better to recognize when the inverse of an immediate will fit in a 12-bit immediate field if the immediate itself does not, and flip add to subtract (and vice versa), so we can instead generate: ``` sub w0, w1, #1 ``` We see this pattern in e.g. `bz2`, where this commit makes the following difference (counting instructions with `perf stat`, filling in the wasmtime cache first then running again to get just runtime): pre: ``` 992.762250 task-clock (msec) # 0.998 CPUs utilized 109 context-switches # 0.110 K/sec 0 cpu-migrations # 0.000 K/sec 5,035 page-faults # 0.005 M/sec 3,224,119,134 cycles # 3.248 GHz 4,000,521,171 instructions # 1.24 insn per cycle <not supported> branches 27,573,755 branch-misses 0.995072322 seconds time elapsed ``` post: ``` 993.853850 task-clock (msec) # 0.998 CPUs utilized 123 context-switches # 0.124 K/sec 1 cpu-migrations # 0.001 K/sec 5,072 page-faults # 0.005 M/sec 3,201,278,337 cycles # 3.221 GHz 3,917,061,340 instructions # 1.22 insn per cycle <not supported> branches 28,410,633 branch-misses 0.996008047 seconds time elapsed ``` In other words, a 2.1% reduction in instruction count on `bz2`.	2020-07-24 11:41:33 -07:00
Benjamin Bouvier	35d9ab19b7	Review fixes;	2020-07-24 19:29:12 +02:00
Benjamin Bouvier	987c616bf5	machinst x64: implement support for dynamic heaps and explicit bound checks;	2020-07-24 19:29:12 +02:00

1 2 3 4

173 Commits