wasmtime

Author	SHA1	Message	Date
Chris Fallin	051feaad75	Merge pull request #2148 from bjorn3/aarch64_fix_put_input_in_rsa Fix put_input_in_reg	2020-08-20 11:41:35 -07:00
Chris Fallin	775dfa9df2	Merge pull request #1520 from bjorn3/aarch64-lower-small-fcvt_from_int Lower fcvt_from_{u,s}int for 8 and 16 bit ints	2020-08-20 11:35:06 -07:00
bjorn3	957eb9eeba	Less unnecessary zero and sign extensions	2020-08-20 10:17:04 +02:00
bjorn3	ba48b9aef1	Fix put_input_in_reg	2020-08-19 19:38:47 +02:00
bjorn3	3a16416132	Add tests	2020-08-19 19:17:27 +02:00
Chris Fallin	5fa0be3515	AArch64 ABI: properly store full 64-bit width of extended args/retvals. When storing an argument to a stack location for consumption by a callee, or storing a return value to an on-stack return slot for consumption by the caller, the ABI implementation was properly extending the value but was then performing a store with only the original width. This fixes the issue by always performing a 64-bit store of the extended value. Issue reported by @uweigand (thanks!).	2020-08-17 15:00:04 -07:00
Anton Kirilov	1ec6930005	Enable the spec::simd::simd_lane test for AArch64 Copyright (c) 2020, Arm Limited.	2020-08-06 11:14:15 +01:00
Chris Fallin	1fbdf169b5	Aarch64: fix narrow integer-register extension with Baldrdash ABI. In the Baldrdash (SpiderMonkey) embedding, we must take care to zero-extend all function arguments to callees in integer registers when the types are narrower than 64 bits. This is because, unlike the native SysV ABI, the Baldrdash ABI expects high bits to be cleared. Not doing so leads to difficult-to-trace errors where high bits falsely tag an int32 as e.g. an object pointer, leading to potential security issues.	2020-07-31 10:19:13 -07:00
Chris Fallin	8fd92093a4	Merge pull request #2061 from cfallin/aarch64-amode Aarch64 codegen quality: support more general add+extend address computations.	2020-07-27 13:48:55 -07:00
Chris Fallin	f9b98f0ddc	Aarch64 codegen quality: support more general add+extend computations. Previously, our pattern-matching for generating load/store addresses was somewhat limited. For example, it could not use a register-extend address mode to handle the following CLIF: ``` v2760 = uextend.i64 v985 v2761 = load.i64 notrap aligned readonly v1 v1018 = iadd v2761, v2760 store v1017, v1018 ``` This PR adds more general support for address expressions made up of additions and extensions. In particular, it pattern-matches a tree of 64-bit `iadd`s, optionally with `uextend`/`sextend` from 32-bit values at the leaves, to collect the list of all addends that form the address. It also collects all offsets at leaves, combining them. It applies a series of heuristics to make the best use of the available addressing modes, filling the load/store itself with as many 64-bit registers, zero/sign-extended 32-bit registers, and/or an offset, then computing the rest with add instructions as necessary. It attempts to make use of immediate forms (add-immediate or subtract-immediate) whenever possible, and also uses the built-in extend operators on add instructions when possible. There are certainly cases where this is not optimal (i.e., does not generate the strictly shortest sequence of instructions), but it should be good enough for most code. Using `perf stat` to measure instruction count (runtime only, on wasmtime, after populating the cache to avoid measuring compilation), this impacts `bz2` as follows: ``` pre: 1006.410425 task-clock (msec) # 1.000 CPUs utilized 113 context-switches # 0.112 K/sec 1 cpu-migrations # 0.001 K/sec 5,036 page-faults # 0.005 M/sec 3,221,547,476 cycles # 3.201 GHz 4,000,670,104 instructions # 1.24 insn per cycle <not supported> branches 27,958,613 branch-misses 1.006071348 seconds time elapsed post: 963.499525 task-clock (msec) # 0.997 CPUs utilized 117 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,081 page-faults # 0.005 M/sec 3,039,687,673 cycles # 3.155 GHz 3,837,761,690 instructions # 1.26 insn per cycle <not supported> branches 28,254,585 branch-misses 0.966072682 seconds time elapsed ``` In other words, this reduces instruction count by 4.1% on `bz2`.	2020-07-27 13:10:50 -07:00
Chris Fallin	1b80860f1f	Aarch64 codegen quality: handle add-negative-imm as subtract. We often see patterns like: ``` mov w2, #0xffff_ffff // uses ORR with logical immediate form add w0, w1, w2 ``` which is just `w0 := w1 - 1`. It would be much better to recognize when the inverse of an immediate will fit in a 12-bit immediate field if the immediate itself does not, and flip add to subtract (and vice versa), so we can instead generate: ``` sub w0, w1, #1 ``` We see this pattern in e.g. `bz2`, where this commit makes the following difference (counting instructions with `perf stat`, filling in the wasmtime cache first then running again to get just runtime): pre: ``` 992.762250 task-clock (msec) # 0.998 CPUs utilized 109 context-switches # 0.110 K/sec 0 cpu-migrations # 0.000 K/sec 5,035 page-faults # 0.005 M/sec 3,224,119,134 cycles # 3.248 GHz 4,000,521,171 instructions # 1.24 insn per cycle <not supported> branches 27,573,755 branch-misses 0.995072322 seconds time elapsed ``` post: ``` 993.853850 task-clock (msec) # 0.998 CPUs utilized 123 context-switches # 0.124 K/sec 1 cpu-migrations # 0.001 K/sec 5,072 page-faults # 0.005 M/sec 3,201,278,337 cycles # 3.221 GHz 3,917,061,340 instructions # 1.22 insn per cycle <not supported> branches 28,410,633 branch-misses 0.996008047 seconds time elapsed ``` In other words, a 2.1% reduction in instruction count on `bz2`.	2020-07-24 11:41:33 -07:00
Chris Fallin	1b3b2dbfd0	Merge pull request #2043 from cfallin/csel-opt Aarch64: handle csel with icmp/fcmp source without materializing the bool.	2020-07-18 19:33:47 -07:00
Chris Fallin	21dac670f0	Aarch64: handle csel with icmp/fcmp source without materializing the bool. Previously, we simply compared the input bool to 0, which forced the value into a register (usually via a cmp and cset), zero-extended it, etc. This patch performs the same pattern-matching that branches do to directly perform the cmp and use its flag results with the csel. On the `bz2` benchmark, the runtime is affected as follows (measuring with `perf stat`, using wasmtime with its cache enabled, and taking the second run after the first compiles and populates the cache): pre: 1117.232000 task-clock (msec) # 1.000 CPUs utilized 133 context-switches # 0.119 K/sec 1 cpu-migrations # 0.001 K/sec 5,041 page-faults # 0.005 M/sec 3,511,615,100 cycles # 3.143 GHz 4,272,427,772 instructions # 1.22 insn per cycle <not supported> branches 27,980,906 branch-misses 1.117299838 seconds time elapsed post: 1003.738075 task-clock (msec) # 1.000 CPUs utilized 121 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,052 page-faults # 0.005 M/sec 3,224,875,393 cycles # 3.213 GHz 4,000,838,686 instructions # 1.24 insn per cycle <not supported> branches 27,928,232 branch-misses 1.003440004 seconds time elapsed In other words, with this change, on `bz2`, we see a 6.3% reduction in executed instructions.	2020-07-17 21:10:21 -07:00
Chris Fallin	9bd9c628aa	Aarch64: mask shift-amounts incorporated into reg-reg-shift ALU insts. We had previously fixed a bug in which constant shift amounts should be masked to modulo the number of bits in the operand; however, we did not fix the analogous case for shifts incorporated into the second register argument of ALU instructions that support integrated shifts. This failure to mask resulted in illegal instructions being generated, e.g. in https://bugzilla.mozilla.org/show_bug.cgi?id=1653502. This PR fixes the issue by masking the amount, as the shift semantics require.	2020-07-17 14:55:23 -07:00
Chris Fallin	756e8b8ea2	Update to regalloc.rs 0.0.28. This version of regalloc.rs includes several bugfixes for reference-types support used by the new backend framework and the aarch64 backend (bytecodealliance/regalloc.rs#85 and bytecodealliance/regalloc.rs#86).	2020-07-16 09:42:09 -07:00
Chris Fallin	26529006e0	Address review comments.	2020-07-14 10:17:29 -07:00
Chris Fallin	08353fcc14	Reftypes part two: add support for stackmaps. This commit adds support for generating stackmaps at safepoints to the new backend framework and to the AArch64 backend in particular. It has been tested to work with SpiderMonkey.	2020-07-14 10:17:27 -07:00
Chris Fallin	b93e8c296d	Initial reftype support in aarch64, modulo safepoints. This commit adds the inital support to allow reftypes to flow through the program when targetting aarch64. It also adds a fix to the `ModuleTranslationState` needed to send R32/R64 types over from the SpiderMonkey embedding. This commit does not include any support for safepoints in aarch64 or the `MachInst` infrastructure; that is in the next commit. This commit also makes a drive-by improvement to `Bint`, avoiding an unneeded zero-extension op when the extended value comes directly from a conditional-set (which produces a full-width 0 or 1).	2020-07-14 10:14:18 -07:00
Chris Fallin	b7ecad1d74	AArch64: avoid branches with explicit offsets at lowering stage. In discussions with @bnjbvr, it came up that generating `OneWayCondBr`s with explicit, hardcoded PC-offsets as part of lowered instruction sequences is actually unsafe, because the register allocator might insert a spill or reload into the middle of our sequence. We were careful about this in some cases but somehow missed that it was a general restriction. Conceptually, all inter-instruction references should be via labels at the VCode level; explicit offsets are only ever known at emission time, and resolved by the `MachBuffer`. To allow for conditional trap checks without modifying the CFG (as seen by regalloc) during lowering, this PR instead adds a `TrapIf` pseudo-instruction that conditionally skips a single embedded trap instruction. It lowers to the same `condbr label ; trap ; label: ...` sequence, but without the hardcoded branch-target offset in the lowering code.	2020-07-02 11:02:27 -07:00
Chris Fallin	0a59a321bd	Merge pull request #1954 from cfallin/b1649432 AArch64: fix shift ops: mask shift amount.	2020-07-01 09:33:29 -07:00
Chris Fallin	533f1c8d8b	Aarch64: fix shift ops: mask shift amount. The failure to mask the amount triggered a panic due to a subtraction overflow check; see https://bugzilla.mozilla.org/show_bug.cgi?id=1649432. Attempting to shift by an out-of-range amount should be defined to shift by an amount mod the operand size (i.e., masked to 5 bits for 32-bit shifts, or 6 bits for 64-bit shifts).	2020-07-01 08:57:56 -07:00
Chris Fallin	e694fb1312	Spectre mitigation on heap access overflow checks. This PR adds a conditional move following a heap bounds check through which the address to be accessed flows. This conditional move ensures that even if the branch is mispredicted (access is actually out of bounds, but speculation goes down in-bounds path), the acually accessed address is zero (a NULL pointer) rather than the out-of-bounds address. The mitigation is controlled by a flag that is off by default, but can be set by the embedding. Note that in order to turn it on by default, we would need to add conditional-move support to the current x86 backend; this does not appear to be present. Once the deprecated backend is removed in favor of the new backend, IMHO we should turn this flag on by default. Note that the mitigation is unneccessary when we use the "huge heap" technique on 64-bit systems, in which we allocate a range of virtual address space such that no 32-bit offset can reach other data. Hence, this only affects small-heap configurations.	2020-07-01 08:36:09 -07:00
Chris Fallin	6286ca7310	AArch64: make use of reg-reg-extend amode. When a load/store instruction needs an address of the form `v0 + uextend(v1)` or `v0 + sextend(v1)` (or the commuted forms thereof), we currently generate a separate zero/sign-extend operation and then use a plain `[rA, rB]` addressing mode. This patch extends `lower_address()` to look at both addends of an address if it has two addends and a zero offset, recognize extension operations, and incorporate them directly into a `[rA, rB, UXTW]` or `[rA, rB, SXTW]` form. This should improve our performence on WebAssembly workloads, at least, because we often see a 64-bit linear memory base indexed by a 32-bit (Wasm) pointer value.	2020-06-12 10:40:54 -07:00
Chris Fallin	6ba165be01	Merge pull request #1858 from cfallin/fix-scale-b1 Bugfix: scaled addressing mode: round B1 up to one byte.	2020-06-11 11:16:07 -07:00
Chris Fallin	47402316e0	Add test case: b1-typed spillslot access using UImm12 addressing mode.	2020-06-11 10:27:39 -07:00
Benjamin Bouvier	46093f6119	Bump regalloc.rs to 0.0.26; And adapt to regalloc.rs API change to provide the exact number of vregs.	2020-06-10 18:23:04 +02:00
Chris Fallin	fc2a6f273b	Three fixes to various SpiderMonkey-related issues: - Properly mask constant values down to appropriate width when generating a constant value directly in aarch64 backend. This was a miscompilation introduced in the new-isel refactor. In combination with failure to respect NarrowValueMode, this resulted in a very subtle bug when an `i32` constant was used in bit-twiddling logic. - Add support for `iadd_ifcout` in aarch64 backend as used in explicit heap-check mode. With this change, we no longer fail heap-related tests with the huge-heap-region mode disabled. - Remove a panic that was occurring in some tests that are currently ignored on aarch64, by simply returning empty/default information in `value_label` functionality rather than touching unimplemented APIs. This is not a bugfix per-se, but removes confusing panic messages from `cargo test` output that might otherwise mislead.	2020-06-08 13:02:00 -07:00
Chris Fallin	fe97659813	Address review comments.	2020-06-03 13:31:34 -07:00
Chris Fallin	615362068f	Multi-value return support.	2020-06-03 13:31:34 -07:00
Chris Fallin	51f9ac2150	Merge pull request #1741 from cfallin/filetest-vcode-compile Merge `vcode` filetest mode into `compile`.	2020-05-22 18:57:21 -07:00
Chris Fallin	48573b52b2	Merge `vcode` filetest mode into `compile`. I hadn't realized before that the filetest backend for `test vcode` is doing essentially what `compile` is doing, but for new (`MachInst`) backends: it is just getting a disassembly and running it through filecheck. There's no reason not to reuse `test compile` for the AArch64 tests as well. This was motivated by the desire to have "this IR compiles successfully" tests work on both x86 and AArch64. It seems this should work fine by adding multiple `target` directives when a test case should be compile-tested on multiple architectures.	2020-05-22 17:28:48 -07:00
Joey Gouly	02c3f238f8	arm64: Use FPU instrctions for Fcopysign Copyright (c) 2020, Arm Limited.	2020-05-21 18:14:12 +01:00
Chris Fallin	72e6be9342	Rework of MachInst isel, branch fixups and lowering, and block ordering. This patch includes: - A complete rework of the way that CLIF blocks and edge blocks are lowered into VCode blocks. The new mechanism in `BlockLoweringOrder` computes RPO over the CFG, but with a twist: it merges edge blocks intto heads or tails of original CLIF blocks wherever possible, and it does this without ever actually materializing the full nodes-plus-edges graph first. The backend driver lowers blocks in final order so there's no need to reshuffle later. - A new `MachBuffer` that replaces the `MachSection`. This is a special version of a code-sink that is far more than a humble `Vec<u8>`. In particular, it keeps a record of label definitions and label uses, with a machine-pluggable `LabelUse` trait that defines various types of fixups (basically internal relocations). Importantly, it implements some simple peephole-style branch rewrites inline in the emission pass, without any separate traversals over the code to use fallthroughs, swap taken/not-taken arms, etc. It tracks branches at the tail of the buffer and can (i) remove blocks that are just unconditional branches (by redirecting the label), (ii) understand a conditional/unconditional pair and swap the conditional polarity when it's helpful; and (iii) remove branches that branch to the fallthrough PC. The `MachBuffer` also implements branch-island support. On architectures like AArch64, this is needed to allow conditional branches within plausibly-attainable ranges (+/- 1MB on AArch64 specifically). It also does this inline while streaming through the emission, without any sort of fixpoint algorithm or later moving of code, by simply tracking outstanding references and "deadlines" and emitting an island just-in-time when we're in danger of going out of range. - A rework of the instruction selector driver. This is largely following the same algorithm as before, but is cleaned up significantly, in particular in the API: the machine backend can ask for an input arg and get any of three forms (constant, register, producing instruction), indicating it needs the register or can merge the constant or producing instruction as appropriate. This new driver takes special care to emit constants right at use-sites (and at phi inputs), minimizing their live-ranges, and also special-cases the "pinned register" to avoid superfluous moves. Overall, on `bz2.wasm`, the results are: wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.	2020-05-16 23:08:22 -07:00
Chris Fallin	a66724aafd	Rework aarch64 stack frame implementation. This PR changes the aarch64 ABI implementation to use positive offsets from SP, rather than negative offsets from FP, to refer to spill slots and stack-local storage. This allows for better addressing-mode options, and hence slightly better code: e.g., the unsigned scaled 12-bit offset mode can be used to reach anywhere in a 32KB frame without extra address-construction instructions, whereas negative offsets are limited to a signed 9-bit unscaled mode (-256 bytes). To enable this, the PR introduces a notion of "nominal SP offsets" as a virtual addressing mode, lowered during the emission pass. The offsets are relative to "SP after adjusting downward to allocate stack/spill slots", but before pushing clobbers. This allows the addressing-mode expressions to be generated before register allocation (or during it, for spill/reload sequences). To convert these offsets into true offsets from SP, we need to track how much further SP is moved downward, and compensate for this. We do so with "virtual SP offset adjustment" pseudo-instructions: these are seen by the emission pass, and result in no instruction (0 byte output), but update state that is now threaded through each instruction emission in turn. In this way, we can push e.g. stack args for a call and adjust the virtual SP offset, allowing reloads from nominal-SP-relative spillslots while we do the argument setup with "real SP offsets" at the same time.	2020-05-06 09:23:55 -07:00
Chris Fallin	e39b4aba1c	Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use it. Previously, every call was lowered on AArch64 to a `call` instruction, which takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible relocation record to be emitted (or rather, a record that a more sophisticated linker would fix up by inserting a shim/veneer). This commit adds a notion of "relocation distance" in the MachInst backends, and provides this information for every call target and symbol reference. The intent is that backends on architectures like AArch64, where there are different offset sizes / addressing strategies to choose from, can either emit a regular call or a load-64-bit-constant / call-indirect sequence, as necessary. This avoids the need to implement complex linking behavior. The MachInst driver code provides this information based on the "colocated" bit in the CLIF symbol references, which appears to have been designed for this purpose, or at least a similar one. Combined with the `use_colocated_libcalls` setting, this allows client code to ensure that library calls can link to library code at any location in the address space. Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing so, it appears all that is necessary to get its tests to pass is to set the `use_colocated_libcalls` flag to false, to make use of the above change. This fixes the `libcall_function` unit-test in this crate.	2020-05-05 09:55:12 -07:00
Alex Crichton	74eda8090c	Implement stack limit checks for AArch64 (#1573 ) This commit implements the stack limit checks in cranelift for the AArch64 backend. This gets the `stack_limit` argument purpose as well as a function's global `stack_limit` directive working for the AArch64 backend. I've tested this locally on some hardware and in an emulator and it looks to be working for basic tests, but I've never really done AArch64 before so some scrutiny on the instructions would be most welcome!	2020-04-24 15:01:57 -05:00
Joey Gouly	f020f0812e	arm64: Implement checks in division / remainder This implements the divide by 0 and signed overflow checks that Wasm specifies. Copyright (c) 2020, Arm Limited.	2020-04-24 17:40:19 +02:00
Benjamin Bouvier	b6e6998713	aarch64: mask rotation counts and share code generation of left and right rotations; Given an integer size N, a left rotation of K places is the same as a right rotation of N - K places. This means we can use right rotations to implement left rotations too. The Cranelift's rotation semantics are inherited from WebAssembly, which mean the rotation count is truncated modulo the operand's bit size. Note the ROR aarch64 instruction has the same semantics, when both input operands are registers.	2020-04-24 12:36:59 +02:00
Benjamin Bouvier	de92b7e014	aarch64: implement correct float-to-int conversion semantics; These are inherited from wasm semantics.	2020-04-24 11:51:35 +02:00
Chris Fallin	d88098744b	Merge pull request #1527 from cfallin/aarch64-fp-vcode-test Add vcode test for floating-point, and fix two FP bugs.	2020-04-21 09:35:23 -07:00
Joey Gouly	ad9be0d445	arm64: Support bool constants Copyright (c) 2020, Arm Limited.	2020-04-21 12:24:57 +02:00
bjorn3	cb1c9ef085	Fix printing of LoadAddr	2020-04-18 13:24:06 +02:00
bjorn3	4960c9a0c6	Add tests for stack_{addr,load,store}	2020-04-18 13:24:06 +02:00
Chris Fallin	5e53482a13	arm64: Support less-than-64-bit integers in Bitrev, Clz, Cls, and Popcnt instructions. Includes a temporary bugfix for popcnt with 32-bit operand. The popcnt issue was initially identified by Benjamin Bouvier <public@benj.me>, and the root cause was debugged by Joey Gouly <joey.gouly@arm.com>. This patch is simply a quick fix that zero-extends the operand to 64 bits; Joey plans to contribute a more permanent fix shortly (tracked in #1537).	2020-04-17 16:42:46 -07:00
Chris Fallin	2b68abed6a	Add vcode test for floating-point, and fix two FP bugs. - Added a filetest for the vcode output of lowering every handled FP opcode. - Fixed two bugs that were discovered while going through the lowerings: - Saturating FP->int operators would return `u{32,64}::MIN` rather than `0` for a NaN input. - `fcopysign` did not mask off the sign bit of the value whose sign is overwritten. These probably would have been caught by Wasm conformance tests soon (and the validity of these lowerings will ultimately be tested this way) but let's get them right by inspection, too!	2020-04-16 13:43:52 -07:00
Chris Fallin	48cf2c2f50	Address review comments: - Undo temporary changes to default features (`all-arch`) and a signal-handler test. - Remove `SIGTRAP` handler: no longer needed now that we've found an "undefined opcode" option on ARM64. - Rename pp.rs to pretty_print.rs in machinst/. - Only use empty stack-probe on non-x86. As per a comment in rust-lang/compiler-builtins [1], LLVM only supports stack probes on x86 and x86-64. Thus, on any other CPU architecture, we cannot refer to `__rust_probestack`, because it does not exist. - Rename arm64 to aarch64. - Use `target` directive in vcode filetests. - Run the flags verifier, but without encinfo, when using new backends. - Clean up warning overrides. - Fix up use of casts: use u32::from(x) and siblings when possible, u32::try_from(x).unwrap() when not, to avoid silent truncation. - Take immutable `Function` borrows as input; we don't actually mutate the input IR. - Lots of other miscellaneous cleanups. [1] `cae3e6ea23/src/probestack.rs (L39)`	2020-04-15 17:21:28 -07:00

46 Commits