wasmtime

Author	SHA1	Message	Date
Chris Fallin	f9b98f0ddc	Aarch64 codegen quality: support more general add+extend computations. Previously, our pattern-matching for generating load/store addresses was somewhat limited. For example, it could not use a register-extend address mode to handle the following CLIF: ``` v2760 = uextend.i64 v985 v2761 = load.i64 notrap aligned readonly v1 v1018 = iadd v2761, v2760 store v1017, v1018 ``` This PR adds more general support for address expressions made up of additions and extensions. In particular, it pattern-matches a tree of 64-bit `iadd`s, optionally with `uextend`/`sextend` from 32-bit values at the leaves, to collect the list of all addends that form the address. It also collects all offsets at leaves, combining them. It applies a series of heuristics to make the best use of the available addressing modes, filling the load/store itself with as many 64-bit registers, zero/sign-extended 32-bit registers, and/or an offset, then computing the rest with add instructions as necessary. It attempts to make use of immediate forms (add-immediate or subtract-immediate) whenever possible, and also uses the built-in extend operators on add instructions when possible. There are certainly cases where this is not optimal (i.e., does not generate the strictly shortest sequence of instructions), but it should be good enough for most code. Using `perf stat` to measure instruction count (runtime only, on wasmtime, after populating the cache to avoid measuring compilation), this impacts `bz2` as follows: ``` pre: 1006.410425 task-clock (msec) # 1.000 CPUs utilized 113 context-switches # 0.112 K/sec 1 cpu-migrations # 0.001 K/sec 5,036 page-faults # 0.005 M/sec 3,221,547,476 cycles # 3.201 GHz 4,000,670,104 instructions # 1.24 insn per cycle <not supported> branches 27,958,613 branch-misses 1.006071348 seconds time elapsed post: 963.499525 task-clock (msec) # 0.997 CPUs utilized 117 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,081 page-faults # 0.005 M/sec 3,039,687,673 cycles # 3.155 GHz 3,837,761,690 instructions # 1.26 insn per cycle <not supported> branches 28,254,585 branch-misses 0.966072682 seconds time elapsed ``` In other words, this reduces instruction count by 4.1% on `bz2`.	2020-07-27 13:10:50 -07:00
Yury Delendik	399ee0a54c	Serialize and deserialize compilation artifacts. (#2020 ) * Serialize and deserialize Module * Use bincode to serialize * Add wasm_module_serialize; docs * Simple tests	2020-07-21 15:05:50 -05:00
Chris Fallin	96ef2f1a1b	Fix `u8::MAX` -> `std::u8::MAX`. (#2047 ) As per Carlo Kok on Zulip #cranelift, this breaks builds with stable Rust pre-1.43, as `core::u8::MAX` was only stabilized then. We'd like to support older versions if we can easily do so. This PR also adds `cranelift-tools` to the crates checked on CI with Rust 1.41.0, which pulls in all backends (including `aarch64`).	2020-07-20 14:59:15 -05:00
Chris Fallin	784e2f1480	Merge pull request #2038 from jgouly/arith2 arm64: Enable arith2 tests	2020-07-20 09:00:10 -07:00
Chris Fallin	1b3b2dbfd0	Merge pull request #2043 from cfallin/csel-opt Aarch64: handle csel with icmp/fcmp source without materializing the bool.	2020-07-18 19:33:47 -07:00
Chris Fallin	ea894c0eeb	Merge pull request #2042 from cfallin/aarch64-fix-regshift-mask Aarch64: mask shift-amounts incorporated into reg-reg-shift ALU insts.	2020-07-18 19:33:35 -07:00
Chris Fallin	21dac670f0	Aarch64: handle csel with icmp/fcmp source without materializing the bool. Previously, we simply compared the input bool to 0, which forced the value into a register (usually via a cmp and cset), zero-extended it, etc. This patch performs the same pattern-matching that branches do to directly perform the cmp and use its flag results with the csel. On the `bz2` benchmark, the runtime is affected as follows (measuring with `perf stat`, using wasmtime with its cache enabled, and taking the second run after the first compiles and populates the cache): pre: 1117.232000 task-clock (msec) # 1.000 CPUs utilized 133 context-switches # 0.119 K/sec 1 cpu-migrations # 0.001 K/sec 5,041 page-faults # 0.005 M/sec 3,511,615,100 cycles # 3.143 GHz 4,272,427,772 instructions # 1.22 insn per cycle <not supported> branches 27,980,906 branch-misses 1.117299838 seconds time elapsed post: 1003.738075 task-clock (msec) # 1.000 CPUs utilized 121 context-switches # 0.121 K/sec 0 cpu-migrations # 0.000 K/sec 5,052 page-faults # 0.005 M/sec 3,224,875,393 cycles # 3.213 GHz 4,000,838,686 instructions # 1.24 insn per cycle <not supported> branches 27,928,232 branch-misses 1.003440004 seconds time elapsed In other words, with this change, on `bz2`, we see a 6.3% reduction in executed instructions.	2020-07-17 21:10:21 -07:00
Nick Fitzgerald	ee5982fd16	peepmatic: Be generic over the operator type This lets us avoid the cost of `cranelift_codegen::ir::Opcode` to `peepmatic_runtime::Operator` conversion overhead, and paves the way for allowing Peepmatic to support non-clif optimizations (e.g. vcode optimizations). Rather than defining our own `peepmatic::Operator` type like we used to, now the whole `peepmatic` crate is effectively generic over a `TOperator` type parameter. For the Cranelift integration, we use `cranelift_codegen::ir::Opcode` as the concrete type for our `TOperator` type parameter. For testing, we also define a `TestOperator` type, so that we can test Peepmatic code without building all of Cranelift, and we can keep them somewhat isolated from each other. The methods that `peepmatic::Operator` had are now translated into trait bounds on the `TOperator` type. These traits need to be shared between all of `peepmatic`, `peepmatic-runtime`, and `cranelift-codegen`'s Peepmatic integration. Therefore, these new traits live in a new crate: `peepmatic-traits`. This crate acts as a header file of sorts for shared trait/type/macro definitions. Additionally, the `peepmatic-runtime` crate no longer depends on the `peepmatic-macro` procedural macro crate, which should lead to faster build times for Cranelift when it is using pre-built peephole optimizers.	2020-07-17 16:16:49 -07:00
Chris Fallin	9bd9c628aa	Aarch64: mask shift-amounts incorporated into reg-reg-shift ALU insts. We had previously fixed a bug in which constant shift amounts should be masked to modulo the number of bits in the operand; however, we did not fix the analogous case for shifts incorporated into the second register argument of ALU instructions that support integrated shifts. This failure to mask resulted in illegal instructions being generated, e.g. in https://bugzilla.mozilla.org/show_bug.cgi?id=1653502. This PR fixes the issue by masking the amount, as the shift semantics require.	2020-07-17 14:55:23 -07:00
Nick Fitzgerald	ae95ad8733	cranelift: Don't build `peepmatic`-based optimizations in `build.rs` Instead, when the `rebuild-peephole-optimizers` feature is enabled, rebuild them the first time they are used. This allows peepmatic to run when Cranelift's `Opcode` is defined and available, which paves the way forward for: * merging `peepmatic_runtime::operator::Operator` and Cranelift's `Opcode` (we are wasting a bunch of cycles converting between the two of them), and * supporting vcode optimizations in `peepmatic`.	2020-07-17 14:35:16 -07:00
Johnnie Birch	a7cedf3100	Add support for 32 bit and 64 bit fcmp for the new backend Implements commiss and commisd.	2020-07-17 13:46:54 -07:00
Nick Fitzgerald	8dd4ab2f1e	Merge pull request #2022 from MaxGraey/peepmatic-bnot peepmatic: Add bnot operation	2020-07-17 09:39:38 -07:00
Nikolay Volf	4f4edc7aef	Remove spam from "do_remove_constant_phis"	2020-07-17 18:14:16 +02:00
Joey Gouly	40473dffed	arm64: Enable arith2 tests Copyright (c) 2020, Arm Limited.	2020-07-17 15:58:16 +01:00
Benjamin Bouvier	ead8a835c4	machinst x64: add more FP support	2020-07-17 15:56:44 +02:00
bjorn3	5c5a30f76c	Fix review comments	2020-07-17 12:03:17 +02:00
bjorn3	7b7b1f4997	Rename sarg__ to sarg_t	2020-07-17 12:03:17 +02:00
bjorn3	4971d9ee80	Merge {make_incoming,get_outgoing}_{,struct_}arg	2020-07-17 12:03:17 +02:00
bjorn3	0d4fa6d32a	Fix review comments	2020-07-17 12:03:17 +02:00
bjorn3	4431ac1108	Implement SystemV struct argument passing	2020-07-17 12:03:17 +02:00
MaxGraey	c653c563dd	Merge branch 'main' into peepmatic-bnot	2020-07-16 22:01:18 +03:00
Chris Fallin	5e0268a542	Merge pull request #2034 from cfallin/update-regalloc Update to regalloc.rs 0.0.28.	2020-07-16 11:36:11 -07:00
Alex Crichton	63d5b91930	Wasmtime 0.19.0 and Cranelift 0.66.0 (#2027 ) This commit updates Wasmtime's version to 0.19.0, Cranelift's version to 0.66.0, and updates the release notes as well.	2020-07-16 12:46:21 -05:00
Chris Fallin	756e8b8ea2	Update to regalloc.rs 0.0.28. This version of regalloc.rs includes several bugfixes for reference-types support used by the new backend framework and the aarch64 backend (bytecodealliance/regalloc.rs#85 and bytecodealliance/regalloc.rs#86).	2020-07-16 09:42:09 -07:00
Benjamin Bouvier	bab337fc32	Address review comments;	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	5a55646fc3	machinst x64: support out-of-bounds memory accesses;	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	ea33ce9116	machinst x64: basic support for baldrdash + fix multi-value support	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	00b38c91f6	machinst x64: fix generation of RegMemImm immediate operands;	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	1430c5e436	machinst x64: fix index handling of jump table; The index should be truncated to 32 bits before being used for the jump table entry computation.	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	55b9059954	machinst x64: remove spurious assertion about FP offset requiring to be 16-bytes aligned	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	3905a1b17b	machinst x64: implement SymbolValue and FuncAddr with a movabsq+reloc;	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	cfa0a0c4e8	machinst x64: lower resumable_trap as trap;	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	311027869b	machinst x64: implement popcnt.i64	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	d9310e8d90	machinst x64: fix checked div sequence - it should mark as clobbering (def) rdx, not modifying it - the signed-div check requires a temporary to compare against int64_min	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	f932bccaf8	machinst x64: fix sign-extension at boundary	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	6f5403a94b	machinst x64: lower Ctz using the Bsf x86 instruction	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	33e0d05645	machinst x64: have cmov modify its destination operand; This is tricky: the control flow implicitly implied by the operand makes it so that the output register may be undefined, if we mark it only as a "def". Make it a "mod" instead, which matches our usage in the codebase, and will make it crash if the output operand isn't unconditionally defined before the instruction.	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	aa7db7fd7b	machinst x64: fix JmpUnknown register mapping;	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	fe7dd41435	machinst x64: fix iconst emission	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	ec2209665a	machinst x64: implement bsr and lower Clz;	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	eda2d143ed	machinst x64: add support for umulhi/smulhi;	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	571061fe4c	machinst x64: add support for rotations;	2020-07-16 18:21:06 +02:00
Benjamin Bouvier	22892466e7	machinst x64: fix implementation of *reduce; They should just generate a plain move, since the high bits are then ignored, and not an extended move.	2020-07-16 18:21:06 +02:00
MaxGraey	4564c396d2	Merge branch 'main' into peepmatic-bnot	2020-07-16 16:13:28 +03:00
MaxGraey	657aea5286	remove rule and tests	2020-07-16 14:56:11 +03:00
Andrew Brown	f0b083c6ad	Legalize `[u\|s]widen_high` for x86 Use `x86_palignr` and `[u\|s]widen_low` for legalizing this instruction.	2020-07-15 11:32:08 -07:00
Andrew Brown	c8ddf8a34c	Encode `[u\|s]widen_low` for x86	2020-07-15 11:32:08 -07:00
Andrew Brown	fafef7db77	Add `x86_palignr` instructions This instruction is necessary for implementing `[s\|u]widen_high`.	2020-07-15 11:32:08 -07:00
Andrew Brown	0e5e8a62c8	Add `DerivedFunction` for doubling lane widths and halving the number of lanes (i.e. merging) Certain operations (e.g. widening) will have operands with types like `NxM` but will return results with types like `(N*2)x(M/2)` (double the lane width, halve the number of lanes; maintain the same number of vector bits). This is equivalent to applying two `DerivedFunction`s to the type: `DerivedFunction::DoubleWidth` then `DerivedFunction::HalfVector`. Since there is no easy way to apply multiple `DerivedFunction`s (e.g. most of the logic is one-level deep, `1d5a678124/cranelift/codegen/meta/src/gen_inst.rs (L618-L621)`), I added `DerivedFunction::MergeLanes` to do the necessary type conversion.	2020-07-15 11:32:08 -07:00
Chris Fallin	12a31c88d7	Merge pull request #2021 from akirilov-arm/VectorSize AArch64: Introduce an enum to specify vector instruction operand sizes	2020-07-15 09:43:18 -07:00

1 2 3 4 5 ...

853 Commits