wasmtime

Author	SHA1	Message	Date
Chris Fallin	02ae1b4464	Merge pull request #1846 from julian-seward1/better-phis Rewrite `lower_edge` to produce better phi-translations:	2020-06-09 09:56:52 -07:00
Julian Seward	6d25759c8e	Rewrite `lower_edge` to produce better phi-translations: * ensure that all const assignments are placed at the end of the sequence. This minimises live ranges. * for the non-const assignments, ignore self-assignments. This can dramatically reduce the total number of moves generated, because any self-assignments trigger the overlap-case handling, hence invoking the double-copy behaviour in cases where it's not necessary. It's worth pointing out that self-assignments are common, and are not due to deficiencies in CLIR optimisation. Rather, they occur whenever a loop back edge doesn't modify all loop-carried values. This can easily happen if the loop has multiple "early" back-edges -- "continues" in C parlance. Eg: loop_header(a, b, c, d, e, f): ... a_new = ... b_new = ... if (..) goto loop_header(a_new, b_new, c, d, e, f) ... c_new = ... d_new = ... if (..) goto loop_header(a_new, b_new, c_new, d_new, e, f) etc For functions with many live values, this can dramatically reduce the number of spill moves we throw into the register allocator. In terms of compilation costs, this ranges from neutral for functions which spill not at all, or minimally (joey_small, joey_med) to a 7.1% reduction in insn count. In terms of run costs, for one spill-heavy test (bz2 w/ custom timing harness), instruction counts are reduced by 4.3%, data reads by 12.3% and data writes by 18.5%. Note those last two figures include all reads and writes made by the generated code, not just spills/reloads, so the proportional reduction in spill/reload traffic must be greater.	2020-06-09 10:36:32 +02:00
Chris Fallin	fc2a6f273b	Three fixes to various SpiderMonkey-related issues: - Properly mask constant values down to appropriate width when generating a constant value directly in aarch64 backend. This was a miscompilation introduced in the new-isel refactor. In combination with failure to respect NarrowValueMode, this resulted in a very subtle bug when an `i32` constant was used in bit-twiddling logic. - Add support for `iadd_ifcout` in aarch64 backend as used in explicit heap-check mode. With this change, we no longer fail heap-related tests with the huge-heap-region mode disabled. - Remove a panic that was occurring in some tests that are currently ignored on aarch64, by simply returning empty/default information in `value_label` functionality rather than touching unimplemented APIs. This is not a bugfix per-se, but removes confusing panic messages from `cargo test` output that might otherwise mislead.	2020-06-08 13:02:00 -07:00
Andrew Brown	40f31375a5	Add TargetIsa::as_any for downcasting to specific ISA implementations This is necessary when we would like to check specific ISA flags, e.g.	2020-06-03 16:27:57 -07:00
Chris Fallin	fe97659813	Address review comments.	2020-06-03 13:31:34 -07:00
Chris Fallin	615362068f	Multi-value return support.	2020-06-03 13:31:34 -07:00
Benjamin Bouvier	e227608510	mach backend: use vectors instead of sets to remember set of uses/defs for calls; This avoids the set uniqueness (hashing) test, reduces memory churn when re-mapping virtual register onto real registers, and is generally more memory-efficient.	2020-06-02 16:29:05 +02:00
Anton Kirilov	8a928830ac	Enable the wast::Cranelift::spec::simd::simd_store test for AArch64 Copyright (c) 2020, Arm Limited.	2020-05-24 22:53:07 +01:00
Chris Fallin	c9e3b71c39	Merge pull request #1729 from cfallin/machinst-branch-opt Fix MachBuffer branch optimization.	2020-05-20 14:43:57 -07:00
Chris Fallin	13e12908a6	MachBuffer branch opts: comments approximating a semi-formal correctness proof.	2020-05-20 14:12:19 -07:00
Chris Fallin	80ab154d04	Update from review comments.	2020-05-20 12:35:36 -07:00
Benjamin Bouvier	1f620e1b46	cranelift: bump regalloc.rs to 0.0.24 and adapt to latest API changes;	2020-05-20 15:37:15 +02:00
Chris Fallin	e11094b28b	Fix MachBuffer branch optimization. This patch fixes a subtle bug that occurred in the MachBuffer branch optimization: in tracking labels at the current buffer tail using a sorted-by-offset array, the code did not update this array properly when redirecting labels. As a result, the dead-branch removal was unsafe, because not every label pointing to a branch is guaranteed to be redirected properly first. Discovered while doing performance testing: bz2 silently took a wrong branch and exited compression early. (Eek!) To address this problem, this patch adopts a slightly simpler data structure: we only track the labels at the current buffer tail, and at the start of each branch, and we're careful to update these appropriately to maintain the invariants. I'm pretty confident that this is correct now, but we should (still) fuzz it a bunch, because wrong control flow scares me a nonzero amount. I should probably also actually write out a formal proof that these data-structure updates are correct. The optimizations are important for performance (removing useless empty blocks, and taking advantage of any fallthrough opportunities at all), so I don't think we would want to drop them entirely.	2020-05-19 18:09:18 -07:00
Chris Fallin	bdd2873c8c	Address review comments.	2020-05-18 16:25:26 -07:00
Chris Fallin	72e6be9342	Rework of MachInst isel, branch fixups and lowering, and block ordering. This patch includes: - A complete rework of the way that CLIF blocks and edge blocks are lowered into VCode blocks. The new mechanism in `BlockLoweringOrder` computes RPO over the CFG, but with a twist: it merges edge blocks intto heads or tails of original CLIF blocks wherever possible, and it does this without ever actually materializing the full nodes-plus-edges graph first. The backend driver lowers blocks in final order so there's no need to reshuffle later. - A new `MachBuffer` that replaces the `MachSection`. This is a special version of a code-sink that is far more than a humble `Vec<u8>`. In particular, it keeps a record of label definitions and label uses, with a machine-pluggable `LabelUse` trait that defines various types of fixups (basically internal relocations). Importantly, it implements some simple peephole-style branch rewrites inline in the emission pass, without any separate traversals over the code to use fallthroughs, swap taken/not-taken arms, etc. It tracks branches at the tail of the buffer and can (i) remove blocks that are just unconditional branches (by redirecting the label), (ii) understand a conditional/unconditional pair and swap the conditional polarity when it's helpful; and (iii) remove branches that branch to the fallthrough PC. The `MachBuffer` also implements branch-island support. On architectures like AArch64, this is needed to allow conditional branches within plausibly-attainable ranges (+/- 1MB on AArch64 specifically). It also does this inline while streaming through the emission, without any sort of fixpoint algorithm or later moving of code, by simply tracking outstanding references and "deadlines" and emitting an island just-in-time when we're in danger of going out of range. - A rework of the instruction selector driver. This is largely following the same algorithm as before, but is cleaned up significantly, in particular in the API: the machine backend can ask for an input arg and get any of three forms (constant, register, producing instruction), indicating it needs the register or can merge the constant or producing instruction as appropriate. This new driver takes special care to emit constants right at use-sites (and at phi inputs), minimizing their live-ranges, and also special-cases the "pinned register" to avoid superfluous moves. Overall, on `bz2.wasm`, the results are: wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.	2020-05-16 23:08:22 -07:00
Benjamin Bouvier	5987cf5cda	machinst: add a linear-scan checked variant too;	2020-05-13 10:56:32 +02:00
Chris Fallin	17cef9140c	MachInst backend: don't reallocate RealRegUniverses for each function compilation. This saves ~0.14% instruction count, ~0.18% allocated bytes, and ~1.5% allocated blocks on a `clif-util wasm` compilation of `bz2.wasm` for aarch64.	2020-05-08 15:35:16 -07:00
Benjamin Bouvier	528d3c1355	machinst: Steal the used/defs Sets when emitting a call in ABICall;	2020-05-07 12:24:02 +02:00
Benjamin Bouvier	19d8a7f1fb	machinst: Reuse memory accross loop iterations in lowering;	2020-05-07 12:24:02 +02:00
Benjamin Bouvier	b24b711c16	machinst: Reduce the number of vec allocations for edge blocks;	2020-05-07 12:24:02 +02:00
Benjamin Bouvier	9215b610ef	machinst: Avoid a lot of short-lived allocations in ABICall;	2020-05-07 12:24:02 +02:00
Benjamin Bouvier	4f919c6460	machinst: bump regalloc to 0.0.23 and return a slice on the successor indexes, in block_succs;	2020-05-07 12:24:02 +02:00
Julian Seward	48521393ae	Update to regalloc.rs version 0.22.	2020-05-06 20:16:31 +02:00
Chris Fallin	6d73fdb70a	Merge pull request #1607 from cfallin/aarch64-stack-frame Rework aarch64 stack frame implementation to use positive offsets.	2020-05-06 10:29:30 -07:00
Chris Fallin	a66724aafd	Rework aarch64 stack frame implementation. This PR changes the aarch64 ABI implementation to use positive offsets from SP, rather than negative offsets from FP, to refer to spill slots and stack-local storage. This allows for better addressing-mode options, and hence slightly better code: e.g., the unsigned scaled 12-bit offset mode can be used to reach anywhere in a 32KB frame without extra address-construction instructions, whereas negative offsets are limited to a signed 9-bit unscaled mode (-256 bytes). To enable this, the PR introduces a notion of "nominal SP offsets" as a virtual addressing mode, lowered during the emission pass. The offsets are relative to "SP after adjusting downward to allocate stack/spill slots", but before pushing clobbers. This allows the addressing-mode expressions to be generated before register allocation (or during it, for spill/reload sequences). To convert these offsets into true offsets from SP, we need to track how much further SP is moved downward, and compensate for this. We do so with "virtual SP offset adjustment" pseudo-instructions: these are seen by the emission pass, and result in no instruction (0 byte output), but update state that is now threaded through each instruction emission in turn. In this way, we can push e.g. stack args for a call and adjust the virtual SP offset, allowing reloads from nominal-SP-relative spillslots while we do the argument setup with "real SP offsets" at the same time.	2020-05-06 09:23:55 -07:00
Benjamin Bouvier	1d90751ba9	machinst: Avoid a full instructions traversal of all the blocks when computing the final block ordering;	2020-05-06 15:13:25 +02:00
Chris Fallin	e39b4aba1c	Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use it. Previously, every call was lowered on AArch64 to a `call` instruction, which takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible relocation record to be emitted (or rather, a record that a more sophisticated linker would fix up by inserting a shim/veneer). This commit adds a notion of "relocation distance" in the MachInst backends, and provides this information for every call target and symbol reference. The intent is that backends on architectures like AArch64, where there are different offset sizes / addressing strategies to choose from, can either emit a regular call or a load-64-bit-constant / call-indirect sequence, as necessary. This avoids the need to implement complex linking behavior. The MachInst driver code provides this information based on the "colocated" bit in the CLIF symbol references, which appears to have been designed for this purpose, or at least a similar one. Combined with the `use_colocated_libcalls` setting, this allows client code to ensure that library calls can link to library code at any location in the address space. Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing so, it appears all that is necessary to get its tests to pass is to set the `use_colocated_libcalls` flag to false, to make use of the above change. This fixes the `libcall_function` unit-test in this crate.	2020-05-05 09:55:12 -07:00
Chris Fallin	8393412c40	Merge pull request #1632 from cfallin/aarch64-fix-srclocs MachInst backend: attach SourceLoc span information to all ranges.	2020-04-30 16:13:55 -07:00
Chris Fallin	964c6087bd	MachInst backend: attach SourceLoc span information to all ranges. Previously, the SourceLoc information transferred in `VCode` only included PC-spans for non-default SourceLocs. I realized that the invariant we're supposed to keep here is that every PC is covered; if no source information, just use `SourceLoc::default()`. This was spurred by @bjorn3's comment in #1575 (thanks!).	2020-04-30 15:40:55 -07:00
Chris Fallin	be6f060abf	Use new regalloc.rs version with dense vreg->rreg maps. This PR updates Cranelift to use the new version of regalloc.rs (bytecodealliance/regalloc.rs#55) that provides dense vreg->rreg maps to the `map_reg()` function for each instruction, rather than the earlier hashmap-based approach. In one test (regex-rs.wasm), this PR results in a 15% reduction in memory allocations (1245MB -> 1060MB) as measured by DHAT on `clif-util wasm` runs.	2020-04-29 10:42:25 -07:00
Benjamin Bouvier	698dc9c401	Fixes #1619 : Properly bubble up errors when seeing an unexpected type during lowering.	2020-04-29 10:26:22 +02:00
Chris Fallin	b691770faa	MachInst backend: pass through SourceLoc information. This change adds SourceLoc information per instruction in a `VCode<Inst>` container, and keeps this information up-to-date across register allocation and branch reordering. The information is initially collected during instruction lowering, eventually collected on the MachSection, and finally provided to the environment that wraps the codegen crate for wasmtime.	2020-04-24 13:18:01 -07:00
Benjamin Bouvier	19b5b0cc7b	aarch64: pass a lowering context to gen_copy_reg_to_arg;	2020-04-24 17:41:14 +02:00
Benjamin Bouvier	0b13d8c848	aarch64: copy SP whenever it's involved in an address lowering with an explicit add;	2020-04-24 17:41:14 +02:00
Benjamin Bouvier	65ef26b989	Add a setting to choose a register allocator algorithm to use with MachBackend;	2020-04-22 14:47:18 +02:00
Benjamin Bouvier	a7ca37e493	Honour the emit_all_ones_funcaddrs() settings when creating unpatched locations;	2020-04-21 17:22:53 +02:00
Benjamin Bouvier	5b8b75def0	Baldrdash: implement support for sign-extension in returns;	2020-04-21 12:12:56 +02:00
Benjamin Bouvier	241c164e25	Implement pinned register usage through set_pinned_reg/get_pinned_reg;	2020-04-21 12:12:56 +02:00
Benjamin Bouvier	d1b5df31fd	Baldrdash: use the right frame offset when loading arguments from the stack	2020-04-21 12:12:56 +02:00
bjorn3	1bee1af755	Implement stack_addr for AArch64	2020-04-18 13:24:06 +02:00
Chris Fallin	48cf2c2f50	Address review comments: - Undo temporary changes to default features (`all-arch`) and a signal-handler test. - Remove `SIGTRAP` handler: no longer needed now that we've found an "undefined opcode" option on ARM64. - Rename pp.rs to pretty_print.rs in machinst/. - Only use empty stack-probe on non-x86. As per a comment in rust-lang/compiler-builtins [1], LLVM only supports stack probes on x86 and x86-64. Thus, on any other CPU architecture, we cannot refer to `__rust_probestack`, because it does not exist. - Rename arm64 to aarch64. - Use `target` directive in vcode filetests. - Run the flags verifier, but without encinfo, when using new backends. - Clean up warning overrides. - Fix up use of casts: use u32::from(x) and siblings when possible, u32::try_from(x).unwrap() when not, to avoid silent truncation. - Take immutable `Function` borrows as input; we don't actually mutate the input IR. - Lots of other miscellaneous cleanups. [1] `cae3e6ea23/src/probestack.rs (L39)`	2020-04-15 17:21:28 -07:00
Chris Fallin	d83574261c	ARM64 backend, part 3 / 11: MachInst infrastructure. This patch adds the MachInst, or Machine Instruction, infrastructure. This is the machine-independent portion of the new backend design. It contains the implementation of the "vcode" (virtual-registerized code) container, the top-level lowering algorithm and compilation pipeline, and the trait definitions that the machine backends will fill in. This backend infrastructure is included in the compilation of the `codegen` crate, but it is not yet tied into the public APIs; that patch will come last, after all the other pieces are filled in. This patch contains code written by Julian Seward <jseward@acm.org> and Benjamin Bouvier <public@benj.me>, originally developed on a side-branch before rebasing and condensing into this patch series. See the `arm64` branch at `https://github.com/cfallin/wasmtime` for original development history. Co-authored-by: Julian Seward <jseward@acm.org> Co-authored-by: Benjamin Bouvier <public@benj.me>	2020-04-11 17:51:11 -07:00

... 3 4 5 6 7

342 Commits