wasmtime

Author	SHA1	Message	Date
bjorn3	9e34df33b9	Remove the old x86 backend	2021-09-29 16:13:46 +02:00
Ulrich Weigand	51131a3acc	Fix s390x regressions (#3330 ) - Add relocation handling needed after PR #3275 - Fix incorrect handling of signed constants detected by PR #3056 test - Fix LabelUse max pos/neg ranges; fix overflow in buffers.rs - Disable fuzzing tests that require pre-built v8 binaries - Disable cranelift test that depends on i128 - Temporarily disable memory64 tests	2021-09-20 09:12:36 -05:00
Benjamin Bouvier	85ec11acb9	Aarch64: always generate the CFA directive indicating no pointer signing	2021-09-02 09:16:34 +02:00
Alex Crichton	1532516a36	Use relative `call` instructions between wasm functions (#3275 ) * Use relative `call` instructions between wasm functions This commit is a relatively major change to the way that Wasmtime generates code for Wasm modules and how functions call each other. Prior to this commit all function calls between functions, even if they were defined in the same module, were done indirectly through a register. To implement this the backend would emit an absolute 8-byte relocation near all function calls, load that address into a register, and then call it. While this technique is simple to implement and easy to get right, it has two primary downsides associated with it: * Function calls are always indirect which means they are more difficult to predict, resulting in worse performance. * Generating a relocation-per-function call requires expensive relocation resolution at module-load time, which can be a large contributing factor to how long it takes to load a precompiled module. To fix these issues, while also somewhat compromising on the previously simple implementation technique, this commit switches wasm calls within a module to using the `colocated` flag enabled in Cranelift-speak, which basically means that a relative call instruction is used with a relocation that's resolved relative to the pc of the call instruction itself. When switching the `colocated` flag to `true` this commit is also then able to move much of the relocation resolution from `wasmtime_jit::link` into `wasmtime_cranelift::obj` during object-construction time. This frontloads all relocation work which means that there's actually no relocations related to function calls in the final image, solving both of our points above. The main gotcha in implementing this technique is that there are hardware limitations to relative function calls which mean we can't simply blindly use them. AArch64, for example, can only go +/- 64 MB from the `bl` instruction to the target, which means that if the function we're calling is a greater distance away then we would fail to resolve that relocation. On x86_64 the limits are +/- 2GB which are much larger, but theoretically still feasible to hit. Consequently the main increase in implementation complexity is fixing this issue. This issue is actually already present in Cranelift itself, and is internally one of the invariants handled by the `MachBuffer` type. When generating a function relative jumps between basic blocks have similar restrictions. This commit adds new methods for the `MachBackend` trait and updates the implementation of `MachBuffer` to account for all these new branches. Specifically the changes to `MachBuffer` are: * For AAarch64 the `LabelUse::Branch26` value now supports veneers, and AArch64 calls use this to resolve relocations. * The `emit_island` function has been rewritten internally to handle some cases which previously didn't come up before, such as: * When emitting an island the deadline is now recalculated, where previously it was always set to infinitely in the future. This was ok prior since only a `Branch19` supported veneers and once it was promoted no veneers were supported, so without multiple layers of promotion the lack of a new deadline was ok. * When emitting an island all pending fixups had veneers forced if their branch target wasn't known yet. This was generally ok for 19-bit fixups since the only kind getting a veneer was a 19-bit fixup, but with mixed kinds it's a bit odd to force veneers for a 26-bit fixup just because a nearby 19-bit fixup needed a veneer. Instead fixups are now re-enqueued unless they're known to be out-of-bounds. This may run the risk of generating more islands for 19-bit branches but it should also reduce the number of islands for between-function calls. * Otherwise the internal logic was tweaked to ideally be a bit more simple, but that's a pretty subjective criteria in compilers... I've added some simple testing of this for now. A synthetic compiler option was create to simply add padded 0s between functions and test cases implement various forms of calls that at least need veneers. A test is also included for x86_64, but it is unfortunately pretty slow because it requires generating 2GB of output. I'm hoping for now it's not too bad, but we can disable the test if it's prohibitive and otherwise just comment the necessary portions to be sure to run the ignored test if these parts of the code have changed. The final end-result of this commit is that for a large module I'm working with the number of relocations dropped to zero, meaning that nothing actually needs to be done to the text section when it's loaded into memory (yay!). I haven't run final benchmarks yet but this is the last remaining source of significant slowdown when loading modules, after I land a number of other PRs both active and ones that I only have locally for now. * Fix arm32 * Review comments	2021-09-01 13:27:38 -05:00
Anton Kirilov	7b98be1bee	Cranelift: Simplify leaf functions that do not use the stack (#2960 ) * Cranelift AArch64: Simplify leaf functions that do not use the stack Leaf functions that do not use the stack (e.g. do not clobber any callee-saved registers) do not need a frame record. Copyright (c) 2021, Arm Limited.	2021-08-27 12:12:37 +02:00
Alex Crichton	6fbddc1931	Replace some cfg(debug) with cfg(debug_assertions) (#3242 ) * Replace some cfg(debug) with cfg(debug_assertions) Cargo nor rustc ever sets `cfg(debug)` automatically, so it's expected that these usages were intended to be `cfg(debug_assertions)`. * Fix MachBuffer debug-assertion invariant checks. We should only check invariants when we expect them to be true -- specifically, before the branch-simplification algorithm runs. At other times, they may be temporarily violated: e.g., after `add_{cond,uncond}_branch()` but before emitting the branch bytes. This is the expected sequence, and the rest of the code is consistent with that. Some of the checks also were not quite right (w.r.t. the written invariants); specifically, we should not check validity of a label's offset when the label has been aliased to another label. It seems that this is an unfortunate consequence of leftover debug-assertions that weren't actually being run, so weren't kept up-to-date. Should no longer happen now that we actually check these! Co-authored-by: Chris Fallin <chris@cfallin.org>	2021-08-25 22:15:24 -05:00
Nick Fitzgerald	4283d2116d	cranelift: Move most debug-level logs to the trace level Cranelift crates have historically been much more verbose with debug-level logging than most other crates in the Rust ecosystem. We log things like how many parameters a basic block has, the color of virtual registers during regalloc, etc. Even for Cranelift hackers, these things are largely only useful when hacking specifically on Cranelift and looking at a particular test case, not even when using some Cranelift embedding (such as Wasmtime). Most of the time, when people want logging for their Rust programs, they do something like: RUST_LOG=debug cargo run This means that they get all that mostly not useful debug logging out of Cranelift. So they might want to disable logging for Cranelift, or change it to a higher log level: RUST_LOG=debug,cranelift=info cargo run The problem is that this is already more annoying to type that `RUST_LOG=debug`, and that Cranelift isn't one single crate, so you actually have to play whack-a-mole with naming all the Cranelift crates off the top of your head, something more like this: RUST_LOG=debug,cranelift=info,cranelift_codegen=info,cranelift_wasm=info,... Therefore, we're changing most of the `debug!` logs into `trace!` logs: anything that is very Cranelift-internal, unlikely to be useful/meaningful to the "average" Cranelift embedder, or prints a message for each instruction visited during a pass. On the other hand, things that just report a one line statistic for a whole pass, for example, are left as `debug!`. The more verbose the log messages are, the higher the bar they must clear to be `debug!` rather than `trace!`.	2021-07-26 11:50:16 -07:00
Benjamin Bouvier	4c595f4f9d	Remove unused store_stackslot/load_stackslot trait methods.	2021-07-02 18:09:33 +02:00
Benjamin Bouvier	91c65d739f	Remove unused code in machinst	2021-07-02 18:09:33 +02:00
Ulrich Weigand	a90ab8a0cf	Fix updating srclocs in truncate_last_branch The truncate_last_branch removes an instruction that had already been added to the buffer, and must update various bookkeeping. However, updating the "srclocs" field is incorrect: if there is a srclocs entry that spans both the removed branch and some previous instruction, that whole srclocs entry is removed, which makes those previous instructions now uncovered by any srclocs record. This can cause subsequent problems e.g. if one of those instructions traps. Fixed by just truncating instead of fully removing the srclocs record in this case.	2021-06-22 13:53:47 +02:00
Benjamin Bouvier	51edea9e57	cranelift: introduce a new WasmtimeAppleAarch64 calling convention The previous choice to use the WasmtimeSystemV calling convention for apple-aarch64 devices was incorrect: padding of arguments was incorrectly computed. So we have to use some flavor of the apple-aarch64 ABI there. Since we want to support the wasmtime custom convention for multiple returns on apple-aarch64 too, a new custom Wasmtime calling convention was introduced to support this.	2021-06-01 17:29:12 +02:00
Chris Fallin	800cf25bb5	Make the CFG metadata computation conditional on a flag.	2021-05-24 13:01:15 -07:00
Chris Fallin	11a2ef01e7	Provide BB layout info externally in terms of code offsets. This is sometimes useful when performing analyses on the generated machine code: for example, some kinds of code verifiers will want to do a control-flow analysis, and it is much easier to do this if one does not have to recover the CFG from the machine code (doing so requires heavyweight analysis when indirect branches are involved). If one trusts the control-flow lowering and only needs to verify other properties of the code, this can be very useful.	2021-05-24 09:18:06 -07:00
Benjamin Bouvier	50aa645769	cranelift: use a deferred display wrapper for logging the vcode's IR	2021-04-16 10:27:19 +02:00
Ulrich Weigand	10efe8e780	cranelift: Fix spillslot regression on big-endian platforms PR 2840 changed the store_spillslot routine to always store integer registers in full word size to a spill slot. However, the load_spillslot routine was not updated, which may causes the contents to be reloaded in a different type. On big-endian systems this will fetch wrong data. Fixed by using the same type override in load_spillslot.	2021-04-15 21:39:14 +02:00
Chris Fallin	36c667d58d	Merge pull request #2837 from uweigand/outgoing-args Add back support for accumulating outgoing arguments	2021-04-14 12:54:06 -07:00
Chris Fallin	fd4bfbe5a7	Merge pull request #2836 from uweigand/framesizefix Fix frame size after unwind rework	2021-04-14 12:19:38 -07:00
Benjamin Bouvier	e7bced9512	cranelift: always spill i32 with i64 stores; Fixes #2839. See also the issue description and comments in this commits for details of what the fix is about here.	2021-04-14 18:08:52 +02:00
Ulrich Weigand	336c6369b4	Add back support for accumulating outgoing arguments The unwind rework (commit `2d5db92a`) removed support for the feature to allow a target to allocate the space for outgoing function arguments right in the prologue (originally added via commit `80c2d70d`). This patch adds it back.	2021-04-14 13:51:16 +02:00
Ulrich Weigand	e3bb36ba77	Fix frame size after unwind rework After the unwind rework (commit `2d5db92a`) the space used to save clobbered registers now lies between the nominal SP and the FP. Therefore, the size of that space should now be included in the frame size as reported by frame_size(), since this value is used to compute the nominal_sp_to_fp offset.	2021-04-14 13:46:08 +02:00
Alex Crichton	195bf0e29a	Fully support multiple returns in Wasmtime (#2806 ) * Fully support multiple returns in Wasmtime For quite some time now Wasmtime has "supported" multiple return values, but only in the mose bare bones ways. Up until recently you couldn't get a typed version of functions with multiple return values, and never have you been able to use `Func::wrap` with functions that return multiple values. Even recently where `Func::typed` can call functions that return multiple values it uses a double-indirection by calling a trampoline which calls the real function. The underlying reason for this lack of support is that cranelift's ABI for returning multiple values is not possible to write in Rust. For example if a wasm function returns two `i32` values there is no Rust (or C!) function you can write to correspond to that. This commit, however fixes that. This commit adds two new ABIs to Cranelift: `WasmtimeSystemV` and `WasmtimeFastcall`. The intention is that these Wasmtime-specific ABIs match their corresponding ABI (e.g. `SystemV` or `WindowsFastcall`) for everything except how multiple values are returned. For multiple return values we simply define our own version of the ABI which Wasmtime implements, which is that for N return values the first is returned as if the function only returned that and the latter N-1 return values are returned via an out-ptr that's the last parameter to the function. These custom ABIs provides the ability for Wasmtime to bind these in Rust meaning that `Func::wrap` can now wrap functions that return multiple values and `Func::typed` no longer uses trampolines when calling functions that return multiple values. Although there's lots of internal changes there's no actual changes in the API surface area of Wasmtime, just a few more impls of more public traits which means that more types are supported in more places! Another change made with this PR is a consolidation of how the ABI of each function in a wasm module is selected. The native `SystemV` ABI, for example, is more efficient at returning multiple values than the wasmtime version of the ABI (since more things are in more registers). To continue to take advantage of this Wasmtime will now classify some functions in a wasm module with the "fast" ABI. Only functions that are not reachable externally from the module are classified with the fast ABI (e.g. those not exported, used in tables, or used with `ref.func`). This should enable purely internal functions of modules to have a faster calling convention than those which might be exposed to Wasmtime itself. Closes #1178 * Tweak some names and add docs * "fix" lightbeam compile * Fix TODO with dummy environ * Unwind info is a property of the target, not the ABI * Remove lightbeam unused imports * Attempt to fix arm64 * Document new ABIs aren't stable * Fix filetests to use the right target * Don't always do 64-bit stores with cranelift This was overwriting upper bits when 32-bit registers were being stored into return values, so fix the code inline to do a sized store instead of one-size-fits-all store. * At least get tests passing on the old backend * Fix a typo * Add some filetests with mixed abi calls * Get `multi` example working * Fix doctests on old x86 backend * Add a mixture of wasmtime/system_v tests	2021-04-07 12:34:26 -05:00
Chris Fallin	cb48ea406e	Switch default to new x86_64 backend. This PR switches the default backend on x86, for both the `cranelift-codegen` crate and for Wasmtime, to the new (`MachInst`-style, `VCode`-based) backend that has been under development and testing for some time now. The old backend is still available by default in builds with the `old-x86-backend` feature, or by requesting `BackendVariant::Legacy` from the appropriate APIs. As part of that switch, it adds some more runtime-configurable plumbing to the testing infrastructure so that tests can be run using the appropriate backend. `clif-util test` is now capable of parsing a backend selector option from filetests and instantiating the correct backend. CI has been updated so that the old x86 backend continues to run its tests, just as we used to run the new x64 backend separately. At some point, we will remove the old x86 backend entirely, once we are satisfied that the new backend has not caused any unforeseen issues and we do not need to revert.	2021-04-02 11:35:53 -07:00
Peter Huene	0ddfe97a09	Change how flags are stored in serialized modules. This commit changes how both the shared flags and ISA flags are stored in the serialized module to detect incompatibilities when a serialized module is instantiated. It improves the error reporting when a compiled module has mismatched shared flags.	2021-04-01 21:39:57 -07:00
Peter Huene	29d366db7b	Add a compile command to Wasmtime. This commit adds a `compile` command to the Wasmtime CLI. The command can be used to Ahead-Of-Time (AOT) compile WebAssembly modules. With the `all-arch` feature enabled, AOT compilation can be performed for non-native architectures (i.e. cross-compilation). The `Module::compile` method has been added to perform AOT compilation. A few of the CLI flags relating to "on by default" Wasm features have been changed to be "--disable-XYZ" flags. A simple example of using the `wasmtime compile` command: ```text $ wasmtime compile input.wasm $ wasmtime input.cwasm ```	2021-04-01 19:38:18 -07:00
Alex Crichton	30d9164b6e	Fix a number of warnings cropping up on nightly Rust (#2767 ) Various small issues here and there, nothing major	2021-03-25 13:19:37 -05:00
Benjamin Bouvier	49ef2c652a	Cranelift: remove logging of vcode when the log level isn't debug or more (#2755 ) This logging step may be quite expensive, since logging has never been optimized at all. Removing it is a clear win in compile times on my machine for a large wasm module, for which parallel compilation is lowering from 6 seconds to 1.5 seconds. Co-authored-by: bjorn3 <bjorn3@users.noreply.github.com>	2021-03-23 16:07:32 +01:00
Benjamin Bouvier	6e6713ae0b	cranelift: add support for the Mac aarch64 calling convention This bumps target-lexicon and adds support for the AppleAarch64 calling convention. Specifically for WebAssembly support, we only have to worry about the new stack slots convention. Stack slots don't need to be at least 8-bytes, they can be as small as the data type's size. For instance, if we need stack slots for (i32, i32), they can be located at offsets (+0, +4). Note that they still need to be properly aligned on the data type they're containing, though, so if we need stack slots for (i32, i64), we can't start the i64 slot at the +4 offset (it must start at the +8 offset). Added one test that was failing on the Mac M1, as well as other tests stressing different yet similar situations.	2021-03-22 10:06:13 +01:00
Chris Fallin	2d5db92a9e	Rework/simplify unwind infrastructure and implement Windows unwind. Our previous implementation of unwind infrastructure was somewhat complex and brittle: it parsed generated instructions in order to reverse-engineer unwind info from prologues. It also relied on some fragile linkage to communicate instruction-layout information that VCode was not designed to provide. A much simpler, more reliable, and easier-to-reason-about approach is to embed unwind directives as pseudo-instructions in the prologue as we generate it. That way, we can say what we mean and just emit it directly. The usual reasoning that leads to the reverse-engineering approach is that metadata is hard to keep in sync across optimization passes; but here, (i) prologues are generated at the very end of the pipeline, and (ii) if we ever do a post-prologue-gen optimization, we can treat unwind directives as black boxes with unknown side-effects, just as we do for some other pseudo-instructions today. It turns out that it was easier to just build this for both x64 and aarch64 (since they share a factored-out ABI implementation), and wire up the platform-specific unwind-info generation for Windows and SystemV. Now we have simpler unwind on all platforms and we can delete the old unwind infra as soon as we remove the old backend. There were a few consequences to supporting Fastcall unwind in particular that led to a refactor of the common ABI. Windows only supports naming clobbered-register save locations within 240 bytes of the frame-pointer register, whatever one chooses that to be (RSP or RBP). We had previously saved clobbers below the fixed frame (and below nominal-SP). The 240-byte range has to include the old RBP too, so we're forced to place clobbers at the top of the frame, just below saved RBP/RIP. This is fine; we always keep a frame pointer anyway because we use it to refer to stack args. It does mean that offsets of fixed-frame slots (spillslots, stackslots) from RBP are no longer known before we do regalloc, so if we ever want to index these off of RBP rather than nominal-SP because we add support for `alloca` (dynamic frame growth), then we'll need a "nominal-BP" mode that is resolved after regalloc and clobber-save code is generated. I added a comment to this effect in `abi_impl.rs`. The above refactor touched both x64 and aarch64 because of shared code. This had a further effect in that the old aarch64 prologue generation subtracted from `sp` once to allocate space, then used stores to `[sp, offset]` to save clobbers. Unfortunately the offset only has 7-bit range, so if there are enough clobbered registers (and there can be -- aarch64 has 384 bytes of registers; at least one unit test hits this) the stores/loads will be out-of-range. I really don't want to synthesize large-offset sequences here; better to go back to the simpler pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like a "push". It's likely not much worse microarchitecturally (dependence chain on SP, but oh well) and it actually saves an instruction if there's no other frame to allocate. As a further advantage, it's much simpler to understand; simpler is usually better. This PR adds the new backend on Windows to CI as well.	2021-03-11 20:03:52 -08:00
Chris Fallin	e41d882144	Merge pull request #2678 from cfallin/x64-fastcall x86-64 Windows fastcall ABI support.	2021-03-05 10:46:47 -08:00
Chris Fallin	6c94eb82aa	x86-64 Windows fastcall ABI support. This adds support for the "fastcall" ABI, which is the native C/C++ ABI on Windows platforms on x86-64. It is similar to but not exactly like System V; primarily, its argument register assignments are different, and it requires stack shadow space. Note that this also adjusts the handling of multi-register values in the shared ABI implementation, and with this change, adjusts handling of `i128`s on both Fastcall/x64 and SysV/x64 platforms. This was done to align with actual behavior by the "rustc ABI" on both platforms, as mapped out experimentally (Compiler Explorer link in comments). This behavior is gated under the `enable_llvm_abi_extensions` flag. Note also that this does not add x64 unwind info on Windows. That will come in a future PR (but is planned!).	2021-03-03 19:53:18 -08:00
Chris Fallin	40db4de44a	Fix incomplete trap metadata due to multiple traps at one address. If an instruction has more than one trap record associated with it (for example: a divide instruction that has participated in load-op fusion, so we have both a heap-out-of-bounds trap record due to its load and a divide-by-zero trap record due to its divide op), the current MachBuffer code would emit only one of the trap records to the sink. Separately, divide instructions probably shouldn't merge loads, because the two separate possible traps at one location might be confusing for some embedders (certainly in Lucet). Divide seems to be the only case in our current codegen where such merging might occur. This PR changes the lowering to always force the divisor into a register. Finally, while working out why trap records were not appearing, I had noticed that `isa::x64::emit_std_enc_mem()` was only emitting heap-OOB trap metadata for loads/stores when it had a srcloc. This PR ensures that the metadata is emitted even when the srcloc is empty. Note that none of the above presents a security or correctness problem; trap metadata only affects the status that we return to the embedder when a Wasm program terminates with a trap.	2021-02-24 15:13:45 -08:00
bjorn3	ff22842da5	More atomic ops	2021-02-18 14:16:15 +01:00
bjorn3	602006ff9d	Fix build_value_labels_ranges for newBE when there are no labels	2021-02-04 11:46:20 +01:00
bjorn3	76d615049d	Make the stackslot offsets available for debuginfo	2021-02-03 17:48:52 +01:00
Kasey Carrothers	99be82c866	Replace MachInst::gen_zero_len_nop with gen_nop(0)	2021-01-29 01:15:08 -08:00
Kasey Carrothers	f76a9d436e	Clean up handling of NOPs in the x64 backend. 1. Restricts max nop size to 15 instead of 16. 2. Fixes an edge case where gen_nop() would return a zero sized intruction on multiples of 16. 3. Clarifies the documentation of the gen_nop interface to state that returning zero is allowed when preferred_size is zero.	2021-01-28 20:45:00 -08:00
Alex Crichton	503129ad91	Add a method to share `Config` across machines (#2608 ) With `Module::{serialize,deserialize}` it should be possible to share wasmtime modules across machines or CPUs. Serialization, however, embeds a hash of all configuration values, including cranelift compilation settings. By default wasmtime's selection of the native ISA would enable ISA flags according to CPU features available on the host, but the same CPU features may not be available across two machines. This commit adds a `Config::cranelift_clear_cpu_flags` method which allows clearing the target-specific ISA flags that are automatically inferred by default for the native CPU. Options can then be incrementally built back up as-desired with teh `cranelift_other_flag` method.	2021-01-26 15:59:12 -06:00
Chris Fallin	f54d0d05c7	Address review comments.	2021-01-22 16:02:29 -08:00
Chris Fallin	7e12abce71	Fix a few comment typos and add a clarifying comment.	2021-01-21 16:01:46 -08:00
Chris Fallin	997fab55d5	Skip value-label analysis if no value labels are present.	2021-01-21 15:59:52 -08:00
Chris Fallin	c84d6be6f4	Detailed debug-info (DWARF) support in new backends (initially x64). This PR propagates "value labels" all the way from CLIF to DWARF metadata on the emitted machine code. The key idea is as follows: - Translate value-label metadata on the input into "value_label" pseudo-instructions when lowering into VCode. These pseudo-instructions take a register as input, denote a value label, and semantically are like a "move into value label" -- i.e., they update the current value (as seen by debugging tools) of the given local. These pseudo-instructions emit no machine code. - Perform a dataflow analysis at the machine-code level, tracking value-labels that propagate into registers and into [SP+constant] stack storage. This is a forward dataflow fixpoint analysis where each storage location can contain a set of value labels, and each value label can reside in a set of storage locations. (Meet function is pairwise intersection by storage location.) This analysis traces value labels symbolically through loads and stores and reg-to-reg moves, so it will naturally handle spills and reloads without knowing anything special about them. - When this analysis converges, we have, at each machine-code offset, a mapping from value labels to some number of storage locations; for each offset for each label, we choose the best location (prefer registers). Note that we can choose any location, as the symbolic dataflow analysis is sound and guarantees that the value at the value_label instruction propagates to all of the named locations. - Then we can convert this mapping into a format that the DWARF generation code (wasmtime's debug crate) can use. This PR also adds the new-backend variant to the gdb tests on CI.	2021-01-21 15:59:49 -08:00
Chris Fallin	456561f431	x64 and aarch64: allow StructArgument and StructReturn args. The StructReturn ABI is fairly simple at the codegen/isel level: we only need to take care to return the sret pointer as one of the return values if that wasn't specified in the initial function signature. Struct arguments are a little more complex. A struct argument is stored as a chunk of memory in the stack-args space. However, the CLIF semantics are slightly special: on the caller side, the parameter passed in is a pointer to an arbitrary memory block, and we must memcpy this data to the on-stack struct-argument; and on the callee side, we provide a pointer to the passed-in struct-argument as the CLIF block param value. This is necessary to support various ABIs other than Wasm, such as that of Rust (with the cg_clif codegen backend).	2021-01-17 23:11:45 -08:00
Chris Fallin	b4426be072	machinst lowering: update inst color when scanning across branch to allow more load-op merging. A branch is considered side-effecting and so updates the instruction color (which is our way of computing how far instructions can sink). However, in the lowering loop, we did not update current instruction color when scanning backward across branches, which are side-effecting. As a result, the color was stale and fewer load-op merges were permitted than are actually possible. Note that this would not have resulted in any correctness issues, as the stale color is too high (so no merges are permitted that should have been disallowed). Fixes #2562.	2021-01-11 11:20:44 -08:00
Chris Fallin	6eea015d6c	Multi-register value support: framework for Values wider than machine regs. This will allow for support for `I128` values everywhere, and `I64` values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the machine backends to build such support; it just adds the framework for the MachInst backends to reason about a `Value` residing in more than one register.	2021-01-05 17:45:02 -08:00
Chris Fallin	2cec20aa57	Merge pull request #2486 from cfallin/fix-probestack Two Lucet-related fixes to stack overflow handling.	2020-12-07 16:47:37 -08:00
Chris Fallin	3a01d14712	Two Lucet-related fixes to stack overflow handling. Lucet uses stack probes rather than explicit stack limit checks as Wasmtime does. In bytecodealliance/lucet#616, I have discovered that I previously was not running some Lucet runtime tests with the new backend, so was missing some test failures due to missing pieces in the new backend. This PR adds (i) calls to probestack, when enabled, in the prologue of every function with a stack frame larger than one page (configurable via flags); and (ii) trap metadata for every instruction on x86-64 that can access the stack, hence be the first point at which a stack overflow is detected when the stack pointer is decremented.	2020-12-07 16:08:53 -08:00
Chris Fallin	3e516e784b	Fix lowering instruction-sinking (load-merging) bug. This fixes a subtle corner case exposed during fuzzing. If we have a bit of CLIF like: ``` v0 = load.i64 ... v1 = iadd.i64 v0, ... v2 = do_other_thing v1 v3 = load.i64 v1 ``` and if this is lowered using a machine backend that can merge loads into ALU ops, and that has an addressing mode that can look through add ops, then the following can happen: 1. We lower the load at `v3`. This looks backward at the address operand tree and finds that `v1` is `v0` plus other things; it has an addressing mode that can add `v0`'s register and the other things directly; so it calls `put_value_in_reg(v0)` and uses its register in the amode. At this point, the add producing `v1` has no references, so it will not (yet) be codegen'd. 2. We lower `do_other_thing`, which puts `v1` in a register and uses it. the `iadd` now has a reference. 3. We reach the `iadd` and, because it has a reference, lower it. Our machine has the ability to merge a load into an ALU operation. Crucially, we think the load at `v0` is mergeable because it has only one user, the add at `v1` (!). So we merge it. 4. We reach the `load` at `v0` and because it has been merged into the `iadd`, we do not separately codegen it. The register that holds `v0` is thus never written, and the use of this register by the final load (Step 1) will see an undefined value. The logic error here is that in the presence of pattern matching that looks through pure ops, we can end up with multiple uses of a value that originally had a single use (because we allow lookthrough of pure ops in all cases). In other words, the multiple-use-ness of `v1` "passes through" in some sense to `v0`. However, the load sinking logic is not aware of this. The fix, I think, is pretty simple: we disallow an effectful instruction from sinking/merging if it already has some other use when we look back at it. If we disallowed lookthrough of any op that had multiple uses, even pure ones, then we would avoid this scenario; but earlier experiments showed that to have a non-negligible performance impact, so (given that we've worked out the logic above) I think this complexity is worth it.	2020-12-03 14:59:12 -08:00
Chris Fallin	60d7f7de0a	Debug info: two fixes in x64 backend. - Sort by generated-code offset to maintain invariant and avoid gimli panic. - Fix srcloc interaction with branch peephole optimization in MachBuffer: if a srcloc range overlaps with a branch that is truncated, remove that srcloc range. These issues were found while fuzzing the new backend (#2453); I suspect that they arise with the new backend because we can sink instructions (e.g. loads or extends) in more interesting ways than before, but I'm not entirely sure. Test coverage will be via the fuzz corpus once #2453 lands.	2020-12-02 10:41:14 -08:00
Chris Fallin	712ff22492	AArch64 SIMD: pattern-match load+splat into `LD1R` instruction.	2020-11-16 15:59:28 -08:00
Chris Fallin	3c8cb7b908	MachInst lowering logic: allow effectful instructions to merge. This PR updates the "coloring" scheme that accounts for side-effects in the MachInst lowering logic. As a result, the new backends will now be able to merge effectful operations (such as memory loads) into other operations; previously, only the other way (pure ops merged into effectful ops) was possible. This will allow, for example, a load+ALU-op combination, as is common on x86. It should even allow a load + ALU-op + store sequence to merge into one lowered instruction. The scheme arose from many fruitful discussions with @julian-seward1 (thanks!); significant credit is due to him for the insights here. The first insight is that given the right basic conditions, i.e. that the root instruction is the only use of an effectful instruction's result, all we need is that the "color" of the effectful instruction is one less than the color of the current instruction. It's easier to think about colors on the program points between instructions: if the color coming out of the first (effectful def) instruction and in to the second (effectful or effect-free use) instruction are the same, then they can merge. Basically the color denotes a version of global state; if the same, then no other effectful ops happened in the meantime. The second insight is that we can keep state as we scan, tracking the "current color", and update this when we sink (merge) an op. Hence when we sink a load into another op, we effectively re-color every instruction it moved over; this may allow further sinks. Consider the example (and assume that we consider loads effectful in order to conservatively ensure a strong memory model; otherwise, replace with other effectful value-producing insts): ``` v0 = load x v1 = load y v2 = add v0, 1 v3 = add v1, 1 ``` Scanning from bottom to top, we first see the add producing `v3` and we can sink the load producing `v1` into it, producing a load + ALU-op machine instruction. This is legal because `v1` moves over only `v2`, which is a pure instruction. Consider, though, `v2`: under a simple scheme that has no other context, `v0` could not sink to `v2` because it would move over `v1`, another load. But because we already sunk `v1` down to `v3`, we are free to sink `v0` to `v2`; the update of the "current color" during the scan allows this. This PR also cleans up the `LowerCtx` interface a bit at the same time: whereas previously it always gave some subset of (constant, mergeable inst, register) directly from `LowerCtx::get_input()`, it now returns zero or more of (constant, mergable inst) from `LowerCtx::maybe_get_input_as_source_or_const()`, and returns the register only from `LowerCtx::put_input_in_reg()`. This removes the need to explicitly denote uses of the register, so it's a little safer. Note that this PR does not actually make use of the new ability to merge loads into other ops; that will come in future PRs, especially to optimize the `x64` backend by using direct-memory operands.	2020-11-16 14:53:45 -08:00

... 3 4 5 6 7

342 Commits