wasmtime

Author	SHA1	Message	Date
bjorn3	d50f27e8f9	Remove reg_universe method from MachBackend and MachInst	2022-01-06 14:39:50 +01:00
Alex Crichton	1532516a36	Use relative `call` instructions between wasm functions (#3275 ) * Use relative `call` instructions between wasm functions This commit is a relatively major change to the way that Wasmtime generates code for Wasm modules and how functions call each other. Prior to this commit all function calls between functions, even if they were defined in the same module, were done indirectly through a register. To implement this the backend would emit an absolute 8-byte relocation near all function calls, load that address into a register, and then call it. While this technique is simple to implement and easy to get right, it has two primary downsides associated with it: * Function calls are always indirect which means they are more difficult to predict, resulting in worse performance. * Generating a relocation-per-function call requires expensive relocation resolution at module-load time, which can be a large contributing factor to how long it takes to load a precompiled module. To fix these issues, while also somewhat compromising on the previously simple implementation technique, this commit switches wasm calls within a module to using the `colocated` flag enabled in Cranelift-speak, which basically means that a relative call instruction is used with a relocation that's resolved relative to the pc of the call instruction itself. When switching the `colocated` flag to `true` this commit is also then able to move much of the relocation resolution from `wasmtime_jit::link` into `wasmtime_cranelift::obj` during object-construction time. This frontloads all relocation work which means that there's actually no relocations related to function calls in the final image, solving both of our points above. The main gotcha in implementing this technique is that there are hardware limitations to relative function calls which mean we can't simply blindly use them. AArch64, for example, can only go +/- 64 MB from the `bl` instruction to the target, which means that if the function we're calling is a greater distance away then we would fail to resolve that relocation. On x86_64 the limits are +/- 2GB which are much larger, but theoretically still feasible to hit. Consequently the main increase in implementation complexity is fixing this issue. This issue is actually already present in Cranelift itself, and is internally one of the invariants handled by the `MachBuffer` type. When generating a function relative jumps between basic blocks have similar restrictions. This commit adds new methods for the `MachBackend` trait and updates the implementation of `MachBuffer` to account for all these new branches. Specifically the changes to `MachBuffer` are: * For AAarch64 the `LabelUse::Branch26` value now supports veneers, and AArch64 calls use this to resolve relocations. * The `emit_island` function has been rewritten internally to handle some cases which previously didn't come up before, such as: * When emitting an island the deadline is now recalculated, where previously it was always set to infinitely in the future. This was ok prior since only a `Branch19` supported veneers and once it was promoted no veneers were supported, so without multiple layers of promotion the lack of a new deadline was ok. * When emitting an island all pending fixups had veneers forced if their branch target wasn't known yet. This was generally ok for 19-bit fixups since the only kind getting a veneer was a 19-bit fixup, but with mixed kinds it's a bit odd to force veneers for a 26-bit fixup just because a nearby 19-bit fixup needed a veneer. Instead fixups are now re-enqueued unless they're known to be out-of-bounds. This may run the risk of generating more islands for 19-bit branches but it should also reduce the number of islands for between-function calls. * Otherwise the internal logic was tweaked to ideally be a bit more simple, but that's a pretty subjective criteria in compilers... I've added some simple testing of this for now. A synthetic compiler option was create to simply add padded 0s between functions and test cases implement various forms of calls that at least need veneers. A test is also included for x86_64, but it is unfortunately pretty slow because it requires generating 2GB of output. I'm hoping for now it's not too bad, but we can disable the test if it's prohibitive and otherwise just comment the necessary portions to be sure to run the ignored test if these parts of the code have changed. The final end-result of this commit is that for a large module I'm working with the number of relocations dropped to zero, meaning that nothing actually needs to be done to the text section when it's loaded into memory (yay!). I haven't run final benchmarks yet but this is the last remaining source of significant slowdown when loading modules, after I land a number of other PRs both active and ones that I only have locally for now. * Fix arm32 * Review comments	2021-09-01 13:27:38 -05:00
Chris Fallin	2d5db92a9e	Rework/simplify unwind infrastructure and implement Windows unwind. Our previous implementation of unwind infrastructure was somewhat complex and brittle: it parsed generated instructions in order to reverse-engineer unwind info from prologues. It also relied on some fragile linkage to communicate instruction-layout information that VCode was not designed to provide. A much simpler, more reliable, and easier-to-reason-about approach is to embed unwind directives as pseudo-instructions in the prologue as we generate it. That way, we can say what we mean and just emit it directly. The usual reasoning that leads to the reverse-engineering approach is that metadata is hard to keep in sync across optimization passes; but here, (i) prologues are generated at the very end of the pipeline, and (ii) if we ever do a post-prologue-gen optimization, we can treat unwind directives as black boxes with unknown side-effects, just as we do for some other pseudo-instructions today. It turns out that it was easier to just build this for both x64 and aarch64 (since they share a factored-out ABI implementation), and wire up the platform-specific unwind-info generation for Windows and SystemV. Now we have simpler unwind on all platforms and we can delete the old unwind infra as soon as we remove the old backend. There were a few consequences to supporting Fastcall unwind in particular that led to a refactor of the common ABI. Windows only supports naming clobbered-register save locations within 240 bytes of the frame-pointer register, whatever one chooses that to be (RSP or RBP). We had previously saved clobbers below the fixed frame (and below nominal-SP). The 240-byte range has to include the old RBP too, so we're forced to place clobbers at the top of the frame, just below saved RBP/RIP. This is fine; we always keep a frame pointer anyway because we use it to refer to stack args. It does mean that offsets of fixed-frame slots (spillslots, stackslots) from RBP are no longer known before we do regalloc, so if we ever want to index these off of RBP rather than nominal-SP because we add support for `alloca` (dynamic frame growth), then we'll need a "nominal-BP" mode that is resolved after regalloc and clobber-save code is generated. I added a comment to this effect in `abi_impl.rs`. The above refactor touched both x64 and aarch64 because of shared code. This had a further effect in that the old aarch64 prologue generation subtracted from `sp` once to allocate space, then used stores to `[sp, offset]` to save clobbers. Unfortunately the offset only has 7-bit range, so if there are enough clobbered registers (and there can be -- aarch64 has 384 bytes of registers; at least one unit test hits this) the stores/loads will be out-of-range. I really don't want to synthesize large-offset sequences here; better to go back to the simpler pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like a "push". It's likely not much worse microarchitecturally (dependence chain on SP, but oh well) and it actually saves an instruction if there's no other frame to allocate. As a further advantage, it's much simpler to understand; simpler is usually better. This PR adds the new backend on Windows to CI as well.	2021-03-11 20:03:52 -08:00
Kasey Carrothers	99be82c866	Replace MachInst::gen_zero_len_nop with gen_nop(0)	2021-01-29 01:15:08 -08:00
Chris Fallin	6eea015d6c	Multi-register value support: framework for Values wider than machine regs. This will allow for support for `I128` values everywhere, and `I64` values on 32-bit targets (e.g., ARM32 and x86-32). It does not alter the machine backends to build such support; it just adds the framework for the MachInst backends to reason about a `Value` residing in more than one register.	2021-01-05 17:45:02 -08:00
Chris Fallin	4dce51096d	MachInst backends: handle SourceLocs out-of-band, not in Insts. In existing MachInst backends, many instructions -- any that can trap or result in a relocation -- carry `SourceLoc` values in order to propagate the location-in-original-source to use to describe resulting traps or relocation errors. This is quite tedious, and also error-prone: it is likely that the necessary plumbing will be missed in some cases, and in any case, it's unnecessarily verbose. This PR factors out the `SourceLoc` handling so that it is tracked during emission as part of the `EmitState`, and plumbed through automatically by the machine-independent framework. Instruction emission code that directly emits trap or relocation records can query the current location as necessary. Then we only need to ensure that memory references and trap instructions, at their (one) emission point rather than their (many) lowering/generation points, are wired up correctly. This does have the side-effect that some loads and stores that do not correspond directly to user code's heap accesses will have unnecessary but harmless trap metadata. For example, the load that fetches a code offset from a jump table will have a 'heap out of bounds' trap record attached to it; but because it is bounds-checked, and will never actually trap if the lowering is correct, this should be harmless. The simplicity improvement here seemed more worthwhile to me than plumbing through a "corresponds to user-level load/store" bit, because the latter is a bit complex when we allow for op merging. Closes #2290: though it does not implement a full "metadata" scheme as described in that issue, this seems simpler overall.	2020-11-10 15:46:53 -08:00
Yury Delendik	de4af90af6	machinst x64: New backend unwind (#2266 ) Addresses unwind for experimental x64 backend. The preliminary code enables backtrace on SystemV call convension.	2020-10-23 15:19:41 -05:00
Chris Fallin	71768bb6cf	Fix AArch64 ABI to respect half-caller-save, half-callee-save vec regs. This PR updates the AArch64 ABI implementation so that it (i) properly respects that v8-v15 inclusive have callee-save lower halves, and caller-save upper halves, by conservatively approximating (to full registers) in the appropriate directions when generating prologue caller-saves and when informing the regalloc of clobbered regs across callsites. In order to prevent saving all of these vector registers in the prologue of every non-leaf function due to the above approximation, this also makes use of a new regalloc.rs feature to exclude call instructions' writes from the clobber set returned by register allocation. This is safe whenever the caller and callee have the same ABI (because anything the callee could clobber, the caller is allowed to clobber as well without saving it in the prologue). Fixes #2254.	2020-10-06 14:44:02 -07:00
Jakub Krauz	f6a140a662	arm32 codegen This commit adds arm32 code generation for some IR insts. Floating-point instructions are not supported, because regalloc does not allow to represent overlapping register classes, which are needed by VFP/Neon. There is also no support for big-endianness, I64 and I128 types.	2020-09-22 12:49:42 +02:00

9 Commits