The patch extends the unwinder to support targets that do not need
to use a dedicated frame pointer register. Specifically, the
changes include:
- Change the "fp" routine in the RegisterMapper to return an
*optional* frame pointer regsiter via Option<Register>.
- On targets that choose to not define a FP register via the above
routine, the UnwindInst::DefineNewFrame operation no longer switches
the CFA to be defined in terms of the FP. (The operation still can
be used to define the location of the clobber area.)
- In addition, on targets that choose not to define a FP register, the
UnwindInst::PushFrameRegs operation is not supported.
- There is a new operation UnwindInst::StackAlloc that needs to be
called on targets without FP whenever the stack pointer is updated.
This caused the CFA offset to be adjusted accordingly. (On
targets with FP this operation is a no-op.)
* Switch macOS to using mach ports for trap handling
This commit moves macOS to using mach ports instead of signals for
handling traps. The motivation for this is listed in #2456, namely that
once mach ports are used in a process that means traditional UNIX signal
handlers won't get used. This means that if Wasmtime is integrated with
Breakpad, for example, then Wasmtime's trap handler never fires and
traps don't work.
The `traphandlers` module is refactored as part of this commit to split
the platform-specific bits into their own files (it was growing quite a
lot for one inline `cfg_if!`). The `unix.rs` and `windows.rs` files
remain the same as they were before with a few minor tweaks for some
refactored interfaces. The `macos.rs` file is brand new and lifts almost
its entire implementation from SpiderMonkey, adapted for Wasmtime
though.
The main gotcha with mach ports is that a separate thread is what
services the exception. Some unsafe magic allows this separate thread to
read non-`Send` and temporary state from other threads, but is hoped to
be safe in this context. The unfortunate downside is that calling wasm
on macOS now involves taking a global lock and modifying a global hash
map twice-per-call. I'm not entirely sure how to get out of this cost
for now, but hopefully for any embeddings on macOS it's not the end of
the world.
Closes#2456
* Add a sketch of arm64 apple support
* store: maintain CallThreadState mapping when switching fibers
* cranelift/aarch64: generate unwind directives to disable pointer auth
Aarch64 post ARMv8.3 has a feature called pointer authentication,
designed to fight ROP/JOP attacks: some pointers may be signed using new
instructions, adding payloads to the high (previously unused) bits of
the pointers. More on this here: https://lwn.net/Articles/718888/
Unwinders on aarch64 need to know if some pointers contained on the call
frame contain an authentication code or not, to be able to properly
authenticate them or use them directly. Since native code may have
enabled it by default (as is the case on the Mac M1), and the default is
that this configuration value is inherited, we need to explicitly
disable it, for the only kind of supported pointers (return addresses).
To do so, we set the value of a non-existing dwarf pseudo register (34)
to 0, as documented in
https://github.com/ARM-software/abi-aa/blob/master/aadwarf64/aadwarf64.rst#note-8.
This is done at the function granularity, in the spirit of Cranelift
compilation model. Alternatively, a single directive could be generated
in the CIE, generating less information per module.
* Make exception handling work on Mac aarch64 too
* fibers: use a breakpoint instruction after the final call in wasmtime_fiber_start
Co-authored-by: Alex Crichton <alex@alexcrichton.com>
Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.
A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.
The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.
It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.
There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for `alloca` (dynamic frame growth),
then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
`abi_impl.rs`.
The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from `sp` once to allocate space, then used stores to `[sp,
offset]` to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 384 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like
a "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.
This PR adds the new backend on Windows to CI as well.
* Make cranelift_codegen::isa::unwind::input public
* Move UnwindCode's common offset field out of the structure
* Make MachCompileResult::unwind_info more generic
* Record initial stack pointer offset
This commit moves the opaque definition of Windows x64 UnwindInfo out of the
ISA and into a location that can be easily used by the top level `UnwindInfo`
enum.
This allows the `unwind` feature to be independent of the individual ISAs
supported.
This commit makes the following changes to unwind information generation in
Cranelift:
* Remove frame layout change implementation in favor of processing the prologue
and epilogue instructions when unwind information is requested. This also
means this work is no longer performed for Windows, which didn't utilize it.
It also helps simplify the prologue and epilogue generation code.
* Remove the unwind sink implementation that required each unwind information
to be represented in final form. For FDEs, this meant writing a
complete frame table per function, which wastes 20 bytes or so for each
function with duplicate CIEs. This also enables Cranelift users to collect the
unwind information and write it as a single frame table.
* For System V calling convention, the unwind information is no longer stored
in code memory (it's only a requirement for Windows ABI to do so). This allows
for more compact code memory for modules with a lot of functions.
* Deletes some duplicate code relating to frame table generation. Users can
now simply use gimli to create a frame table from each function's unwind
information.
Fixes#1181.