* Turn off probestack by default in Cranelift
The probestack feature is not implemented for the aarch64 and s390x
backends and currently the on-by-default status requires the aarch64 and
s390x implementations to be a stub. Turning off probestack by default
allows the s390x and aarch64 backends to panic with an error message to
avoid providing a false sense of security. When the probestack option is
implemented for all backends, however, it may be reasonable to
re-enable.
* aarch64: Improve codegen for AMode fallback
Currently the final fallback for finalizing an `AMode` will generate
both a constant-loading instruction as well as an `add` instruction to
the base register into the same temporary. This commit improves the
codegen by removing the `add` instruction and folding the final add into
the finalized `AMode`. This changes the `extendop` used but both
registers are 64-bit so shouldn't be affected by the extending
operation.
* aarch64: Implement inline stack probes
This commit implements inline stack probes for the aarch64 backend in
Cranelift. The support here is modeled after the x64 support where
unrolled probes are used up to a particular threshold after which a loop
is generated. The instructions here are similar in spirit to x64 except
that unlike x64 the stack pointer isn't modified during the unrolled
loop to avoid needing to re-adjust it back up at the end of the loop.
* Enable inline probestack for AArch64 and Riscv64
This commit enables inline probestacks for the AArch64 and Riscv64
architectures in the same manner that x86_64 has it enabled now. Some
more testing was additionally added since on Unix platforms we should be
guaranteed that Rust's stack overflow message is now printed too.
* Enable probestack for aarch64 in cranelift-fuzzgen
* Address review comments
* Remove implicit stack overflow traps from x64 backend
This commit removes implicit `StackOverflow` traps inserted by the x64
backend for stack-based operations. This was historically required when
stack overflow was detected with page faults but Wasmtime no longer
requires that since it's not suitable for wasm modules which call host
functions. Additionally no other backend implements this form of
implicit trap-code additions so this is intended to synchronize the
behavior of all the backends.
This fixes a test added prior for aarch64 to properly abort the process
instead of accidentally being caught by Wasmtime.
* Fix a style issue
* egraph-based midend: draw the rest of the owl.
* Rename `egg` submodule of cranelift-codegen to `egraph`.
* Apply some feedback from @jsharp during code walkthrough.
* Remove recursion from find_best_node by doing a single pass.
Rather than recursively computing the lowest-cost node for a given
eclass and memoizing the answer at each eclass node, we can do a single
forward pass; because every eclass node refers only to earlier nodes,
this is sufficient. The behavior may slightly differ from the earlier
behavior because we cannot short-circuit costs to zero once a node is
elaborated; but in practice this should not matter.
* Make elaboration non-recursive.
Use an explicit stack instead (with `ElabStackEntry` entries,
alongside a result stack).
* Make elaboration traversal of the domtree non-recursive/stack-safe.
* Work analysis logic in Cranelift-side egraph glue into a general analysis framework in cranelift-egraph.
* Apply static recursion limit to rule application.
* Fix aarch64 wrt dynamic-vector support -- broken rebase.
* Topo-sort cranelift-egraph before cranelift-codegen in publish script, like the comment instructs me to!
* Fix multi-result call testcase.
* Include `cranelift-egraph` in `PUBLISHED_CRATES`.
* Fix atomic_rmw: not really a load.
* Remove now-unnecessary PartialOrd/Ord derivations.
* Address some code-review comments.
* Review feedback.
* Review feedback.
* No overlap in mid-end rules, because we are defining a multi-constructor.
* rustfmt
* Review feedback.
* Review feedback.
* Review feedback.
* Review feedback.
* Remove redundant `mut`.
* Add comment noting what rules can do.
* Review feedback.
* Clarify comment wording.
* Update `has_memory_fence_semantics`.
* Apply @jameysharp's improved loop-level computation.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix suggestion commit.
* Fix off-by-one in new loop-nest analysis.
* Review feedback.
* Review feedback.
* Review feedback.
* Use `Default`, not `std::default::Default`, as per @fitzgen
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Apply @fitzgen's comment elaboration to a doc-comment.
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Add stat for hitting the rewrite-depth limit.
* Some code motion in split prelude to make the diff a little clearer wrt `main`.
* Take @jameysharp's suggested `try_into()` usage for blockparam indices.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Take @jameysharp's suggestion to avoid double-match on load op.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix suggestion (add import).
* Review feedback.
* Fix stack_load handling.
* Remove redundant can_store case.
* Take @jameysharp's suggested improvement to FuncEGraph::build() logic
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Tweaks to FuncEGraph::build() on top of suggestion.
* Take @jameysharp's suggested clarified condition
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Clean up after suggestion (unused variable).
* Fix loop analysis.
* loop level asserts
* Revert constant-space loop analysis -- edge cases were incorrect, so let's go with the simple thing for now.
* Take @jameysharp's suggestion re: result_tys
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix up after suggestion
* Take @jameysharp's suggestion to use fold rather than reduce
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fixup after suggestion
* Take @jameysharp's suggestion to remove elaborate_eclass_use's return value.
* Clarifying comment in terminator insts.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
* Cranellift: remove Baldrdash support and related features.
As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.
In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.
This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425
* Review feedback.
* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.
The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.
* Review feedback.
Preserving frame pointers -- even inside leaf functions -- makes it easy to
capture the stack of a running program, without requiring any side tables or
metadata (like `.eh_frame` sections). Many sampling profilers and similar tools
walk frame pointers to capture stacks. Enabling this option will play nice with
those tools.
Currently, we have partial Spectre mitigation: we protect heap accesses
with dynamic bounds checks. Specifically, we guard against errant
accesses on the misspeculated path beyond the bounds-check conditional
branch by adding a conditional move that is also dependent on the
bounds-check condition. This data dependency on the condition is not
speculated and thus will always pick the "safe" value (in the heap case,
a NULL address) on the misspeculated path, until the pipeline flushes
and recovers onto the correct path.
This PR uses the same technique both for table accesses -- used to
implement Wasm tables -- and for jumptables, used to implement Wasm
`br_table` instructions.
In the case of Wasm tables, the cmove picks the table base address on
the misspeculated path. This is equivalent to reading the first table
entry. This prevents loads of arbitrary data addresses on the
misspeculated path.
In the case of `br_table`, the cmove picks index 0 on the misspeculated
path. This is safer than allowing a branch to an address loaded from an
index under misspeculation (i.e., it preserves control-flow integrity
even under misspeculation).
The table mitigation is controlled by a Cranelift setting, on by
default. The br_table mitigation is always on, because it is part of the
single lowering pseudoinstruction. In both cases, the impact should be
minimal: a single extra cmove in a (relatively) rarely-used operation.
The table mitigation is architecture-independent (happens during
legalization); the br_table mitigation has been implemented for both x64
and aarch64. (I don't know enough about s390x to implement this
confidently there, but would happily review a PR to do the same on that
platform.)
With these fixes, all this PR has to do is instantiate and run the
checker on the `regalloc2::Output`. This is off by default, and is
enabled by setting the `regalloc_checker` Cranelift option.
This restores the old functionality provided by e.g. the
`backtracking_checked` regalloc algorithm setting rather than
`backtracking` when we were still on regalloc.rs.
This PR switches Cranelift over to the new register allocator, regalloc2.
See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.
Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:
```
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/A
```
This commit changes how both the shared flags and ISA flags are stored in the
serialized module to detect incompatibilities when a serialized module is
instantiated.
It improves the error reporting when a compiled module has mismatched shared
flags.
This commit adds the `wasmtime settings` command to print out available
Cranelift settings for a target (defaults to the host).
The compile command has been updated to remove the Cranelift ISA options in
favor of encouraging users to use `wasmtime settings` to discover what settings
are available. This will reduce the maintenance cost for syncing the compile
command with Cranelift ISA flags.
Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.
A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.
The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.
It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.
There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for `alloca` (dynamic frame growth),
then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
`abi_impl.rs`.
The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from `sp` once to allocate space, then used stores to `[sp,
offset]` to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 384 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like
a "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.
This PR adds the new backend on Windows to CI as well.
This adds support for the "fastcall" ABI, which is the native C/C++ ABI
on Windows platforms on x86-64. It is similar to but not exactly like
System V; primarily, its argument register assignments are different,
and it requires stack shadow space.
Note that this also adjusts the handling of multi-register values in the
shared ABI implementation, and with this change, adjusts handling of
`i128`s on *both* Fastcall/x64 *and* SysV/x64 platforms. This was done
to align with actual behavior by the "rustc ABI" on both platforms, as
mapped out experimentally (Compiler Explorer link in comments). This
behavior is gated under the `enable_llvm_abi_extensions` flag.
Note also that this does *not* add x64 unwind info on Windows. That will
come in a future PR (but is planned!).
With `Module::{serialize,deserialize}` it should be possible to share
wasmtime modules across machines or CPUs. Serialization, however, embeds
a hash of all configuration values, including cranelift compilation
settings. By default wasmtime's selection of the native ISA would enable
ISA flags according to CPU features available on the host, but the same
CPU features may not be available across two machines.
This commit adds a `Config::cranelift_clear_cpu_flags` method which
allows clearing the target-specific ISA flags that are automatically
inferred by default for the native CPU. Options can then be
incrementally built back up as-desired with teh `cranelift_other_flag`
method.
This PR adds a conditional move following a heap bounds check through
which the address to be accessed flows. This conditional move ensures
that even if the branch is mispredicted (access is actually out of
bounds, but speculation goes down in-bounds path), the acually accessed
address is zero (a NULL pointer) rather than the out-of-bounds address.
The mitigation is controlled by a flag that is off by default, but can
be set by the embedding. Note that in order to turn it on by default,
we would need to add conditional-move support to the current x86
backend; this does not appear to be present. Once the deprecated
backend is removed in favor of the new backend, IMHO we should turn
this flag on by default.
Note that the mitigation is unneccessary when we use the "huge heap"
technique on 64-bit systems, in which we allocate a range of virtual
address space such that no 32-bit offset can reach other data. Hence,
this only affects small-heap configurations.
This is a breaking API change: the following settings have been renamed:
- jump_tables_enabled -> enable_jump_tables
- colocated_libcalls -> use_colocated_libcalls
- probestack_enabled -> enable_probestack
- allones_funcaddrs -> emit_all_ones_funcaddrs
The failure crate invents its own traits that don't use
std::error::Error (because failure predates certain features added to
Error); this prevents using ? on an error from failure in a function
using Error. The thiserror crate integrates with the standard Error
trait instead.
This patch:
* removes the "default" opt level, on the basis that it has no definition and
is referred to nowhere in the compiler.
* renames the "fastest" level to "none". The resulting set of transformations
is unchanged.
* renames the "best" level to "speed_and_size". The resulting set of
transformations is unchanged.
* adds a new level, "speed". This is the same as "speed_and_size" except that
it omits transformations aimed only at reducing code size. Currently it
omits only the insn shrinking pass.
-Add resumable_trap, safepoint, isnull, and null instructions
-Add Stackmap struct and StackmapSink trait
Co-authored-by: Mir Ahmed <mirahmed753@gmail.com>
Co-authored-by: Dan Gohman <sunfish@mozilla.com>