* cranelift-isle: Add "partial" flag for constructors
Instead of tying fallibility of constructors to whether they're either
internal or pure, this commit assumes all constructors are infallible
unless tagged otherwise with a "partial" flag.
Internal constructors without the "partial" flag are not allowed to use
constructors which have the "partial" flag on the right-hand side of any
rules, because they have no way to report last-minute match failures.
Multi-constructors should never be "partial"; they report match failures
with an empty iterator instead. In turn this means you can't use partial
constructors on the right-hand side of internal multi-constructor rules.
However, you can use the same constructors on the left-hand side with
`if` or `if-let` instead.
In many cases, ISLE can already trivially prove that an internal
constructor always returns `Some`. With this commit, those cases are
largely unchanged, except for removing all the `Option`s and `Some`s
from the generated code for those terms.
However, for internal non-partial constructors where ISLE could not
prove that, it now emits an `unreachable!` panic as the last-resort,
instead of returning `None` like it used to do. Among the existing
backends, here's how many constructors have these panic cases:
- x64: 14% (53/374)
- aarch64: 15% (41/277)
- riscv64: 23% (26/114)
- s390x: 47% (268/567)
It's often possible to rewrite rules so that ISLE can tell the panic can
never be hit. Just ensure that there's a lowest-priority rule which has
no constraints on the left-hand side.
But in many of these constructors, it's difficult to statically prove
the unhandled cases are unreachable because that's only down to
knowledge about how they're called or other preconditions.
So this commit does not try to enforce that all terms have a last-resort
fallback rule.
* Check term flags while translating expressions
Instead of doing it in a separate pass afterward.
This involved threading all the term flags (pure, multi, partial)
through the recursive `translate_expr` calls, so I extracted the flags
to a new struct so they can all be passed together.
* Validate multi-term usage
Now that I've threaded the flags through `translate_expr`, it's easy to
check this case too, so let's just do it.
* Extract `ReturnKind` to use in `ExternalSig`
There are only three legal states for the combination of `multi` and
`infallible`, so replace those fields of `ExternalSig` with a
three-state enum.
* Remove `Option` wrapper from multi-extractors too
If we'd had any external multi-constructors this would correct their
signatures as well.
* Update ISLE tests
* Tag prelude constructors as pure where appropriate
I believe the only reason these weren't marked `pure` before was because
that would have implied that they're also partial. Now that those two
states are specified separately we apply this flag more places.
* Fix my changes to aarch64 `lower_bmask` and `imm` terms
* Optimizations to egraph framework:
- Save elaborated results by canonical value, not latest value (union
value). Previously we were artificially skipping and re-elaborating
some values we already had because we were not finding them in the
map.
- Make some changes to handling of icmp results: when icmp became
I8-typed (when bools went away), many uses became `(uextend $I32 (icmp
$I8 ...))`, and so patterns in lowering backends were no longer
matching.
This PR includes an x64-specific change to match `(brz (uextend (icmp
...)))` and similarly for `brnz`, but it also takes advantage of the
ability to write rules easily in the egraph mid-end to rewrite selects
with icmp inputs appropriately.
- Extend constprop to understand selects in the egraph mid-end.
With these changes, bz2.wasm sees a ~1% speedup, and spidermonkey.wasm
with a fib.js input sees a 16.8% speedup:
```
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.base.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./fib.js 2.14s user 0.01s system 99% cpu 2.148 total
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.egraphs.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./fib.js 1.78s user 0.01s system 99% cpu 1.788 total
```
* Review feedback.
Enable regalloc2's SSA verifier in debug builds to check for any outstanding reuse of virtual registers in def constraints. As fuzzing enables debug_assertions, this will enable the SSA verifier when fuzzing as well.
* cranelift-codegen: Use ISLE matching, not same_value
The `same_value` function just wrapped an equality test into an external
constructor, but we can do that with ISLE's equality constraints
instead.
* riscv64: Remove custom condition-code tests
The `lower_icmp` term exists solely to decide whether to sign-extend or
zero-extend the comparison operands, based on whether the condition code
requires a signed comparison. It additionally tested whether the
condition code was == or !=, but produced the same result as for other
unsigned comparisons.
We already have `signed_cond_code` in the ISLE prelude, which classifies
the total-ordering condition codes according to whether they're signed.
It also lumps == and != in the "unsigned" camp, as desired.
So this commit uses the existing method from the prelude instead of
riscv64-local definitions.
Because this version has no constraints on the left-hand side of the
rule in the unsigned case, ISLE generates Rust that always returns
`Some`. That shows that the current use of `unwrap` is justified, at the
only Rust-side call-site of `constructor_lower_icmp`, which is in
cranelift/codegen/src/isa/riscv64/lower/isle.rs.
* ISLE prelude: make offset32 infallible
This extractor always returns `Some`, so it doesn't need to be fallible.
This PR reverts #5128 (commit b3333bf9ea),
adding back the ability for the fuzzing config generator to set
the `use_egraphs` Cranelift option. This will start to fuzz the
egraphs-based optimization framework again, now that #5382 has landed.
* egraph support: rewrite to work in terms of CLIF data structures.
This work rewrites the "egraph"-based optimization framework in
Cranelift to operate on aegraphs (acyclic egraphs) represented in the
CLIF itself rather than as a separate data structure to which and from
which we translate the CLIF.
The basic idea is to add a new kind of value, a "union", that is like an
alias but refers to two other values rather than one. This allows us to
represent an eclass of enodes (values) as a tree. The union node allows
for a value to have *multiple representations*: either constituent value
could be used, and (in well-formed CLIF produced by correct
optimization rules) they must be equivalent.
Like the old egraph infrastructure, we take advantage of acyclicity and
eager rule application to do optimization in a single pass. Like before,
we integrate GVN (during the optimization pass) and LICM (during
elaboration).
Unlike the old egraph infrastructure, everything stays in the
DataFlowGraph. "Pure" enodes are represented as instructions that have
values attached, but that are not placed into the function layout. When
entering "egraph" form, we remove them from the layout while optimizing.
When leaving "egraph" form, during elaboration, we can place an
instruction back into the layout the first time we elaborate the enode;
if we elaborate it more than once, we clone the instruction.
The implementation performs two passes overall:
- One, a forward pass in RPO (to see defs before uses), that (i) removes
"pure" instructions from the layout and (ii) optimizes as it goes. As
before, we eagerly optimize, so we form the entire union of optimized
forms of a value before we see any uses of that value. This lets us
rewrite uses to use the most "up-to-date" form of the value and
canonicalize and optimize that form.
The eager rewriting and acyclic representation make each other work
(we could not eagerly rewrite if there were cycles; and acyclicity
does not miss optimization opportunities only because the first time
we introduce a value, we immediately produce its "best" form). This
design choice is also what allows us to avoid the "parent pointers"
and fixpoint loop of traditional egraphs.
This forward optimization pass keeps a scoped hashmap to "intern"
nodes (thus performing GVN), and also interleaves on a per-instruction
level with alias analysis. The interleaving with alias analysis allows
alias analysis to see the most optimized form of each address (so it
can see equivalences), and allows the next value to see any
equivalences (reuses of loads or stored values) that alias analysis
uncovers.
- Two, a forward pass in domtree preorder, that "elaborates" pure enodes
back into the layout, possibly in multiple places if needed. This
tracks the loop nest and hoists nodes as needed, performing LICM as it
goes. Note that by doing this in forward order, we avoid the
"fixpoint" that traditional LICM needs: we hoist a def before its
uses, so when we place a node, we place it in the right place the
first time rather than moving later.
This PR replaces the old (a)egraph implementation. It removes both the
cranelift-egraph crate and the logic in cranelift-codegen that uses it.
On `spidermonkey.wasm` running a simple recursive Fibonacci
microbenchmark, this work shows 5.5% compile-time reduction and 7.7%
runtime improvement (speedup).
Most of this implementation was done in (very productive) pair
programming sessions with Jamey Sharp, thus:
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* Review feedback.
* Review feedback.
* Review feedback.
* Bugfix: cprop rule: `(x + k1) - k2` becomes `x - (k2 - k1)`, not `x - (k1 - k2)`.
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
This PR fixes two bugs in the riscv64 backend, where branch instructions were emitted in the middle of a basic block:
Constant emission, where the constants are inlined into the vcode and are jumped over at runtime,
The BrTableCheck pseudo-instruction, which was always emitted before a BrTable instruction, and would handle jumping to the default label.
The first bug was resolved by introducing two new psuedo instructions, LoadConst32 and LoadConst64. Both of these instructions serve to delay the original encoding to emission time, after regalloc2 has run.
The second bug was fixed by removing the BrTableCheck instruction. As it was always emitted directly before BrTable, it was easier to remove it and merge the two into a single instruction.
* Alias analysis: refactor for use by other driver loops.
This PR pulls the core of the alias analysis infrastructure into a
`process_inst()` method that operates on a single instruction, and
allows another compiler pass to apply store-to-load forwarding and
redundant-load elimination interleaved with other work. The existing
behavior remains unchanged; the pass's toplevel loop calls this
extracted method.
This refactor is a prerequisite for using the alias analysis as part of
a refactored egraph-based optimization framework.
* Review feedback: remove unneeded mut.
* Add release notes for 3.0.1
* Update some version directives for crates in Wasmtime
* Mark anything with `publish = false` as version 0.0.0
* Mark the icache coherence crate with the same version as Wasmtime
* Fix manifest directives
Rework the compilation of amodes in the aarch64 backend to stop reusing registers and instead generate fresh virtual registers for intermediates. This resolves some SSA checker violations with the aarch64 backend, and as a nice side-effect removes some unnecessary movs in the generated code.
As loading constants on aarch64 can take up to 4 instructions, we need to plumb through some additional registers. Rather than pass a fixed list of registers in, pass an allocation function.
* cranelift-wasm: Remove `ModuleTranslationState`
We were putting data into it, but never reading data out of it. Can be removed.
* cranelift-wasm: move `funct_state.rs` sub module to `state.rs`
Since it is the only submodule of `state` it can just be the whole `state`
module.
Avoid reusing a destination virtual register for 64-bit constants in the s390x backend. This change addresses a case identified by the regalloc2 ssa validator, as the destination register was written to twice when constants were generated via the MachInst::gen_constant function.
Some of our ISLE rules can never fire because there's a higher-priority
rule that will always fire instead.
Sometimes the worst that can happen is we generate sub-optimal output.
That's not so bad but we'd still like to know about it so we can fix it.
In other cases there might be instructions which can't be lowered in
isolation. If a general rule for lowering one of the instructions is
higher-priority than the rule for lowering the combined sequence, then
lowering the combined sequence will always fail.
Either way, this is always a bug, so make it a fatal error if we can
detect it.
Introduce a temporary for an intermediate value in the lowering of div in the x64 backend. Additionally, add a src argument to the shift_r smart constructor, which is why the diff got larger than just the div lowering.
Avoid reusing output registers in make_i64x2_from_lanes by threading the output name instead, and using smart constructors for x64_pinsrd instead of constructing the instructions directly.
* Cranelift: implement `heap_{load,store}` instruction legalization
This does not remove `heap_addr` yet, but it does factor out the common
bounds-check-and-compute-the-native-address functionality that is shared between
all of `heap_{addr,load,store}`.
Finally, this adds a missing optimization for when we can dedupe explicit bounds
checks for static memories and Spectre mitigations.
* Cranelift: Enable `heap_load_store_*` run tests on all targets
* Turn off probestack by default in Cranelift
The probestack feature is not implemented for the aarch64 and s390x
backends and currently the on-by-default status requires the aarch64 and
s390x implementations to be a stub. Turning off probestack by default
allows the s390x and aarch64 backends to panic with an error message to
avoid providing a false sense of security. When the probestack option is
implemented for all backends, however, it may be reasonable to
re-enable.
* aarch64: Improve codegen for AMode fallback
Currently the final fallback for finalizing an `AMode` will generate
both a constant-loading instruction as well as an `add` instruction to
the base register into the same temporary. This commit improves the
codegen by removing the `add` instruction and folding the final add into
the finalized `AMode`. This changes the `extendop` used but both
registers are 64-bit so shouldn't be affected by the extending
operation.
* aarch64: Implement inline stack probes
This commit implements inline stack probes for the aarch64 backend in
Cranelift. The support here is modeled after the x64 support where
unrolled probes are used up to a particular threshold after which a loop
is generated. The instructions here are similar in spirit to x64 except
that unlike x64 the stack pointer isn't modified during the unrolled
loop to avoid needing to re-adjust it back up at the end of the loop.
* Enable inline probestack for AArch64 and Riscv64
This commit enables inline probestacks for the AArch64 and Riscv64
architectures in the same manner that x86_64 has it enabled now. Some
more testing was additionally added since on Unix platforms we should be
guaranteed that Rust's stack overflow message is now printed too.
* Enable probestack for aarch64 in cranelift-fuzzgen
* Address review comments
* Remove implicit stack overflow traps from x64 backend
This commit removes implicit `StackOverflow` traps inserted by the x64
backend for stack-based operations. This was historically required when
stack overflow was detected with page faults but Wasmtime no longer
requires that since it's not suitable for wasm modules which call host
functions. Additionally no other backend implements this form of
implicit trap-code additions so this is intended to synchronize the
behavior of all the backends.
This fixes a test added prior for aarch64 to properly abort the process
instead of accidentally being caught by Wasmtime.
* Fix a style issue
Was missing some '$' characters and so was comparing string literals against
string literals instead of variable values against string literals. Regenerated
tests to fix them and add missing tests.
* Cranelift: consider heap's guard pages when legalizing `heap_addr`
Fixes#5328
* Update comment to align more directly with implementation
* Add legalization tests for `heap_addr` and offset guard pages
Ulrich Weigand identified two bugs in this code due to it falsely
claiming there were unreachable rules in the s390x backend. The fixes
are:
- Add constraints for pure constructors.
I didn't notice that a constructor which is declared pure (which
currently implies that it is fallible), when used on the left-hand side
of a rule, can cause the rule to fail to match. Therefore, any
constructors on the left-hand side must be noted as additional
constraints on the rule, so that overlap checking can see them.
- Ignore subset-overlaps for rules with equality constraints
This eliminates false positives when checking for unreachable rules. It
introduces false negatives instead but we prefer to fail to detect an
error instead of claiming that valid input is wrong. We can implement a
more accurate check later.
- Remove remaining references to Miette
- Borrow implementation of `line_starts` from codespan-reporting
- Clean up a use of `Result` that no longer conflicts with a local
definition
- When printing plain errors, add a blank line between errors for
readability
Fix shadowing identified in #5322 for imul and swiden_high/swiden_low/uwiden_high/uwiden_low combinations in the x64 backend, and remove some redundant rules from the aarch64 dynamic neon ruleset. Additionally, add tests to the x64 backend showing that the imul specializations are firing.
Add assertions to the OperandCollector that show we're not using pinned vregs, and use reg_fixed_nonallocatable constraints when a real register is used with other constraint generation functions like reg_use etc.
There were several issues with ISLE's existing error reporting
implementation.
- When using Miette for more readable error reports, it would panic if
errors were reported from multiple files in the same run.
- Miette is pretty heavy-weight for what we're doing, with a lot of
dependencies.
- The `Error::Errors` enum variant led to normalization steps in many
places, to avoid using that variant to represent a single error.
This commit:
- replaces Miette with codespan-reporting
- gets rid of a bunch of cargo-vet exemptions
- replaces the `Error::Errors` variant with a new `Errors` type
- removes source info from `Error` variants so they're easy to construct
- adds source info only when formatting `Errors`
- formats `Errors` with a custom `Debug` impl
- shares common code between ISLE's callers, islec and cranelift-codegen
- includes a source snippet even with fancy-errors disabled
I tried to make this a series of smaller commits but I couldn't find any
good split points; everything was too entangled with everything else.
* Cranelift: Define `heap_load` and `heap_store` instructions
* Cranelift: Implement interpreter support for `heap_load` and `heap_store`
* Cranelift: Add a suite runtests for `heap_{load,store}`
There are so many knobs we can twist for heaps and I wanted to exhaustively test
all of them, so I wrote a script to generate the tests. I've checked in the
script in case we want to make any changes in the future, but I don't think it
is worth adding this to CI to check that scripts are up to date or anything like
that.
* Review feedback
In #5174 we decided it doesn't make sense for a rule to have a
bind-pattern at the root of its left-hand side. There's no Rust value
corresponding to the root value of such a term, because it actually
represents a function declaration with one or more arguments.
This commit takes that to its logical conclusion.
`sema::Rule` previously had an `lhs` field whose value must always be a
`Pattern::Term` variant, and anyone using that structure had to deal
with the possibility of finding the wrong variant there.
Now the relevant fields from that variant are stored directly in `Rule`
instead. Also, the (tiny!) portion of `translate_pattern` which applied
when the pattern was the root term is now inlined in `collect_rules`.
Because `translate_pattern` no longer has to special-case the root term,
we can delete its `rule_term` and `is_root` arguments. That brings it
down to a more manageable four arguments, which means many calls fit on
one line now.