In #4104 we discussed whether it makes sense for the division and
remainder ops to support vector types. We concluded that because most
hardware doesn't support it directly, it probably is not ideal to force
all backends to polyfill it. In the future we can always reverse this
decision, perhaps with a platform-independent legalization.
This PR restricts the allowed types on the CLIF ops to integer types
only.
This test was added between the last CI run on #4088 and its merge to
main, and the changes in #4088 (use of constants directly in instruction
via load from constant pool, rather than from a register initialized by
a separate instruction) cause it to fail now.
This PR alters the test to be invariant to regalloc and argument
decisions during lowering, as the test is really checking (per the
comment) that we get two cmoves without an intervening move. As such, it
just matches the instruction opcodes, irrespective of the arguments.
* Allow emitting u64 constants into constant pool.
* Use constant pool for constants on x64 that do not fit in a simm32 and are needed as a RegMem or RegMemImm.
* Fix rip-relative addressing bug in pinsrd emission.
* Narrow `allow(dead_code)` declarations
Having module wide `allow(dead_code)` may hide some code that's really
dead. In this commit I just narrowed the declarations to the specific
enum variants that were not used (as it seems reasonable to keep them
and their handling in all the matches, for future use). And the compiler
found more dead code that I think we can remove safely in the short
term.
With this, the only files annotated with a module-wide
`allow(dead_code)` are isle-generated files.
* resurrect some functions as test helpers
* ISLE compiler: fix priority-trie interval bug. (#4093)
This PR fixes a bug in the ISLE compiler related to rule priorities.
An important note first: the bug did not affect the correctness of the
Cranelift backends, either in theory (because the rules should be
correct applied in any order, even contrary to the stated priorities)
or in practice (because the generated code actually does not change at
all with the DSL compiler fix, only with a separate minimized bug
example).
The issue was a simple swap of `min` for `max` (see first
commit). This is the minimal fix, I think, to get a correct
priority-trie with the minimized bug example in this commit.
However, while debugging this, I started to convince myself that the
complexity of merging multiple priority ranges using the sort of
hybrid interval tree / string-matching trie data structure was
unneeded. The original design was built with the assumption we might
have a bunch of different priority levels, and would need the
efficiency of merging where possible. But in practice we haven't used
priorities this way: the vast majority of lowering rules exist at the
default (priority 0), and just a few overrides are explicitly at prio
1, 2 or (rarely) 3.
So, it turns out to be a lot simpler to label trie edges with (prio,
symbol) rather than (prio-range, symbol), and delete the whole mess of
interval-splitting logic on insertion. It's easier (IMHO) to convince
oneself that the resulting insertion algorithm is correct.
I was worried that this might impact the size of the generated Rust
code or its runtime, but In fact, to my initial surprise (but it makes
sense given the above "rarely used" factor), the generated code with
this compiler fix is *exactly the same*. I rebuilt with `--features
rebuild-isle,all-arch` but... there were no diffs to commit! This is
to me the simplest evidence that we didn't really need that
complexity.
* Fix earlier commit from #4093: properly sort trie.
This commit fixes an in-hindsight-obvious bug in #4093: the trie's edges
must be sorted recursively, not just at the top level.
With this fix, the generated code differs only in one cosmetic way (a
let-binding moves) but otherwise is the same.
This includes @fitzgen's fix to the CI (from the revert in #4102) that
deletes manifests to actually check that the checked-in source is
consistent with the checked-in compiler. The force-rebuild step is now
in a shell script for convenience: anyone hacking on the ISLE compiler
itself can use this script to more easily rebuild everything.
* Add note to build.rs to remind to update force-rebuild-isle.sh
* Revert "ISLE compiler: fix priority-trie interval bug. (#4093)"
This reverts commit 2122337112.
* ci: Delete ISLE manifest files to force an ISLE rebuild
This makes it so that we actually rebuild ISLE rather than just check that the
manifests are up-to-date.
In #3744, we identified that extra `mov` instructions were inserted in
between the `cmov` instructions that CLIF's `select` lowers to. The
switch to regalloc2 resolved this and this test checks that no
intervening `mov`s are inserted. Closes#3744.
The pretty-printing had swapped dst and src2; this was introduced when
we moved to RA2 (sorry about that! IMHO we should do something to
automate the mapping between regalloc arg collection and pretty
printing/emission).
`src2` comes at the end because it has a variable number of register
mentions; this is in line with how many of the other inst formats work.
Actual emitted code was never incorrect, just the pretty-printing.
Updated test golden outputs look correct to me now, including the one
that we saw was incorrect in #3945.
This PR fixes a bug in the ISLE compiler related to rule priorities.
An important note first: the bug did not affect the correctness of the
Cranelift backends, either in theory (because the rules should be
correct applied in any order, even contrary to the stated priorities)
or in practice (because the generated code actually does not change at
all with the DSL compiler fix, only with a separate minimized bug
example).
The issue was a simple swap of `min` for `max` (see first
commit). This is the minimal fix, I think, to get a correct
priority-trie with the minimized bug example in this commit.
However, while debugging this, I started to convince myself that the
complexity of merging multiple priority ranges using the sort of
hybrid interval tree / string-matching trie data structure was
unneeded. The original design was built with the assumption we might
have a bunch of different priority levels, and would need the
efficiency of merging where possible. But in practice we haven't used
priorities this way: the vast majority of lowering rules exist at the
default (priority 0), and just a few overrides are explicitly at prio
1, 2 or (rarely) 3.
So, it turns out to be a lot simpler to label trie edges with (prio,
symbol) rather than (prio-range, symbol), and delete the whole mess of
interval-splitting logic on insertion. It's easier (IMHO) to convince
oneself that the resulting insertion algorithm is correct.
I was worried that this might impact the size of the generated Rust
code or its runtime, but In fact, to my initial surprise (but it makes
sense given the above "rarely used" factor), the generated code with
this compiler fix is *exactly the same*. I rebuilt with `--features
rebuild-isle,all-arch` but... there were no diffs to commit! This is
to me the simplest evidence that we didn't really need that
complexity.
This PR refactors the x64 backend address-mode lowering to use an
incremental-build approach, where it considers each node in a tree of
`iadd`s that feed into a load/store address and, at each step, builds
the best possible `Amode`. It will combine an arbitrary number of
constant offsets (an extension beyond the current rules), and can
capture a left-shifted (scaled) index in any position of the tree
(another extension).
This doesn't have any measurable performance improvement on our Wasm
benchmarks in Sightglass, unfortunately, because the IR lowered from
wasm32 will do address computation in 32 bits and then `uextend` it to
add to the 64-bit heap base. We can't quite lift the 32-bit adds to 64
bits because this loses the wraparound semantics.
(We could label adds as "expected not to overflow", and allow *those* to
be lifted to 64 bit operations; wasm32 heap address computation should
fit this. This is `add nuw` (no unsigned wrap) in LLVM IR terms. That's
likely my next step.)
Nevertheless, (i) this generalizes the cases we can handle, which should
be a good thing, all other things being equal (and in this case, no
compile time impact was measured); and (ii) might benefit non-Wasm
frontends.
Currently, we have partial Spectre mitigation: we protect heap accesses
with dynamic bounds checks. Specifically, we guard against errant
accesses on the misspeculated path beyond the bounds-check conditional
branch by adding a conditional move that is also dependent on the
bounds-check condition. This data dependency on the condition is not
speculated and thus will always pick the "safe" value (in the heap case,
a NULL address) on the misspeculated path, until the pipeline flushes
and recovers onto the correct path.
This PR uses the same technique both for table accesses -- used to
implement Wasm tables -- and for jumptables, used to implement Wasm
`br_table` instructions.
In the case of Wasm tables, the cmove picks the table base address on
the misspeculated path. This is equivalent to reading the first table
entry. This prevents loads of arbitrary data addresses on the
misspeculated path.
In the case of `br_table`, the cmove picks index 0 on the misspeculated
path. This is safer than allowing a branch to an address loaded from an
index under misspeculation (i.e., it preserves control-flow integrity
even under misspeculation).
The table mitigation is controlled by a Cranelift setting, on by
default. The br_table mitigation is always on, because it is part of the
single lowering pseudoinstruction. In both cases, the impact should be
minimal: a single extra cmove in a (relatively) rarely-used operation.
The table mitigation is architecture-independent (happens during
legalization); the br_table mitigation has been implemented for both x64
and aarch64. (I don't know enough about s390x to implement this
confidently there, but would happily review a PR to do the same on that
platform.)
This PR removes "argument polarity": the feature of ISLE extractors that lets them take
inputs aside from the value to be matched.
Cases that need this expressivity have been subsumed by #4072 with if-let clauses;
we can now finally remove this misfeature of the language, which has caused significant
confusion and has always felt like a bit of a hack.
This PR (i) removes the feature from the ISLE compiler; (ii) removes it from the reference
documentation; and (iii) refactors away all uses of the feature in our three existing
backends written in ISLE.
In #4072 I mistakenly put the subsection about if-let clauses in the
language doc just below the next section header, so it's in the wrong
section ("mapping to Rust"). This moves it back upward to where it
should be. Sorry about that!
This PR adds support for `if-let` clauses, as proposed in
bytecodealliance/rfcs#21. These clauses augment the left-hand side
(pattern-matching phase) of rules in the ISLE instruction-selection DSL
with sub-patterns matching on sub-expressions. The ability to list
additional match operations to perform, out-of-line from the original
toplevel pattern, greatly simplifies some idioms. See the RFC for more
details and examples of use.
This fixes a bug where the ISLE compiler would refuse to accept
out-of-order declarations in the order of: (i) use of an implicit
conversion backed by an extern constructor; (ii) extern declaration
for that constructor.
The issue was one of phase separation: we were capturing and noting
"extern constructor" status on terms in the same pass in which we were
typechecking and resolving implicit conversions. Given this knowledge,
the fix is straightforward: externs are picked up in a prior pass.
* Update to clap 3.0
This commit migrates all CLI commands internally used in this project
from structopt/clap2 to clap 3. The intent here is to ensure that we're
using maintained versions of the dependencies as structopt and clap 2
are less maintained nowadays. Most transitions were pretty
straightforward and mostly dealing with structopt/clap3 differences.
* Fix a number of `cargo deny` errors
This commit fixes a few errors around duplicate dependencies which
arose from the prior update to clap3. This also uses a new feature in
`deny.toml`, `skip-tree`, which allows having a bit more targeted
ignores for skips of duplicate version checks. This showed a few more
locations in Wasmtime itself where we could update some dependencies.
* Run a `cargo update` over our dependencies
This'll notably fix a `cargo audit` error where we have a pinned version
of the `regex` crate which has a CVE assigned to it.
* Update to `object` and `hashbrown` crates
Prune some duplicate versions showing up from the previous `cargo update`
Currently, a variable can be named in two different ways in an ISLE
pattern. One can write a pattern like `(T x y)` that binds the two
args of `T` with the subpatterns `x` and `y`, each of which match
anything and capture the value as a bound variable. Or, one can write
a pattern like `(T x =x)`, where the first arg pattern `x` captures
the value in `x` and the second arg pattern `=x` matches only the same
value that was already captured.
It turns out (thanks to @fitzgen for this insight here [1]) that this
distinction can actually be inferred easily: if `x` isn't bound, then
mentioning it binds it; otherwise, it matches only the already-bound
variable. There's no concern about ordering (one mention binding
vs. the other) because (i) the value is equal either way, and (ii) the
types at both sites must be the same.
This language tweak seems like it should simplify things nicely! We
can remove the `=x` syntax later if we want, but this PR doesn't do
so.
[1] https://github.com/bytecodealliance/wasmtime/pull/4071#discussion_r859111513
Also fix and extend the current implementation:
- AtomicRMWOp::Clr != AtomicRmwOp::And, as the input needs to be
inverted first.
- Inputs to the cmp for the RMWLoop case are sign-extended when
needed.
- Lower Xchg to Swp.
- Lower Sub to Add with a negated input.
- Added more runtests.
Copyright (c) 2022, Arm Limited.
x64 backend: add lowerings with load-op-store fusion.
These lowerings use the `OP [mem], reg` forms (or in AT&T syntax, `OP
%reg, (mem)`) -- i.e., x86 instructions that load from memory, perform
an ALU operation, and store the result, all in one instruction. Using
these instruction forms, we can merge three CLIF ops together: a load,
an arithmetic operation, and a store.
The recent work in #4061 introduced a notion of "unique uses" for CLIF
values that both simplified the load-op merging rules and allowed
loads to merge in some more places.
Unfortunately there's one factor that PR didn't account for: a unique
use at the CLIF level could become a multiple-use at the VCode level,
when a lowering uses a value multiple times!
Making this less error-prone in general is hard, because we don't know
the lowering in VCode until it's emitted, so we can't ahead-of-time
know that a value will be used multiple times and prevent its
merging. But we *can* know in the lowerings themselves when we're
doing this. At least we get a panic from regalloc when we get this
wrong; no bad code (uninitialized register being read) should ever
come from a backend bug like this.
This is still a bit less than ideal, but for now the fix is: in
`cmp_and_choose` in the x64 backend (which compares values, then
picks one or the other with a cmove), explicitly put values in
registers.
Fixes#4067 (thanks @Mrmaxmeier for the report!).
* Cranelift: fix#3953: rework single/multiple-use logic in lowering.
This PR addresses the longstanding issue with loads trying to merge
into compares on x86-64, and more generally, with the lowering
framework falsely recognizing "single uses" of one op by
another (which would normally allow merging of side-effecting ops like
loads) when there is *indirect* duplication.
To fix this, we replace the direct `value_uses` count with a
transitive notion of uniqueness (not unlike Rust's `&`/`&mut` and how
a `&mut` downgrades to `&` when accessed through another `&`!). A
value is used multiple times transitively if it has multiple direct
uses, or is used by another op that is used multiple times
transitively.
The canonical example of badness is:
```
v1 := load
v2 := ifcmp v1, ...
v3 := selectif v2, ...
v4 := selectif v2, ...
```
both `v3` and `v4` effectively merge the `ifcmp` (`v2`), so even
though the use of `v1` is "unique", it is codegenned twice. This is
why we ~~can't have nice things~~ can't merge loads into
compares (#3953).
There is quite a subtle and interesting design space around this
problem and how we might solve it. See the long doc-comment on
`ValueUseState` in this PR for more justification for the particular
design here. In particular, this design deliberately simplifies a bit
relative to an "optimal" solution: some uses can *become* unique
depending on merging, but we don't design our data structures for such
updates because that would require significant extra costly
tracking (some sort of transitive refcounting). For example, in the
above, if `selectif` somehow did not merge `ifcmp`, then we would only
codegen the `ifcmp` once into its result register (and use that
register twice); then the load *is* uniquely used, and could be
merged. But that requires transitioning from "multiple use" back to
"unique use" with careful tracking as we do pattern-matching, which
I've chosen to make out-of-scope here for now. In practice, I don't
think it will matter too much (and we can always improve later).
With this PR, we can now re-enable load-op merging for compares. A
subsequent commit does this.
* Update x64 backend to allow load-op merging for `cmp`.
* Update filetests.
* Add test for cmp-mem merging on x64.
* Comment fixes.
* Rework ValueUseState analysis for better performance.
* Update s390x filetest: iadd_ifcout cannot merge loads anymore because it has multiple outputs (ValueUseState limitation)
* Address review comments.
With these fixes, all this PR has to do is instantiate and run the
checker on the `regalloc2::Output`. This is off by default, and is
enabled by setting the `regalloc_checker` Cranelift option.
This restores the old functionality provided by e.g. the
`backtracking_checked` regalloc algorithm setting rather than
`backtracking` when we were still on regalloc.rs.
regalloc2 is a bit pickier about critical edges than regalloc.rs was,
because of how it inserts moves. In particular, if a branch has any
arguments (e.g., a conditional branch or br_table), its successors must
all have only one predecessor, so we can do edge moves at the top of
successor blocks rather than at the end of this block. Otherwise, moves
that semantically must come after the block's last uses (the branch's
args) would be placed before it.
This is almost always the case, because crit-edge splitting ensures that
if we have more than one succ, all our succs will have only one pred.
This is because branch kinds that take arguments (fixed args, not the
blockparam args) tend to have more than one successor: conditionals and
br_tables.
However, a fuzzbug recently illuminated one corner case I had missed: a
br_table can have *one* successor only, if it has a default target and
an empty table. In this case, crit-edge splitting will happily skip a
split and assume that we can insert edge moves at the end of the block
with the br_table. But this will fail.
regalloc2 explicitly checks this and bails with a panic, rather than
continue, so no miscompilation is possible; but without this fix, we
will get these panics on br_tables with empty tables.
Previously, the block successor accumulation and the blockparam branch
arg setup were decoupled. The lowering backend implicitly specified
the order of successor edges via its `MachTerminator` enum on the last
instruction in the block, while the `Lower` toplevel
machine-independent driver set up blockparam branch args in the edge
order seen in CLIF.
In some cases, these orders did not match -- for example, when the
conditional branch depended on an FP condition that was implemented by
swapping taken/not-taken edges and inverting the condition code.
This PR refactors the successor handling to be centralized in `Lower`
rather than flow through the terminator `MachInst`, and adds a
successor block and its blockparam args at the same time, ensuring the
orders match.
Following the merge of regalloc2 support, this became slower because we
are stricter about the critical-edge invariant, generating a separate
edge block for every out-edge even if two or more out-edges go to the
same successor (this is significant in cases of `br_table` with many
entries having the same target block, for example).
Many of those edge blocks are empty and end up collapsed by the
MachBuffer, which leads to a large set of aliased labels.
The invariant validation will dutifully iterate over all the data
structures at every step, validating all of our conditions. But this
gets way slower in the new context, to the point that we'll probably
have some fuzz timeouts.
This was pointed out in [1] but I missed removing this in #3989. Given
that `MachBuffer` has been around for nearly two years now, has been
fuzzed continuously with the invariant validation for that time, and
also has a correctness proof in the comments, it's probably reasonable
to remove this high (recently increased) cost from the fuzzing-specific
compilation configuration.
[1]
https://github.com/bytecodealliance/wasmtime/pull/3989#discussion_r847712263
Merge Mov32 and Mov64 into a single instruction parameterized by a new
OperandSize field. Also combine the Mov[K,N,Z] into a single instruction
with a new opcode to select between the operations.
Copyright (c) 2022, Arm Limited.
This PR switches Cranelift over to the new register allocator, regalloc2.
See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.
Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:
```
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/A
```
Issue #3963 identified a miscompilation with select in which the second
in the pair of `CMOV`s (one pair per `i128` register) used the wrong
flag. This change fixes the error in the x64 ISLE helper function
emitting these `CMOV` instructions.
This change moves the majority of the lowerings for CLIF's `load`
instruction over to ISLE. To do so, it also migrates the previous
mechanism for creating an `Amode` (`lower_to_amode`) to several ISLE
rules (see `to_amode`).
* Update wasm-tools crates
This commit updates the wasm-tools family of crates as used in Wasmtime.
Notably this brings in the update which removes module linking support
as well as a number of internal refactorings around names and such
within wasmparser itself. This updates all of the wasm translation
support which binds to wasmparser as appropriate.
Other crates all had API-compatible changes for at least what Wasmtime
used so no further changes were necessary beyond updating version
requirements.
* Update a test expectation
* Upgrade all crates to the Rust 2021 edition
I've personally started using the new format strings for things like
`panic!("some message {foo}")` or similar and have been upgrading crates
on a case-by-case basis, but I think it probably makes more sense to go
ahead and blanket upgrade everything so 2021 features are always
available.
* Fix compile of the C API
* Fix a warning
* Fix another warning
* Bump to 0.36.0
* Add a two-week delay to Wasmtime's release process
This commit is a proposal to update Wasmtime's release process with a
two-week delay from branching a release until it's actually officially
released. We've had two issues lately that came up which led to this proposal:
* In #3915 it was realized that changes just before the 0.35.0 release
weren't enough for an embedding use case, but the PR didn't meet the
expectations for a full patch release.
* At Fastly we were about to start rolling out a new version of Wasmtime
when over the weekend the fuzz bug #3951 was found. This led to the
desire internally to have a "must have been fuzzed for this long"
period of time for Wasmtime changes which we felt were better
reflected in the release process itself rather than something about
Fastly's own integration with Wasmtime.
This commit updates the automation for releases to unconditionally
create a `release-X.Y.Z` branch on the 5th of every month. The actual
release from this branch is then performed on the 20th of every month,
roughly two weeks later. This should provide a period of time to ensure
that all changes in a release are fuzzed for at least two weeks and
avoid any further surprises. This should also help with any last-minute
changes made just before a release if they need tweaking since
backporting to a not-yet-released branch is much easier.
Overall there are some new properties about Wasmtime with this proposal
as well:
* The `main` branch will always have a section in `RELEASES.md` which is
listed as "Unreleased" for us to fill out.
* The `main` branch will always be a version ahead of the latest
release. For example it will be bump pre-emptively as part of the
release process on the 5th where if `release-2.0.0` was created then
the `main` branch will have 3.0.0 Wasmtime.
* Dates for major versions are automatically updated in the
`RELEASES.md` notes.
The associated documentation for our release process is updated and the
various scripts should all be updated now as well with this commit.
* Add notes on a security patch
* Clarify security fixes shouldn't be previewed early on CI
* Run the GC smoketest with epoch support enabled as well.
* Handle safepoints in cold blocks properly.
Currently, the way that we find safepoint slots for a given instruction
relies on the instruction index order in the safepoint list matching the
order of instruction emission.
Previous to the introduction of cold-block support, this was trivially
satisfied by sorting the safepoint list: we emit instructions 0, 1, 2,
3, 4, ..., and so if we have safepoints at instructions 1 and 4, we will
encounter them in that order.
However, cold blocks are supported by swizzling the emission order at
the last moment (to avoid having to renumber instructions partway
through the compilation pipeline), so we actually emit instructions out
of index order when cold blocks are present.
Reference-type support in Wasm in particular uses cold blocks for
slowpaths, and has live refs and safepoints in these slowpaths, so we
can reliably "skip" a safepoint (not emit any metadata for it) in the
presence of reftype usage.
This PR fixes the emission code by building a map from instruction index
to safepoint index first, then doing lookups through this map, rather
than following along in-order as it emits instructions.
This change removes all variants of `load*_complex` and `store*_complex`
from Cranelift; this is a breaking change to the instructions exposed by
CLIF. The complete list of instructions removed is: `load_complex`,
`store_complex`, `uload8_complex`, `sload8_complex`, `istore8_complex`,
`sload8_complex`, `uload16_complex`, `sload16_complex`,
`istore16_complex`, `uload32_complex`, `sload32_complex`,
`istore32_complex`, `uload8x8_complex`, `sload8x8_complex`,
`sload16x4_complex`, `uload16x4_complex`, `uload32x2_complex`,
`sload32x2_complex`.
The rationale for this removal is that the Cranelift backend now has the
ability to pattern-match multiple upstream additions in order to
calculate the address to access. Previously, this was not possible so
the `*_complex` instructions were needed. Over time, these instructions
have fallen out of use in this repository, making the additional
overhead of maintaining them a chore.