If an address expression is given to `to_amode` that is completely
constant (no registers at all), then it will produce an `Amode` that has
the resulting constant as an offset, and `(invalid_reg)` as the base.
This is a side-effect of the way we build up the amode step-by-step --
we're waiting to see a register and plug it into the base field. If we
never get a reg though, we need to generate a constant zero into a
register and use that as the base. This PR adds a `finalize_amode`
helper to do just that.
Fixes#4234.
RA2 recently removed the need for a dedicated scratch register for
cyclic moves (bytecodealliance/regalloc2#51). This has moderate positive
performance impact on function bodies that were register-constrained, as
it means that one more register is available. In Sightglass, I measured
+5-8% on `blake3-scalar`, at least among current benchmarks.
* Remove unused srcloc in MachReloc
* Remove unused srcloc in MachTrap
* Use `into_iter` on array in bench code to suppress a warning
* Remove unused srcloc in MachCallSite
Previously, the pinned register (enabled by the `enable_pinned_reg`
Cranelift setting and used via the `get_pinned_reg` and `set_pinned_reg`
CLIF ops) was only used when Cranelift was embedded in SpiderMonkey, in
order to support a pinned heap register. SpiderMonkey has its own
calling convention in Cranelift (named after the integration layer,
"Baldrdash").
However, the feature is more general, and should be usable with the
default system calling convention too, e.g. SysV or Windows Fastcall.
This PR fixes the ABI code to properly treat the pinned register as a
globally allocated register -- and hence an implicit input and output to
every function, not saved/restored in the prologue/epilogue -- for SysV
on x86-64 and aarch64, and Fastcall on x86-64.
Fixes#4170.
This PR adds a basic *alias analysis*, and optimizations that use it.
This is a "mid-end optimization": it operates on CLIF, the
machine-independent IR, before lowering occurs.
The alias analysis (or maybe more properly, a sort of memory-value
analysis) determines when it can prove a particular memory
location is equal to a given SSA value, and when it can, it replaces any
loads of that location.
This subsumes two common optimizations:
* Redundant load elimination: when the same memory address is loaded two
times, and it can be proven that no intervening operations will write
to that memory, then the second load is *redundant* and its result
must be the same as the first. We can use the first load's result and
remove the second load.
* Store-to-load forwarding: when a load can be proven to access exactly
the memory written by a preceding store, we can replace the load's
result with the store's data operand, and remove the load.
Both of these optimizations rely on a "last store" analysis that is a
sort of coloring mechanism, split across disjoint categories of abstract
state. The basic idea is that every memory-accessing operation is put
into one of N disjoint categories; it is disallowed for memory to ever
be accessed by an op in one category and later accessed by an op in
another category. (The frontend must ensure this.)
Then, given this, we scan the code and determine, for each
memory-accessing op, when a single prior instruction is a store to the
same category. This "colors" the instruction: it is, in a sense, a
static name for that version of memory.
This analysis provides an important invariant: if two operations access
memory with the same last-store, then *no other store can alias* in the
time between that last store and these operations. This must-not-alias
property, together with a check that the accessed address is *exactly
the same* (same SSA value and offset), and other attributes of the
access (type, extension mode) are the same, let us prove that the
results are the same.
Given last-store info, we scan the instructions and build a table from
"memory location" key (last store, address, offset, type, extension) to
known SSA value stored in that location. A store inserts a new mapping.
A load may also insert a new mapping, if we didn't already have one.
Then when a load occurs and an entry already exists for its "location",
we can reuse the value. This will be either RLE or St-to-Ld depending on
where the value came from.
Note that this *does* work across basic blocks: the last-store analysis
is a full iterative dataflow pass, and we are careful to check dominance
of a previously-defined value before aliasing to it at a potentially
redundant load. So we will do the right thing if we only have a
"partially redundant" load (loaded already but only in one predecessor
block), but we will also correctly reuse a value if there is a store or
load above a loop and a redundant load of that value within the loop, as
long as no potentially-aliasing stores happen within the loop.
Current codegen had a number of logic errors confusing
NAND with AND WITH COMPLEMENT, and NOR with OR WITH COMPLEMENT.
Add support for the missing z15 instructions and fix logic.
This PR fixes#4066: it modifies the Cranelift `build.rs` workflow to
invoke the ISLE DSL compiler on every compilation, rather than only
when the user specifies a special "rebuild ISLE" feature.
The main benefit of this change is that it vastly simplifies the mental
model required of developers, and removes a bunch of failure modes
we have tried to work around in other ways. There is now just one
"source of truth", the ISLE source itself, in the repository, and so there
is no need to understand a special "rebuild" step and how to handle
merge errors. There is no special process needed to develop the compiler
when modifying the DSL. And there is no "noise" in the git history produced
by constantly-regenerated files.
The two main downsides we discussed in #4066 are:
- Compile time could increase, by adding more to the "meta" step before the main build;
- It becomes less obvious where the source definitions are (everything becomes
more "magic"), which makes exploration and debugging harder.
This PR addresses each of these concerns:
1. To maintain reasonable compile time, it includes work to cut down the
dependencies of the `cranelift-isle` crate to *nothing* (only the Rust stdlib),
in the default build. It does this by putting the error-reporting bits
(`miette` crate) under an optional feature, and the logging (`log` crate) under
a feature-controlled macro, and manually writing an `Error` impl rather than
using `thiserror`. This completely avoids proc macros and the `syn` build slowness.
The user can still get nice errors out of `miette`: this is enabled by specifying
a Cargo feature `--features isle-errors`.
2. To allow the user to optionally inspect the generated source, which nominally
lives in a hard-to-find path inside `target/` now, this PR adds a feature `isle-in-source-tree`
that, as implied by the name, moves the target for ISLE generated source into
the source tree, at `cranelift/codegen/isle_generated_source/`. It seems reasonable
to do this when an explicit feature (opt-in) is specified because this is how ISLE regeneration
currently works as well. To prevent surprises, if the feature is *not* specified, the
build fails if this directory exists.
* Allow emitting u64 constants into constant pool.
* Use constant pool for constants on x64 that do not fit in a simm32 and are needed as a RegMem or RegMemImm.
* Fix rip-relative addressing bug in pinsrd emission.
* Narrow `allow(dead_code)` declarations
Having module wide `allow(dead_code)` may hide some code that's really
dead. In this commit I just narrowed the declarations to the specific
enum variants that were not used (as it seems reasonable to keep them
and their handling in all the matches, for future use). And the compiler
found more dead code that I think we can remove safely in the short
term.
With this, the only files annotated with a module-wide
`allow(dead_code)` are isle-generated files.
* resurrect some functions as test helpers
* ISLE compiler: fix priority-trie interval bug. (#4093)
This PR fixes a bug in the ISLE compiler related to rule priorities.
An important note first: the bug did not affect the correctness of the
Cranelift backends, either in theory (because the rules should be
correct applied in any order, even contrary to the stated priorities)
or in practice (because the generated code actually does not change at
all with the DSL compiler fix, only with a separate minimized bug
example).
The issue was a simple swap of `min` for `max` (see first
commit). This is the minimal fix, I think, to get a correct
priority-trie with the minimized bug example in this commit.
However, while debugging this, I started to convince myself that the
complexity of merging multiple priority ranges using the sort of
hybrid interval tree / string-matching trie data structure was
unneeded. The original design was built with the assumption we might
have a bunch of different priority levels, and would need the
efficiency of merging where possible. But in practice we haven't used
priorities this way: the vast majority of lowering rules exist at the
default (priority 0), and just a few overrides are explicitly at prio
1, 2 or (rarely) 3.
So, it turns out to be a lot simpler to label trie edges with (prio,
symbol) rather than (prio-range, symbol), and delete the whole mess of
interval-splitting logic on insertion. It's easier (IMHO) to convince
oneself that the resulting insertion algorithm is correct.
I was worried that this might impact the size of the generated Rust
code or its runtime, but In fact, to my initial surprise (but it makes
sense given the above "rarely used" factor), the generated code with
this compiler fix is *exactly the same*. I rebuilt with `--features
rebuild-isle,all-arch` but... there were no diffs to commit! This is
to me the simplest evidence that we didn't really need that
complexity.
* Fix earlier commit from #4093: properly sort trie.
This commit fixes an in-hindsight-obvious bug in #4093: the trie's edges
must be sorted recursively, not just at the top level.
With this fix, the generated code differs only in one cosmetic way (a
let-binding moves) but otherwise is the same.
This includes @fitzgen's fix to the CI (from the revert in #4102) that
deletes manifests to actually check that the checked-in source is
consistent with the checked-in compiler. The force-rebuild step is now
in a shell script for convenience: anyone hacking on the ISLE compiler
itself can use this script to more easily rebuild everything.
* Add note to build.rs to remind to update force-rebuild-isle.sh
The pretty-printing had swapped dst and src2; this was introduced when
we moved to RA2 (sorry about that! IMHO we should do something to
automate the mapping between regalloc arg collection and pretty
printing/emission).
`src2` comes at the end because it has a variable number of register
mentions; this is in line with how many of the other inst formats work.
Actual emitted code was never incorrect, just the pretty-printing.
Updated test golden outputs look correct to me now, including the one
that we saw was incorrect in #3945.
This PR refactors the x64 backend address-mode lowering to use an
incremental-build approach, where it considers each node in a tree of
`iadd`s that feed into a load/store address and, at each step, builds
the best possible `Amode`. It will combine an arbitrary number of
constant offsets (an extension beyond the current rules), and can
capture a left-shifted (scaled) index in any position of the tree
(another extension).
This doesn't have any measurable performance improvement on our Wasm
benchmarks in Sightglass, unfortunately, because the IR lowered from
wasm32 will do address computation in 32 bits and then `uextend` it to
add to the 64-bit heap base. We can't quite lift the 32-bit adds to 64
bits because this loses the wraparound semantics.
(We could label adds as "expected not to overflow", and allow *those* to
be lifted to 64 bit operations; wasm32 heap address computation should
fit this. This is `add nuw` (no unsigned wrap) in LLVM IR terms. That's
likely my next step.)
Nevertheless, (i) this generalizes the cases we can handle, which should
be a good thing, all other things being equal (and in this case, no
compile time impact was measured); and (ii) might benefit non-Wasm
frontends.
Currently, we have partial Spectre mitigation: we protect heap accesses
with dynamic bounds checks. Specifically, we guard against errant
accesses on the misspeculated path beyond the bounds-check conditional
branch by adding a conditional move that is also dependent on the
bounds-check condition. This data dependency on the condition is not
speculated and thus will always pick the "safe" value (in the heap case,
a NULL address) on the misspeculated path, until the pipeline flushes
and recovers onto the correct path.
This PR uses the same technique both for table accesses -- used to
implement Wasm tables -- and for jumptables, used to implement Wasm
`br_table` instructions.
In the case of Wasm tables, the cmove picks the table base address on
the misspeculated path. This is equivalent to reading the first table
entry. This prevents loads of arbitrary data addresses on the
misspeculated path.
In the case of `br_table`, the cmove picks index 0 on the misspeculated
path. This is safer than allowing a branch to an address loaded from an
index under misspeculation (i.e., it preserves control-flow integrity
even under misspeculation).
The table mitigation is controlled by a Cranelift setting, on by
default. The br_table mitigation is always on, because it is part of the
single lowering pseudoinstruction. In both cases, the impact should be
minimal: a single extra cmove in a (relatively) rarely-used operation.
The table mitigation is architecture-independent (happens during
legalization); the br_table mitigation has been implemented for both x64
and aarch64. (I don't know enough about s390x to implement this
confidently there, but would happily review a PR to do the same on that
platform.)
This PR removes "argument polarity": the feature of ISLE extractors that lets them take
inputs aside from the value to be matched.
Cases that need this expressivity have been subsumed by #4072 with if-let clauses;
we can now finally remove this misfeature of the language, which has caused significant
confusion and has always felt like a bit of a hack.
This PR (i) removes the feature from the ISLE compiler; (ii) removes it from the reference
documentation; and (iii) refactors away all uses of the feature in our three existing
backends written in ISLE.
Also fix and extend the current implementation:
- AtomicRMWOp::Clr != AtomicRmwOp::And, as the input needs to be
inverted first.
- Inputs to the cmp for the RMWLoop case are sign-extended when
needed.
- Lower Xchg to Swp.
- Lower Sub to Add with a negated input.
- Added more runtests.
Copyright (c) 2022, Arm Limited.
x64 backend: add lowerings with load-op-store fusion.
These lowerings use the `OP [mem], reg` forms (or in AT&T syntax, `OP
%reg, (mem)`) -- i.e., x86 instructions that load from memory, perform
an ALU operation, and store the result, all in one instruction. Using
these instruction forms, we can merge three CLIF ops together: a load,
an arithmetic operation, and a store.
The recent work in #4061 introduced a notion of "unique uses" for CLIF
values that both simplified the load-op merging rules and allowed
loads to merge in some more places.
Unfortunately there's one factor that PR didn't account for: a unique
use at the CLIF level could become a multiple-use at the VCode level,
when a lowering uses a value multiple times!
Making this less error-prone in general is hard, because we don't know
the lowering in VCode until it's emitted, so we can't ahead-of-time
know that a value will be used multiple times and prevent its
merging. But we *can* know in the lowerings themselves when we're
doing this. At least we get a panic from regalloc when we get this
wrong; no bad code (uninitialized register being read) should ever
come from a backend bug like this.
This is still a bit less than ideal, but for now the fix is: in
`cmp_and_choose` in the x64 backend (which compares values, then
picks one or the other with a cmove), explicitly put values in
registers.
Fixes#4067 (thanks @Mrmaxmeier for the report!).
* Cranelift: fix#3953: rework single/multiple-use logic in lowering.
This PR addresses the longstanding issue with loads trying to merge
into compares on x86-64, and more generally, with the lowering
framework falsely recognizing "single uses" of one op by
another (which would normally allow merging of side-effecting ops like
loads) when there is *indirect* duplication.
To fix this, we replace the direct `value_uses` count with a
transitive notion of uniqueness (not unlike Rust's `&`/`&mut` and how
a `&mut` downgrades to `&` when accessed through another `&`!). A
value is used multiple times transitively if it has multiple direct
uses, or is used by another op that is used multiple times
transitively.
The canonical example of badness is:
```
v1 := load
v2 := ifcmp v1, ...
v3 := selectif v2, ...
v4 := selectif v2, ...
```
both `v3` and `v4` effectively merge the `ifcmp` (`v2`), so even
though the use of `v1` is "unique", it is codegenned twice. This is
why we ~~can't have nice things~~ can't merge loads into
compares (#3953).
There is quite a subtle and interesting design space around this
problem and how we might solve it. See the long doc-comment on
`ValueUseState` in this PR for more justification for the particular
design here. In particular, this design deliberately simplifies a bit
relative to an "optimal" solution: some uses can *become* unique
depending on merging, but we don't design our data structures for such
updates because that would require significant extra costly
tracking (some sort of transitive refcounting). For example, in the
above, if `selectif` somehow did not merge `ifcmp`, then we would only
codegen the `ifcmp` once into its result register (and use that
register twice); then the load *is* uniquely used, and could be
merged. But that requires transitioning from "multiple use" back to
"unique use" with careful tracking as we do pattern-matching, which
I've chosen to make out-of-scope here for now. In practice, I don't
think it will matter too much (and we can always improve later).
With this PR, we can now re-enable load-op merging for compares. A
subsequent commit does this.
* Update x64 backend to allow load-op merging for `cmp`.
* Update filetests.
* Add test for cmp-mem merging on x64.
* Comment fixes.
* Rework ValueUseState analysis for better performance.
* Update s390x filetest: iadd_ifcout cannot merge loads anymore because it has multiple outputs (ValueUseState limitation)
* Address review comments.
With these fixes, all this PR has to do is instantiate and run the
checker on the `regalloc2::Output`. This is off by default, and is
enabled by setting the `regalloc_checker` Cranelift option.
This restores the old functionality provided by e.g. the
`backtracking_checked` regalloc algorithm setting rather than
`backtracking` when we were still on regalloc.rs.
regalloc2 is a bit pickier about critical edges than regalloc.rs was,
because of how it inserts moves. In particular, if a branch has any
arguments (e.g., a conditional branch or br_table), its successors must
all have only one predecessor, so we can do edge moves at the top of
successor blocks rather than at the end of this block. Otherwise, moves
that semantically must come after the block's last uses (the branch's
args) would be placed before it.
This is almost always the case, because crit-edge splitting ensures that
if we have more than one succ, all our succs will have only one pred.
This is because branch kinds that take arguments (fixed args, not the
blockparam args) tend to have more than one successor: conditionals and
br_tables.
However, a fuzzbug recently illuminated one corner case I had missed: a
br_table can have *one* successor only, if it has a default target and
an empty table. In this case, crit-edge splitting will happily skip a
split and assume that we can insert edge moves at the end of the block
with the br_table. But this will fail.
regalloc2 explicitly checks this and bails with a panic, rather than
continue, so no miscompilation is possible; but without this fix, we
will get these panics on br_tables with empty tables.
Previously, the block successor accumulation and the blockparam branch
arg setup were decoupled. The lowering backend implicitly specified
the order of successor edges via its `MachTerminator` enum on the last
instruction in the block, while the `Lower` toplevel
machine-independent driver set up blockparam branch args in the edge
order seen in CLIF.
In some cases, these orders did not match -- for example, when the
conditional branch depended on an FP condition that was implemented by
swapping taken/not-taken edges and inverting the condition code.
This PR refactors the successor handling to be centralized in `Lower`
rather than flow through the terminator `MachInst`, and adds a
successor block and its blockparam args at the same time, ensuring the
orders match.
Following the merge of regalloc2 support, this became slower because we
are stricter about the critical-edge invariant, generating a separate
edge block for every out-edge even if two or more out-edges go to the
same successor (this is significant in cases of `br_table` with many
entries having the same target block, for example).
Many of those edge blocks are empty and end up collapsed by the
MachBuffer, which leads to a large set of aliased labels.
The invariant validation will dutifully iterate over all the data
structures at every step, validating all of our conditions. But this
gets way slower in the new context, to the point that we'll probably
have some fuzz timeouts.
This was pointed out in [1] but I missed removing this in #3989. Given
that `MachBuffer` has been around for nearly two years now, has been
fuzzed continuously with the invariant validation for that time, and
also has a correctness proof in the comments, it's probably reasonable
to remove this high (recently increased) cost from the fuzzing-specific
compilation configuration.
[1]
https://github.com/bytecodealliance/wasmtime/pull/3989#discussion_r847712263
Merge Mov32 and Mov64 into a single instruction parameterized by a new
OperandSize field. Also combine the Mov[K,N,Z] into a single instruction
with a new opcode to select between the operations.
Copyright (c) 2022, Arm Limited.
This PR switches Cranelift over to the new register allocator, regalloc2.
See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.
Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:
```
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/A
```
Issue #3963 identified a miscompilation with select in which the second
in the pair of `CMOV`s (one pair per `i128` register) used the wrong
flag. This change fixes the error in the x64 ISLE helper function
emitting these `CMOV` instructions.
This change moves the majority of the lowerings for CLIF's `load`
instruction over to ISLE. To do so, it also migrates the previous
mechanism for creating an `Amode` (`lower_to_amode`) to several ISLE
rules (see `to_amode`).
* Run the GC smoketest with epoch support enabled as well.
* Handle safepoints in cold blocks properly.
Currently, the way that we find safepoint slots for a given instruction
relies on the instruction index order in the safepoint list matching the
order of instruction emission.
Previous to the introduction of cold-block support, this was trivially
satisfied by sorting the safepoint list: we emit instructions 0, 1, 2,
3, 4, ..., and so if we have safepoints at instructions 1 and 4, we will
encounter them in that order.
However, cold blocks are supported by swizzling the emission order at
the last moment (to avoid having to renumber instructions partway
through the compilation pipeline), so we actually emit instructions out
of index order when cold blocks are present.
Reference-type support in Wasm in particular uses cold blocks for
slowpaths, and has live refs and safepoints in these slowpaths, so we
can reliably "skip" a safepoint (not emit any metadata for it) in the
presence of reftype usage.
This PR fixes the emission code by building a map from instruction index
to safepoint index first, then doing lookups through this map, rather
than following along in-order as it emits instructions.
This change removes all variants of `load*_complex` and `store*_complex`
from Cranelift; this is a breaking change to the instructions exposed by
CLIF. The complete list of instructions removed is: `load_complex`,
`store_complex`, `uload8_complex`, `sload8_complex`, `istore8_complex`,
`sload8_complex`, `uload16_complex`, `sload16_complex`,
`istore16_complex`, `uload32_complex`, `sload32_complex`,
`istore32_complex`, `uload8x8_complex`, `sload8x8_complex`,
`sload16x4_complex`, `uload16x4_complex`, `uload32x2_complex`,
`sload32x2_complex`.
The rationale for this removal is that the Cranelift backend now has the
ability to pattern-match multiple upstream additions in order to
calculate the address to access. Previously, this was not possible so
the `*_complex` instructions were needed. Over time, these instructions
have fallen out of use in this repository, making the additional
overhead of maintaining them a chore.
Previous changes had ported the difficult "`select` based on an `fcmp`"
patterns to ISLE; this completes porting of `select` by moving over the
final two kinds of patterns:
- `select` based on an `icmp`
- `select` based on a value
* x64: port scalar `fcmp` to ISLE
Implement the CLIF lowering for the `fcmp` to ISLE. This adds a new
type-matcher, `ty_scalar_float`, for detecting uses of `F32` and `F64`.
* isle: rename `vec128` to `ty_vec12`
This refactoring changes the name of the `vec128` matcher function to
follow the `ty_*` convention of the other type matchers. It also makes
the helper an inline function call.
* x64: port vector `fcmp` to ISLE
Fuzz testing identified a lowering case for CLIF's `icmp` in which the
double use of a loaded operand resulted in a register allocation error.
This change manually adds `put_in_xmm` to avoid load-coalescing these
values and includes a CLIF filetest to trigger this issue. Closes#3951.
I opened #3953 to discuss a way in which this kind of mistake (i.e.,
forgetting to add `put_in_*` in certain situations) could be avoided.
This change is refactoring only--it should have no logic changes. As
discussed previously, prefixing all machine code instructions with
`x64_` will make it easier to identify what parts of the ISLE code
correspond to single instructions and what parts rely on helpers that
may emit more than one instruction.
Previously, we used the flags of `AND` for `SETcc`. This change uses
`TEST` instead, which discards the AND result but sets the flags needed
for `SETcc`. This reduces register pressure slightly for this sequence.