Similar to when we exposed the DataFlowGraph::insts field through a restrictive newtype, expose DataFlowGraph::blocks through an interface that allows a restrictive set of operations. This field being public now allows us to avoid a rematch in ssa construction, and simplifies the implementation of adding a block argument to a block referenced by a br_table instruction.
We don't need to spend time going through the GVN map to dedup a
newly-constructed `iconst 0` when we already matched that value on the
left-hand side of these rules.
Also, mark these rules as subsuming any others since we can't do better
than reducing an expression to a constant.
I was playing around with souper recently on some wasms I had lying
around and these are some optimization opportunities that popped out
which seemed easy-enough to add to the egraph-based optimizations.
When investigating #5716, I found that rematerialization of a `call`, in
addition to blowing up for other reasons, caused aliasing of the varargs
list (the `EntityList` in the `ListPool`), such that editing the args of
the second copy of the call instruction inadvertently updated the first
as well.
This PR modifies `DataFlowGraph::clone_inst` so that it always clones
the varargs list if present. This shouldn't have any functional impact
on Cranelift today, because we don't rematerialize any instructions with
varargs; but it's important to get it right to avoid a bug later!
We don't have overlap in behavior for branch instructions anymore, so we can remove analyze_branch and instead match on the InstructionData directly.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
In the provided test case in #5716, the result of a call was then
added to 0. We have a rewrite rule that sets the remat-bit on any add
of a value and a constant, because these frequently appear (e.g. from
address offset calculations) and this can frequently reduce register
pressure (one long-lived base vs. many long-lived base+offset values).
Separately, we have an algebraic rule that `x+0` rewrites to `x`.
The result of this was that we had an eclass with the remat bit set on
the add, but the add was also union'd into the call. We pick the
latter during extraction, because it's cheaper not to do the add at
all; but we still get the remat bit, and try to remat a call (!),
which blows up later.
This PR fixes the logic to look up the "best value" for a value (i.e.,
whatever extraction determined), and look up the remat bit on *that*
node, not the canonical node.
(Why did the canonical node become the iadd and not the call? Because
the former had a lower value-number, as an accident of IR
construction; we don't impose any requirements on the input CLIF's
value-number ordering, and I don't think this breaks any of the
important acyclic properties, even though there is technically a
dependence from a lower-numbered to a higher-numbered node. In essence
one can think of them as having "virtual numbers" in any true
topologically-sorted order, and the only place the actual integer
indices matter should be in choosing the "canonical ID", which is just
used for dedup'ing, modulo this bug.)
Fixes#5716.
Instead of identifying unused branch tables by looking for unused blocks inside of them, track used branch tables while traversing reachable blocks. This introduces an extra allocation of an EntitySet to track the used jump tables, but as those are few and this function runs once per ir::Function, the allocation seems reasonable.
I audited the egraph "algebraic" optimization rules for any which
construct an `iconst` on the right-hand side of the rule. In these cases
we need to constrain the type passed to `iconst` to be both `fits_in_64`
and `ty_int`, because `iconst` is not defined on other types.
* Remove trailing whitespace in `lower.isle` files
* Legalize the `band_not` instruction into simpler form
This commit legalizes the `band_not` instruction into `band`-of-`bnot`,
or two instructions. This is intended to assist with egraph-based
optimizations where the `band_not` instruction doesn't have to be
specifically included in other bit-operation-patterns.
Lowerings of the `band_not` instruction have been moved to a
specialization of the `band` instruction.
* Legalize `bor_not` into components
Same as prior commit, but for the `bor_not` instruction.
* Legalize bxor_not into bxor-of-bnot
Same as prior commits. I think this also ended up fixing a bug in the
s390x backend where `bxor_not x y` was actually translated as `bnot
(bxor x y)` by accident given the test update changes.
* Simplify not-fused operands for riscv64
Looks like some delegated-to rules have special-cases for "if this
feature is enabled use the fused instruction" so move the clause for
testing the feature up to the lowering phase to help trigger other rules
if the feature isn't enabled. This should make the riscv64 backend more
consistent with how other backends are implemented.
* Remove B{and,or,xor}Not from cost of egraph metrics
These shouldn't ever reach egraphs now that they're legalized away.
* Add an egraph optimization for `x^-1 => ~x`
This adds a simplification node to translate xor-against-minus-1 to a
`bnot` instruction. This helps trigger various other optimizations in
the egraph implementation and also various backend lowering rules for
instructions. This is chiefly useful as wasm doesn't have a `bnot`
equivalent, so it's encoded as `x^-1`.
* Add a wasm test for end-to-end bitwise lowerings
Test that end-to-end various optimizations are being applied for input
wasm modules.
* Specifically don't self-update rustup on CI
I forget why this was here originally, but this is failing on Windows
CI. In general there's no need to update rustup, so leave it as-is.
* Cleanup some aarch64 lowering rules
Previously a 32/64 split was necessary due to the `ALUOp` being different
but that's been refactored away no so there's no longer any need for
duplicate rules.
* Narrow a x64 lowering rule
This previously made more sense when it was `band_not` and rarely used,
but be more specific in the type-filter on this rule that it's only
applicable to SIMD types with lanes.
* Simplify xor-against-minus-1 rule
No need to have the commutative version since constants are already
shuffled right for egraphs
* Optimize band-of-bnot when bnot is on the left
Use some more rules in the egraph algebraic optimizations to
canonicalize band/bor/bxor with a `bnot` operand to put the operand on
the right. That way the lowerings in the backends only have to list the
rule once, with the operand on the right, to optimize both styles of
input.
* Add commutative lowering rules
* Update cranelift/codegen/src/isa/x64/lower.isle
Co-authored-by: Jamey Sharp <jamey@minilop.net>
---------
Co-authored-by: Jamey Sharp <jamey@minilop.net>
Also move these optimization rules to cprop.isle; it's where all the
other similar rules are.
Like the other cprop rules, these can subsume any other rules. We can't
do better than reducing an expression to a constant.
The new i64_sextend_imm64 and u64_uextend_imm64 constructors are useful
helpers to clean up other code. I applied them to `imm64_icmp` while I
was here, as well as using the existing `ty_mask` helper to clean up
`imm64_masked`.
* Cranelift: Introduce the `tail` calling convention
This is an unstable-ABI calling convention that we will eventually use to
support Wasm tail calls.
Co-Authored-By: Jamey Sharp <jsharp@fastly.com>
* Cranelift: Introduce the `return_call` and `return_call_indirect` instructions
These will be used to implement tail calls for Wasm and any other language
targeting CLIF. The `return_call_indirect` instruction differs from the Wasm
instruction of the same name by taking a native address callee rather than a
Wasm function index.
Co-Authored-By: Jamey Sharp <jsharp@fastly.com>
* Cranelift: Implement verification rules for `return_call[_indirect]`
They must:
* have the same return types between the caller and callee,
* have the same calling convention between caller and callee,
* and that calling convention must support tail calls.
Co-Authored-By: Jamey Sharp <jsharp@fastly.com>
* cargo fmt
---------
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
This rewrite was introduced in #5676 and then reverted in #5682 due to a footgun
where we accidentally weren't actually checking the `y == !z` precondition. This
commit fixes the precondition check. It also fixes the arithmetic to be
correctly masked to the value type's width.
This reverts commit 268f6bfc1d.
Add a display method to BlockCall that returns a std::fmt::Displayable result. Rework the display code in the write module of cranelift-codegen to use this method instead.
Souper requires an `i1` condition value, we don't and will implicitly check
against 0. We were truncating conditions but should actually be doing the
comparison against `0`.
Remove the boolean parameters from the instruction builder functions, as they were only ever used with true. Additionally, change the returns and branches functions to imply terminates_block.
Most of these optimizations are in the egraph `cprop.isle` rules now,
making a separate crate unnecessary.
Also I think the `udiv` optimizations here are straight-up wrong (doing
signed instead of unsigned division, and panicking instead of preserving
traps on division by zero) so I'm guessing this crate isn't seriously
used anywhere.
At the least, bjorn3 confirms that cg_clif doesn't use this, and I've
verified that Wasmtime doesn't either.
Closes#1090.
Improve the generated code for unordered floating point comparisons by negating the comparison and inveritng the branches. This allows us to pick the unordered versions, which generate significantly better code.
Add a conditional branch instruction with two targets: brif. This instruction will eventually replace brz and brnz, as it encompasses the behavior of both.
This PR also changes the InstructionData layout for instruction formats that hold BlockCall values, taking the same approach we use for Value arguments. This allows branch_destination to return a slice to the BlockCall values held in the instruction, rather than requiring that we pattern match on InstructionData to fetch the then/else blocks.
Function generation for fuzzing has been updated to generate uses of brif, and I've run the cranelift-fuzzgen target locally for hours without triggering any new failures.
This PR follows up on #5382 and #5391, which rebuilt the egraph-based optimization framework to be more performant, by enabling it by default.
Based on performance results in #5382 (my measurements on SpiderMonkey and bjorn3's independent confirmation with cg_clif), it seems that this is reasonable to enable. Now that we have been fuzzing compiler configurations with egraph opts (#5388) for 6 weeks, having fixed a few fuzzbugs that came up (#5409, #5420, #5438) and subsequently received no further reports from OSS-Fuzz, I believe it is stable enough to rely on.
This PR enables `use_egraphs`, and also normalizes its meaning: previously it forced optimization (it basically meant "turn on the egraph optimization machinery"), now it runs egraph opts if the opt level indicates (it means "use egraphs to optimize if we are going to optimize"). The conditionals in the top-level pass driver are a little subtle, but will get simpler once we can remove the non-egraph path (which we plan to do eventually!).
Fixes#5181.
* Support mergeable-but-side-effectful (idempotent) operations in general in the egraph's GVN.
This mirrors the similar change made in #5534.
* Add tests for egraph case.
Add a new type BlockCall that represents the pair of a block name with arguments to be passed to it. (The mnemonic here is that it looks a bit like a function call.) Rework the implementation of jump, brz, and brnz to use BlockCall instead of storing the block arguments as varargs in the instruction's ValueList.
To ensure that we're processing block arguments from BlockCall values in instructions, three new functions have been introduced on DataFlowGraph that both sets of arguments:
inst_values - returns an iterator that traverses values in the instruction and block arguments
map_inst_values - applies a function to each value in the instruction and block arguments
overwrite_inst_values - overwrite all values in an instruction and block arguments with values from the iterator
Co-authored-by: Jamey Sharp <jamey@minilop.net>
This is a cleanup to help prepare for #5464.
Most of the diff is inlining the closure for `mark_all_uses_as_multiple`
which was only called once. That avoids some borrow-checker challenges.
The key change is that the former `push_args_on_stack` closure no longer
actually pushes the iterator on the stack, but just returns it. That
way this closure doesn't need the name of the stack's type. It also
allows it to be reused in the debug_assert.