* egraph-based midend: draw the rest of the owl.
* Rename `egg` submodule of cranelift-codegen to `egraph`.
* Apply some feedback from @jsharp during code walkthrough.
* Remove recursion from find_best_node by doing a single pass.
Rather than recursively computing the lowest-cost node for a given
eclass and memoizing the answer at each eclass node, we can do a single
forward pass; because every eclass node refers only to earlier nodes,
this is sufficient. The behavior may slightly differ from the earlier
behavior because we cannot short-circuit costs to zero once a node is
elaborated; but in practice this should not matter.
* Make elaboration non-recursive.
Use an explicit stack instead (with `ElabStackEntry` entries,
alongside a result stack).
* Make elaboration traversal of the domtree non-recursive/stack-safe.
* Work analysis logic in Cranelift-side egraph glue into a general analysis framework in cranelift-egraph.
* Apply static recursion limit to rule application.
* Fix aarch64 wrt dynamic-vector support -- broken rebase.
* Topo-sort cranelift-egraph before cranelift-codegen in publish script, like the comment instructs me to!
* Fix multi-result call testcase.
* Include `cranelift-egraph` in `PUBLISHED_CRATES`.
* Fix atomic_rmw: not really a load.
* Remove now-unnecessary PartialOrd/Ord derivations.
* Address some code-review comments.
* Review feedback.
* Review feedback.
* No overlap in mid-end rules, because we are defining a multi-constructor.
* rustfmt
* Review feedback.
* Review feedback.
* Review feedback.
* Review feedback.
* Remove redundant `mut`.
* Add comment noting what rules can do.
* Review feedback.
* Clarify comment wording.
* Update `has_memory_fence_semantics`.
* Apply @jameysharp's improved loop-level computation.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix suggestion commit.
* Fix off-by-one in new loop-nest analysis.
* Review feedback.
* Review feedback.
* Review feedback.
* Use `Default`, not `std::default::Default`, as per @fitzgen
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Apply @fitzgen's comment elaboration to a doc-comment.
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Add stat for hitting the rewrite-depth limit.
* Some code motion in split prelude to make the diff a little clearer wrt `main`.
* Take @jameysharp's suggested `try_into()` usage for blockparam indices.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Take @jameysharp's suggestion to avoid double-match on load op.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix suggestion (add import).
* Review feedback.
* Fix stack_load handling.
* Remove redundant can_store case.
* Take @jameysharp's suggested improvement to FuncEGraph::build() logic
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Tweaks to FuncEGraph::build() on top of suggestion.
* Take @jameysharp's suggested clarified condition
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Clean up after suggestion (unused variable).
* Fix loop analysis.
* loop level asserts
* Revert constant-space loop analysis -- edge cases were incorrect, so let's go with the simple thing for now.
* Take @jameysharp's suggestion re: result_tys
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix up after suggestion
* Take @jameysharp's suggestion to use fold rather than reduce
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fixup after suggestion
* Take @jameysharp's suggestion to remove elaborate_eclass_use's return value.
* Clarifying comment in terminator insts.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Elide redundant sentinel values
The `undef_variables` lists were a binding from Variable to Value, but
the Values were always equal to a suffix of the block's parameters. So
instead of storing another copy, we can just get them back from the
block parameters.
According to DHAT, this decreases total memory allocated and number of
bytes written, and increases number of bytes read and instructions
retired, but all by small fractions of a percent. According to
hyperfine, main is "1.00 ± 0.01 times faster".
* Use entity_impl for cranelift_frontend::Variable
Instead of hand-coding essentially the same thing.
* Keep undefined variables in a ListPool
According to DHAT, this improves every measure of performance
(instructions retired, total memory allocated, max heap size, bytes
read, and bytes written), although by fractions of a percent. According
to hyperfine the difference is nearly zero, but on Spidermonkey this
branch is "1.01 ± 0.00 times faster" than main.
* Elide redundant block IDs
In a list of predecessors, we previously kept both the jump instruction
that points to the current block, and the block where that instruction
resides. But we can look up the block from the instruction as long as we
have access to the current Layout, which we do everywhere that it was
necessary. So don't store the block, just store the instruction.
* Keep predecessor definitions in a ListPool
* Make append_jump_argument independent of self
This makes it easier to reason about borrow-checking issues.
* Reuse `results` instead of re-doing variable lookup
This eliminates three array lookups per predecessor by hanging on to the
results of earlier steps a little longer. This only works now because I
previously removed the need to borrow all of `self`, which otherwise
prevented keeping a borrow of self.results alive.
I had experimented with using `Vec::split_off` to copy the relevant
chunk of results to a temporary heap allocation, but the extra
allocation and copy was measurably slower. So it's important that this
is just a borrow.
* Cache single-predecessor block ID when sealing
Of the code in cranelift_frontend, `use_var` is the second-hottest path,
sitting close behind the `build` function that's used when inserting
every new instruction. This makes sense given that the operands of a new
instruction usually need to be looked up immediately before building the
instruction.
So making the single-predecessor loops in `find_var` and `use_var_local`
do fewer memory accesses and execute fewer instructions turns out to
have a measurable effect. It's still only a small fraction of a percent
overall since cranelift-frontend is only a few percent of total runtime.
This patch keeps a block ID in the SSABlockData, which is None unless
both the block is sealed and it has exactly one predecessor. Doing so
avoids two array lookups on each iteration of the two loops.
According to DHAT, compared with main, at this point this PR uses 0.3%
less memory at max heap, reads 0.6% fewer bytes, and writes 0.2% fewer
bytes.
According to Hyperfine, this PR is "1.01 ± 0.01 times faster" than main
when compiling Spidermonkey. On the other hand, Sightglass says main is
1.01x faster than this PR on the same benchmark by CPU cycles. In short,
actual effects are too small to measure reliably.
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
On the build side, this commit introduces two things:
1. The automatic generation of various ISLE definitions for working with
CLIF. Specifically, it generates extern type definitions for clif opcodes and
the clif instruction data `enum`, as well as extractors for matching each clif
instructions. This happens inside the `cranelift-codegen-meta` crate.
2. The compilation of ISLE DSL sources to Rust code, that can be included in the
main `cranelift-codegen` compilation.
Next, this commit introduces the integration glue code required to get
ISLE-generated Rust code hooked up in clif-to-x64 lowering. When lowering a clif
instruction, we first try to use the ISLE code path. If it succeeds, then we are
done lowering this instruction. If it fails, then we proceed along the existing
hand-written code path for lowering.
Finally, this commit ports many lowering rules over from hand-written,
open-coded Rust to ISLE.
In the process of supporting ISLE, this commit also makes the x64 `Inst` capable
of expressing SSA by supporting 3-operand forms for all of the existing
instructions that only have a 2-operand form encoding:
dst = src1 op src2
Rather than only the typical x86-64 2-operand form:
dst = dst op src
This allows `MachInst` to be in SSA form, since `dst` and `src1` are
disentangled.
("3-operand" and "2-operand" are a little bit of a misnomer since not all
operations are binary operations, but we do the same thing for, e.g., unary
operations by disentangling the sole operand from the result.)
There are two motivations for this change:
1. To allow ISLE lowering code to have value-equivalence semantics. We want ISLE
lowering to translate a CLIF expression that evaluates to some value into a
`MachInst` expression that evaluates to the same value. We want both the
lowering itself and the resulting `MachInst` to be pure and referentially
transparent. This is both a nice paradigm for compiler writers that are
authoring and maintaining lowering rules and is a prerequisite to any sort of
formal verification of our lowering rules in the future.
2. Better align `MachInst` with `regalloc2`'s API, which requires that the input
be in SSA form.