This PR adds a basic *alias analysis*, and optimizations that use it.
This is a "mid-end optimization": it operates on CLIF, the
machine-independent IR, before lowering occurs.
The alias analysis (or maybe more properly, a sort of memory-value
analysis) determines when it can prove a particular memory
location is equal to a given SSA value, and when it can, it replaces any
loads of that location.
This subsumes two common optimizations:
* Redundant load elimination: when the same memory address is loaded two
times, and it can be proven that no intervening operations will write
to that memory, then the second load is *redundant* and its result
must be the same as the first. We can use the first load's result and
remove the second load.
* Store-to-load forwarding: when a load can be proven to access exactly
the memory written by a preceding store, we can replace the load's
result with the store's data operand, and remove the load.
Both of these optimizations rely on a "last store" analysis that is a
sort of coloring mechanism, split across disjoint categories of abstract
state. The basic idea is that every memory-accessing operation is put
into one of N disjoint categories; it is disallowed for memory to ever
be accessed by an op in one category and later accessed by an op in
another category. (The frontend must ensure this.)
Then, given this, we scan the code and determine, for each
memory-accessing op, when a single prior instruction is a store to the
same category. This "colors" the instruction: it is, in a sense, a
static name for that version of memory.
This analysis provides an important invariant: if two operations access
memory with the same last-store, then *no other store can alias* in the
time between that last store and these operations. This must-not-alias
property, together with a check that the accessed address is *exactly
the same* (same SSA value and offset), and other attributes of the
access (type, extension mode) are the same, let us prove that the
results are the same.
Given last-store info, we scan the instructions and build a table from
"memory location" key (last store, address, offset, type, extension) to
known SSA value stored in that location. A store inserts a new mapping.
A load may also insert a new mapping, if we didn't already have one.
Then when a load occurs and an entry already exists for its "location",
we can reuse the value. This will be either RLE or St-to-Ld depending on
where the value came from.
Note that this *does* work across basic blocks: the last-store analysis
is a full iterative dataflow pass, and we are careful to check dominance
of a previously-defined value before aliasing to it at a potentially
redundant load. So we will do the right thing if we only have a
"partially redundant" load (loaded already but only in one predecessor
block), but we will also correctly reuse a value if there is a store or
load above a loop and a redundant load of that value within the loop, as
long as no potentially-aliasing stores happen within the loop.
Peepmatic was an early attempt at a DSL for peephole optimizations, with the
idea that maybe sometime in the future we could user it for instruction
selection as well. It didn't really pan out, however:
* Peepmatic wasn't quite flexible enough, and adding new operators or snippets
of code implemented externally in Rust was a bit of a pain.
* The performance was never competitive with the hand-written peephole
optimizers. It was *very* size efficient, but that came at the cost of
run-time efficiency. Everything was table-based and interpreted, rather than
generating any Rust code.
Ultimately, because of these reasons, we never turned Peepmatic on by default.
These days, we just landed the ISLE domain-specific language, and it is better
suited than Peepmatic for all the things that Peepmatic was originally designed
to do. It is more flexible and easy to integrate with external Rust code. It is
has better time efficiency, meeting or even beating hand-written code. I think a
small part of the reason why ISLE excels in these things is because its design
was informed by Peepmatic's failures. I still plan on continuing Peepmatic's
mission to make Cranelift's peephole optimizer passes generated from DSL rewrite
rules, but using ISLE instead of Peepmatic.
Thank you Peepmatic, rest in peace!
This also paves the way for unifying TargetIsa and MachBackend, since now they map one to one. In theory the two traits could be merged, which would be nice to limit the number of total concepts. Also they have quite different responsibilities, so it might be fine to keep them separate.
Interestingly, this PR started as removing RegInfo from the TargetIsa trait since the adapter returned a dummy value there. From the fallout, noticed that all Display implementations didn't needed an ISA anymore (since these were only used to render ISA specific registers). Also the whole family of RegInfo / ValueLoc / RegUnit was exclusively used for the old backend, and these could be removed. Notably, some IR instructions needed to be removed, because they were using RegUnit too: this was the oddball of regfill / regmove / regspill / copy_special, which were IR instructions inserted by the old regalloc. Fare thee well!
* cranelift: Add heap support to filetest infrastructure
* cranelift: Explicit heap pointer placement in filetest annotations
* cranelift: Add documentation about the Heap directive
* cranelift: Clarify that heap filetests pointers must be laid out sequentially
* cranelift: Use wrapping add when computing bound pointer
* cranelift: Better error messages when invalid signatures are found for heap file tests.
* cranelift: Initial fuzzer implementation
* cranelift: Generate multiple test cases in fuzzer
* cranelift: Separate function generator in fuzzer
* cranelift: Insert random instructions in fuzzer
* cranelift: Rename gen_testcase
* cranelift: Implement div for unsigned values in interpreter
* cranelift: Run all test cases in fuzzer
* cranelift: Comment options in function_runner
* cranelift: Improve fuzzgen README.md
* cranelift: Fuzzgen remove unused variable
* cranelift: Fuzzer code style fixes
Thanks! @bjorn3
* cranelift: Fix nits in CLIF fuzzer
Thanks @cfallin!
* cranelift: Implement Arbitrary for TestCase
* cranelift: Remove gen_testcase
* cranelift: Move fuzzers to wasmtime fuzz directory
* cranelift: CLIF-Fuzzer ignore tests that produce traps
* cranelift: CLIF-Fuzzer create new fuzz target to validate generated testcases
* cranelift: Store clif-fuzzer config in a separate struct
* cranelift: Generate variables upfront per function
* cranelift: Prevent publishing of fuzzgen crate
I hadn't realized before that the filetest backend for `test vcode` is
doing essentially what `compile` is doing, but for new (`MachInst`)
backends: it is just getting a disassembly and running it through
filecheck. There's no reason not to reuse `test compile` for the AArch64
tests as well.
This was motivated by the desire to have "this IR compiles successfully"
tests work on both x86 and AArch64. It seems this should work fine by
adding multiple `target` directives when a test case should be
compile-tested on multiple architectures.
Rather than outright replacing parts of our existing peephole optimizations
passes, this makes peepmatic an optional cargo feature that can be enabled. This
allows us to take a conservative approach with enabling peepmatic everywhere,
while also allowing us to get it in-tree and make it easier to collaborate on
improving it quickly.
This resolves the work started in https://github.com/bytecodealliance/cranelift/pull/1231 and https://github.com/bytecodealliance/wasmtime/pull/1436. Cranelift filetests currently have the ability to run CLIF functions with a signature like `() -> b*` and check that the result is true under the `test run` directive. This PR adds the ability to call functions with arbitrary arguments and non-boolean returns and either print the result or check against a list of expected results:
- `run` commands look like `; run: %add(2, 2) == 4` or `; run: %add(2, 2) != 5` and verify that the executed CLIF function returns the expected value
- `print` commands look like `; print: %add(2, 2)` and print the result of the function to stdout
To make this work, this PR compiles a single Cranelift `Function` into a `CompiledFunction` using a `SingleFunctionCompiler`. Because we will not know the signature of the function until runtime, we use a `Trampoline` to place the values in the appropriate location for the calling convention; this should look a lot like what @alexcrichton is doing with `VMTrampoline` in wasmtime (see 3b7cb6ee64/crates/api/src/func.rs (L510-L526), 3b7cb6ee64/crates/jit/src/compiler.rs (L260)). To avoid re-compiling `Trampoline`s for the same function signatures, `Trampoline`s are cached in the `SingleFunctionCompiler`.
This commit makes the following changes to unwind information generation in
Cranelift:
* Remove frame layout change implementation in favor of processing the prologue
and epilogue instructions when unwind information is requested. This also
means this work is no longer performed for Windows, which didn't utilize it.
It also helps simplify the prologue and epilogue generation code.
* Remove the unwind sink implementation that required each unwind information
to be represented in final form. For FDEs, this meant writing a
complete frame table per function, which wastes 20 bytes or so for each
function with duplicate CIEs. This also enables Cranelift users to collect the
unwind information and write it as a single frame table.
* For System V calling convention, the unwind information is no longer stored
in code memory (it's only a requirement for Windows ABI to do so). This allows
for more compact code memory for modules with a lot of functions.
* Deletes some duplicate code relating to frame table generation. Users can
now simply use gimli to create a frame table from each function's unwind
information.
Fixes#1181.
This patch adds support for filetests with the `vcode` type. This allows
test cases to feed CLIF into the new backend, produce VCode output with
machine instructions, and then perform matching against the
pretty-printed text representation of the VCode.
Tests for the new ARM64 backend using this infrastructure will come in a
followup patch.
* Implement emitting Windows unwind information for fastcall functions.
This commit implements emitting Windows unwind information for x64 fastcall
calling convention functions.
The unwind information can be used to construct a Windows function table at
runtime for JIT'd code, enabling stack walking and unwinding by the operating
system.
* Address code review feedback.
This commit addresses code review feedback:
* Remove unnecessary unsafe code.
* Emit the unwind information always as little endian.
* Fix comments.
A dependency from cranelift-codegen to the byteorder crate was added.
The byteorder crate is a no-dependencies crate with a reasonable
abstraction for writing binary data for a specific endianness.
* Address code review feedback.
* Disable default features for the `byteorder` crate.
* Add a comment regarding the Windows ABI unwind code numerical values.
* Panic if we encounter a Windows function with a prologue greater than 256
bytes in size.
* Add ability to run CLIF IR using `clif-util run [-v] {file}` and add `test run` to cranelift-filetests to allow executing CLIF
This re-factors the compile/execute parts to a FunctionRunner that is shared between cranelift-filetests and clif-util. CLIF can be now be run using `clif-util run` as well as during `clif-util test` for files with a `test run` header. As before, only functions suffixed with a `run` comment are executed. The `run: fn(...) == ...` expression syntax is left for a subsequent change.
-Add resumable_trap, safepoint, isnull, and null instructions
-Add Stackmap struct and StackmapSink trait
Co-authored-by: Mir Ahmed <mirahmed753@gmail.com>
Co-authored-by: Dan Gohman <sunfish@mozilla.com>