This patch includes:
- A complete rework of the way that CLIF blocks and edge blocks are
lowered into VCode blocks. The new mechanism in `BlockLoweringOrder`
computes RPO over the CFG, but with a twist: it merges edge blocks intto
heads or tails of original CLIF blocks wherever possible, and it does
this without ever actually materializing the full nodes-plus-edges
graph first. The backend driver lowers blocks in final order so
there's no need to reshuffle later.
- A new `MachBuffer` that replaces the `MachSection`. This is a special
version of a code-sink that is far more than a humble `Vec<u8>`. In
particular, it keeps a record of label definitions and label uses,
with a machine-pluggable `LabelUse` trait that defines various types
of fixups (basically internal relocations).
Importantly, it implements some simple peephole-style branch rewrites
*inline in the emission pass*, without any separate traversals over
the code to use fallthroughs, swap taken/not-taken arms, etc. It
tracks branches at the tail of the buffer and can (i) remove blocks
that are just unconditional branches (by redirecting the label), (ii)
understand a conditional/unconditional pair and swap the conditional
polarity when it's helpful; and (iii) remove branches that branch to
the fallthrough PC.
The `MachBuffer` also implements branch-island support. On
architectures like AArch64, this is needed to allow conditional
branches within plausibly-attainable ranges (+/- 1MB on AArch64
specifically). It also does this inline while streaming through the
emission, without any sort of fixpoint algorithm or later moving of
code, by simply tracking outstanding references and "deadlines" and
emitting an island just-in-time when we're in danger of going out of
range.
- A rework of the instruction selector driver. This is largely following
the same algorithm as before, but is cleaned up significantly, in
particular in the API: the machine backend can ask for an input arg
and get any of three forms (constant, register, producing
instruction), indicating it needs the register or can merge the
constant or producing instruction as appropriate. This new driver
takes special care to emit constants right at use-sites (and at phi
inputs), minimizing their live-ranges, and also special-cases the
"pinned register" to avoid superfluous moves.
Overall, on `bz2.wasm`, the results are:
wasmtime full run (compile + runtime) of bz2:
baseline: 9774M insns, 9742M cycles, 3.918s
w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns)
clif-util wasm compile bz2:
baseline: 2633M insns, 3278M cycles, 1.034s
w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns)
All numbers are averages of two runs on an Ampere eMAG.
* CI: only test `peepmatic` in one job
This avoids building Z3 in most jobs, which saves CI time.
* Fix curl syntax on Windows
Co-authored-by: Alex Crichton <alex@alexcrichton.com>
This updates our github actions configuration with a new feature
released which allows configuring the default shell for the entire
worflow. Here we set that to `bash` since we frequently do that anyway
and it helps keep syntax consistent throughout the configuration file.
Right now we're just getting a lot of noisy "can't parse manifest" error
messages, and with `cargo audit` running on CI we should be alerted to
security vulnerabilities anyway.
Rather than outright replacing parts of our existing peephole optimizations
passes, this makes peepmatic an optional cargo feature that can be enabled. This
allows us to take a conservative approach with enabling peepmatic everywhere,
while also allowing us to get it in-tree and make it easier to collaborate on
improving it quickly.
After replacing an instruction with an alias to an earlier value, trying to
further optimize that value is unnecessary, since we've already processed it,
and also was triggering an assertion.
This ports all of the identity, no-op, simplification, and canonicalization
related optimizations over from being hand-coded to the `peepmatic` DSL. This
does not handle the branch-to-branch optimizations or most of the
divide-by-constant optimizations.
This crate contains oracles, generators, and fuzz targets for use with fuzzing
engines (e.g. libFuzzer). This doesn't contain the actual
`libfuzzer_sys::fuzz_target!` definitions (those are in the `peepmatic-fuzz`
crate) but does those definitions are one liners calling out to functions
defined in this crate.
This crate provides testing utilities for `peepmatic`, and a test-only
instruction set we can use to check that various optimizations do or don't
apply.
Peepmatic is a DSL for peephole optimizations and compiler for generating
peephole optimizers from them. The user writes a set of optimizations in the
DSL, and then `peepmatic` compiles the set of optimizations into an efficient
peephole optimizer:
```
DSL ----peepmatic----> Peephole Optimizer
```
The generated peephole optimizer has all of its optimizations' left-hand sides
collapsed into a compact automata that makes matching candidate instruction
sequences fast.
The DSL's optimizations may be written by hand or discovered mechanically with a
superoptimizer like [Souper][]. Eventually, `peepmatic` should have a verifier
that ensures that the DSL's optimizations are sound, similar to what [Alive][]
does for LLVM optimizations.
[Souper]: https://github.com/google/souper
[Alive]: https://github.com/AliveToolkit/alive2
The `peepmatic-runtime` crate contains everything required to use a
`peepmatic`-generated peephole optimizer.
In short: build times and code size.
If you are just using a peephole optimizer, you shouldn't need the functions
to construct it from scratch from the DSL (and the implied code size and
compilation time), let alone even build it at all. You should just
deserialize an already-built peephole optimizer, and then use it.
That's all that is contained here in this crate.
This crate provides the derive macros used by `peepmatic`, notable AST-related
derives that enumerate child AST nodes, and operator-related derives that
provide helpers for type checking.
The `peepmatic-automata` crate builds and queries finite-state transducer
automata.
A transducer is a type of automata that has not only an input that it
accepts or rejects, but also an output. While regular automata check whether
an input string is in the set that the automata accepts, a transducer maps
the input strings to values. A regular automata is sort of a compressed,
immutable set, and a transducer is sort of a compressed, immutable key-value
dictionary. A [trie] compresses a set of strings or map from a string to a
value by sharing prefixes of the input string. Automata and transducers can
compress even better: they can share both prefixes and suffixes. [*Index
1,600,000,000 Keys with Automata and Rust* by Andrew Gallant (aka
burntsushi)][burntsushi-blog-post] is a top-notch introduction.
If you're looking for a general-purpose transducers crate in Rust you're
probably looking for [the `fst` crate][fst-crate]. While this implementation
is fully generic and has no dependencies, its feature set is specific to
`peepmatic`'s needs:
* We need to associate extra data with each state: the match operation to
evaluate next.
* We can't provide the full input string up front, so this crate must
support incremental lookups. This is because the peephole optimizer is
computing the input string incrementally and dynamically: it looks at the
current state's match operation, evaluates it, and then uses the result as
the next character of the input string.
* We also support incremental insertion and output when building the
transducer. This is necessary because we don't want to emit output values
that bind a match on an optimization's left-hand side's pattern (for
example) until after we've succeeded in matching it, which might not
happen until we've reached the n^th state.
* We need to support generic output values. The `fst` crate only supports
`u64` outputs, while we need to build up an optimization's right-hand side
instructions.
This implementation is based on [*Direct Construction of Minimal Acyclic
Subsequential Transducers* by Mihov and Maurel][paper]. That means that keys
must be inserted in lexicographic order during construction.
[trie]: https://en.wikipedia.org/wiki/Trie
[burntsushi-blog-post]: https://blog.burntsushi.net/transducers/#ordered-maps
[fst-crate]: https://crates.io/crates/fst
[paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3698&rep=rep1&type=pdf
This reduces the size of the Inst enum from 112 bytes to 48 bytes.
Using DHAT on a regex-rs.wasm benchmark, `valgrind --tool=dhat clif-util compile --target aarch64`
The total number of allocated bytes, drops by around 170 MB.
At t-gmax drops by 3 MB.
Using `perf stat clif-util compile --target aarch64`, the instructions count dropped by 0.6%. Cache misses dropped by 6%. Cycles dropped by 2.3%.
This commit adds a suite of `wasmtime_funcref_table_*` APIs which mirror
the standard APIs but have a few differences:
* More errors are returned. For example error messages are communicated
through `wasmtime_error_t` and out-of-bounds vs load of null can be
differentiated in the `get` API.
* APIs take `wasm_func_t` instead of `wasm_ref_t`. Given the recent
decision to remove subtyping from the anyref proposal it's not clear
how the C API for tables will be affected, so for now these APIs are
all specialized to only funcref tables.
* Growth now allows access to the previous size of the table, if
desired, which mirrors the `table.grow` instruction.
This was originally motivated by bytecodealliance/wasmtime-go#5 where
the current APIs we have for working with tables don't quite work. We
don't have a great way to take an anyref constructed from a `Func` and
get the `Func` back out, so for now this sidesteps those concerns while
we sort out the anyref story.
It's intended that once the anyref story has settled and the official C
API has updated we'll likely delete these wasmtime-specific APIs or
implement them as trivial wrappers around the official ones.
* Remove Cranelift's OutOfBounds trap, which is no longer used.
* Change proc_exit to unwind instead of exit the host process.
This implements the semantics in https://github.com/WebAssembly/WASI/pull/235.
Fixes#783.
Fixes#993.
* Fix exit-status tests on Windows.
* Revert the wiggle changes and re-introduce the wasi-common implementations.
* Move `wasi_proc_exit` into the wasmtime-wasi crate.
* Revert the spec_testsuite change.
* Remove the old proc_exit implementations.
* Make `TrapReason` an implementation detail.
* Allow exit status 2 on Windows too.
* Fix a documentation link.
* Really fix a documentation link.
The `wasmtime` crate currently lives in `crates/api` for historical
reasons, because we once called it `wasmtime-api` crate. This creates a
stumbling block for new contributors.
As discussed on Zulip, rename the directory to `crates/wasmtime`.
Several links were broken by line-breaks between the link caption and
the link itself. This commit fixes them by moving each on its own line.
Co-authored-by: k.bagrov <k.bagrov@g.nsu.ru>
If the scratch register is caller-saved, then it might appear in fixed
ranges because of call clobbers. Instead, use a register that's not
caller-saved and has no predefined use in the ABI.