* egraph-based midend: draw the rest of the owl.
* Rename `egg` submodule of cranelift-codegen to `egraph`.
* Apply some feedback from @jsharp during code walkthrough.
* Remove recursion from find_best_node by doing a single pass.
Rather than recursively computing the lowest-cost node for a given
eclass and memoizing the answer at each eclass node, we can do a single
forward pass; because every eclass node refers only to earlier nodes,
this is sufficient. The behavior may slightly differ from the earlier
behavior because we cannot short-circuit costs to zero once a node is
elaborated; but in practice this should not matter.
* Make elaboration non-recursive.
Use an explicit stack instead (with `ElabStackEntry` entries,
alongside a result stack).
* Make elaboration traversal of the domtree non-recursive/stack-safe.
* Work analysis logic in Cranelift-side egraph glue into a general analysis framework in cranelift-egraph.
* Apply static recursion limit to rule application.
* Fix aarch64 wrt dynamic-vector support -- broken rebase.
* Topo-sort cranelift-egraph before cranelift-codegen in publish script, like the comment instructs me to!
* Fix multi-result call testcase.
* Include `cranelift-egraph` in `PUBLISHED_CRATES`.
* Fix atomic_rmw: not really a load.
* Remove now-unnecessary PartialOrd/Ord derivations.
* Address some code-review comments.
* Review feedback.
* Review feedback.
* No overlap in mid-end rules, because we are defining a multi-constructor.
* rustfmt
* Review feedback.
* Review feedback.
* Review feedback.
* Review feedback.
* Remove redundant `mut`.
* Add comment noting what rules can do.
* Review feedback.
* Clarify comment wording.
* Update `has_memory_fence_semantics`.
* Apply @jameysharp's improved loop-level computation.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix suggestion commit.
* Fix off-by-one in new loop-nest analysis.
* Review feedback.
* Review feedback.
* Review feedback.
* Use `Default`, not `std::default::Default`, as per @fitzgen
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Apply @fitzgen's comment elaboration to a doc-comment.
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Add stat for hitting the rewrite-depth limit.
* Some code motion in split prelude to make the diff a little clearer wrt `main`.
* Take @jameysharp's suggested `try_into()` usage for blockparam indices.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Take @jameysharp's suggestion to avoid double-match on load op.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix suggestion (add import).
* Review feedback.
* Fix stack_load handling.
* Remove redundant can_store case.
* Take @jameysharp's suggested improvement to FuncEGraph::build() logic
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Tweaks to FuncEGraph::build() on top of suggestion.
* Take @jameysharp's suggested clarified condition
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Clean up after suggestion (unused variable).
* Fix loop analysis.
* loop level asserts
* Revert constant-space loop analysis -- edge cases were incorrect, so let's go with the simple thing for now.
* Take @jameysharp's suggestion re: result_tys
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fix up after suggestion
* Take @jameysharp's suggestion to use fold rather than reduce
Co-authored-by: Jamey Sharp <jamey@minilop.net>
* Fixup after suggestion
* Take @jameysharp's suggestion to remove elaborate_eclass_use's return value.
* Clarifying comment in terminator insts.
Co-authored-by: Jamey Sharp <jamey@minilop.net>
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
* Elide redundant sentinel values
The `undef_variables` lists were a binding from Variable to Value, but
the Values were always equal to a suffix of the block's parameters. So
instead of storing another copy, we can just get them back from the
block parameters.
According to DHAT, this decreases total memory allocated and number of
bytes written, and increases number of bytes read and instructions
retired, but all by small fractions of a percent. According to
hyperfine, main is "1.00 ± 0.01 times faster".
* Use entity_impl for cranelift_frontend::Variable
Instead of hand-coding essentially the same thing.
* Keep undefined variables in a ListPool
According to DHAT, this improves every measure of performance
(instructions retired, total memory allocated, max heap size, bytes
read, and bytes written), although by fractions of a percent. According
to hyperfine the difference is nearly zero, but on Spidermonkey this
branch is "1.01 ± 0.00 times faster" than main.
* Elide redundant block IDs
In a list of predecessors, we previously kept both the jump instruction
that points to the current block, and the block where that instruction
resides. But we can look up the block from the instruction as long as we
have access to the current Layout, which we do everywhere that it was
necessary. So don't store the block, just store the instruction.
* Keep predecessor definitions in a ListPool
* Make append_jump_argument independent of self
This makes it easier to reason about borrow-checking issues.
* Reuse `results` instead of re-doing variable lookup
This eliminates three array lookups per predecessor by hanging on to the
results of earlier steps a little longer. This only works now because I
previously removed the need to borrow all of `self`, which otherwise
prevented keeping a borrow of self.results alive.
I had experimented with using `Vec::split_off` to copy the relevant
chunk of results to a temporary heap allocation, but the extra
allocation and copy was measurably slower. So it's important that this
is just a borrow.
* Cache single-predecessor block ID when sealing
Of the code in cranelift_frontend, `use_var` is the second-hottest path,
sitting close behind the `build` function that's used when inserting
every new instruction. This makes sense given that the operands of a new
instruction usually need to be looked up immediately before building the
instruction.
So making the single-predecessor loops in `find_var` and `use_var_local`
do fewer memory accesses and execute fewer instructions turns out to
have a measurable effect. It's still only a small fraction of a percent
overall since cranelift-frontend is only a few percent of total runtime.
This patch keeps a block ID in the SSABlockData, which is None unless
both the block is sealed and it has exactly one predecessor. Doing so
avoids two array lookups on each iteration of the two loops.
According to DHAT, compared with main, at this point this PR uses 0.3%
less memory at max heap, reads 0.6% fewer bytes, and writes 0.2% fewer
bytes.
According to Hyperfine, this PR is "1.01 ± 0.01 times faster" than main
when compiling Spidermonkey. On the other hand, Sightglass says main is
1.01x faster than this PR on the same benchmark by CPU cycles. In short,
actual effects are too small to measure reliably.
* Cleanups to cranelift-frontend SSA construction
* Encode sealed/undef_variables relationship in type
A block can't have any undef_variables if it is sealed. It's useful to
make that fact explicit in the types so that any time either value is
used, it's clear that we should think about the other one too.
In addition, encoding this fact in an enum type lets Rust apply an
optimization that reduces the size of SSABlockData by 8 bytes, making it
fit in a 64-byte cache line. I haven't taken the extra step of making
SSABlockData be 64-byte aligned because 1) it doesn't seem to have a
performance impact and b) doing so makes other structures quite a bit
bigger.
* Simplify finish_predecessors_lookup
Using Vec::drain is more concise than a combination of
iter().rev().take() followed by Vec::truncate. And in this case it
doesn't matter what order we examine the results in, because we just
want to know if they're all equal, so we might as well iterate forward
instead of in reverse.
There's no need for the ZeroOneOrMore enum. Instead, there are only two
cases: either we have a single value to use for the variable (possibly
synthesized as a constant zero), or we need to add a block parameter in
every predecessor.
Pre-filtering the results iterator to eliminate the sentinel makes it
easy to identify how many distinct definitions this variable has.
iter.next() indicates if there are any definitions at all, and then
iter.all() is a clear way to express that we want to know if the
remaining definitions are the same as the first one.
* Simplify append_jump_argument
* Avoid assigning default() into SecondaryMap
This eliminates some redundant reads and writes.
* cranelift-frontend: Construct with default()
This eliminates a bunch of boilerplate in favor of a built in `derive`
macro.
Also I'm deleting an import that had the comment "FIXME: Remove in
edition2021", which we've been using everywhere since April.
* Fix tests
* Leverage Cargo's workspace inheritance feature
This commit is an attempt to reduce the complexity of the Cargo
manifests in this repository with Cargo's workspace-inheritance feature
becoming stable in Rust 1.64.0. This feature allows specifying fields in
the root workspace `Cargo.toml` which are then reused throughout the
workspace. For example this PR shares definitions such as:
* All of the Wasmtime-family of crates now use `version.workspace =
true` to have a single location which defines the version number.
* All crates use `edition.workspace = true` to have one default edition
for the entire workspace.
* Common dependencies are listed in `[workspace.dependencies]` to avoid
typing the same version number in a lot of different places (e.g. the
`wasmparser = "0.89.0"` is now in just one spot.
Currently the workspace-inheritance feature doesn't allow having two
different versions to inherit, so all of the Cranelift-family of crates
still manually specify their version. The inter-crate dependencies,
however, are shared amongst the root workspace.
This feature can be seen as a method of "preprocessing" of sorts for
Cargo manifests. This will help us develop Wasmtime but shouldn't have
any actual impact on the published artifacts -- everything's dependency
lists are still the same.
* Fix wasi-crypto tests
This commit replaces #4869 and represents the actual version bump that
should have happened had I remembered to bump the in-tree version of
Wasmtime to 1.0.0 prior to the branch-cut date. Alas!
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
* Add initial support for fused adapter trampolines
This commit lands a significant new piece of functionality to Wasmtime's
implementation of the component model in the form of the implementation
of fused adapter trampolines. Internally within a component core wasm
modules can communicate with each other by having their exports
`canon lift`'d to get `canon lower`'d into a different component. This
signifies that two components are communicating through a statically
known interface via the canonical ABI at this time. Previously Wasmtime
was able to identify that this communication was happening but it simply
panicked with `unimplemented!` upon seeing it. This commit is the
beginning of filling out this panic location with an actual
implementation.
The implementation route chosen here for fused adapters is to use a
WebAssembly module itself for the implementation. This means that, at
compile time of a component, Wasmtime is generating core WebAssembly
modules which then get recursively compiled within Wasmtime as well. The
choice to use WebAssembly itself as the implementation of fused adapters
stems from a few motivations:
* This does not represent a significant increase in the "trusted
compiler base" of Wasmtime. Getting the Wasm -> CLIF translation
correct once is hard enough much less for an entirely different IR to
CLIF. By generating WebAssembly no new interactions with Cranelift are
added which drastically reduces the possibilities for mistakes.
* Using WebAssembly means that component adapters are insulated from
miscompilations and mistakes. If something goes wrong it's defined
well within the WebAssembly specification how it goes wrong and what
happens as a result. This means that the "blast zone" for a wrong
adapter is the component instance but not the entire host itself.
Accesses to linear memory are guaranteed to be in-bounds and otherwise
handled via well-defined traps.
* A fully-finished fused adapter compiler is expected to be a
significant and quite complex component of Wasmtime. Functionality
along these lines is expected to be needed for Web-based polyfills of
the component model and by using core WebAssembly it provides the
opportunity to share code between Wasmtime and these polyfills for the
component model.
* Finally the runtime implementation of managing WebAssembly modules is
already implemented and quite easy to integrate with, so representing
fused adapters with WebAssembly results in very little extra support
necessary for the runtime implementation of instantiating and managing
a component.
The compiler added in this commit is dubbed Wasmtime's Fused Adapter
Compiler of Trampolines (FACT) because who doesn't like deriving a name
from an acronym. Currently the trampoline compiler is limited in its
support for interface types and only supports a few primitives. I plan
on filing future PRs to flesh out the support here for all the variants
of `InterfaceType`. For now this PR is primarily focused on all of the
other infrastructure for the addition of a trampoline compiler.
With the choice to use core WebAssembly to implement fused adapters it
means that adapters need to be inserted into a module. Unfortunately
adapters cannot all go into a single WebAssembly module because adapters
themselves have dependencies which may be provided transitively through
instances that were instantiated with other adapters. This means that a
significant chunk of this PR (`adapt.rs`) is dedicated to determining
precisely which adapters go into precisely which adapter modules. This
partitioning process attempts to make large modules wherever it can to
cut down on core wasm instantiations but is likely not optimal as
it's just a simple heuristic today.
With all of this added together it's now possible to start writing
`*.wast` tests that internally have adapted modules communicating with
one another. A `fused.wast` test suite was added as part of this PR
which is the beginning of tests for the support of the fused adapter
compiler added in this PR. Currently this is primarily testing some
various topologies of adapters along with direct/indirect modes. This
will grow many more tests over time as more types are supported.
Overall I'm not 100% satisfied with the testing story of this PR. When a
test fails it's very difficult to debug since everything is written in
the text format of WebAssembly meaning there's no "conveniences" to
print out the state of the world when things go wrong and easily debug.
I think this will become even more apparent as more tests are written
for more types in subsequent PRs. At this time though I know of no
better alternative other than leaning pretty heavily on fuzz-testing to
ensure this is all exercised.
* Fix an unused field warning
* Fix tests in `wasmtime-runtime`
* Add some more tests for compiled trampolines
* Remap exports when injecting adapters
The exports of a component were accidentally left unmapped which meant
that they indexed the instance indexes pre-adapter module insertion.
* Fix typo
* Rebase conflicts
* Cranelift: make `ir::Type` a `u16`.
* Cranelift: pack ValueData back into 64 bits.
After extending `Type` to a `u16`, `ValueData` became 12 bytes rather
than 8. This packs it back down to 8 bytes (64 bits) by stealing two
bits from the `Type` for the enum discriminant (leaving 14 bits for the
type itself).
Performance comparison (3-way between original (`ty-u8`), 16-bit `Type`
(`ty-u16`), and this PR (`ty-packed`)):
```
~/work/sightglass% target/release/sightglass-cli benchmark \
-e ~/ty-u8.so -e ~/ty-u16.so -e ~/ty-packed.so \
--iterations-per-process 10 --processes 2 \
benchmarks-next/spidermonkey/benchmark.wasm
compilation
benchmarks-next/spidermonkey/benchmark.wasm
cycles
[20654406874 21749213920.50 22958520306] /home/cfallin/ty-packed.so
[22227738316 22584704883.90 22916433748] /home/cfallin/ty-u16.so
[20659150490 21598675968.60 22588108428] /home/cfallin/ty-u8.so
nanoseconds
[5435333269 5723139427.25 6041072883] /home/cfallin/ty-packed.so
[5848788229 5942729637.85 6030030341] /home/cfallin/ty-u16.so
[5436002390 5683248226.10 5943626225] /home/cfallin/ty-u8.so
```
So, when compiling SpiderMonkey.wasm, making `Type` 16 bits regresses
performance by 4.5% (5.683s -> 5.723s), while this PR gets 14 bits for a 1.0%
cost (5.683s -> 5.723s). That's still not great, and we can likely do better,
but it's a start.
* Fix test failure: entities to/from u32 via `{from,to}_bits`, not `{from,to}_u32`.
* Upgrade all crates to the Rust 2021 edition
I've personally started using the new format strings for things like
`panic!("some message {foo}")` or similar and have been upgrading crates
on a case-by-case basis, but I think it probably makes more sense to go
ahead and blanket upgrade everything so 2021 features are always
available.
* Fix compile of the C API
* Fix a warning
* Fix another warning
* Bump to 0.36.0
* Add a two-week delay to Wasmtime's release process
This commit is a proposal to update Wasmtime's release process with a
two-week delay from branching a release until it's actually officially
released. We've had two issues lately that came up which led to this proposal:
* In #3915 it was realized that changes just before the 0.35.0 release
weren't enough for an embedding use case, but the PR didn't meet the
expectations for a full patch release.
* At Fastly we were about to start rolling out a new version of Wasmtime
when over the weekend the fuzz bug #3951 was found. This led to the
desire internally to have a "must have been fuzzed for this long"
period of time for Wasmtime changes which we felt were better
reflected in the release process itself rather than something about
Fastly's own integration with Wasmtime.
This commit updates the automation for releases to unconditionally
create a `release-X.Y.Z` branch on the 5th of every month. The actual
release from this branch is then performed on the 20th of every month,
roughly two weeks later. This should provide a period of time to ensure
that all changes in a release are fuzzed for at least two weeks and
avoid any further surprises. This should also help with any last-minute
changes made just before a release if they need tweaking since
backporting to a not-yet-released branch is much easier.
Overall there are some new properties about Wasmtime with this proposal
as well:
* The `main` branch will always have a section in `RELEASES.md` which is
listed as "Unreleased" for us to fill out.
* The `main` branch will always be a version ahead of the latest
release. For example it will be bump pre-emptively as part of the
release process on the 5th where if `release-2.0.0` was created then
the `main` branch will have 3.0.0 Wasmtime.
* Dates for major versions are automatically updated in the
`RELEASES.md` notes.
The associated documentation for our release process is updated and the
various scripts should all be updated now as well with this commit.
* Add notes on a security patch
* Clarify security fixes shouldn't be previewed early on CI
On the build side, this commit introduces two things:
1. The automatic generation of various ISLE definitions for working with
CLIF. Specifically, it generates extern type definitions for clif opcodes and
the clif instruction data `enum`, as well as extractors for matching each clif
instructions. This happens inside the `cranelift-codegen-meta` crate.
2. The compilation of ISLE DSL sources to Rust code, that can be included in the
main `cranelift-codegen` compilation.
Next, this commit introduces the integration glue code required to get
ISLE-generated Rust code hooked up in clif-to-x64 lowering. When lowering a clif
instruction, we first try to use the ISLE code path. If it succeeds, then we are
done lowering this instruction. If it fails, then we proceed along the existing
hand-written code path for lowering.
Finally, this commit ports many lowering rules over from hand-written,
open-coded Rust to ISLE.
In the process of supporting ISLE, this commit also makes the x64 `Inst` capable
of expressing SSA by supporting 3-operand forms for all of the existing
instructions that only have a 2-operand form encoding:
dst = src1 op src2
Rather than only the typical x86-64 2-operand form:
dst = dst op src
This allows `MachInst` to be in SSA form, since `dst` and `src1` are
disentangled.
("3-operand" and "2-operand" are a little bit of a misnomer since not all
operations are binary operations, but we do the same thing for, e.g., unary
operations by disentangling the sole operand from the result.)
There are two motivations for this change:
1. To allow ISLE lowering code to have value-equivalence semantics. We want ISLE
lowering to translate a CLIF expression that evaluates to some value into a
`MachInst` expression that evaluates to the same value. We want both the
lowering itself and the resulting `MachInst` to be pure and referentially
transparent. This is both a nice paradigm for compiler writers that are
authoring and maintaining lowering rules and is a prerequisite to any sort of
formal verification of our lowering rules in the future.
2. Better align `MachInst` with `regalloc2`'s API, which requires that the input
be in SSA form.
N.B. There is likely still some light refactoring to do, so that we are
not duplicating so much code. We should also additionally introduce some
test coverage.
Moves the slow path which resizes the vector out-of-line. The actual
indexing is also done in the out-of-line path which avoids the need for
a second bounds check in the fast path after a potential resize.
Looking at some profiles these or their related functions were all
showing up, so this commit adds `#[inline]` to allow cross-crate
inlining by default.
* Remove `once-cell` dependency.
* Remove function address `BTreeMap` from `CompiledModule` in favor of binary
searching finished functions directly.
* Use `with_capacity` when populating `CompiledModule` finished functions and
trampolines.
* entity: Fix typo in generated documentation
The same function documentation was used for `from_u32()` and `as_u32()`
while their behaviour is different
* Refactor where results of compilation are stored
This commit refactors the internals of compilation in Wasmtime to change
where results of individual function compilation are stored. Previously
compilation resulted in many maps being returned, and compilation
results generally held all these maps together. This commit instead
switches this to have all metadata stored in a `CompiledFunction`
instead of having a separate map for each item that can be stored.
The motivation for this is primarily to help out with future
module-linking-related PRs. What exactly "module level" is depends on
how we interpret modules and how many modules are in play, so it's a bit
easier for operations in wasmtime to work at the function level where
possible. This means that we don't have to pass around multiple
different maps and a function index, but instead just one map or just
one entry representing a compiled function.
Additionally this change updates where the parallelism of compilation
happens, pushing it into `wasmtime-jit` instead of `wasmtime-environ`.
This is another goal where `wasmtime-jit` will have more knowledge about
module-level pieces with module linking in play. User-facing-wise this
should be the same in terms of parallel compilation, though.
The ultimate goal of this refactoring is to make it easier for the
results of compilation to actually be a set of wasm modules. This means
we won't be able to have a map-per-metadata where the primary key is the
function index, because there will be many modules within one "object
file".
* Don't clear out fields, just don't store them
Persist a smaller set of fields in `CompilationArtifacts` instead of
trying to clear fields out and dynamically not accessing them.