* Fix some warnings on nightly Rust
Cargo is warning about the usage of workspace dependencies where the
workspace declaration does not mention `default-features` but the
dependency mentions `default-features`, so this explicitly turns off
default features for `cranelift-codegen` at the workspace level and
removes the explicit `default-features = false` at the manifest levels.
* Explicitly enable default feature in wasmtime
* Enable another feature
Nothing major pulled in here, but wanted to update to the latest
versions which enable tail calls by default. When used in Wasmtime,
however, the feature is disabled without the possibility of being
enabled since it's not implemented.
We can sometimes statically determine that a given load will unconditionally
trap. When this happens, we emit an unconditional trap, and we need to stop
adding new instructions to the block. This commit introduces a `Reachability<T>`
type that is like `Option<T>` but specifically for communicating reachability
and is marked `must_use` to receivers to handle transitions from reachable to
unreachable states.
Additionally, adds handling of reachable -> unreachable state transitions to
some SIMD op translations that weren't checking for it.
Fixes#5455Fixes#5456
* Account for fuel before unconditionally trapping Wasm accesses
Fixes#5445
* Add a test for fuel accounting and unconditionally trapping memory accesses
* cranelift-wasm: translate Wasm loads into lower-level CLIF operations
Rather than using `heap_{load,store,addr}`.
* cranelift: Remove the `heap_{addr,load,store}` instructions
These are now legalized in the `cranelift-wasm` frontend.
* cranelift: Remove the `ir::Heap` entity from CLIF
* Port basic memory operation tests to .wat filetests
* Remove test for verifying CLIF heaps
* Remove `heap_addr` from replace_branching_instructions_and_cfg_predecessors.clif test
* Remove `heap_addr` from readonly.clif test
* Remove `heap_addr` from `table_addr.clif` test
* Remove `heap_addr` from the simd-fvpromote_low.clif test
* Remove `heap_addr` from simd-fvdemote.clif test
* Remove `heap_addr` from the load-op-store.clif test
* Remove the CLIF heap runtest
* Remove `heap_addr` from the global_value.clif test
* Remove `heap_addr` from fpromote.clif runtests
* Remove `heap_addr` from fdemote.clif runtests
* Remove `heap_addr` from memory.clif parser test
* Remove `heap_addr` from reject_load_readonly.clif test
* Remove `heap_addr` from reject_load_notrap.clif test
* Remove `heap_addr` from load_readonly_notrap.clif test
* Remove `static-heap-without-guard-pages.clif` test
Will be subsumed when we port `make-heap-load-store-tests.sh` to generating
`.wat` tests.
* Remove `static-heap-with-guard-pages.clif` test
Will be subsumed when we port `make-heap-load-store-tests.sh` over to `.wat`
tests.
* Remove more heap tests
These will be subsumed by porting `make-heap-load-store-tests.sh` over to `.wat`
tests.
* Remove `heap_addr` from `simple-alias.clif` test
* Remove `heap_addr` from partial-redundancy.clif test
* Remove `heap_addr` from multiple-blocks.clif test
* Remove `heap_addr` from fence.clif test
* Remove `heap_addr` from extends.clif test
* Remove runtests that rely on heaps
Heaps are not a thing in CLIF or the interpreter anymore
* Add generated load/store `.wat` tests
* Enable memory-related wasm features in `.wat` tests
* Remove CLIF heap from fcmp-mem-bug.clif test
* Add a mode for compiling `.wat` all the way to assembly in filetests
* Also generate WAT to assembly tests in `make-load-store-tests.sh`
* cargo fmt
* Reinstate `f{de,pro}mote.clif` tests without the heap bits
* Remove undefined doc link
* Remove outdated SVG and dot file from docs
* Add docs about `None` returns for base address computation helpers
* Factor out `env.heap_access_spectre_mitigation()` to a local
* Expand docs for `FuncEnvironment::heaps` trait method
* Restore f{de,pro}mote+load clif runtests with stack memory
This adds support for `.wat` tests in `cranelift-filetest`. The test runner
translates the WAT to Wasm and then uses `cranelift-wasm` to translate the Wasm
to CLIF.
These tests are always precise output tests. The test expectations can be
updated by running tests with the `CRANELIFT_TEST_BLESS=1` environment variable
set, similar to our compile precise output tests. The test's expected output is
contained in the last comment in the test file.
The tests allow for configuring the kinds of heaps used to implement Wasm linear
memory via TOML in a `;;!` comment at the start of the test.
To get ISA and Cranelift flags parsing available in the filetests crate, I had
to move the `parse_sets_and_triple` helper from the `cranelift-tools` binary
crate to the `cranelift-reader` crate, where I think it logically
fits.
Additionally, I had to make some more bits of `cranelift-wasm`'s dummy
environment `pub` so that I could properly wrap and compose it with the
environment used for the `.wat` tests. I don't think this is a big deal, but if
we eventually want to clean this stuff up, we can probably remove the dummy
environments completely, remove `translate_module`, and fold them into these new
test environments and test runner (since Wasmtime isn't using those things
anyways).
* cranelift-wasm: Remove `ModuleTranslationState`
We were putting data into it, but never reading data out of it. Can be removed.
* cranelift-wasm: move `funct_state.rs` sub module to `state.rs`
Since it is the only submodule of `state` it can just be the whole `state`
module.
Before, we would do a `heap_addr` to translate the given Wasm memory address
into a native memory address and pass it into the libcall that implemented the
atomic operation, which would then treat the address as a Wasm memory address
and pass it to `validate_atomic_addr` to be bounds checked a second time. This
is a bit nonsensical, as we are validating a native memory address as if it were
a Wasm memory address.
Now, we no longer do a `heap_addr` to translate the Wasm memory address to a
native memory address. Instead, we pass the Wasm memory address to the libcall,
and the libcall is responsible for doing the bounds check (by calling
`validate_atomic_addr` with the correct type of memory address now).
* Cranelift: Make `heap_addr` return calculated `base + index + offset`
Rather than return just the `base + index`.
(Note: I've chosen to use the nomenclature "index" for the dynamic operand and
"offset" for the static immediate.)
This move the addition of the `offset` into `heap_addr`, instead of leaving it
for the subsequent memory operation, so that we can Spectre-guard the full
address, and not allow speculative execution to read the first 4GiB of memory.
Before this commit, we were effectively doing
load(spectre_guard(base + index) + offset)
Now we are effectively doing
load(spectre_guard(base + index + offset))
Finally, this also corrects `heap_addr`'s documented semantics to say that it
returns an address that will trap on access if `index + offset + access_size` is
out of bounds for the given heap, rather than saying that the `heap_addr` itself
will trap. This matches the implemented behavior for static memories, and after
https://github.com/bytecodealliance/wasmtime/pull/5190 lands (which is blocked
on this commit) will also match the implemented behavior for dynamic memories.
* Update heap_addr docs
* Factor out `offset + size` to a helper
Add a MemFlags operand to the bitcast instruction, where only the
`big` and `little` flags are accepted. These define the lane order
to be used when casting between types of different lane counts.
Update all users to pass an appropriate MemFlags argument.
Implement lane swaps where necessary in the s390x back-end.
This is the final part necessary to fix
https://github.com/bytecodealliance/wasmtime/issues/4566.
This branch removes the trapif and trapff instructions, in favor of using an explicit comparison and trapnz. This moves us closer to removing iflags and fflags, but introduces the need to implement instructions like iadd_cout in the x64 and aarch64 backends.
- Allow bitcast for vectors with differing lane widths
- Remove raw_bitcast IR instruction
- Change all users of raw_bitcast to bitcast
- Implement support for no-op bitcast cases across backends
This implements the second step of the plan outlined here:
https://github.com/bytecodealliance/wasmtime/issues/4566#issuecomment-1234819394
Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.
Fixes#3205
Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
* Replace resize+copy_from_slice with extend_from_slice
Vec::resize initializes the new space, which is wasted effort if we're
just going to call `copy_from_slice` on it immediately afterward. Using
`extend_from_slice` is simpler, and very slightly faster.
If the new size were bigger than the buffer we're copying from, then it
would make sense to initialize the excess. But it isn't: it's always
exactly the same size.
* Move helpers from Context to CompiledCode
These methods only use information from Context::compiled_code, so they
should live on CompiledCode instead.
* Remove an unnecessary #[cfg_attr]
There are other uses of `#[allow(clippy::too_many_arguments)]` in this
file, so apparently it doesn't need to be guarded by the "cargo-clippy"
feature.
* Fix a few comments
Two of these were wrong/misleading:
- `FunctionBuilder::new` does not clear the provided func_ctx. It does
debug-assert that the context is already clear, but I don't think
that's worth a comment.
- `switch_to_block` does not "create values for the arguments." That's
done by the combination of `append_block_params_for_function_params`
and `declare_wasm_parameters`.
* wasmtime-cranelift: Misc cleanups
The main change is to use the `CompiledCode` reference we already had
instead of getting it out of `Context` repeatedly. This removes a bunch
of `unwrap()` calls.
* wasmtime-cranelift: Factor out uncached compile
* Elide redundant sentinel values
The `undef_variables` lists were a binding from Variable to Value, but
the Values were always equal to a suffix of the block's parameters. So
instead of storing another copy, we can just get them back from the
block parameters.
According to DHAT, this decreases total memory allocated and number of
bytes written, and increases number of bytes read and instructions
retired, but all by small fractions of a percent. According to
hyperfine, main is "1.00 ± 0.01 times faster".
* Use entity_impl for cranelift_frontend::Variable
Instead of hand-coding essentially the same thing.
* Keep undefined variables in a ListPool
According to DHAT, this improves every measure of performance
(instructions retired, total memory allocated, max heap size, bytes
read, and bytes written), although by fractions of a percent. According
to hyperfine the difference is nearly zero, but on Spidermonkey this
branch is "1.01 ± 0.00 times faster" than main.
* Elide redundant block IDs
In a list of predecessors, we previously kept both the jump instruction
that points to the current block, and the block where that instruction
resides. But we can look up the block from the instruction as long as we
have access to the current Layout, which we do everywhere that it was
necessary. So don't store the block, just store the instruction.
* Keep predecessor definitions in a ListPool
* Make append_jump_argument independent of self
This makes it easier to reason about borrow-checking issues.
* Reuse `results` instead of re-doing variable lookup
This eliminates three array lookups per predecessor by hanging on to the
results of earlier steps a little longer. This only works now because I
previously removed the need to borrow all of `self`, which otherwise
prevented keeping a borrow of self.results alive.
I had experimented with using `Vec::split_off` to copy the relevant
chunk of results to a temporary heap allocation, but the extra
allocation and copy was measurably slower. So it's important that this
is just a borrow.
* Cache single-predecessor block ID when sealing
Of the code in cranelift_frontend, `use_var` is the second-hottest path,
sitting close behind the `build` function that's used when inserting
every new instruction. This makes sense given that the operands of a new
instruction usually need to be looked up immediately before building the
instruction.
So making the single-predecessor loops in `find_var` and `use_var_local`
do fewer memory accesses and execute fewer instructions turns out to
have a measurable effect. It's still only a small fraction of a percent
overall since cranelift-frontend is only a few percent of total runtime.
This patch keeps a block ID in the SSABlockData, which is None unless
both the block is sealed and it has exactly one predecessor. Doing so
avoids two array lookups on each iteration of the two loops.
According to DHAT, compared with main, at this point this PR uses 0.3%
less memory at max heap, reads 0.6% fewer bytes, and writes 0.2% fewer
bytes.
According to Hyperfine, this PR is "1.01 ± 0.01 times faster" than main
when compiling Spidermonkey. On the other hand, Sightglass says main is
1.01x faster than this PR on the same benchmark by CPU cycles. In short,
actual effects are too small to measure reliably.
* cranelift-wasm: Assume block is reachable
In handling the WebAssembly "end" operator, cranelift-wasm had logic to
skip generating a jump instruction if the block was both unreachable and
"pristine", meaning no instructions had been added.
However, `translate_operator` checks first that `state.reachable` is
true, so this logic only runs when cranelift-wasm believes that the
current block _is_ reachable. Therefore the condition should always be
true, whether the block is pristine or not.
I've left a debug_assert in case `state.reachable` ever doesn't agree
with `builder.is_unreachable()`, but the assert doesn't fail in any of
the tests. We'll see if fuzzing finds something.
Anyway, outside of cranelift-frontend, this eliminates the only use of
`is_pristine()`, and there were no uses of `is_filled()`. So I've made
both of those private. They're now only used in a nearby debug assert.
* cranelift-frontend: Clarify pristine/filled states
There was a comment here saying "A filled block cannot be pristine."
Given that the intent was for those two states to be mutually exclusive,
I've replaced the two booleans with a three-state enum.
I also replaced all reads of these two flags with method calls. In all
but one case these are only checked in debug assertions, so I don't even
care whether they get inlined. They're easier to read, and this will
make it easier to replace their implementations, which I hope to do
soon.
Finally, I replaced all assignments to either flag with an appropriate
assignment of the corresponding enum state. Keep in mind this
correspondence between the new enum and the old flags:
- Empty: pristine true, filled false
- Partial: pristine false, filled false
- Filled: pristine false, filled true
Every existing update to these flags could only move to a later state.
(For example, Partial couldn't go back to Empty.) In the old flags that
meant that pristine could only go from true to false, and filled could
only go from false to true.
`fill_current_block` was a weird case because at first glance it looks
like it could allow both pristine and filled to be true at the same
time. However, it's only called from `FuncInstBuilder::build`, which
calls `ensure_inserted_block` before doing anything else, and _that_
cleared the pristine flag.
Similarly, `handle_ssa_side_effects` looks like it could allow both
pristine and filled to be true for anything in `split_blocks_created`.
However, those blocks are created by SSABuilder, so their BlockData is
not initialized by `create_block`, and instead uses BlockData::default.
The `Default` implementation here previously set both flags false, while
`create_block` would instead set pristine to true. So these split blocks
were correctly set to the Filled state, and after this patch they are
still set correctly.
* cranelift-frontend: Separate SSA and user block params
Previously there was a `user_param_count` field in BlockData, used
purely to debug-assert that no user parameters are added to a block
after `use_var` adds SSA parameters.
Instead, this patch enforces a strict phase separation between the
period after a block is created when user parameters can be added to it,
and the period when `use_var` may be called and instructions may be
added.
I'm assuming that calls to `use_var` are _always_ followed by inserting
one or more instructions into the block. (If you don't want to insert an
instruction, why do you need to know where instructions in this block
would get variable definitions from?) This patch has no visible effect
for callers which follow that rule.
However, it was previously legal to call `use_var`, then append a block
parameter before adding instructions, so long as `use_var` didn't
actually need to add a block parameter. That could only happen if the
current block is sealed and has exactly one predecessor. So anyone who
was counting on this behavior was playing a dangerous game anyway.
* cranelift-frontend: Defer initializing block data
Every reference to the func_ctx.status SecondaryMap will automatically
create the appropriate entries on-demand, with the sole exception of
`finalize`. In that function, debug assertions use SecondaryMap::keys to
find out which blocks need to be checked.
However, those assertions always succeed for blocks which never had any
instructions added. So it's okay to skip them for blocks which aren't
touched after `create_block`.
* Update wasm-tools dependencies
This update brings in a number of features such as:
* The component model binary format and AST has been slightly adjusted
in a few locations. Names are dropped from parameters/results now in
the internal representation since they were not used anyway. At this
time the ability to bind a multi-return function has not been exposed.
* The `wasmparser` validator pass will now share allocations with prior
functions, providing what's probably a very minor speedup for Wasmtime
itself.
* The text format for many component-related tests now requires named
parameters.
* Some new relaxed-simd instructions are updated to be ignored.
I hope to have a follow-up to expose the multi-return ability to the
embedding API of components.
* Update audit information for new crates
* Leverage Cargo's workspace inheritance feature
This commit is an attempt to reduce the complexity of the Cargo
manifests in this repository with Cargo's workspace-inheritance feature
becoming stable in Rust 1.64.0. This feature allows specifying fields in
the root workspace `Cargo.toml` which are then reused throughout the
workspace. For example this PR shares definitions such as:
* All of the Wasmtime-family of crates now use `version.workspace =
true` to have a single location which defines the version number.
* All crates use `edition.workspace = true` to have one default edition
for the entire workspace.
* Common dependencies are listed in `[workspace.dependencies]` to avoid
typing the same version number in a lot of different places (e.g. the
`wasmparser = "0.89.0"` is now in just one spot.
Currently the workspace-inheritance feature doesn't allow having two
different versions to inherit, so all of the Cranelift-family of crates
still manually specify their version. The inter-crate dependencies,
however, are shared amongst the root workspace.
This feature can be seen as a method of "preprocessing" of sorts for
Cargo manifests. This will help us develop Wasmtime but shouldn't have
any actual impact on the published artifacts -- everything's dependency
lists are still the same.
* Fix wasi-crypto tests
We weren't using the "union" cargo feature for the smallvec crate, which
reduces the size of a SmallVec by one machine word. This feature
requires Rust 1.49 but we already require much newer versions.
When using Wasmtime to compile pulldown-cmark from Sightglass, this
saves a decent amount of memory allocations and writes. According to
`valgrind --tool=dhat`:
- 6.2MiB (3.69%) less memory allocated over the program's lifetime
- 0.5MiB (4.13%) less memory allocated at maximum heap size
- 5.5MiB (1.88%) fewer bytes written to
- 0.44% fewer instructions executed
Sightglass reports a statistically significant runtime improvement too:
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 24379323.60 ± 20051394.04 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.01x to 1.13x faster than main-be690a468.so!
[227506364 355007998.78 423280514] main-be690a468.so
[227686018 330628675.18 406025344] shrink-abiarg-0406da67c.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 360151622.56 ± 278294316.90 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.01x to 1.07x faster than main-be690a468.so!
[8709162212 8911001926.44 9535111576] main-be690a468.so
[5058015392 8550850303.88 9282148438] shrink-abiarg-0406da67c.so
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 6936570.28 ± 6897696.38 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.00x to 1.08x faster than main-be690a468.so!
[155810934 175260571.20 234737344] main-be690a468.so
[119128240 168324000.92 257451074] shrink-abiarg-0406da67c.so
This commit replaces #4869 and represents the actual version bump that
should have happened had I remembered to bump the in-tree version of
Wasmtime to 1.0.0 prior to the branch-cut date. Alas!
* Upgrade wasm-tools crates, namely the component model
This commit pulls in the latest versions of all of the `wasm-tools`
family of crates. There were two major changes that happened in
`wasm-tools` in the meantime:
* bytecodealliance/wasm-tools#697 - this commit introduced a new API for
more efficiently reading binary operators from a wasm binary. The old
`Operator`-based reading was left in place, however, and continues to
be what Wasmtime uses. I hope to update Wasmtime in a future PR to use
this new API, but for now the biggest change is...
* bytecodealliance/wasm-tools#703 - this commit was a major update to
the component model AST. This commit almost entirely deals with the
fallout of this change.
The changes made to the component model were:
1. The `unit` type no longer exists. This was generally a simple change
where the `Unit` case in a few different locations were all removed.
2. The `expected` type was renamed to `result`. This similarly was
relatively lightweight and mostly just a renaming on the surface. I
took this opportunity to rename `val::Result` to `val::ResultVal` and
`types::Result` to `types::ResultType` to avoid clashing with the
standard library types. The `Option`-based types were handled with
this as well.
3. The payload type of `variant` and `result` types are now optional.
This affected many locations that calculate flat type
representations, ABI information, etc. The `#[derive(ComponentType)]`
macro now specifically handles Rust-defined `enum` types which have
no payload to the equivalent in the component model.
4. Functions can now return multiple parameters. This changed the
signature of invoking component functions because the return value is
now bound by `ComponentNamedList` (renamed from `ComponentParams`).
This had a large effect in the tests, fuzz test case generation, etc.
5. Function types with 2-or-more parameters/results must uniquely name
all parameters/results. This mostly affected the text format used
throughout the tests.
I haven't added specifically new tests for multi-return but I changed a
number of tests to use it. Additionally I've updated the fuzzers to all
exercise multi-return as well so I think we should get some good
coverage with that.
* Update version numbers
* Use crates.io
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
* Cranellift: remove Baldrdash support and related features.
As noted in Mozilla's bugzilla bug 1781425 [1], the SpiderMonkey team
has recently determined that their current form of integration with
Cranelift is too hard to maintain, and they have chosen to remove it
from their codebase. If and when they decide to build updated support
for Cranelift, they will adopt different approaches to several details
of the integration.
In the meantime, after discussion with the SpiderMonkey folks, they
agree that it makes sense to remove the bits of Cranelift that exist
to support the integration ("Baldrdash"), as they will not need
them. Many of these bits are difficult-to-maintain special cases that
are not actually tested in Cranelift proper: for example, the
Baldrdash integration required Cranelift to emit function bodies
without prologues/epilogues, and instead communicate very precise
information about the expected frame size and layout, then stitched
together something post-facto. This was brittle and caused a lot of
incidental complexity ("fallthrough returns", the resulting special
logic in block-ordering); this is just one example. As another
example, one particular Baldrdash ABI variant processed stack args in
reverse order, so our ABI code had to support both traversal
orders. We had a number of other Baldrdash-specific settings as well
that did various special things.
This PR removes Baldrdash ABI support, the `fallthrough_return`
instruction, and pulls some threads to remove now-unused bits as a
result of those two, with the understanding that the SpiderMonkey folks
will build new functionality as needed in the future and we can perhaps
find cleaner abstractions to make it all work.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1781425
* Review feedback.
* Fix (?) DWARF debug tests: add `--disable-cache` to wasmtime invocations.
The debugger tests invoke `wasmtime` from within each test case under
the control of a debugger (gdb or lldb). Some of these tests started to
inexplicably fail in CI with unrelated changes, and the failures were
only inconsistently reproducible locally. It seems to be cache related:
if we disable cached compilation on the nested `wasmtime` invocations,
the tests consistently pass.
* Review feedback.
For wasm programs using SIMD vector types, the type known at function
entry or exit may not match the type used within the body of the
function, so we have to bitcast them. We have to check all calls and
returns for this condition, which involves comparing a subset of a
function's signature with the CLIF types we're trying to use. Currently,
this check heap-allocates a short-lived Vec for the appropriate subset
of the signature.
But most of the time none of the values need a bitcast. So this patch
avoids allocating unless at least one bitcast is actually required, and
only saves the pointers to values that need fixing up. I leaned heavily
on iterators to keep space usage constant until we discover allocation
is necessary after all.
I don't think it's possible to eliminate the allocation entirely,
because the function signature we're examining is borrowed from the
FuncBuilder, but we need to mutably borrow the FuncBuilder to insert the
bitcast instructions. Fortunately, the FromIterator implementation for
Vec doesn't reserve any heap space if the iterator is empty, so we can
unconditionally collect into a Vec and still avoid unnecessary
allocations.
Since the relationship between a function signature and a list of CLIF
values is somewhat complicated, I refactored all the uses of
`bitcast_arguments` and `wasm_param_types`. Instead there's
`bitcast_wasm_params` and `bitcast_wasm_returns` which share a helper
that combines the previous pair of functions into one.
According to DHAT, when compiling the Sightglass Spidermonkey benchmark,
this avoids 52k allocations averaging about 9 bytes each, which are
freed on average within 2k instructions.
Most Sightglass benchmarks, including Spidermonkey, show no performance
difference with this change. Only one has a slowdown, and it's small:
compilation :: nanoseconds :: benchmarks/shootout-matrix/benchmark.wasm
Δ = 689373.34 ± 593720.78 (confidence = 99%)
lazy-bitcast.so is 0.94x to 1.00x faster than main-e121c209f.so!
main-e121c209f.so is 1.00x to 1.06x faster than lazy-bitcast.so!
[19741713 21375844.46 32976047] lazy-bitcast.so
[19345471 20686471.12 30872267] main-e121c209f.so
But several Sightglass benchmarks have notable speed-ups, with smaller
improvements for shootout-ed25519, meshoptimizer, and pulldown-cmark,
and larger ones as follows:
compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm
Δ = 20071545.47 ± 2950014.92 (confidence = 99%)
lazy-bitcast.so is 1.26x to 1.36x faster than main-e121c209f.so!
main-e121c209f.so is 0.73x to 0.80x faster than lazy-bitcast.so!
[55995164 64849257.68 89083031] lazy-bitcast.so
[79382460 84920803.15 98016185] main-e121c209f.so
compilation :: nanoseconds :: benchmarks/blake3-scalar/benchmark.wasm
Δ = 16620780.61 ± 5395162.13 (confidence = 99%)
lazy-bitcast.so is 1.14x to 1.28x faster than main-e121c209f.so!
main-e121c209f.so is 0.77x to 0.88x faster than lazy-bitcast.so!
[54604352 79877776.35 103666598] lazy-bitcast.so
[94011835 96498556.96 106200091] main-e121c209f.so
compilation :: nanoseconds :: benchmarks/intgemm-simd/benchmark.wasm
Δ = 36891254.34 ± 12403663.50 (confidence = 99%)
lazy-bitcast.so is 1.12x to 1.24x faster than main-e121c209f.so!
main-e121c209f.so is 0.79x to 0.90x faster than lazy-bitcast.so!
[131610845 201289587.88 247341883] lazy-bitcast.so
[232065032 238180842.22 250957563] main-e121c209f.so
This commit builds on bytecodealliance/wasm-tools#690 to add support to
testing of the component model to execute functions when running
`*.wast` files. This support is all built on #4442 as functions are
invoked through a "dynamic" API. Right now the testing and integration
is fairly crude but I'm hoping that we can try to improve it over time
as necessary. For now this should provide a hopefully more convenient
syntax for unit tests and the like.
* cranelift: Reorganize test suite
Group some SIMD operations by instruction.
* cranelift: Deduplicate some shift tests
Also, new tests with the mod behaviour
* aarch64: Lower shifts with mod behaviour
* x64: Lower shifts with mod behaviour
* wasmtime: Don't mask SIMD shifts
* Bump versions of wasm-tools crates
Note that this leaves new features in the component model, outer type
aliases for core wasm types, unimplemented for now.
* Move to crates.io-based versions of tools
This commit updates the wasm-tools family of crates, notably pulling in
the refactorings and updates from bytecodealliance/wasm-tools#621 for
the latest iteration of the component model. This commit additionally
updates all support for the component model for these changes, notably:
* Many bits and pieces of type information was refactored. Many
`FooTypeIndex` namings are now `TypeFooIndex`. Additionally there is
now `TypeIndex` as well as `ComponentTypeIndex` for the two type index
spaces in a component.
* A number of new sections are now processed to handle the core and
component variants.
* Internal maps were split such as the `funcs` map into
`component_funcs` and `funcs` (same for `instances`).
* Canonical options are now processed individually instead of one bulk
`into` definition.
Overall this was not a major update to the internals of handling the
component model in Wasmtime. Instead this was mostly a surface-level
refactoring to make sure that everything lines up with the new binary
format for components.
* All text syntax used in tests was updated to the new syntax.
This PR adds a basic *alias analysis*, and optimizations that use it.
This is a "mid-end optimization": it operates on CLIF, the
machine-independent IR, before lowering occurs.
The alias analysis (or maybe more properly, a sort of memory-value
analysis) determines when it can prove a particular memory
location is equal to a given SSA value, and when it can, it replaces any
loads of that location.
This subsumes two common optimizations:
* Redundant load elimination: when the same memory address is loaded two
times, and it can be proven that no intervening operations will write
to that memory, then the second load is *redundant* and its result
must be the same as the first. We can use the first load's result and
remove the second load.
* Store-to-load forwarding: when a load can be proven to access exactly
the memory written by a preceding store, we can replace the load's
result with the store's data operand, and remove the load.
Both of these optimizations rely on a "last store" analysis that is a
sort of coloring mechanism, split across disjoint categories of abstract
state. The basic idea is that every memory-accessing operation is put
into one of N disjoint categories; it is disallowed for memory to ever
be accessed by an op in one category and later accessed by an op in
another category. (The frontend must ensure this.)
Then, given this, we scan the code and determine, for each
memory-accessing op, when a single prior instruction is a store to the
same category. This "colors" the instruction: it is, in a sense, a
static name for that version of memory.
This analysis provides an important invariant: if two operations access
memory with the same last-store, then *no other store can alias* in the
time between that last store and these operations. This must-not-alias
property, together with a check that the accessed address is *exactly
the same* (same SSA value and offset), and other attributes of the
access (type, extension mode) are the same, let us prove that the
results are the same.
Given last-store info, we scan the instructions and build a table from
"memory location" key (last store, address, offset, type, extension) to
known SSA value stored in that location. A store inserts a new mapping.
A load may also insert a new mapping, if we didn't already have one.
Then when a load occurs and an entry already exists for its "location",
we can reuse the value. This will be either RLE or St-to-Ld depending on
where the value came from.
Note that this *does* work across basic blocks: the last-store analysis
is a full iterative dataflow pass, and we are careful to check dominance
of a previously-defined value before aliasing to it at a potentially
redundant load. So we will do the right thing if we only have a
"partially redundant" load (loaded already but only in one predecessor
block), but we will also correctly reuse a value if there is a store or
load above a loop and a redundant load of that value within the loop, as
long as no potentially-aliasing stores happen within the loop.