Add a new type BlockCall that represents the pair of a block name with arguments to be passed to it. (The mnemonic here is that it looks a bit like a function call.) Rework the implementation of jump, brz, and brnz to use BlockCall instead of storing the block arguments as varargs in the instruction's ValueList.
To ensure that we're processing block arguments from BlockCall values in instructions, three new functions have been introduced on DataFlowGraph that both sets of arguments:
inst_values - returns an iterator that traverses values in the instruction and block arguments
map_inst_values - applies a function to each value in the instruction and block arguments
overwrite_inst_values - overwrite all values in an instruction and block arguments with values from the iterator
Co-authored-by: Jamey Sharp <jamey@minilop.net>
We have some operations defined on DataFlowGraph purely to work around borrow-checker issues with InstructionData and other data on DataFlowGraph. Part of the problem is that indexing the DFG directly hides the fact that we're only indexing the insts field of the DFG.
This PR makes the insts field of the DFG public, but wraps it in a newtype that only allows indexing. This means that the borrow checker is better able to tell when operations on memory held by the DFG won't conflict, which comes up frequently when mutating ValueLists held by InstructionData.
* cranelift-wasm: translate Wasm loads into lower-level CLIF operations
Rather than using `heap_{load,store,addr}`.
* cranelift: Remove the `heap_{addr,load,store}` instructions
These are now legalized in the `cranelift-wasm` frontend.
* cranelift: Remove the `ir::Heap` entity from CLIF
* Port basic memory operation tests to .wat filetests
* Remove test for verifying CLIF heaps
* Remove `heap_addr` from replace_branching_instructions_and_cfg_predecessors.clif test
* Remove `heap_addr` from readonly.clif test
* Remove `heap_addr` from `table_addr.clif` test
* Remove `heap_addr` from the simd-fvpromote_low.clif test
* Remove `heap_addr` from simd-fvdemote.clif test
* Remove `heap_addr` from the load-op-store.clif test
* Remove the CLIF heap runtest
* Remove `heap_addr` from the global_value.clif test
* Remove `heap_addr` from fpromote.clif runtests
* Remove `heap_addr` from fdemote.clif runtests
* Remove `heap_addr` from memory.clif parser test
* Remove `heap_addr` from reject_load_readonly.clif test
* Remove `heap_addr` from reject_load_notrap.clif test
* Remove `heap_addr` from load_readonly_notrap.clif test
* Remove `static-heap-without-guard-pages.clif` test
Will be subsumed when we port `make-heap-load-store-tests.sh` to generating
`.wat` tests.
* Remove `static-heap-with-guard-pages.clif` test
Will be subsumed when we port `make-heap-load-store-tests.sh` over to `.wat`
tests.
* Remove more heap tests
These will be subsumed by porting `make-heap-load-store-tests.sh` over to `.wat`
tests.
* Remove `heap_addr` from `simple-alias.clif` test
* Remove `heap_addr` from partial-redundancy.clif test
* Remove `heap_addr` from multiple-blocks.clif test
* Remove `heap_addr` from fence.clif test
* Remove `heap_addr` from extends.clif test
* Remove runtests that rely on heaps
Heaps are not a thing in CLIF or the interpreter anymore
* Add generated load/store `.wat` tests
* Enable memory-related wasm features in `.wat` tests
* Remove CLIF heap from fcmp-mem-bug.clif test
* Add a mode for compiling `.wat` all the way to assembly in filetests
* Also generate WAT to assembly tests in `make-load-store-tests.sh`
* cargo fmt
* Reinstate `f{de,pro}mote.clif` tests without the heap bits
* Remove undefined doc link
* Remove outdated SVG and dot file from docs
* Add docs about `None` returns for base address computation helpers
* Factor out `env.heap_access_spectre_mitigation()` to a local
* Expand docs for `FuncEnvironment::heaps` trait method
* Restore f{de,pro}mote+load clif runtests with stack memory
* cranelift: Remove iconst.i128
* bugpoint: Report Changed when only one instruction is mutated
* cranelift: Fix egraph bxor rule
* cranelift: Remove some simple_preopt opts for i128
Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.
Fixes#3205
Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
* Elide redundant sentinel values
The `undef_variables` lists were a binding from Variable to Value, but
the Values were always equal to a suffix of the block's parameters. So
instead of storing another copy, we can just get them back from the
block parameters.
According to DHAT, this decreases total memory allocated and number of
bytes written, and increases number of bytes read and instructions
retired, but all by small fractions of a percent. According to
hyperfine, main is "1.00 ± 0.01 times faster".
* Use entity_impl for cranelift_frontend::Variable
Instead of hand-coding essentially the same thing.
* Keep undefined variables in a ListPool
According to DHAT, this improves every measure of performance
(instructions retired, total memory allocated, max heap size, bytes
read, and bytes written), although by fractions of a percent. According
to hyperfine the difference is nearly zero, but on Spidermonkey this
branch is "1.01 ± 0.00 times faster" than main.
* Elide redundant block IDs
In a list of predecessors, we previously kept both the jump instruction
that points to the current block, and the block where that instruction
resides. But we can look up the block from the instruction as long as we
have access to the current Layout, which we do everywhere that it was
necessary. So don't store the block, just store the instruction.
* Keep predecessor definitions in a ListPool
* Make append_jump_argument independent of self
This makes it easier to reason about borrow-checking issues.
* Reuse `results` instead of re-doing variable lookup
This eliminates three array lookups per predecessor by hanging on to the
results of earlier steps a little longer. This only works now because I
previously removed the need to borrow all of `self`, which otherwise
prevented keeping a borrow of self.results alive.
I had experimented with using `Vec::split_off` to copy the relevant
chunk of results to a temporary heap allocation, but the extra
allocation and copy was measurably slower. So it's important that this
is just a borrow.
* Cache single-predecessor block ID when sealing
Of the code in cranelift_frontend, `use_var` is the second-hottest path,
sitting close behind the `build` function that's used when inserting
every new instruction. This makes sense given that the operands of a new
instruction usually need to be looked up immediately before building the
instruction.
So making the single-predecessor loops in `find_var` and `use_var_local`
do fewer memory accesses and execute fewer instructions turns out to
have a measurable effect. It's still only a small fraction of a percent
overall since cranelift-frontend is only a few percent of total runtime.
This patch keeps a block ID in the SSABlockData, which is None unless
both the block is sealed and it has exactly one predecessor. Doing so
avoids two array lookups on each iteration of the two loops.
According to DHAT, compared with main, at this point this PR uses 0.3%
less memory at max heap, reads 0.6% fewer bytes, and writes 0.2% fewer
bytes.
According to Hyperfine, this PR is "1.01 ± 0.01 times faster" than main
when compiling Spidermonkey. On the other hand, Sightglass says main is
1.01x faster than this PR on the same benchmark by CPU cycles. In short,
actual effects are too small to measure reliably.
* cranelift-wasm: Assume block is reachable
In handling the WebAssembly "end" operator, cranelift-wasm had logic to
skip generating a jump instruction if the block was both unreachable and
"pristine", meaning no instructions had been added.
However, `translate_operator` checks first that `state.reachable` is
true, so this logic only runs when cranelift-wasm believes that the
current block _is_ reachable. Therefore the condition should always be
true, whether the block is pristine or not.
I've left a debug_assert in case `state.reachable` ever doesn't agree
with `builder.is_unreachable()`, but the assert doesn't fail in any of
the tests. We'll see if fuzzing finds something.
Anyway, outside of cranelift-frontend, this eliminates the only use of
`is_pristine()`, and there were no uses of `is_filled()`. So I've made
both of those private. They're now only used in a nearby debug assert.
* cranelift-frontend: Clarify pristine/filled states
There was a comment here saying "A filled block cannot be pristine."
Given that the intent was for those two states to be mutually exclusive,
I've replaced the two booleans with a three-state enum.
I also replaced all reads of these two flags with method calls. In all
but one case these are only checked in debug assertions, so I don't even
care whether they get inlined. They're easier to read, and this will
make it easier to replace their implementations, which I hope to do
soon.
Finally, I replaced all assignments to either flag with an appropriate
assignment of the corresponding enum state. Keep in mind this
correspondence between the new enum and the old flags:
- Empty: pristine true, filled false
- Partial: pristine false, filled false
- Filled: pristine false, filled true
Every existing update to these flags could only move to a later state.
(For example, Partial couldn't go back to Empty.) In the old flags that
meant that pristine could only go from true to false, and filled could
only go from false to true.
`fill_current_block` was a weird case because at first glance it looks
like it could allow both pristine and filled to be true at the same
time. However, it's only called from `FuncInstBuilder::build`, which
calls `ensure_inserted_block` before doing anything else, and _that_
cleared the pristine flag.
Similarly, `handle_ssa_side_effects` looks like it could allow both
pristine and filled to be true for anything in `split_blocks_created`.
However, those blocks are created by SSABuilder, so their BlockData is
not initialized by `create_block`, and instead uses BlockData::default.
The `Default` implementation here previously set both flags false, while
`create_block` would instead set pristine to true. So these split blocks
were correctly set to the Filled state, and after this patch they are
still set correctly.
* cranelift-frontend: Separate SSA and user block params
Previously there was a `user_param_count` field in BlockData, used
purely to debug-assert that no user parameters are added to a block
after `use_var` adds SSA parameters.
Instead, this patch enforces a strict phase separation between the
period after a block is created when user parameters can be added to it,
and the period when `use_var` may be called and instructions may be
added.
I'm assuming that calls to `use_var` are _always_ followed by inserting
one or more instructions into the block. (If you don't want to insert an
instruction, why do you need to know where instructions in this block
would get variable definitions from?) This patch has no visible effect
for callers which follow that rule.
However, it was previously legal to call `use_var`, then append a block
parameter before adding instructions, so long as `use_var` didn't
actually need to add a block parameter. That could only happen if the
current block is sealed and has exactly one predecessor. So anyone who
was counting on this behavior was playing a dangerous game anyway.
* cranelift-frontend: Defer initializing block data
Every reference to the func_ctx.status SecondaryMap will automatically
create the appropriate entries on-demand, with the sole exception of
`finalize`. In that function, debug assertions use SecondaryMap::keys to
find out which blocks need to be checked.
However, those assertions always succeed for blocks which never had any
instructions added. So it's okay to skip them for blocks which aren't
touched after `create_block`.
* Cleanups to cranelift-frontend SSA construction
* Encode sealed/undef_variables relationship in type
A block can't have any undef_variables if it is sealed. It's useful to
make that fact explicit in the types so that any time either value is
used, it's clear that we should think about the other one too.
In addition, encoding this fact in an enum type lets Rust apply an
optimization that reduces the size of SSABlockData by 8 bytes, making it
fit in a 64-byte cache line. I haven't taken the extra step of making
SSABlockData be 64-byte aligned because 1) it doesn't seem to have a
performance impact and b) doing so makes other structures quite a bit
bigger.
* Simplify finish_predecessors_lookup
Using Vec::drain is more concise than a combination of
iter().rev().take() followed by Vec::truncate. And in this case it
doesn't matter what order we examine the results in, because we just
want to know if they're all equal, so we might as well iterate forward
instead of in reverse.
There's no need for the ZeroOneOrMore enum. Instead, there are only two
cases: either we have a single value to use for the variable (possibly
synthesized as a constant zero), or we need to add a block parameter in
every predecessor.
Pre-filtering the results iterator to eliminate the sentinel makes it
easy to identify how many distinct definitions this variable has.
iter.next() indicates if there are any definitions at all, and then
iter.all() is a clear way to express that we want to know if the
remaining definitions are the same as the first one.
* Simplify append_jump_argument
* Avoid assigning default() into SecondaryMap
This eliminates some redundant reads and writes.
* cranelift-frontend: Construct with default()
This eliminates a bunch of boilerplate in favor of a built in `derive`
macro.
Also I'm deleting an import that had the comment "FIXME: Remove in
edition2021", which we've been using everywhere since April.
* Fix tests
In the common case where there is a chain of sealed blocks that each
have exactly one predecessor, we can keep track of any sub-sequence of
those blocks in O(1) space. So there's no need to use the state machine
stack to propagate variable definitions back along the chain.
Instead, we can do one loop to find which block to stop at, then either
get the variable definition from that block or introduce a block
parameter there, and finally do one more loop to update variable
definitions in all the intervening blocks.
The existing implementation already had to do a graph traversal to
propagate variable definitions correctly, so this doesn't visit any more
blocks than before. However, this change also makes it possible to
integrate cycle detection with the graph traversal. That eliminates the
need for the in_predecessor_cycle flags, and any possibility of spiky
performance profiles in maintaining those flags.
As far as performance goes, this is all pretty much a wash: Changes to
CPU time and CPU cycles are within noise, according to hyperfine and
Sightglass/perf. But it's a substantially simpler implementation, with
fewer invisible interactions between functions.
At control-flow join points, cranelift-frontend's SSA builder currently
checks to see if only one definition of a variable reaches the current
block. If so, it can eliminate the corresponding block parameter and use
the original def directly. It implements this by turning the block
parameter into an alias for the original value.
However, it didn't resolve aliases during this check, except after it
had already determined that there was only one definition.
Resolving aliases first instead allows it to detect that more block
parameters are redundant. And as more block parameters get converted to
aliases, later blocks can see common definitions from further away, so
this has a compounding effect.
This also merges a special case, where there's exactly one unique
non-sentinel definition but it's actually an alias for the sentinel,
into the general case where all definitions are from the sentinel. As a
result there's only one case that has to introduce a definition of the
variable to zero.
According to `valgrind --tool=dhat`, this is a significant memory
savings. On the pulldown-cmark benchmark from Sightglass:
- 15.3% (1.9MiB) less memory allocated at maximum heap
- 4.1% (6.7MiB) less memory allocated in total
- 9.8% (57MiB) fewer bytes read
- 12.6% (36MiB) fewer bytes written
- 5.4% fewer instructions retired
- 1.04x faster by instructions retired (per Sightglass/perf)
- 1.03x to 1.04x faster by CPU cycles (per Sightglass/perf)
- 1.03 ± 0.01 times faster by CPU time (per hyperfine)
- 1.04x faster by cache accesses (per Sightglass/perf)
On the bz2 benchmark:
- 1.06x faster by instructions retired (per Sightglass/perf)
- 1.05x faster by CPU cycles (per Sightglass/perf)
- 1.04 ± 0.01 times faster by CPU time (per hyperfine)
- 1.02x to 1.03x faster by cache accesses (per Sightglass/perf)
Even on the largest benchmark in Sightglass (spidermonkey.wasm), this is
a measurable improvement:
- 1.03x faster by instructions retired (per Sightglass/perf)
- 1.02x faster by CPU cycles (per Sightglass/perf)
- 1.02 ± 0.00 times faster by CPU time (per hyperfine)
There was no significant difference in cache misses for any benchmark,
according to Sightglass/perf.
* Leverage Cargo's workspace inheritance feature
This commit is an attempt to reduce the complexity of the Cargo
manifests in this repository with Cargo's workspace-inheritance feature
becoming stable in Rust 1.64.0. This feature allows specifying fields in
the root workspace `Cargo.toml` which are then reused throughout the
workspace. For example this PR shares definitions such as:
* All of the Wasmtime-family of crates now use `version.workspace =
true` to have a single location which defines the version number.
* All crates use `edition.workspace = true` to have one default edition
for the entire workspace.
* Common dependencies are listed in `[workspace.dependencies]` to avoid
typing the same version number in a lot of different places (e.g. the
`wasmparser = "0.89.0"` is now in just one spot.
Currently the workspace-inheritance feature doesn't allow having two
different versions to inherit, so all of the Cranelift-family of crates
still manually specify their version. The inter-crate dependencies,
however, are shared amongst the root workspace.
This feature can be seen as a method of "preprocessing" of sorts for
Cargo manifests. This will help us develop Wasmtime but shouldn't have
any actual impact on the published artifacts -- everything's dependency
lists are still the same.
* Fix wasi-crypto tests
We weren't using the "union" cargo feature for the smallvec crate, which
reduces the size of a SmallVec by one machine word. This feature
requires Rust 1.49 but we already require much newer versions.
When using Wasmtime to compile pulldown-cmark from Sightglass, this
saves a decent amount of memory allocations and writes. According to
`valgrind --tool=dhat`:
- 6.2MiB (3.69%) less memory allocated over the program's lifetime
- 0.5MiB (4.13%) less memory allocated at maximum heap size
- 5.5MiB (1.88%) fewer bytes written to
- 0.44% fewer instructions executed
Sightglass reports a statistically significant runtime improvement too:
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 24379323.60 ± 20051394.04 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.01x to 1.13x faster than main-be690a468.so!
[227506364 355007998.78 423280514] main-be690a468.so
[227686018 330628675.18 406025344] shrink-abiarg-0406da67c.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 360151622.56 ± 278294316.90 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.01x to 1.07x faster than main-be690a468.so!
[8709162212 8911001926.44 9535111576] main-be690a468.so
[5058015392 8550850303.88 9282148438] shrink-abiarg-0406da67c.so
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 6936570.28 ± 6897696.38 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.00x to 1.08x faster than main-be690a468.so!
[155810934 175260571.20 234737344] main-be690a468.so
[119128240 168324000.92 257451074] shrink-abiarg-0406da67c.so
* cranelift-frontend: Avoid quadratic behavior
Fixes#4923.
* Improve comments and debug assertions
* Improve comments
One thing that's especially neat about this PR is that, unlike the
`can_optimize_var_lookup` graph traversal, `update_predecessor_cycle`
doesn't need to keep track of all the blocks it has visited in order to
detect cycles. However, the reasons why are subtle and need careful
documentation.
Also neat: We've previously tried keeping either a HashSet or a
SecondaryMap around to re-use the same heap allocation for the `visited`
set, which needs space linear in the number of blocks. After this PR,
we're still using space that's linear in the number of blocks to store
the `in_predecessor_cycle` flag, but that flag fits inside existing
padding in `SSABlockData`, so it's a net savings in memory consumption.
* Avoid quadratic behavior in `update_predecessor_cycle`
So far I hadn't really eliminated the quadratic behavior from
`can_optimize_var_lookup`. I just moved it to happen when the CFG is
modified instead, and switched to indexing directly into the vector of
blocks instead of going through a HashSet. I suspect the latter change
is always a win, but the former is only an improvement assuming that
`use_var` is called more often than `declare_block_predecessor`.
But @cfallin pointed out that it feels like we should be able to do
better by taking advantage of the knowledge that once a block is sealed,
its predecessors can't change any more.
That's not completely trivial to do because changes to the property we
care about propagate toward successors, and we're only keeping pointers
to predecessors. Still, as long as frontends follow the existing
recommendation to seal blocks as soon as possible, maintaining a
conservative approximation using only local information works fine in
practice.
This significantly limits the situations where this graph traversal
could visit a lot of the CFG.
* Review comments
* Memoize `can_optimize_var_lookup`
`can_optimize_var_lookup` can have quadratic behavior if there is a chain
of blocks each containing a `local.get` instruction because each run can
walk up the entire chain. This change memoizes the results of
`can_optimize_var_lookup` so that we can stop following the chain of
predecessors when we hit a block that has previously been handled
(making the operation linear again).
* cranelift: Remove of/nof overflow flags from icmp
Neither Wasmtime nor cg-clif use these flags under any circumstances.
From discussion on #3060 I see it's long been unclear what purpose these
flags served.
Fixes#3060, fixes#4406, and fixes #4875... by deleting all the code
that could have been buggy.
This changes the cranelift-fuzzgen input format by removing some IntCC
options, so I've gone ahead and enabled I128 icmp tests at the same
time. Since only the of/nof cases were failing before, I expect these to
work.
* Restore trapif tests
It's still useful to validate that iadd_ifcout's iflags result can be
forwarded correctly to trapif, and for that purpose it doesn't really
matter what condition code is checked.
This commit replaces #4869 and represents the actual version bump that
should have happened had I remembered to bump the in-tree version of
Wasmtime to 1.0.0 prior to the branch-cut date. Alas!
When trying to read generated CLIF, it's nice to be able to see at a
glance that some of the operands are defined by `iconst` and similar
instructions, without having to go find each operand's definition
manually.
This fixes two problems: minimum symbol alignment for the LARL
instruction, and alignment requirements for LRL/LGRL etc.
The first problem is that the LARL instruction used to load a
symbol address (PC relative) requires that the target symbol
is at least 2-byte aligned. This is always guaranteed for code
symbols (all instructions must be 2-aligned anyway), but not
necessarily for data symbols.
Other s390x compilers fix this problem by ensuring that all
global symbols are always emitted with a minimum 2-byte
alignment. This patch introduces an equivalent mechanism
for cranelift:
- Add a symbol_alignment routine to TargetIsa, similar to the
existing code_section_alignment routine.
- Respect symbol_alignment as minimum alignment for all symbols
emitted in the object backend (code and data).
The second problem is that PC-relative instructions that
directly *access* data (like LRL/LGRL, STRL/STGRL etc.)
not only have the 2-byte requirement like LARL, but actually
require that their memory operand is *naturally* aligned
(i.e. alignment is at least the size of the access).
This property (natural alignment for memory accesses) is
supposed to be provided by the "aligned" flag in MemFlags;
however, this is not implemented correctly at the moment.
To fix this, this patch:
- Only emits PC-relative memory access instructions if the
"aligned" flag is set in the associated MemFlags.
- Fixes a bug in emit_small_memory_copy and emit_small_memset
which currently set the aligned flag unconditionally, ignoring
the actual alignment info passed by their caller.
Tested with wasmtime and cg_clif.
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime.
After the suggestion of Chris, `Function` has been split into mostly two parts:
- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.
Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:
- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
- `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
- The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.
The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.
A basic fuzz target has been introduced that tries to do the bare minimum:
- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
- This last check is less efficient and less likely to happen, so probably should be rethought a bit.
Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.
Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement.
Fixes#4155.
* Add `try_use_var` method to `cranelift-frontend`.
- Unlike `use_var`, this method does not panic if the variable has not been defined
before use
* Add `try_declare_var` and `try_def_var`.
- Also implement Error for error enums.
* Use `write!` macro.
* Add `write!` use I missed.
First, we switch from a `BTreeSet` to a `HashSet` because clearing a `BTreeSet`
will deallocate the btree's nodes but clearing a `HashSet` will not deallocate
the backing hash table, saving the space to reuse for future insertions.
Then, we reuse the same set (and therefore the same allocation) across every
call to `can_optimize_var_lookup`.
This results in a 1.22x to 1.32x speed up on various Sightglass benchmarks:
```
compilation :: nanoseconds :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 39478181.76 ± 3441880.32 (confidence = 99%)
main.so is 0.75x to 0.79x faster than reuse-set.so!
reuse-set.so is 1.27x to 1.32x faster than main.so!
[160128343 172174751.09 213325968] main.so
[115055695 132696569.33 200782128] reuse-set.so
compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm
Δ = 22576954.88 ± 1830771.68 (confidence = 99%)
main.so is 0.77x to 0.81x faster than reuse-set.so!
reuse-set.so is 1.25x to 1.29x faster than main.so!
[100449245 106820149.65 118628066] main.so
[77039172 84243194.77 128168647] reuse-set.so
compilation :: nanoseconds :: benchmarks/spidermonkey/benchmark.wasm
Δ = 664533554.97 ± 22109170.05 (confidence = 99%)
main.so is 0.81x to 0.82x faster than reuse-set.so!
reuse-set.so is 1.22x to 1.23x faster than main.so!
[3549762523 3640587103.35 3798662501] main.so
[2793335181 2976053548.38 3192950484] reuse-set.so
```
* cranelift: Restrict `br_table` to `i32` indices
In #4498 it was proposed that we should only accept `i32` indices
to `br_table`. The rationale for this is that larger types lead the
users to a false sense of flexibility (since we don't support jump
tables larger than u32's), and narrower types are not well tested
paths that would be safer if we removed them.
* cranelift: Reduce directly from i128 to i32 in Switch
In #4502 we discovered a bug in the switch api where it would emit
`icmp_imm`'s with types that were not able to fully represent the
destination index.
We now reject these inputs. The index val must always have a
type that is capable of addressing the entire range of inputs.
* Allow using jump-tables multiple times (fixes#3347)
If there are multiple `br_table` instructions using the same jump table,
then `append_jump_argument` must not modify the jump table in-place.
When this function is called, we don't know if more `br_table`
instructions might be added later. So this patch conservatively assumes
that all jump tables might be reused. If Cranelift needs to add a block
argument to a block that's the target of some jump table, then the jump
table will be unconditionally cloned.
I'm not sure if having duplicated and unused jump tables will turn out
to be a compile-time performance issue. If it is, there's discussion in
issue #3347 about ways to determine that there can't be any more uses of
a jump table, so that it's safe to modify in-place.
* Re-enable cranelift-fuzzgen fuzz target
I've been running this fuzz target for an hour without finding new bugs.
Let's see if oss-fuzz finds anything now.
Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.
Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.
The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.
Copyright (c) 2022, Arm Limited.
* Run a `cargo update` over our dependencies
This'll notably fix a `cargo audit` error where we have a pinned version
of the `regex` crate which has a CVE assigned to it.
* Update to `object` and `hashbrown` crates
Prune some duplicate versions showing up from the previous `cargo update`
* Upgrade all crates to the Rust 2021 edition
I've personally started using the new format strings for things like
`panic!("some message {foo}")` or similar and have been upgrading crates
on a case-by-case basis, but I think it probably makes more sense to go
ahead and blanket upgrade everything so 2021 features are always
available.
* Fix compile of the C API
* Fix a warning
* Fix another warning
* Bump to 0.36.0
* Add a two-week delay to Wasmtime's release process
This commit is a proposal to update Wasmtime's release process with a
two-week delay from branching a release until it's actually officially
released. We've had two issues lately that came up which led to this proposal:
* In #3915 it was realized that changes just before the 0.35.0 release
weren't enough for an embedding use case, but the PR didn't meet the
expectations for a full patch release.
* At Fastly we were about to start rolling out a new version of Wasmtime
when over the weekend the fuzz bug #3951 was found. This led to the
desire internally to have a "must have been fuzzed for this long"
period of time for Wasmtime changes which we felt were better
reflected in the release process itself rather than something about
Fastly's own integration with Wasmtime.
This commit updates the automation for releases to unconditionally
create a `release-X.Y.Z` branch on the 5th of every month. The actual
release from this branch is then performed on the 20th of every month,
roughly two weeks later. This should provide a period of time to ensure
that all changes in a release are fuzzed for at least two weeks and
avoid any further surprises. This should also help with any last-minute
changes made just before a release if they need tweaking since
backporting to a not-yet-released branch is much easier.
Overall there are some new properties about Wasmtime with this proposal
as well:
* The `main` branch will always have a section in `RELEASES.md` which is
listed as "Unreleased" for us to fill out.
* The `main` branch will always be a version ahead of the latest
release. For example it will be bump pre-emptively as part of the
release process on the 5th where if `release-2.0.0` was created then
the `main` branch will have 3.0.0 Wasmtime.
* Dates for major versions are automatically updated in the
`RELEASES.md` notes.
The associated documentation for our release process is updated and the
various scripts should all be updated now as well with this commit.
* Add notes on a security patch
* Clarify security fixes shouldn't be previewed early on CI
Addresses #3809: when we are asked to create a Cranelift backend with
shared flags that indicate support for SIMD, we should check that the
ISA level needed for our SIMD lowerings is present.
This PR adds a flag to each block that can be set via the frontend/builder
interface that indicates that the block will not be frequently
executed. As such, the compiler backend should place the block "out of
line" in the final machine code, so that the ordinary, more frequent
execution path that excludes the block does not have to jump around it.
This is useful for adding handlers for exceptional conditions
(slow-paths, guard violations) in a way that minimizes performance cost.
Fixes#2747.