6fe69d00ca8d1760584cf55723fc65b6e9464ab6
159 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
28cf995fd3 |
cranelift-frontend: make FunctionBuilder::finalize consume self (#5316)
|
||
|
|
7a31c5b07c |
Deduplicate listings of traps in Wasmtime (#5299)
This commit replaces `wasmtime_environ::TrapCode` with `wasmtime::Trap`. This is possible with past refactorings which slimmed down the `Trap` definition in the `wasmtime` crate to a simple `enum`. This means that there's one less place that all the various trap opcodes need to be listed in Wasmtime. |
||
|
|
47fa1ad6a8 |
Rework bounds checking for atomic operations (#5239)
Before, we would do a `heap_addr` to translate the given Wasm memory address into a native memory address and pass it into the libcall that implemented the atomic operation, which would then treat the address as a Wasm memory address and pass it to `validate_atomic_addr` to be bounds checked a second time. This is a bit nonsensical, as we are validating a native memory address as if it were a Wasm memory address. Now, we no longer do a `heap_addr` to translate the Wasm memory address to a native memory address. Instead, we pass the Wasm memory address to the libcall, and the libcall is responsible for doing the bounds check (by calling `validate_atomic_addr` with the correct type of memory address now). |
||
|
|
cd53bed898 |
Implement AOT compilation for components (#5160)
* Pull `Module` out of `ModuleTextBuilder` This commit is the first in what will likely be a number towards preparing for serializing a compiled component to bytes, a precompiled artifact. To that end my rough plan is to merge all of the compiled artifacts for a component into one large object file instead of having lots of separate object files and lots of separate mmaps to manage. To that end I plan on eventually using `ModuleTextBuilder` to build one large text section for all core wasm modules and trampolines, meaning that `ModuleTextBuilder` is no longer specific to one module. I've extracted out functionality such as function name calculation as well as relocation resolving (now a closure passed in) in preparation for this. For now this just keeps tests passing, and the trajectory for this should become more clear over the following commits. * Remove component-specific object emission This commit removes the `ComponentCompiler::emit_obj` function in favor of `Compiler::emit_obj`, now renamed `append_code`. This involved significantly refactoring code emission to take a flat list of functions into `append_code` and the caller is responsible for weaving together various "families" of functions and un-weaving them afterwards. * Consolidate ELF parsing in `CodeMemory` This commit moves the ELF file parsing and section iteration from `CompiledModule` into `CodeMemory` so one location keeps track of section ranges and such. This is in preparation for sharing much of this code with components which needs all the same sections to get tracked but won't be using `CompiledModule`. A small side benefit from this is that the section parsing done in `CodeMemory` and `CompiledModule` is no longer duplicated. * Remove separately tracked traps in components Previously components would generate an "always trapping" function and the metadata around which pc was allowed to trap was handled manually for components. With recent refactorings the Wasmtime-standard trap section in object files is now being generated for components as well which means that can be reused instead of custom-tracking this metadata. This commit removes the manual tracking for the `always_trap` functions and plumbs the necessary bits around to make components look more like modules. * Remove a now-unnecessary `Arc` in `Module` Not expected to have any measurable impact on performance, but complexity-wise this should make it a bit easier to understand the internals since there's no longer any need to store this somewhere else than its owner's location. * Merge compilation artifacts of components This commit is a large refactoring of the component compilation process to produce a single artifact instead of multiple binary artifacts. The core wasm compilation process is refactored as well to share as much code as necessary with the component compilation process. This method of representing a compiled component necessitated a few medium-sized changes internally within Wasmtime: * A new data structure was created, `CodeObject`, which represents metadata about a single compiled artifact. This is then stored as an `Arc` within a component and a module. For `Module` this is always uniquely owned and represents a shuffling around of data from one owner to another. For a `Component`, however, this is shared amongst all loaded modules and the top-level component. * The "module registry" which is used for symbolicating backtraces and for trap information has been updated to account for a single region of loaded code holding possibly multiple modules. This involved adding a second-level `BTreeMap` for now. This will likely slow down instantiation slightly but if it poses an issue in the future this should be able to be represented with a more clever data structure. This commit additionally solves a number of longstanding issues with components such as compiling only one host-to-wasm trampoline per signature instead of possibly once-per-module. Additionally the `SignatureCollection` registration now happens once-per-component instead of once-per-module-within-a-component. * Fix compile errors from prior commits * Support AOT-compiling components This commit adds support for AOT-compiled components in the same manner as `Module`, specifically adding: * `Engine::precompile_component` * `Component::serialize` * `Component::deserialize` * `Component::deserialize_file` Internally the support for components looks quite similar to `Module`. All the prior commits to this made adding the support here (unsurprisingly) easy. Components are represented as a single object file as are modules, and the functions for each module are all piled into the same object file next to each other (as are areas such as data sections). Support was also added here to quickly differentiate compiled components vs compiled modules via the `e_flags` field in the ELF header. * Prevent serializing exported modules on components The current representation of a module within a component means that the implementation of `Module::serialize` will not work if the module is exported from a component. The reason for this is that `serialize` doesn't actually do anything and simply returns the underlying mmap as a list of bytes. The mmap, however, has `.wasmtime.info` describing component metadata as opposed to this module's metadata. While rewriting this section could be implemented it's not so easy to do so and is otherwise seen as not super important of a feature right now anyway. * Fix windows build * Fix an unused function warning * Update crates/environ/src/compilation.rs Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com> Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com> |
||
|
|
ec12415b1f |
cranelift: Remove redundant branch and select instructions (#5097)
As discussed in the 2022/10/19 meeting, this PR removes many of the branch and select instructions that used iflags, in favor if using brz/brnz and select in their place. Additionally, it reworks selectif_spectre_guard to take an i8 input instead of an iflags input. For reference, the removed instructions are: br_icmp, brif, brff, trueif, trueff, and selectif. |
||
|
|
32a7593c94 |
cranelift: Remove booleans (#5031)
Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements. Fixes #3205 Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com> Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com> Co-authored-by: Chris Fallin <chris@cfallin.org> |
||
|
|
04b30acad9 |
Misc cleanups (#5014)
* Replace resize+copy_from_slice with extend_from_slice Vec::resize initializes the new space, which is wasted effort if we're just going to call `copy_from_slice` on it immediately afterward. Using `extend_from_slice` is simpler, and very slightly faster. If the new size were bigger than the buffer we're copying from, then it would make sense to initialize the excess. But it isn't: it's always exactly the same size. * Move helpers from Context to CompiledCode These methods only use information from Context::compiled_code, so they should live on CompiledCode instead. * Remove an unnecessary #[cfg_attr] There are other uses of `#[allow(clippy::too_many_arguments)]` in this file, so apparently it doesn't need to be guarded by the "cargo-clippy" feature. * Fix a few comments Two of these were wrong/misleading: - `FunctionBuilder::new` does not clear the provided func_ctx. It does debug-assert that the context is already clear, but I don't think that's worth a comment. - `switch_to_block` does not "create values for the arguments." That's done by the combination of `append_block_params_for_function_params` and `declare_wasm_parameters`. * wasmtime-cranelift: Misc cleanups The main change is to use the `CompiledCode` reference we already had instead of getting it out of `Context` repeatedly. This removes a bunch of `unwrap()` calls. * wasmtime-cranelift: Factor out uncached compile |
||
|
|
29c7de7340 |
Update wasm-tools dependencies (#4970)
* Update wasm-tools dependencies This update brings in a number of features such as: * The component model binary format and AST has been slightly adjusted in a few locations. Names are dropped from parameters/results now in the internal representation since they were not used anyway. At this time the ability to bind a multi-return function has not been exposed. * The `wasmparser` validator pass will now share allocations with prior functions, providing what's probably a very minor speedup for Wasmtime itself. * The text format for many component-related tests now requires named parameters. * Some new relaxed-simd instructions are updated to be ignored. I hope to have a follow-up to expose the multi-return ability to the embedding API of components. * Update audit information for new crates |
||
|
|
7b311004b5 |
Leverage Cargo's workspace inheritance feature (#4905)
* Leverage Cargo's workspace inheritance feature This commit is an attempt to reduce the complexity of the Cargo manifests in this repository with Cargo's workspace-inheritance feature becoming stable in Rust 1.64.0. This feature allows specifying fields in the root workspace `Cargo.toml` which are then reused throughout the workspace. For example this PR shares definitions such as: * All of the Wasmtime-family of crates now use `version.workspace = true` to have a single location which defines the version number. * All crates use `edition.workspace = true` to have one default edition for the entire workspace. * Common dependencies are listed in `[workspace.dependencies]` to avoid typing the same version number in a lot of different places (e.g. the `wasmparser = "0.89.0"` is now in just one spot. Currently the workspace-inheritance feature doesn't allow having two different versions to inherit, so all of the Cranelift-family of crates still manually specify their version. The inter-crate dependencies, however, are shared amongst the root workspace. This feature can be seen as a method of "preprocessing" of sorts for Cargo manifests. This will help us develop Wasmtime but shouldn't have any actual impact on the published artifacts -- everything's dependency lists are still the same. * Fix wasi-crypto tests |
||
|
|
d8b290898c |
Initial forward-edge CFI implementation (#3693)
* Initial forward-edge CFI implementation Give the user the option to start all basic blocks that are targets of indirect branches with the BTI instruction introduced by the Branch Target Identification extension to the Arm instruction set architecture. Copyright (c) 2022, Arm Limited. * Refactor `from_artifacts` to avoid second `make_executable` (#1) This involves "parsing" twice but this is parsing just the header of an ELF file so it's not a very intensive operation and should be ok to do twice. * Address the code review feedback Copyright (c) 2022, Arm Limited. Co-authored-by: Alex Crichton <alex@alexcrichton.com> |
||
|
|
65930640f8 |
Bump Wasmtime to 2.0.0 (#4874)
This commit replaces #4869 and represents the actual version bump that should have happened had I remembered to bump the in-tree version of Wasmtime to 1.0.0 prior to the branch-cut date. Alas! |
||
|
|
dde2c5a3b6 |
Align functions according to their ISA's requirements (#4826)
Add a function_alignment function to the TargetIsa trait, and use it to align functions when generating objects. Additionally, collect the maximum alignment required for pc-relative constants in functions and pass that value out. Use the max of these two values when padding functions for alignment. This fixes a bug on x86_64 where rip-relative loads to sse registers could cause a segfault, as functions weren't always guaranteed to be aligned to 16-byte addresses. Fixes #4812 |
||
|
|
57dca934ad |
Upgrade wasm-tools crates, namely the component model (#4715)
* Upgrade wasm-tools crates, namely the component model This commit pulls in the latest versions of all of the `wasm-tools` family of crates. There were two major changes that happened in `wasm-tools` in the meantime: * bytecodealliance/wasm-tools#697 - this commit introduced a new API for more efficiently reading binary operators from a wasm binary. The old `Operator`-based reading was left in place, however, and continues to be what Wasmtime uses. I hope to update Wasmtime in a future PR to use this new API, but for now the biggest change is... * bytecodealliance/wasm-tools#703 - this commit was a major update to the component model AST. This commit almost entirely deals with the fallout of this change. The changes made to the component model were: 1. The `unit` type no longer exists. This was generally a simple change where the `Unit` case in a few different locations were all removed. 2. The `expected` type was renamed to `result`. This similarly was relatively lightweight and mostly just a renaming on the surface. I took this opportunity to rename `val::Result` to `val::ResultVal` and `types::Result` to `types::ResultType` to avoid clashing with the standard library types. The `Option`-based types were handled with this as well. 3. The payload type of `variant` and `result` types are now optional. This affected many locations that calculate flat type representations, ABI information, etc. The `#[derive(ComponentType)]` macro now specifically handles Rust-defined `enum` types which have no payload to the equivalent in the component model. 4. Functions can now return multiple parameters. This changed the signature of invoking component functions because the return value is now bound by `ComponentNamedList` (renamed from `ComponentParams`). This had a large effect in the tests, fuzz test case generation, etc. 5. Function types with 2-or-more parameters/results must uniquely name all parameters/results. This mostly affected the text format used throughout the tests. I haven't added specifically new tests for multi-return but I changed a number of tests to use it. Additionally I've updated the fuzzers to all exercise multi-return as well so I think we should get some good coverage with that. * Update version numbers * Use crates.io |
||
|
|
8a9b1a9025 |
Implement an incremental compilation cache for Cranelift (#4551)
This is the implementation of https://github.com/bytecodealliance/wasmtime/issues/4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime. After the suggestion of Chris, `Function` has been split into mostly two parts: - on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`. - on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on. Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache: - most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set. - user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`. - some refactorings have been made for function names: - `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name. - The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with. The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions. A basic fuzz target has been introduced that tries to do the bare minimum: - check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function. - check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache. - This last check is less efficient and less likely to happen, so probably should be rethought a bit. Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip. Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement. Fixes #4155. |
||
|
|
67870d1518 |
s390x: Support both big- and little-endian vector lane order (#4682)
This implements the s390x back-end portion of the solution for https://github.com/bytecodealliance/wasmtime/issues/4566 We now support both big- and little-endian vector lane order in code generation. The order used for a function is determined by the function's ABI: if it uses a Wasmtime ABI, it will use little-endian lane order, and big-endian lane order otherwise. (This ensures that all raw_bitcast instructions generated by both wasmtime and other cranelift frontends can always be implemented as a no-op.) Lane order affects the implementation of a number of operations: - Vector immediates - Vector memory load / store (in big- and little-endian variants) - Operations explicitly using lane numbers (insertlane, extractlane, shuffle, swizzle) - Operations implicitly using lane numbers (iadd_pairwise, narrow/widen, promote/demote, fcvt_low, vhigh_bits) In addition, when calling a function using a different lane order, we need to lane-swap all vector values passed or returned in registers. A small number of changes to common code were also needed: - Ensure we always select a Wasmtime calling convention on s390x in crates/cranelift (func_signature). - Fix vector immediates for filetests/runtests. In PR #4427, I attempted to fix this by byte-swapping the V128 value, but with the new scheme, we'd instead need to perform a per-lane byte swap. Since we do not know the actual type in write_to_slice and read_from_slice, this isn't easily possible. Revert this part of PR #4427 again, and instead just mark the memory buffer as little-endian when emitting the trampoline; the back-end will then emit correct code to load the constant. - Change a runtest in simd-bitselect-to-vselect.clif to no longer make little-endian lane order assumptions. - Remove runtests in simd-swizzle.clif that make little-endian lane order assumptions by relying on implicit type conversion when using a non-i16x8 swizzle result type (this feature should probably be removed anyway). Tested with both wasmtime and cg_clif. |
||
|
|
650979ae40 |
Implement strings in adapter modules (#4623)
* Implement strings in adapter modules This commit is a hefty addition to Wasmtime's support for the component model. This implements the final remaining type (in the current type hierarchy) unimplemented in adapter module trampolines: strings. Strings are the most complicated type to implement in adapter trampolines because they are highly structured chunks of data in memory (according to specific encodings). Additionally each lift/lower operation can choose its own encoding for strings meaning that Wasmtime, the host, may have to convert between any pairwise ordering of string encodings. The `CanonicalABI.md` in the component-model repo in general specifies all the fiddly bits of string encoding so there's not a ton of wiggle room for Wasmtime to get creative. This PR largely "just" implements that. The high-level architecture of this implementation is: * Fused adapters are first identified to determine src/dst string encodings. This statically fixes what transcoding operation is being performed. * The generated adapter will be responsible for managing calls to `realloc` and performing bounds checks. The adapter itself does not perform memory copies or validation of string contents, however. Instead each transcoding operation is modeled as an imported function into the adapter module. This means that the adapter module dynamically, during compile time, determines what string transcoders are needed. Note that an imported transcoder is not only parameterized over the transcoding operation but additionally which memory is the source and which is the destination. * The imported core wasm functions are modeled as a new `CoreDef::Transcoder` structure. These transcoders end up being small Cranelift-compiled trampolines. The Cranelift-compiled trampoline will load the actual base pointer of memory and add it to the relative pointers passed as function arguments. This trampoline then calls a transcoder "libcall" which enters Rust-defined functions for actual transcoding operations. * Each possible transcoding operation is implemented in Rust with a unique name and a unique signature depending on the needs of the transcoder. I've tried to document inline what each transcoder does. This means that the `Module::translate_string` in adapter modules is by far the largest translation method. The main reason for this is due to the management around calling the imported transcoder functions in the face of validating string pointer/lengths and performing the dance of `realloc`-vs-transcode at the right time. I've tried to ensure that each individual case in transcoding is documented well enough to understand what's going on as well. Additionally in this PR is a full implementation in the host for the `latin1+utf16` encoding which means that both lifting and lowering host strings now works with this encoding. Currently the implementation of each transcoder function is likely far from optimal. Where possible I've leaned on the standard library itself and for latin1-related things I'm leaning on the `encoding_rs` crate. I initially tried to implement everything with `encoding_rs` but was unable to uniformly do so easily. For now I settled on trying to get a known-correct (even in the face of endianness) implementation for all of these transcoders. If an when performance becomes an issue it should be possible to implement more optimized versions of each of these transcoding operations. Testing this commit has been somewhat difficult and my general plan, like with the `(list T)` type, is to rely heavily on fuzzing to cover the various cases here. In this PR though I've added a simple test that pushes some statically known strings through all the pairs of encodings between source and destination. I've attempted to pick "interesting" strings that one way or another stress the various paths in each transcoding operation to ideally get full branch coverage there. Additionally a suite of "negative" tests have also been added to ensure that validity of encoding is actually checked. * Fix a temporarily commented out case * Fix wasmtime-runtime tests * Update deny.toml configuration * Add `BSD-3-Clause` for the `encoding_rs` crate * Remove some unused licenses * Add an exemption for `encoding_rs` for now * Split up the `translate_string` method Move out all the closures and package up captured state into smaller lists of arguments. * Test out-of-bounds for zero-length strings |
||
|
|
412fa04911 |
Bump Wasmtime to 0.41.0 (#4620)
Co-authored-by: Wasmtime Publish <wasmtime-publish@users.noreply.github.com> |
||
|
|
70ce288dc7 |
Save exit Wasm FP and PC in component-to-host trampolines (#4601)
* Wasmtime: Add a pointer to `VMRuntimeLimits` in component contexts * Save exit Wasm FP and PC in component-to-host trampolines Fixes #4535 * Add comment about why we deref the trampoline's FP * Update some tests to use new `vmruntime_limits_*` methods |
||
|
|
9f82644cc3 |
Some minor cleanups/refactorings in components (#4582)
This is a collection of some minor renamings, refactorings, sharing of code, etc. This was all discovered during my addition of string support to adapter functions and I figured it'd be best to frontload this and land it ahead of the full patch since it's getting complex. |
||
|
|
ff37c9d8a4 |
[cranelift] Rejigger the compile API (#4540)
* Move `emit_to_memory` to `MachCompileResult` This small refactoring makes it clearer to me that emitting to memory doesn't require anything else from the compilation `Context`. While it's a trivial change, it's a small public API change that shouldn't cause too much trouble, and doesn't seem RFC-worthy. Happy to hear different opinions about this, though! * hide the MachCompileResult behind a method * Add a `CompileError` wrapper type that references a `Function` * Rename MachCompileResult to CompiledCode * Additionally remove the last unsafe API in cranelift-codegen |
||
|
|
46782b18c2 |
wasmtime: Implement fast Wasm stack walking (#4431)
* Always preserve frame pointers in Wasmtime
This allows us to efficiently and simply capture Wasm stacks without maintaining
and synchronizing any safety-critical side tables between the compiler and the
runtime.
* wasmtime: Implement fast Wasm stack walking
Why do we want Wasm stack walking to be fast? Because we capture stacks whenever
there is a trap and traps actually happen fairly frequently with short-lived
programs and WASI's `exit`.
Previously, we would rely on generating the system unwind info (e.g.
`.eh_frame`) and using the system unwinder (via the `backtrace`crate) to walk
the full stack and filter out any non-Wasm stack frames. This can,
unfortunately, be slow for two primary reasons:
1. The system unwinder is doing `O(all-kinds-of-frames)` work rather than
`O(wasm-frames)` work.
2. System unwind info and the system unwinder need to be much more general than
a purpose-built stack walker for Wasm needs to be. It has to handle any kind of
stack frame that any compiler might emit where as our Wasm frames are emitted by
Cranelift and always have frame pointers. This translates into implementation
complexity and general overhead. There can also be unnecessary-for-our-use-cases
global synchronization and locks involved, further slowing down stack walking in
the presence of multiple threads trying to capture stacks in parallel.
This commit introduces a purpose-built stack walker for traversing just our Wasm
frames. To find all the sequences of Wasm-to-Wasm stack frames, and ignore
non-Wasm stack frames, we keep a linked list of `(entry stack pointer, exit
frame pointer)` pairs. This linked list is maintained via Wasm-to-host and
host-to-Wasm trampolines. Within a sequence of Wasm-to-Wasm calls, we can use
frame pointers (which Cranelift preserves) to find the next older Wasm frame on
the stack, and we keep doing this until we reach the entry stack pointer,
meaning that the next older frame will be a host frame.
The trampolines need to avoid a couple stumbling blocks. First, they need to be
compiled ahead of time, since we may not have access to a compiler at
runtime (e.g. if the `cranelift` feature is disabled) but still want to be able
to call functions that have already been compiled and get stack traces for those
functions. Usually this means we would compile the appropriate trampolines
inside `Module::new` and the compiled module object would hold the
trampolines. However, we *also* need to support calling host functions that are
wrapped into `wasmtime::Func`s and there doesn't exist *any* ahead-of-time
compiled module object to hold the appropriate trampolines:
```rust
// Define a host function.
let func_type = wasmtime::FuncType::new(
vec![wasmtime::ValType::I32],
vec![wasmtime::ValType::I32],
);
let func = Func::new(&mut store, func_type, |_, params, results| {
// ...
Ok(())
});
// Call that host function.
let mut results = vec![wasmtime::Val::I32(0)];
func.call(&[wasmtime::Val::I32(0)], &mut results)?;
```
Therefore, we define one host-to-Wasm trampoline and one Wasm-to-host trampoline
in assembly that work for all Wasm and host function signatures. These
trampolines are careful to only use volatile registers, avoid touching any
register that is an argument in the calling convention ABI, and tail call to the
target callee function. This allows forwarding any set of arguments and any
returns to and from the callee, while also allowing us to maintain our linked
list of Wasm stack and frame pointers before transferring control to the
callee. These trampolines are not used in Wasm-to-Wasm calls, only when crossing
the host-Wasm boundary, so they do not impose overhead on regular calls. (And if
using one trampoline for all host-Wasm boundary crossing ever breaks branch
prediction enough in the CPU to become any kind of bottleneck, we can do fun
things like have multiple copies of the same trampoline and choose a random copy
for each function, sharding the functions across branch predictor entries.)
Finally, this commit also ends the use of a synthetic `Module` and allocating a
stubbed out `VMContext` for host functions. Instead, we define a
`VMHostFuncContext` with its own magic value, similar to `VMComponentContext`,
specifically for host functions.
<h2>Benchmarks</h2>
<h3>Traps and Stack Traces</h3>
Large improvements to taking stack traces on traps, ranging from shaving off 64%
to 99.95% of the time it used to take.
<details>
```
multi-threaded-traps/0 time: [2.5686 us 2.5808 us 2.5934 us]
thrpt: [0.0000 elem/s 0.0000 elem/s 0.0000 elem/s]
change:
time: [-85.419% -85.153% -84.869%] (p = 0.00 < 0.05)
thrpt: [+560.90% +573.56% +585.84%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
4 (4.00%) high mild
4 (4.00%) high severe
multi-threaded-traps/1 time: [2.9021 us 2.9167 us 2.9322 us]
thrpt: [341.04 Kelem/s 342.86 Kelem/s 344.58 Kelem/s]
change:
time: [-91.455% -91.294% -91.096%] (p = 0.00 < 0.05)
thrpt: [+1023.1% +1048.6% +1070.3%]
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) high mild
5 (5.00%) high severe
multi-threaded-traps/2 time: [2.9996 us 3.0145 us 3.0295 us]
thrpt: [660.18 Kelem/s 663.47 Kelem/s 666.76 Kelem/s]
change:
time: [-94.040% -93.910% -93.762%] (p = 0.00 < 0.05)
thrpt: [+1503.1% +1542.0% +1578.0%]
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high severe
multi-threaded-traps/4 time: [5.5768 us 5.6052 us 5.6364 us]
thrpt: [709.68 Kelem/s 713.63 Kelem/s 717.25 Kelem/s]
change:
time: [-93.193% -93.121% -93.052%] (p = 0.00 < 0.05)
thrpt: [+1339.2% +1353.6% +1369.1%]
Performance has improved.
multi-threaded-traps/8 time: [8.6408 us 9.1212 us 9.5438 us]
thrpt: [838.24 Kelem/s 877.08 Kelem/s 925.84 Kelem/s]
change:
time: [-94.754% -94.473% -94.202%] (p = 0.00 < 0.05)
thrpt: [+1624.7% +1709.2% +1806.1%]
Performance has improved.
multi-threaded-traps/16 time: [10.152 us 10.840 us 11.545 us]
thrpt: [1.3858 Melem/s 1.4760 Melem/s 1.5761 Melem/s]
change:
time: [-97.042% -96.823% -96.577%] (p = 0.00 < 0.05)
thrpt: [+2821.5% +3048.1% +3281.1%]
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
many-modules-registered-traps/1
time: [2.6278 us 2.6361 us 2.6447 us]
thrpt: [378.11 Kelem/s 379.35 Kelem/s 380.55 Kelem/s]
change:
time: [-85.311% -85.108% -84.909%] (p = 0.00 < 0.05)
thrpt: [+562.65% +571.51% +580.76%]
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
many-modules-registered-traps/8
time: [2.6294 us 2.6460 us 2.6623 us]
thrpt: [3.0049 Melem/s 3.0235 Melem/s 3.0425 Melem/s]
change:
time: [-85.895% -85.485% -85.022%] (p = 0.00 < 0.05)
thrpt: [+567.63% +588.95% +608.95%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
many-modules-registered-traps/64
time: [2.6218 us 2.6329 us 2.6452 us]
thrpt: [24.195 Melem/s 24.308 Melem/s 24.411 Melem/s]
change:
time: [-93.629% -93.551% -93.470%] (p = 0.00 < 0.05)
thrpt: [+1431.4% +1450.6% +1469.5%]
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
many-modules-registered-traps/512
time: [2.6569 us 2.6737 us 2.6923 us]
thrpt: [190.17 Melem/s 191.50 Melem/s 192.71 Melem/s]
change:
time: [-99.277% -99.268% -99.260%] (p = 0.00 < 0.05)
thrpt: [+13417% +13566% +13731%]
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
many-modules-registered-traps/4096
time: [2.7258 us 2.7390 us 2.7535 us]
thrpt: [1.4876 Gelem/s 1.4955 Gelem/s 1.5027 Gelem/s]
change:
time: [-99.956% -99.955% -99.955%] (p = 0.00 < 0.05)
thrpt: [+221417% +223380% +224881%]
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
many-stack-frames-traps/1
time: [1.4658 us 1.4719 us 1.4784 us]
thrpt: [676.39 Kelem/s 679.38 Kelem/s 682.21 Kelem/s]
change:
time: [-90.368% -89.947% -89.586%] (p = 0.00 < 0.05)
thrpt: [+860.23% +894.72% +938.21%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe
many-stack-frames-traps/8
time: [2.4772 us 2.4870 us 2.4973 us]
thrpt: [3.2034 Melem/s 3.2167 Melem/s 3.2294 Melem/s]
change:
time: [-85.550% -85.370% -85.199%] (p = 0.00 < 0.05)
thrpt: [+575.65% +583.51% +592.03%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
4 (4.00%) high mild
4 (4.00%) high severe
many-stack-frames-traps/64
time: [10.109 us 10.171 us 10.236 us]
thrpt: [6.2525 Melem/s 6.2925 Melem/s 6.3309 Melem/s]
change:
time: [-78.144% -77.797% -77.336%] (p = 0.00 < 0.05)
thrpt: [+341.22% +350.38% +357.55%]
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
many-stack-frames-traps/512
time: [126.16 us 126.54 us 126.96 us]
thrpt: [4.0329 Melem/s 4.0461 Melem/s 4.0583 Melem/s]
change:
time: [-65.364% -64.933% -64.453%] (p = 0.00 < 0.05)
thrpt: [+181.32% +185.17% +188.71%]
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high severe
```
</details>
<h3>Calls</h3>
There is, however, a small regression in raw Wasm-to-host and host-to-Wasm call
performance due the new trampolines. It seems to be on the order of about 2-10
nanoseconds per call, depending on the benchmark.
I believe this regression is ultimately acceptable because
1. this overhead will be vastly dominated by whatever work a non-nop callee
actually does,
2. we will need these trampolines, or something like them, when implementing the
Wasm exceptions proposal to do things like translate Wasm's exceptions into
Rust's `Result`s,
3. and because the performance improvements to trapping and capturing stack
traces are of such a larger magnitude than this call regressions.
<details>
```
sync/no-hook/host-to-wasm - typed - nop
time: [28.683 ns 28.757 ns 28.844 ns]
change: [+16.472% +17.183% +17.904%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
4 (4.00%) high mild
5 (5.00%) high severe
sync/no-hook/host-to-wasm - untyped - nop
time: [42.515 ns 42.652 ns 42.841 ns]
change: [+12.371% +14.614% +17.462%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) high mild
10 (10.00%) high severe
sync/no-hook/host-to-wasm - unchecked - nop
time: [33.936 ns 34.052 ns 34.179 ns]
change: [+25.478% +26.938% +28.369%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
7 (7.00%) high mild
2 (2.00%) high severe
sync/no-hook/host-to-wasm - typed - nop-params-and-results
time: [34.290 ns 34.388 ns 34.502 ns]
change: [+40.802% +42.706% +44.526%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
5 (5.00%) high mild
8 (8.00%) high severe
sync/no-hook/host-to-wasm - untyped - nop-params-and-results
time: [62.546 ns 62.721 ns 62.919 ns]
change: [+2.5014% +3.6319% +4.8078%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) high mild
10 (10.00%) high severe
sync/no-hook/host-to-wasm - unchecked - nop-params-and-results
time: [42.609 ns 42.710 ns 42.831 ns]
change: [+20.966% +22.282% +23.475%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
4 (4.00%) high mild
7 (7.00%) high severe
sync/hook-sync/host-to-wasm - typed - nop
time: [29.546 ns 29.675 ns 29.818 ns]
change: [+20.693% +21.794% +22.836%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
sync/hook-sync/host-to-wasm - untyped - nop
time: [45.448 ns 45.699 ns 45.961 ns]
change: [+17.204% +18.514% +19.590%] (p = 0.00 < 0.05)
Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
4 (4.00%) high mild
10 (10.00%) high severe
sync/hook-sync/host-to-wasm - unchecked - nop
time: [34.334 ns 34.437 ns 34.558 ns]
change: [+23.225% +24.477% +25.886%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
5 (5.00%) high mild
7 (7.00%) high severe
sync/hook-sync/host-to-wasm - typed - nop-params-and-results
time: [36.594 ns 36.763 ns 36.974 ns]
change: [+41.967% +47.261% +52.086%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) high mild
9 (9.00%) high severe
sync/hook-sync/host-to-wasm - untyped - nop-params-and-results
time: [63.541 ns 63.831 ns 64.194 ns]
change: [-4.4337% -0.6855% +2.7134%] (p = 0.73 > 0.05)
No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe
sync/hook-sync/host-to-wasm - unchecked - nop-params-and-results
time: [43.968 ns 44.169 ns 44.437 ns]
change: [+18.772% +21.802% +24.623%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
3 (3.00%) high mild
12 (12.00%) high severe
async/no-hook/host-to-wasm - typed - nop
time: [4.9612 us 4.9743 us 4.9889 us]
change: [+9.9493% +11.911% +13.502%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
async/no-hook/host-to-wasm - untyped - nop
time: [5.0030 us 5.0211 us 5.0439 us]
change: [+10.841% +11.873% +12.977%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
3 (3.00%) high mild
7 (7.00%) high severe
async/no-hook/host-to-wasm - typed - nop-params-and-results
time: [4.9273 us 4.9468 us 4.9700 us]
change: [+4.7381% +6.8445% +8.8238%] (p = 0.00 < 0.05)
Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
5 (5.00%) high mild
9 (9.00%) high severe
async/no-hook/host-to-wasm - untyped - nop-params-and-results
time: [5.1151 us 5.1338 us 5.1555 us]
change: [+9.5335% +11.290% +13.044%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) high mild
13 (13.00%) high severe
async/hook-sync/host-to-wasm - typed - nop
time: [4.9330 us 4.9394 us 4.9467 us]
change: [+10.046% +11.038% +12.035%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
5 (5.00%) high mild
7 (7.00%) high severe
async/hook-sync/host-to-wasm - untyped - nop
time: [5.0073 us 5.0183 us 5.0310 us]
change: [+9.3828% +10.565% +11.752%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
async/hook-sync/host-to-wasm - typed - nop-params-and-results
time: [4.9610 us 4.9839 us 5.0097 us]
change: [+9.0857% +11.513% +14.359%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
7 (7.00%) high mild
6 (6.00%) high severe
async/hook-sync/host-to-wasm - untyped - nop-params-and-results
time: [5.0995 us 5.1272 us 5.1617 us]
change: [+9.3600% +11.506% +13.809%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
async-pool/no-hook/host-to-wasm - typed - nop
time: [2.4242 us 2.4316 us 2.4396 us]
change: [+7.8756% +8.8803% +9.8346%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe
async-pool/no-hook/host-to-wasm - untyped - nop
time: [2.5102 us 2.5155 us 2.5210 us]
change: [+12.130% +13.194% +14.270%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe
async-pool/no-hook/host-to-wasm - typed - nop-params-and-results
time: [2.4203 us 2.4310 us 2.4440 us]
change: [+4.0380% +6.3623% +8.7534%] (p = 0.00 < 0.05)
Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
5 (5.00%) high mild
9 (9.00%) high severe
async-pool/no-hook/host-to-wasm - untyped - nop-params-and-results
time: [2.5501 us 2.5593 us 2.5700 us]
change: [+8.8802% +10.976% +12.937%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
5 (5.00%) high mild
11 (11.00%) high severe
async-pool/hook-sync/host-to-wasm - typed - nop
time: [2.4135 us 2.4190 us 2.4254 us]
change: [+8.3640% +9.3774% +10.435%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
6 (6.00%) high mild
5 (5.00%) high severe
async-pool/hook-sync/host-to-wasm - untyped - nop
time: [2.5172 us 2.5248 us 2.5357 us]
change: [+11.543% +12.750% +13.982%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) high mild
7 (7.00%) high severe
async-pool/hook-sync/host-to-wasm - typed - nop-params-and-results
time: [2.4214 us 2.4353 us 2.4532 us]
change: [+1.5158% +5.0872% +8.6765%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
2 (2.00%) high mild
13 (13.00%) high severe
async-pool/hook-sync/host-to-wasm - untyped - nop-params-and-results
time: [2.5499 us 2.5607 us 2.5748 us]
change: [+10.146% +12.459% +14.919%] (p = 0.00 < 0.05)
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
3 (3.00%) high mild
15 (15.00%) high severe
sync/no-hook/wasm-to-host - nop - typed
time: [6.6135 ns 6.6288 ns 6.6452 ns]
change: [+37.927% +38.837% +39.869%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) high mild
5 (5.00%) high severe
sync/no-hook/wasm-to-host - nop-params-and-results - typed
time: [15.930 ns 15.993 ns 16.067 ns]
change: [+3.9583% +5.6286% +7.2430%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
11 (11.00%) high mild
1 (1.00%) high severe
sync/no-hook/wasm-to-host - nop - untyped
time: [20.596 ns 20.640 ns 20.690 ns]
change: [+4.3293% +5.2047% +6.0935%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) high mild
5 (5.00%) high severe
sync/no-hook/wasm-to-host - nop-params-and-results - untyped
time: [42.659 ns 42.882 ns 43.159 ns]
change: [-2.1466% -0.5079% +1.2554%] (p = 0.58 > 0.05)
No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) high mild
14 (14.00%) high severe
sync/no-hook/wasm-to-host - nop - unchecked
time: [10.671 ns 10.691 ns 10.713 ns]
change: [+83.911% +87.620% +92.062%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severe
sync/no-hook/wasm-to-host - nop-params-and-results - unchecked
time: [11.136 ns 11.190 ns 11.263 ns]
change: [-29.719% -28.446% -27.029%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
4 (4.00%) high mild
10 (10.00%) high severe
sync/hook-sync/wasm-to-host - nop - typed
time: [6.7964 ns 6.8087 ns 6.8226 ns]
change: [+21.531% +24.206% +27.331%] (p = 0.00 < 0.05)
Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
4 (4.00%) high mild
10 (10.00%) high severe
sync/hook-sync/wasm-to-host - nop-params-and-results - typed
time: [15.865 ns 15.921 ns 15.985 ns]
change: [+4.8466% +6.3330% +7.8317%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) high mild
13 (13.00%) high severe
sync/hook-sync/wasm-to-host - nop - untyped
time: [21.505 ns 21.587 ns 21.677 ns]
change: [+8.0908% +9.1943% +10.254%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
4 (4.00%) high mild
4 (4.00%) high severe
sync/hook-sync/wasm-to-host - nop-params-and-results - untyped
time: [44.018 ns 44.128 ns 44.261 ns]
change: [-1.4671% -0.0458% +1.2443%] (p = 0.94 > 0.05)
No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
5 (5.00%) high mild
9 (9.00%) high severe
sync/hook-sync/wasm-to-host - nop - unchecked
time: [11.264 ns 11.326 ns 11.387 ns]
change: [+80.225% +81.659% +83.068%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
sync/hook-sync/wasm-to-host - nop-params-and-results - unchecked
time: [11.816 ns 11.865 ns 11.920 ns]
change: [-29.152% -28.040% -26.957%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
8 (8.00%) high mild
6 (6.00%) high severe
async/no-hook/wasm-to-host - nop - typed
time: [6.6221 ns 6.6385 ns 6.6569 ns]
change: [+43.618% +44.755% +45.965%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
6 (6.00%) high mild
7 (7.00%) high severe
async/no-hook/wasm-to-host - nop-params-and-results - typed
time: [15.884 ns 15.929 ns 15.983 ns]
change: [+3.5987% +5.2053% +6.7846%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) high mild
13 (13.00%) high severe
async/no-hook/wasm-to-host - nop - untyped
time: [20.615 ns 20.702 ns 20.821 ns]
change: [+6.9799% +8.1212% +9.2819%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
async/no-hook/wasm-to-host - nop-params-and-results - untyped
time: [41.956 ns 42.207 ns 42.521 ns]
change: [-4.3057% -2.7730% -1.2428%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
3 (3.00%) high mild
11 (11.00%) high severe
async/no-hook/wasm-to-host - nop - unchecked
time: [10.440 ns 10.474 ns 10.513 ns]
change: [+83.959% +85.826% +87.541%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
async/no-hook/wasm-to-host - nop-params-and-results - unchecked
time: [11.476 ns 11.512 ns 11.554 ns]
change: [-29.857% -28.383% -26.978%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
1 (1.00%) low mild
6 (6.00%) high mild
5 (5.00%) high severe
async/no-hook/wasm-to-host - nop - async-typed
time: [26.427 ns 26.478 ns 26.532 ns]
change: [+6.5730% +7.4676% +8.3983%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severe
async/no-hook/wasm-to-host - nop-params-and-results - async-typed
time: [28.557 ns 28.693 ns 28.880 ns]
change: [+1.9099% +3.7332% +5.9731%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) high mild
14 (14.00%) high severe
async/hook-sync/wasm-to-host - nop - typed
time: [6.7488 ns 6.7630 ns 6.7784 ns]
change: [+19.935% +22.080% +23.683%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) high mild
5 (5.00%) high severe
async/hook-sync/wasm-to-host - nop-params-and-results - typed
time: [15.928 ns 16.031 ns 16.149 ns]
change: [+5.5188% +6.9567% +8.3839%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
9 (9.00%) high mild
2 (2.00%) high severe
async/hook-sync/wasm-to-host - nop - untyped
time: [21.930 ns 22.114 ns 22.296 ns]
change: [+4.6674% +7.7588% +10.375%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
async/hook-sync/wasm-to-host - nop-params-and-results - untyped
time: [42.684 ns 42.858 ns 43.081 ns]
change: [-5.2957% -3.4693% -1.6217%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
2 (2.00%) high mild
12 (12.00%) high severe
async/hook-sync/wasm-to-host - nop - unchecked
time: [11.026 ns 11.053 ns 11.086 ns]
change: [+70.751% +72.378% +73.961%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) high mild
5 (5.00%) high severe
async/hook-sync/wasm-to-host - nop-params-and-results - unchecked
time: [11.840 ns 11.900 ns 11.982 ns]
change: [-27.977% -26.584% -24.887%] (p = 0.00 < 0.05)
Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
3 (3.00%) high mild
15 (15.00%) high severe
async/hook-sync/wasm-to-host - nop - async-typed
time: [27.601 ns 27.709 ns 27.882 ns]
change: [+8.1781% +9.1102% +10.030%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
2 (2.00%) low mild
3 (3.00%) high mild
6 (6.00%) high severe
async/hook-sync/wasm-to-host - nop-params-and-results - async-typed
time: [28.955 ns 29.174 ns 29.413 ns]
change: [+1.1226% +3.0366% +5.1126%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
7 (7.00%) high mild
6 (6.00%) high severe
async-pool/no-hook/wasm-to-host - nop - typed
time: [6.5626 ns 6.5733 ns 6.5851 ns]
change: [+40.561% +42.307% +44.514%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
5 (5.00%) high mild
4 (4.00%) high severe
async-pool/no-hook/wasm-to-host - nop-params-and-results - typed
time: [15.820 ns 15.886 ns 15.969 ns]
change: [+4.1044% +5.7928% +7.7122%] (p = 0.00 < 0.05)
Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
4 (4.00%) high mild
13 (13.00%) high severe
async-pool/no-hook/wasm-to-host - nop - untyped
time: [20.481 ns 20.521 ns 20.566 ns]
change: [+6.7962% +7.6950% +8.7612%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
6 (6.00%) high mild
5 (5.00%) high severe
async-pool/no-hook/wasm-to-host - nop-params-and-results - untyped
time: [41.834 ns 41.998 ns 42.189 ns]
change: [-3.8185% -2.2687% -0.7541%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
3 (3.00%) high mild
10 (10.00%) high severe
async-pool/no-hook/wasm-to-host - nop - unchecked
time: [10.353 ns 10.380 ns 10.414 ns]
change: [+82.042% +84.591% +87.205%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
async-pool/no-hook/wasm-to-host - nop-params-and-results - unchecked
time: [11.123 ns 11.168 ns 11.228 ns]
change: [-30.813% -29.285% -27.874%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
11 (11.00%) high mild
1 (1.00%) high severe
async-pool/no-hook/wasm-to-host - nop - async-typed
time: [27.442 ns 27.528 ns 27.638 ns]
change: [+7.5215% +9.9795% +12.266%] (p = 0.00 < 0.05)
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
3 (3.00%) high mild
15 (15.00%) high severe
async-pool/no-hook/wasm-to-host - nop-params-and-results - async-typed
time: [29.014 ns 29.148 ns 29.312 ns]
change: [+2.0227% +3.4722% +4.9047%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
async-pool/hook-sync/wasm-to-host - nop - typed
time: [6.7916 ns 6.8116 ns 6.8325 ns]
change: [+20.937% +22.050% +23.281%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
async-pool/hook-sync/wasm-to-host - nop-params-and-results - typed
time: [15.917 ns 15.975 ns 16.051 ns]
change: [+4.6404% +6.4217% +8.3075%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
5 (5.00%) high mild
11 (11.00%) high severe
async-pool/hook-sync/wasm-to-host - nop - untyped
time: [21.558 ns 21.612 ns 21.679 ns]
change: [+8.1158% +9.1409% +10.217%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severe
async-pool/hook-sync/wasm-to-host - nop-params-and-results - untyped
time: [42.475 ns 42.614 ns 42.775 ns]
change: [-6.3613% -4.4709% -2.7647%] (p = 0.00 < 0.05)
Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
3 (3.00%) high mild
15 (15.00%) high severe
async-pool/hook-sync/wasm-to-host - nop - unchecked
time: [11.150 ns 11.195 ns 11.247 ns]
change: [+74.424% +77.056% +79.811%] (p = 0.00 < 0.05)
Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
3 (3.00%) high mild
11 (11.00%) high severe
async-pool/hook-sync/wasm-to-host - nop-params-and-results - unchecked
time: [11.639 ns 11.695 ns 11.760 ns]
change: [-30.212% -29.023% -27.954%] (p = 0.00 < 0.05)
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
7 (7.00%) high mild
8 (8.00%) high severe
async-pool/hook-sync/wasm-to-host - nop - async-typed
time: [27.480 ns 27.712 ns 27.984 ns]
change: [+2.9764% +6.5061% +9.8914%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe
async-pool/hook-sync/wasm-to-host - nop-params-and-results - async-typed
time: [29.218 ns 29.380 ns 29.600 ns]
change: [+5.2283% +7.7247% +10.822%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
2 (2.00%) high mild
14 (14.00%) high severe
```
</details>
* Add s390x support for frame pointer-based stack walking
* wasmtime: Allow `Caller::get_export` to get all exports
* fuzzing: Add a fuzz target to check that our stack traces are correct
We generate Wasm modules that keep track of their own stack as they call and
return between functions, and then we periodically check that if the host
captures a backtrace, it matches what the Wasm module has recorded.
* Remove VM offsets for `VMHostFuncContext` since it isn't used by JIT code
* Add doc comment with stack walking implementation notes
* Document the extra state that can be passed to `wasmtime_runtime::Backtrace` methods
* Add extensive comments for stack walking function
* Factor architecture-specific bits of stack walking out into modules
* Initialize store-related fields in a vmctx to null when there is no store yet
Rather than leaving them as uninitialized data.
* Use `set_callee` instead of manually setting the vmctx field
* Use a more informative compile error message for unsupported architectures
* Document unsafety of `prepare_host_to_wasm_trampoline`
* Use `bti c` instead of `hint #34` in inline aarch64 assembly
* Remove outdated TODO comment
* Remove setting of `last_wasm_exit_fp` in `set_jit_trap`
This is no longer needed as the value is plumbed through to the backtrace code
directly now.
* Only set the stack limit once, in the face of re-entrancy into Wasm
* Add comments for s390x-specific stack walking bits
* Use the helper macro for all libcalls
If we forget to use it, and then trigger a GC from the libcall, that means we
could miss stack frames when walking the stack, fail to find live GC refs, and
then get use after free bugs. Much less risky to always use the helper macro
that takes care of all of that for us.
* Use the `asm_sym!` macro in Wasm-to-libcall trampolines
This macro handles the macOS-specific underscore prefix stuff for us.
* wasmtime: add size and align to `externref` assertion error message
* Extend the `stacks` fuzzer to have host frames in between Wasm frames
This way we get one or more contiguous sequences of Wasm frames on the stack,
instead of exactly one.
* Add documentation for aarch64-specific backtrace helpers
* Clarify that we only support little-endian aarch64 in trampoline comment
* Use `.machine z13` in s390x assembly file
Since apparently our CI machines have pretty old assemblers that don't have
`.machine z14`. This should be fine though since these trampolines don't make
use of anything that is introduced in z14.
* Fix aarch64 build
* Fix macOS build
* Document the `asm_sym!` macro
* Add windows support to the `wasmtime-asm-macros` crate
* Add windows support to host<--->Wasm trampolines
* Fix trap handler build on windows
* Run `rustfmt` on s390x trampoline source file
* Temporarily disable some assertions about a trap's backtrace in the component model tests
Follow up to re-enable this and fix the associated issue:
https://github.com/bytecodealliance/wasmtime/issues/4535
* Refactor libcall definitions with less macros
This refactors the `libcall!` macro to use the
`foreach_builtin_function!` macro to define all of the trampolines.
Additionally the macro surrounding each libcall itself is no longer
necessary and helps avoid too many macros.
* Use `VMOpaqueContext::from_vm_host_func_context` in `VMHostFuncContext::new`
* Move `backtrace` module to be submodule of `traphandlers`
This avoids making some things `pub(crate)` in `traphandlers` that really
shouldn't be.
* Fix macOS aarch64 build
* Use "i64" instead of "word" in aarch64-specific file
* Save/restore entry SP and exit FP/return pointer in the face of panicking imported host functions
Also clean up assertions surrounding our saved entry/exit registers.
* Put "typed" vs "untyped" in the same position of call benchmark names
Regardless if we are doing wasm-to-host or host-to-wasm
* Fix stacks test case generator build for new `wasm-encoder`
* Fix build for s390x
* Expand libcalls in s390x asm
* Disable more parts of component tests now that backtrace assertions are a bit tighter
* Remove assertion that can maybe fail on s390x
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Alex Crichton <alex@alexcrichton.com>
|
||
|
|
174b60dcf7 |
Add *.wast support for invoking components (#4526)
This commit builds on bytecodealliance/wasm-tools#690 to add support to testing of the component model to execute functions when running `*.wast` files. This support is all built on #4442 as functions are invoked through a "dynamic" API. Right now the testing and integration is fairly crude but I'm hoping that we can try to improve it over time as necessary. For now this should provide a hopefully more convenient syntax for unit tests and the like. |
||
|
|
1321c234e5 |
Remove dependency on more-asserts (#4408)
* Remove dependency on `more-asserts` In my recent adventures to do a bit of gardening on our dependencies I noticed that there's a new major version for the `more-asserts` crate. Instead of updating to this though I've opted to instead remove the dependency since I don't think we heavily lean on this crate and otherwise one-off prints are probably sufficient to avoid the need for pulling in a whole crate for this. * Remove exemption for `more-asserts` |
||
|
|
97894bc65e |
Add initial support for fused adapter trampolines (#4501)
* Add initial support for fused adapter trampolines This commit lands a significant new piece of functionality to Wasmtime's implementation of the component model in the form of the implementation of fused adapter trampolines. Internally within a component core wasm modules can communicate with each other by having their exports `canon lift`'d to get `canon lower`'d into a different component. This signifies that two components are communicating through a statically known interface via the canonical ABI at this time. Previously Wasmtime was able to identify that this communication was happening but it simply panicked with `unimplemented!` upon seeing it. This commit is the beginning of filling out this panic location with an actual implementation. The implementation route chosen here for fused adapters is to use a WebAssembly module itself for the implementation. This means that, at compile time of a component, Wasmtime is generating core WebAssembly modules which then get recursively compiled within Wasmtime as well. The choice to use WebAssembly itself as the implementation of fused adapters stems from a few motivations: * This does not represent a significant increase in the "trusted compiler base" of Wasmtime. Getting the Wasm -> CLIF translation correct once is hard enough much less for an entirely different IR to CLIF. By generating WebAssembly no new interactions with Cranelift are added which drastically reduces the possibilities for mistakes. * Using WebAssembly means that component adapters are insulated from miscompilations and mistakes. If something goes wrong it's defined well within the WebAssembly specification how it goes wrong and what happens as a result. This means that the "blast zone" for a wrong adapter is the component instance but not the entire host itself. Accesses to linear memory are guaranteed to be in-bounds and otherwise handled via well-defined traps. * A fully-finished fused adapter compiler is expected to be a significant and quite complex component of Wasmtime. Functionality along these lines is expected to be needed for Web-based polyfills of the component model and by using core WebAssembly it provides the opportunity to share code between Wasmtime and these polyfills for the component model. * Finally the runtime implementation of managing WebAssembly modules is already implemented and quite easy to integrate with, so representing fused adapters with WebAssembly results in very little extra support necessary for the runtime implementation of instantiating and managing a component. The compiler added in this commit is dubbed Wasmtime's Fused Adapter Compiler of Trampolines (FACT) because who doesn't like deriving a name from an acronym. Currently the trampoline compiler is limited in its support for interface types and only supports a few primitives. I plan on filing future PRs to flesh out the support here for all the variants of `InterfaceType`. For now this PR is primarily focused on all of the other infrastructure for the addition of a trampoline compiler. With the choice to use core WebAssembly to implement fused adapters it means that adapters need to be inserted into a module. Unfortunately adapters cannot all go into a single WebAssembly module because adapters themselves have dependencies which may be provided transitively through instances that were instantiated with other adapters. This means that a significant chunk of this PR (`adapt.rs`) is dedicated to determining precisely which adapters go into precisely which adapter modules. This partitioning process attempts to make large modules wherever it can to cut down on core wasm instantiations but is likely not optimal as it's just a simple heuristic today. With all of this added together it's now possible to start writing `*.wast` tests that internally have adapted modules communicating with one another. A `fused.wast` test suite was added as part of this PR which is the beginning of tests for the support of the fused adapter compiler added in this PR. Currently this is primarily testing some various topologies of adapters along with direct/indirect modes. This will grow many more tests over time as more types are supported. Overall I'm not 100% satisfied with the testing story of this PR. When a test fails it's very difficult to debug since everything is written in the text format of WebAssembly meaning there's no "conveniences" to print out the state of the world when things go wrong and easily debug. I think this will become even more apparent as more tests are written for more types in subsequent PRs. At this time though I know of no better alternative other than leaning pretty heavily on fuzz-testing to ensure this is all exercised. * Fix an unused field warning * Fix tests in `wasmtime-runtime` * Add some more tests for compiled trampolines * Remap exports when injecting adapters The exports of a component were accidentally left unmapped which meant that they indexed the instance indexes pre-adapter module insertion. * Fix typo * Rebase conflicts |
||
|
|
9c43749dfe |
[RFC] Dynamic Vector Support (#4200)
Introduce a new concept in the IR that allows a producer to create dynamic vector types. An IR function can now contain global value(s) that represent a dynamic scaling factor, for a given fixed-width vector type. A dynamic type is then created by 'multiplying' the corresponding global value with a fixed-width type. These new types can be used just like the existing types and the type system has a set of hard-coded dynamic types, such as I32X4XN, which the user defined types map onto. The dynamic types are also used explicitly to create dynamic stack slots, which have no set size like their existing counterparts. New IR instructions are added to access these new stack entities. Currently, during codegen, the dynamic scaling factor has to be lowered to a constant so the dynamic slots do eventually have a compile-time known size, as do spill slots. The current lowering for aarch64 just targets Neon, using a dynamic scale of 1. Copyright (c) 2022, Arm Limited. |
||
|
|
9ae060a12a |
Update some dependency versions used by Wasmtime (#4405)
No major motivation here, mostly just dependency gardening. |
||
|
|
41ba851a95 |
Bump versions of wasm-tools crates (#4380)
* Bump versions of wasm-tools crates Note that this leaves new features in the component model, outer type aliases for core wasm types, unimplemented for now. * Move to crates.io-based versions of tools |
||
|
|
7c428bbd62 |
Bump Wasmtime to 0.40.0 (#4378)
Co-authored-by: Wasmtime Publish <wasmtime-publish@users.noreply.github.com> |
||
|
|
f0278c5db7 |
Implement canon lower of a canon lift function in the same component (#4347)
* Implement `canon lower` of a `canon lift` function in the same component This commit implements the "degenerate" logic for implementing a function within a component that is lifted and then immediately lowered again. In this situation the lowered function will immediately generate a trap and doesn't need to implement anything else. The implementation in this commit is somewhat heavyweight but I think is probably justified moreso in future additions to the component model rather than what exactly is here right now. It's not expected that this "always trap" functionality will really be used all that often since it would generally mean a buggy component, but the functionality plumbed through here is hopefully going to be useful for implementing component-to-component adapter trampolines. Specifically this commit implements a strategy where the `canon.lower`'d function is generated by Cranelift and simply has a single trap instruction when called, doing nothing else. The main complexity comes from juggling around all the data associated with these functions, primarily plumbing through the traps into the `ModuleRegistry` to ensure that the global `is_wasm_trap_pc` function returns `true` and at runtime when we lookup information about the trap it's all readily available (e.g. translating the trapping pc to a `TrapCode`). * Fix non-component build * Fix some offset calculations * Only create one "always trap" per signature Use an internal map to deduplicate during compilation. |
||
|
|
c1b3962f7b |
Implement lowered-then-lifted functions (#4327)
* Implement lowered-then-lifted functions This commit is a few features bundled into one, culminating in the implementation of lowered-then-lifted functions for the component model. It's probably not going to be used all that often but this is possible within a valid component so Wasmtime needs to do something relatively reasonable. The main things implemented in this commit are: * Component instances are now assigned a `RuntimeComponentInstanceIndex` to differentiate each one. This will be used in the future to detect fusion (one instance lowering a function from another instance). For now it's used to allocate separate `VMComponentFlags` for each internal component instance. * The `CoreExport<FuncIndex>` of lowered functions was changed to a `CoreDef` since technically a lowered function can use another lowered function as the callee. This ended up being not too difficult to plumb through as everything else was already in place. * A need arose to compile host-to-wasm trampolines which weren't already present. Currently wasm in a component is always entered through a host-to-wasm trampoline but core wasm modules are the source of all the trampolines. In the case of a lowered-then-lifted function there may not actually be any core wasm modules, so component objects now contain necessary trampolines not otherwise provided by the core wasm objects. This feature required splitting a new function into the `Compiler` trait for creating a host-to-wasm trampoline. After doing this core wasm compilation was also updated to leverage this which further enabled compiling trampolines in parallel as opposed to the previous synchronous compilation. * Review comments |
||
|
|
66b829b1bf |
Change how unwind information is stored on Windows (#4314)
* Change how unwind information is stored on Windows Unwind information on Windows is stored in two separate locations. The first location is the unwind information itself which corresponds to `UNWIND_INFO`. The second location is a list of `RUNTIME_INFO` structures which point to function bodes and `UNWIND_INFO` structures. Currently in Wasmtime the `UNWIND_INFO` structures are stored just after functions themselves with a somewhat cryptic comment indicating that Windows prefers this (I'm unsure as to the provenance of this comment). The `RUNTIME_INFO` data is then stored in a separate section which has the custom name of `_wasmtime_winx64_unwind`. After my recent foray into trying to debug windows-2022 bad unwind information again I realized though that Windows actually has official sections for these two unwind information items. The `.xdata` section is used to store the `UNWIND_INFO` structures and the `.pdata` section stores the `RUNTIME_INFO` list. To try to be somewhat idiomatic and perhaps one day even hook into standard Windows debugging tools I went ahead and refactored how our unwind information is stored to match this. Perhaps the main benefit of this is that it reduces the size of the read/execute section of the binary. Previously the unwind information was executable since it was stored in the `.text` section, but unnecessarily so. Now it's in a read-only section which is in theory a small amount of hardening. Otherwise though I don't think this will really help all that much to hook up in to standard debugging tools like `objdump` because it's all still stored in an ELF file rather than a COFF file. * Review comments |
||
|
|
3339dd1f01 |
Implement the post-return attribute (#4297)
This commit implements the `post-return` feature of the canonical ABI in the component model. This attribute is an optionally-specified function which is to be executed after the return value has been processed by the caller to optionally clean-up the return value. This enables, for example, returning an allocated string and the host then knows how to clean it up to prevent memory leaks in the original module. The API exposed in this PR changes the prior `TypedFunc::call` API in behavior but not in its signature. Previously the `TypedFunc::call` method would set the `may_enter` flag on the way out, but now that operation is deferred until a new `TypedFunc::post_return` method is called. This means that once a method on an instance is invoked then nothing else can be done on the instance until the `post_return` method is called. Note that the method must be called irrespective of whether the `post-return` canonical ABI option was specified or not. Internally wasm will be invoked if necessary. This is a pretty wonky and unergonomic API to work with. For now I couldn't think of a better alternative that improved on the ergonomics. In the theory that the raw Wasmtime bindings for a component may not be used all that heavily (instead `wit-bindgen` would largely be used) I'm hoping that this isn't too much of an issue in the future. cc #4185 |
||
|
|
6778b4fce2 |
Update comment on stack overflow checking (#4290)
* Update comment on stack overflow checking This commit moves the top-level comment in `crates/cranelift/src/lib.rs` into the location where the global value for the stack limit is generated. Stack overflow checking is pretty localized nowadays so there's not much need to have it at the top of the crate and most of the words there were just adapted to this new location. Closes #4286 * Review comments |
||
|
|
258dc9de42 |
fix(wasmtime):Config methods should be idempotent (#4252)
This commit refactored `Config` to use a seperate `CompilerConfig` field instead of operating on `CompilerBuilder` directly to make all its methods idempotent. Fixes #4189 |
||
|
|
7d7ddceb17 |
Update wasm-tools crates (#4246)
This commit updates the wasm-tools family of crates, notably pulling in the refactorings and updates from bytecodealliance/wasm-tools#621 for the latest iteration of the component model. This commit additionally updates all support for the component model for these changes, notably: * Many bits and pieces of type information was refactored. Many `FooTypeIndex` namings are now `TypeFooIndex`. Additionally there is now `TypeIndex` as well as `ComponentTypeIndex` for the two type index spaces in a component. * A number of new sections are now processed to handle the core and component variants. * Internal maps were split such as the `funcs` map into `component_funcs` and `funcs` (same for `instances`). * Canonical options are now processed individually instead of one bulk `into` definition. Overall this was not a major update to the internals of handling the component model in Wasmtime. Instead this was mostly a surface-level refactoring to make sure that everything lines up with the new binary format for components. * All text syntax used in tests was updated to the new syntax. |
||
|
|
c15c3061ca |
CFI improvements to the AArch64 fiber implementation (#4195)
Now the fiber implementation on AArch64 authenticates function return addresses and includes the relevant BTI instructions, except on macOS. Also, change the locations of the saved FP and LR registers on the fiber stack to make them compliant with the Procedure Call Standard for the Arm 64-bit Architecture. Copyright (c) 2022, Arm Limited. |
||
|
|
2b52f47b83 |
Add shared memories (#4187)
* Add shared memories This change adds the ability to use shared memories in Wasmtime when the [threads proposal] is enabled. Shared memories are annotated as `shared` in the WebAssembly syntax, e.g., `(memory 1 1 shared)`, and are protected from concurrent access during `memory.size` and `memory.grow`. [threads proposal]: https://github.com/WebAssembly/threads/blob/master/proposals/threads/Overview.md In order to implement this in Wasmtime, there are two main cases to cover: - a program may simply create a shared memory and possibly export it; this means that Wasmtime itself must be able to create shared memories - a user may create a shared memory externally and pass it in as an import during instantiation; this is the case when the program contains code like `(import "env" "memory" (memory 1 1 shared))`--this case is handled by a new Wasmtime API type--`SharedMemory` Because of the first case, this change allows any of the current memory-creation mechanisms to work as-is. Wasmtime can still create either static or dynamic memories in either on-demand or pooling modes, and any of these memories can be considered shared. When shared, the `Memory` runtime container will lock appropriately during `memory.size` and `memory.grow` operations; since all memories use this container, it is an ideal place for implementing the locking once and once only. The second case is covered by the new `SharedMemory` structure. It uses the same `Mmap` allocation under the hood as non-shared memories, but allows the user to perform the allocation externally to Wasmtime and share the memory across threads (via an `Arc`). The pointer address to the actual memory is carefully wired through and owned by the `SharedMemory` structure itself. This means that there are differing views of where to access the pointer (i.e., `VMMemoryDefinition`): for owned memories (the default), the `VMMemoryDefinition` is stored directly by the `VMContext`; in the `SharedMemory` case, however, this `VMContext` must point to this separate structure. To ensure that the `VMContext` can always point to the correct `VMMemoryDefinition`, this change alters the `VMContext` structure. Since a `SharedMemory` owns its own `VMMemoryDefinition`, the `defined_memories` table in the `VMContext` becomes a sequence of pointers--in the shared memory case, they point to the `VMMemoryDefinition` owned by the `SharedMemory` and in the owned memory case (i.e., not shared) they point to `VMMemoryDefinition`s stored in a new table, `owned_memories`. This change adds an additional indirection (through the `*mut VMMemoryDefinition` pointer) that could add overhead. Using an imported memory as a proxy, we measured a 1-3% overhead of this approach on the `pulldown-cmark` benchmark. To avoid this, Cranelift-generated code will special-case the owned memory access (i.e., load a pointer directly to the `owned_memories` entry) for `memory.size` so that only shared memories (and imported memories, as before) incur the indirection cost. * review: remove thread feature check * review: swap wasmtime-types dependency for existing wasmtime-environ use * review: remove unused VMMemoryUnion * review: reword cross-engine error message * review: improve tests * review: refactor to separate prevent Memory <-> SharedMemory conversion * review: into_shared_memory -> as_shared_memory * review: remove commented out code * review: limit shared min/max to 32 bits * review: skip imported memories * review: imported memories are not owned * review: remove TODO * review: document unsafe send + sync * review: add limiter assertion * review: remove TODO * review: improve tests * review: fix doc test * fix: fixes based on discussion with Alex This changes several key parts: - adds memory indexes to imports and exports - makes `VMMemoryDefinition::current_length` an atomic usize * review: add `Extern::SharedMemory` * review: remove TODO * review: atomically load from VMMemoryDescription in JIT-generated code * review: add test probing the last available memory slot across threads * fix: move assertion to new location due to rebase * fix: doc link * fix: add TODOs to c-api * fix: broken doc link * fix: modify pooling allocator messages in tests * review: make owned_memory_index panic instead of returning an option * review: clarify calculation of num_owned_memories * review: move 'use' to top of file * review: change '*const [u8]' to '*mut [u8]' * review: remove TODO * review: avoid hard-coding memory index * review: remove 'preallocation' parameter from 'Memory::_new' * fix: component model memory length * review: check that shared memory plans are static * review: ignore growth limits for shared memory * review: improve atomic store comment * review: add FIXME for memory growth failure * review: add comment about absence of bounds-checked 'memory.size' * review: make 'current_length()' doc comment more precise * review: more comments related to memory.size non-determinism * review: make 'vmmemory' unreachable for shared memory * review: move code around * review: thread plan through to 'wrap()' * review: disallow shared memory allocation with the pooling allocator |
||
|
|
55946704cb |
Bump Wasmtime to 0.39.0 (#4225)
Co-authored-by: Wasmtime Publish <wasmtime-publish@users.noreply.github.com> |
||
|
|
3ed6fae7b3 |
Add trampoline compilation support for lowered imports (#4206)
* Add trampoline compilation support for lowered imports This commit adds support to the component model implementation for compiling trampolines suitable for calling host imports. Currently this is purely just the compilation side of things, modifying the wasmtime-cranelift crate and additionally filling out a new `VMComponentOffsets` type (similar to `VMOffsets`). The actual creation of a `VMComponentContext` is still not performed and will be a subsequent PR. Internally though some tests are actually possible with this where we at least assert that compilation of a component and creation of everything in-memory doesn't panic or trip any assertions, so some tests are added here for that as well. * Fix some test errors |
||
|
|
f638b390b6 |
Refactor some internals of wasmtime-cranelift (#4202)
* Split `wasm_to_host_trampoline` into pieces In the upcoming component model supoprt for imports my plan is to reuse some of these pieces but not the entirety of the current `wasm_to_host_trampoline`. In an effort to make that diff smaller this commit splits up the function preemptively into pieces to get reused later. * Delete unused `for_each_libcall` macros Came across this when working in the object support for cranelift. * Refactor some object creation details This commit refactors some of the internals around creating an object file in the wasmtime-cranelift integration. The old `ObjectBuilder` is now named `ModuleTextBuilder` and is only used to create the text section rather than other sections too. This helps maintain the invariant that the unwind information section is placed directly after the text section without having an odd API for doing this. Additionally the unwind information creation is moved externally from the `ModuleTextBuilder` to a standalone structure. This separate structure is currently in use in the component model work I'm doing although I may change that to using the `ModuleTextBuilder` instead. In any case it seemed nice to encapsulate all of the unwinding information into one standalone structure. Finally, the insertion of native debug information has been refactored to happen in a new `append_dwarf` method to keep all the dwarf-related stuff together in one place as much as possible. * Fix a doctest * Fix a typo |
||
|
|
f4b9020913 |
Change wasm-to-host trampolines to take the values_vec size (#4192)
* Change wasm-to-host trampolines to take the values_vec size This commit changes the ABI of wasm-to-host trampolines, which are only used right now for functions created with `Func::new`, to pass along the size of the `values_vec` argument. Previously the trampoline simply received `*mut ValRaw` and assumed that it was the appropriate size. By receiving a size as well we can thread through `&mut [ValRaw]` internally instead of `*mut ValRaw`. The original motivation for this is that I'm planning to leverage these trampolines for the component model for host-defined functions. Out of an abundance of caution of making sure that everything lines up I wanted to be able to write down asserts about the size received at runtime compared to the size expected. This overall led me to the desire to thread this size parameter through on the assumption that it would not impact performance all that much. I ran two benchmarks locally from the `call.rs` benchmark and got: * `sync/no-hook/wasm-to-host - nop - unchecked` - no change * `sync/no-hook/wasm-to-host - nop-params-and-results - unchecked` - 5% slower This is what I roughly expected in that if nothing actually reads the new parameter (e.g. no arguments) then threading through the parameter is effectively otherwise free. Otherwise though accesses to the `ValRaw` storage is now bounds-checked internally in Wasmtime instead of assuming it's valid, leading to the 5% slowdown (~9.6ns to ~10.3ns). If this becomes a peformance bottleneck for a particular use case then we should be fine to remove the bounds checking here or otherwise only bounds check in debug mode, otherwise I plan on leaving this as-is. Of particular note this also changes the C API for `*_unchecked` functions where the C callback now receives the size of the array as well. * Add docs |
||
|
|
3a7910ecb0 | Reuse Cranelift codegen contexts across wasmtime compilations (#4181) | ||
|
|
6e828df632 |
Remove unused SourceLoc in many Mach data structures (#4180)
* Remove unused srcloc in MachReloc * Remove unused srcloc in MachTrap * Use `into_iter` on array in bench code to suppress a warning * Remove unused srcloc in MachCallSite |
||
|
|
fcf6208750 |
Initial skeleton of some component model processing (#4005)
* Initial skeleton of some component model processing This commit is the first of what will likely be many to implement the component model proposal in Wasmtime. This will be structured as a series of incremental commits, most of which haven't been written yet. My hope is to make this incremental and over time to make this easier to review and easier to test each step in isolation. Here much of the skeleton of how components are going to work in Wasmtime is sketched out. This is not a complete implementation of the component model so it's not all that useful yet, but some things you can do are: * Process the type section into a representation amenable for working with in Wasmtime. * Process the module section and register core wasm modules. * Process the instance section for core wasm modules. * Process core wasm module imports. * Process core wasm instance aliasing. * Ability to compile a component with core wasm embedded. * Ability to instantiate a component with no imports. * Ability to get functions from this component. This is already starting to diverge from the previous module linking representation where a `Component` will try to avoid unnecessary metadata about the component and instead internally only have the bare minimum necessary to instantiate the module. My hope is we can avoid constructing most of the index spaces during instantiation only for it to all ge thrown away. Additionally I'm predicting that we'll need to see through processing where possible to know how to generate adapters and where they are fused. At this time you can't actually call a component's functions, and that's the next PR that I would like to make. * Add tests for the component model support This commit uses the recently updated wasm-tools crates to add tests for the component model added in the previous commit. This involved updating the `wasmtime-wast` crate for component-model changes. Currently the component support there is quite primitive, but enough to at least instantiate components and verify the internals of Wasmtime are all working correctly. Additionally some simple tests for the embedding API have also been added. |
||
|
|
0824abbae4 |
Add a basic alias analysis with redundant-load elim and store-to-load fowarding opts. (#4163)
This PR adds a basic *alias analysis*, and optimizations that use it. This is a "mid-end optimization": it operates on CLIF, the machine-independent IR, before lowering occurs. The alias analysis (or maybe more properly, a sort of memory-value analysis) determines when it can prove a particular memory location is equal to a given SSA value, and when it can, it replaces any loads of that location. This subsumes two common optimizations: * Redundant load elimination: when the same memory address is loaded two times, and it can be proven that no intervening operations will write to that memory, then the second load is *redundant* and its result must be the same as the first. We can use the first load's result and remove the second load. * Store-to-load forwarding: when a load can be proven to access exactly the memory written by a preceding store, we can replace the load's result with the store's data operand, and remove the load. Both of these optimizations rely on a "last store" analysis that is a sort of coloring mechanism, split across disjoint categories of abstract state. The basic idea is that every memory-accessing operation is put into one of N disjoint categories; it is disallowed for memory to ever be accessed by an op in one category and later accessed by an op in another category. (The frontend must ensure this.) Then, given this, we scan the code and determine, for each memory-accessing op, when a single prior instruction is a store to the same category. This "colors" the instruction: it is, in a sense, a static name for that version of memory. This analysis provides an important invariant: if two operations access memory with the same last-store, then *no other store can alias* in the time between that last store and these operations. This must-not-alias property, together with a check that the accessed address is *exactly the same* (same SSA value and offset), and other attributes of the access (type, extension mode) are the same, let us prove that the results are the same. Given last-store info, we scan the instructions and build a table from "memory location" key (last store, address, offset, type, extension) to known SSA value stored in that location. A store inserts a new mapping. A load may also insert a new mapping, if we didn't already have one. Then when a load occurs and an entry already exists for its "location", we can reuse the value. This will be either RLE or St-to-Ld depending on where the value came from. Note that this *does* work across basic blocks: the last-store analysis is a full iterative dataflow pass, and we are careful to check dominance of a previously-defined value before aliasing to it at a potentially redundant load. So we will do the right thing if we only have a "partially redundant" load (loaded already but only in one predecessor block), but we will also correctly reuse a value if there is a store or load above a loop and a redundant load of that value within the loop, as long as no potentially-aliasing stores happen within the loop. |
||
|
|
89ccc56e46 |
Update the wasm-tools family of crates (#4165)
* Update the wasm-tools family of crates This commit updates these crates as used by Wasmtime for the recently published versions to pull in changes necessary to support the component model. I've split this out from #4005 to make it clear what's impacted here and #4005 can simply rebase on top of this to pick up the necessary changes. * More test fixes |
||
|
|
52524d258c |
Expose TrapCode::Interrupt on epoch based interruption (#4105)
|
||
|
|
9a6854456d |
Bump Wasmtime to 0.38.0 (#4103)
Co-authored-by: Wasmtime Publish <wasmtime-publish@users.noreply.github.com> |
||
|
|
871a9d93f2 |
Update some dependencies in Cargo.lock (#4081)
* Run a `cargo update` over our dependencies This'll notably fix a `cargo audit` error where we have a pinned version of the `regex` crate which has a CVE assigned to it. * Update to `object` and `hashbrown` crates Prune some duplicate versions showing up from the previous `cargo update` |
||
|
|
51d82aebfd |
Store the ValRaw type in little-endian format (#4035)
* Store the `ValRaw` type in little-endian format
This commit changes the internal representation of the `ValRaw` type to
an unconditionally little-endian format instead of its current
native-endian format. The documentation and various accessors here have
been updated as well as the associated trampolines that read `ValRaw`
to always work with little-endian values, converting to the host
endianness as necessary.
The motivation for this change originally comes from the implementation
of the component model that I'm working on. One aspect of the component
model's canonical ABI is how variants are passed to functions as
immediate arguments. For example for a component model function:
```
foo: function(x: expected<i32, f64>)
```
This translates to a core wasm function:
```wasm
(module
(func (export "foo") (param i32 i64)
;; ...
)
)
```
The first `i32` parameter to the core wasm function is the discriminant
of whether the result is an "ok" or an "err". The second `i64`, however,
is the "join" operation on the `i32` and `f64` payloads. Essentially
these two types are unioned into one type to get passed into the function.
Currently in the implementation of the component model my plan is to
construct a `*mut [ValRaw]` to pass through to WebAssembly, always
invoking component exports through host trampolines. This means that the
implementation for `Result<T, E>` needs to do the correct "join"
operation here when encoding a particular case into the corresponding
`ValRaw`.
I personally found this particularly tricky to do structurally. The
solution that I settled on with fitzgen was that if `ValRaw` was always
stored in a little endian format then we could employ a trick where when
encoding a variant we first set all the `ValRaw` slots to zero, then the
associated case we have is encoding. Afterwards the `ValRaw` values are
already encoded into the correct format as if they'd been "join"ed.
For example if we were to encode `Ok(1i32)` then this would produce
`ValRaw { i32: 1 }`, which memory-wise is equivalent to `ValRaw { i64: 1 }`
if the other bytes in the `ValRaw` are guaranteed to be zero. Similarly
storing `ValRaw { f64 }` is equivalent to the storage required for
`ValRaw { i64 }` here in the join operation.
Note, though, that this equivalence relies on everything being
little-endian. Otherwise the in-memory representations of `ValRaw { i32: 1 }`
and `ValRaw { i64: 1 }` are different.
That motivation is what leads to this change. It's expected that this is
a low-to-zero cost change in the sense that little-endian platforms will
see no change and big-endian platforms are already required to
efficiently byte-swap loads/stores as WebAssembly requires that.
Additionally the `ValRaw` type is an esoteric niche use case primarily
used for accelerating the C API right now, so it's expected that not
many users will have to update for this change.
* Track down some more endianness conversions
|