* Update wasm-tools crates
This commit updates the wasm-tools family of crates as used in Wasmtime.
Notably this brings in the update which removes module linking support
as well as a number of internal refactorings around names and such
within wasmparser itself. This updates all of the wasm translation
support which binds to wasmparser as appropriate.
Other crates all had API-compatible changes for at least what Wasmtime
used so no further changes were necessary beyond updating version
requirements.
* Update a test expectation
* Refactor away the `Instantiator` type in Wasmtime
This internal type in Wasmtime was primarily used for the module linking
proposal to handle instantiation of many instances and refactor out the
sync and async parts to minimize duplication. With the removal of the
module linking proposal, however, this type isn't really necessary any
longer. In working to implement the component model proposal I was
looking already to refactor this and I figured it'd be good to land that
ahead of time on `main` separate of other refactorings.
This commit removes the `Instantiator` type in the `instance` module.
The type was already private to Wasmtime so this shouldn't have any
impact on consumers. This allows simplifying various code paths to avoid
another abstraction. The meat of instantiation is moved to
`Instance::new_raw` which should be reusable for the component model as
well.
One bug is actually fixed in this commit as well where
`Linker::instantiate` and `InstancePre::instantiate` failed to check
that async support was disabled on a store. This means that they could
have led to a panic if used with an async store and a start function
called an async import (or an async resource limiter yielded). A few
tests were updated with this.
* Review comments
* Upgrade all crates to the Rust 2021 edition
I've personally started using the new format strings for things like
`panic!("some message {foo}")` or similar and have been upgrading crates
on a case-by-case basis, but I think it probably makes more sense to go
ahead and blanket upgrade everything so 2021 features are always
available.
* Fix compile of the C API
* Fix a warning
* Fix another warning
* Bump to 0.36.0
* Add a two-week delay to Wasmtime's release process
This commit is a proposal to update Wasmtime's release process with a
two-week delay from branching a release until it's actually officially
released. We've had two issues lately that came up which led to this proposal:
* In #3915 it was realized that changes just before the 0.35.0 release
weren't enough for an embedding use case, but the PR didn't meet the
expectations for a full patch release.
* At Fastly we were about to start rolling out a new version of Wasmtime
when over the weekend the fuzz bug #3951 was found. This led to the
desire internally to have a "must have been fuzzed for this long"
period of time for Wasmtime changes which we felt were better
reflected in the release process itself rather than something about
Fastly's own integration with Wasmtime.
This commit updates the automation for releases to unconditionally
create a `release-X.Y.Z` branch on the 5th of every month. The actual
release from this branch is then performed on the 20th of every month,
roughly two weeks later. This should provide a period of time to ensure
that all changes in a release are fuzzed for at least two weeks and
avoid any further surprises. This should also help with any last-minute
changes made just before a release if they need tweaking since
backporting to a not-yet-released branch is much easier.
Overall there are some new properties about Wasmtime with this proposal
as well:
* The `main` branch will always have a section in `RELEASES.md` which is
listed as "Unreleased" for us to fill out.
* The `main` branch will always be a version ahead of the latest
release. For example it will be bump pre-emptively as part of the
release process on the 5th where if `release-2.0.0` was created then
the `main` branch will have 3.0.0 Wasmtime.
* Dates for major versions are automatically updated in the
`RELEASES.md` notes.
The associated documentation for our release process is updated and the
various scripts should all be updated now as well with this commit.
* Add notes on a security patch
* Clarify security fixes shouldn't be previewed early on CI
This splits the existing `lookup_by_declaration` function into a
lookup-per-type-of-item. This refactor ends up cleaning up a fair bit of
code in the `wasmtime` crate by removing a number of `unreachable!()`
blocks which are now no longer necessary.
* Remove duplicate `TypeTables` type
This was once needed historically but it is no longer needed.
* Make the internals of `TypeTables` private
Instead of reaching internally for the `wasm_signatures` map an `Index`
implementation now exists to indirect accesses through the type of the
index being accessed. For the component model this table of types will
grow a number of other tables and this'll assist in consuming sites not
having to worry so much about which map they're reaching into.
* Remove the module linking implementation in Wasmtime
This commit removes the experimental implementation of the module
linking WebAssembly proposal from Wasmtime. The module linking is no
longer intended for core WebAssembly but is instead incorporated into
the component model now at this point. This means that very large parts
of Wasmtime's implementation of module linking are no longer applicable
and would change greatly with an implementation of the component model.
The main purpose of this is to remove Wasmtime's reliance on the support
for module-linking in `wasmparser` and tooling crates. With this
reliance removed we can move over to the `component-model` branch of
`wasmparser` and use the updated support for the component model.
Additionally given the trajectory of the component model proposal the
embedding API of Wasmtime will not look like what it looks like today
for WebAssembly. For example the core wasm `Instance` will not change
and instead a `Component` is likely to be added instead.
Some more rationale for this is in #3941, but the basic idea is that I
feel that it's not going to be viable to develop support for the
component model on a non-`main` branch of Wasmtime. Additionaly I don't
think it's viable, for the same reasons as `wasm-tools`, to support the
old module linking proposal and the new component model at the same
time.
This commit takes a moment to not only delete the existing module
linking implementation but some abstractions are also simplified. For
example module serialization is a bit simpler that there's only one
module. Additionally instantiation is much simpler since the only
initializer we have to deal with are imports and nothing else.
Closes#3941
* Fix doc link
* Update comments
* Instead of simply panicking, return an error when we attempt to resume on a dying fiber.
This situation should never occur in the existing code base, but can be
triggered if support for running outside async code in a call hook.
* Shift `async_cx()` to return an `Option`, reflecting if the fiber is dying.
This should never happen in the existing code base, but is a nice
forward-looking guard. The current implementations simply lift the
trap that would eventually be produced by such an operation into
a `Trap` (or similar) at the invocation of `async_cx()`.
* Add support for using `async` call hooks.
This retains the ability to do non-async hooks. Hooks end up being
implemented as an async trait with a handler call, to get around some
issues passing around async closures. This change requires some of
the prior changes to handle picking up blocked tasks during fiber
shutdown, to avoid some panics during timeouts and other such events.
* More fully specify a doc link, to avoid a doc-building error.
* Revert the use of catchable traps on cancellation of a fiber; turn them into expect()/unwrap().
The justification for this revert is that (a) these events shouldn't
happen, and (b) they wouldn't be catchable by wasm anyways.
* Replace a duplicated check in `async` hook evaluation with a single check.
This also moves the checks inside of their respective Async variants,
meaning that if you're using an async-enabled version of wasmtime but
using the synchronous versions of the callbacks, you won't pay any
penalty for validating the async context.
* Use `match &mut ...` insead of `ref mut`.
* Add some documentation on why/when `async_cx` can return None.
* Add two simple test cases for async call hooks.
* Fix async_cx() to check both the box and the value for current_poll_cx.
In the prior version, we only checked that the box had not been cleared,
but had not ensured that there was an actual context for us to use. This
updates the check to validate both, returning None if the inner context
is missing. This allows us to skip a validation check inside `block_on`,
since all callers will have run through the `async_cx` check prior to
arrival.
* Tweak the timeout test to address PR suggestions.
* Add a test about dropping async hooks while suspended
Should help exercise that the check for `None` is properly handled in a
few more locations.
Co-authored-by: Alex Crichton <alex@alexcrichton.com>
This commit exposes some various details and config options for having
finer-grain control over mlock-ing the memory of modules. This amounts
to three different changes being present in this commit:
* A new `Module::image_range` API is added to expose the range in host
memory of where the compiled image resides. This enables embedders to
make mlock-ing decisions independently of Wasmtime. Otherwise though
there's not too much useful that can be done with this range
information at this time.
* A new `Config::force_memory_init_memfd` option has been added. This
option is used to force the usage of `memfd_create` on Linux even when
the original module comes from a file on disk. With mlock-ing the main
purpose for Wasmtime is likely to be avoiding major page faults that
go back to disk, so this is another major source of avoiding page
faults by ensuring that the initialization contents of memory are
always in RAM.
* The `memory_images` field of a `Module` has gone back to being lazily
created on the first instantiation, effectively reverting #3914. This
enables embedders to defer the creation of the image to as late as
possible to allow modules to be created from precompiled images
without actually loading all the contents of the data segments from
disk immediately.
These changes are all somewhat low-level controls which aren't intended
to be generally used by embedders. If fine-grained control is desired
though it's hoped that these knobs provide what's necessary to be
achieved.
* Support disabling backtraces at compile time
This commit adds support to Wasmtime to disable, at compile time, the
gathering of backtraces on traps. The `wasmtime` crate now sports a
`wasm-backtrace` feature which, when disabled, will mean that backtraces
are never collected at compile time nor are unwinding tables inserted
into compiled objects.
The motivation for this commit stems from the fact that generating a
backtrace is quite a slow operation. Currently backtrace generation is
done with libunwind and `_Unwind_Backtrace` typically found in glibc or
other system libraries. When thousands of modules are loaded into the
same process though this means that the initial backtrace can take
nearly half a second and all subsequent backtraces can take upwards of
hundreds of milliseconds. Relative to all other operations in Wasmtime
this is extremely expensive at this time. In the future we'd like to
implement a more performant backtrace scheme but such an implementation
would require coordination with Cranelift and is a big chunk of work
that may take some time, so in the meantime if embedders don't need a
backtrace they can still use this option to disable backtraces at
compile time and avoid the performance pitfalls of collecting
backtraces.
In general I tried to originally make this a runtime configuration
option but ended up opting for a compile-time option because `Trap::new`
otherwise has no arguments and always captures a backtrace. By making
this a compile-time option it was possible to configure, statically, the
behavior of `Trap::new`. Additionally I also tried to minimize the
amount of `#[cfg]` necessary by largely only having it at the producer
and consumer sites.
Also a noteworthy restriction of this implementation is that if
backtrace support is disabled at compile time then reference types
support will be unconditionally disabled at runtime. With backtrace
support disabled there's no way to trace the stack of wasm frames which
means that GC can't happen given our current implementation.
* Always enable backtraces for the C API
* Delete historical interruptable support in Wasmtime
This commit removes the `Config::interruptable` configuration along with
the `InterruptHandle` type from the `wasmtime` crate. The original
support for adding interruption to WebAssembly was added pretty early on
in the history of Wasmtime when there was no other method to prevent an
infinite loop from the host. Nowadays, however, there are alternative
methods for interruption such as fuel or epoch-based interruption.
One of the major downsides of `Config::interruptable` is that even when
it's not enabled it forces an atomic swap to happen when entering
WebAssembly code. This technically could be a non-atomic swap if the
configuration option isn't enabled but that produces even more branch-y
code on entry into WebAssembly which is already something we try to
optimize. Calling into WebAssembly is on the order of a dozens of
nanoseconds at this time and an atomic swap, even uncontended, can add
up to 5ns on some platforms.
The main goal of this PR is to remove this atomic swap on entry into
WebAssembly. This is done by removing the `Config::interruptable` field
entirely, moving all existing consumers to epochs instead which are
suitable for the same purposes. This means that the stack overflow check
is no longer entangled with the interruption check and perhaps one day
we could continue to optimize that further as well.
Some consequences of this change are:
* Epochs are now the only method of remote-thread interruption.
* There are no more Wasmtime traps that produces the `Interrupted` trap
code, although we may wish to move future traps to this so I left it
in place.
* The C API support for interrupt handles was also removed and bindings
for epoch methods were added.
* Function-entry checks for interruption are a tiny bit less efficient
since one check is performed for the stack limit and a second is
performed for the epoch as opposed to the `Config::interruptable`
style of bundling the stack limit and the interrupt check in one. It's
expected though that this is likely to not really be measurable.
* The old `VMInterrupts` structure is renamed to `VMRuntimeLimits`.
This commit removes the currently existing laziness-via-`OnceCell` when
a `Module` is created for creating a `ModuleMemoryImages` data
structure. Processing of data is now already shifted to compile time for
the wasm module which means that creating a `ModuleMemoryImages` is
either cheap because the module is backed by a file on disk, it's a
single `write` into the kernel to a memfd, or it's cheap as it's not
supported. This should help make module instantiation time more
deterministic, even for the first instantiation of a module.
* Implement runtime checks for compilation settings
This commit fills out a few FIXME annotations by implementing run-time
checks that when a `Module` is created it has compatible codegen
settings for the current host (as `Module` is proof of "this code can
run"). This is done by implementing new `Engine`-level methods which
validate compiler settings. These settings are validated on
`Module::new` as well as when loading serialized modules.
Settings are split into two categories, one for "shared" top-level
settings and one for ISA-specific settings. Both categories now have
allow-lists hardcoded into `Engine` which indicate the acceptable values
for each setting (if applicable). ISA-specific settings are checked with
the Rust standard library's `std::is_x86_feature_detected!` macro. Other
macros for other platforms are not stable at this time but can be added
here if necessary.
Closes#3897
* Fix fall-through logic to actually be correct
* Use a `OnceCell`, not an `AtomicBool`
* Fix some broken tests
This commit fixes calling the call hooks configured in a store for host
functions defined with `Func::new_unchecked` or similar. I believe that
this was just an accidental oversight and there's no fundamental reason
to not support this.
This commit aims to achieve the goal of being able to run the test suite
on Windows with `--test-threads 1`, or more notably allowing Wasmtime's
defaults to work better with the main thread on Windows which appears to
have a smaller stack by default than Linux by comparison. In decreasing
the default wasm stack size a test is also update to probe for less
stack to work on Windows' main thread by default, ideally allowing the
full test suite to work with `--test-threads 1` (although this isn't
added to CI as it's not really critical).
Closes#3857
* Shrink the size of the anyfunc table in `VMContext`
This commit shrinks the size of the `VMCallerCheckedAnyfunc` table
allocated into a `VMContext` to be the size of the number of "escaped"
functions in a module rather than the number of functions in a module.
Escaped functions include exports, table elements, etc, and are
typically an order of magnitude smaller than the number of functions in
general. This should greatly shrink the `VMContext` for some modules
which while we aren't necessarily having any problems with that today
shouldn't cause any problems in the future.
The original motivation for this was that this came up during the recent
lazy-table-initialization work and while it no longer has a direct
performance benefit since tables aren't initialized at all on
instantiation it should still improve long-running instances
theoretically with smaller `VMContext` allocations as well as better
locality between anyfuncs.
* Fix some tests
* Remove redundant hash set
* Use a helper for pushing function type information
* Use a more descriptive `is_escaping` method
* Clarify a comment
* Fix condition
* Remove the `ModuleLimits` pooling configuration structure
This commit is an attempt to improve the usability of the pooling
allocator by removing the need to configure a `ModuleLimits` structure.
Internally this structure has limits on all forms of wasm constructs but
this largely bottoms out in the size of an allocation for an instance in
the instance pooling allocator. Maintaining this list of limits can be
cumbersome as modules may get tweaked over time and there's otherwise no
real reason to limit the number of globals in a module since the main
goal is to limit the memory consumption of a `VMContext` which can be
done with a memory allocation limit rather than fine-tuned control over
each maximum and minimum.
The new approach taken in this commit is to remove `ModuleLimits`. Some
fields, such as `tables`, `table_elements` , `memories`, and
`memory_pages` are moved to `InstanceLimits` since they're still
enforced at runtime. A new field `size` is added to `InstanceLimits`
which indicates, in bytes, the maximum size of the `VMContext`
allocation. If the size of a `VMContext` for a module exceeds this value
then instantiation will fail.
This involved adding a few more checks to `{Table, Memory}::new_static`
to ensure that the minimum size is able to fit in the allocation, since
previously modules were validated at compile time of the module that
everything fit and that validation no longer happens (it happens at
runtime).
A consequence of this commit is that Wasmtime will have no built-in way
to reject modules at compile time if they'll fail to be instantiated
within a particular pooling allocator configuration. Instead a module
must attempt instantiation see if a failure happens.
* Fix benchmark compiles
* Fix some doc links
* Fix a panic by ensuring modules have limited tables/memories
* Review comments
* Add back validation at `Module` time instantiation is possible
This allows for getting an early signal at compile time that a module
will never be instantiable in an engine with matching settings.
* Provide a better error message when sizes are exceeded
Improve the error message when an instance size exceeds the maximum by
providing a breakdown of where the bytes are all going and why the large
size is being requested.
* Try to fix test in qemu
* Flag new test as 64-bit only
Sizes are all specific to 64-bit right now
* fuzzing: Add a custom mutator based on `wasm-mutate`
* fuzz: Add a version of the `compile` fuzz target that uses `wasm-mutate`
* Update `wasmparser` dependencies
* Enable copy-on-write heap initialization by default
This commit enables the `Config::memfd` feature by default now that it's
been fuzzed for a few weeks on oss-fuzz, and will continue to be fuzzed
leading up to the next release of Wasmtime in early March. The
documentation of the `Config` option has been updated as well as adding
a CLI flag to disable the feature.
* Remove ubiquitous "memfd" terminology
Switch instead to forms of "memory image" or "cow" or some combination
thereof.
* Update new option names
* Enable SSE 4.2 unconditionally
Fuzzing over the weekend found that `i64x2` comparison operators
require `pcmpgtq` which is an SSE 4.2 instruction. Along the lines of #3816
this commit unconditionally enables and requires SSE 4.2 for compilation
and fuzzing. It will no longer be possible to create a compiler for
x86_64 with simd enabled if SSE 4.2 is disabled.
* Update comment
In #3820 we see an issue with the new heuristics that control use of
memfd: it's entirely possible for a reasonable Wasm module produced by a
snapshotting system to have a relatively sparse heap (less than 50%
filled). A system that avoids memfd because of this would have an
undesirable performance reduction on such modules.
Ultimately we should try to implement a hybrid scheme where we support
outlier/leftover initializers, but for now this PR makes the "always
allow dense" limit configurable. This way, embedders that want to ensure
that memfd is used can do so, if they have other knowledge about the
maximum heap size allowed in their system.
(Partially addresses #3820 but let's leave it open to track the hybrid
idea)
* x64: enable VTune support by default
After significant work in the `ittapi-rs` crate, this dependency should
build without issue on Wasmtime's supported operating systems: Windows,
Linux, and macOS. The difference in the release binary is <20KB, so this
change makes `vtune` a default build feature. This change upgrades
`ittapi-rs` to v0.2.0 and updates the documentation.
* review: add configuration for defaults in more places
* review: remove OS conditional compilation, add architecture
* review: do not default vtune feature in wasmtime-jit
Addresses #3809: when we are asked to create a Cranelift backend with
shared flags that indicate support for SIMD, we should check that the
ISA level needed for our SIMD lowerings is present.
* Shrink the size of `FuncData`
Before this commit on a 64-bit system the `FuncData` type had a size of
88 bytes and after this commit it has a size of 32 bytes. A `FuncData`
is required for all host functions in a store, including those inserted
from a `Linker` into a store used during linking. This means that
instantiation ends up creating a nontrivial number of these types and
pushing them into the store. Looking at some profiles there were some
surprisingly expensive movements of `FuncData` from the stack to a
vector for moves-by-value generated by Rust. Shrinking this type enables
more efficient code to be generated and additionally means less storage
is needed in a store's function array.
For instantiating the spidermonkey and rustpython modules this improves
instantiation by 10% since they each import a fair number of host
functions and the speedup here is relative to the number of items
imported.
* Use `ptr::copy_nonoverlapping` during initialization
Prevoiusly `ptr::copy` was used for copying imports into place which
translates to `memmove`, but `ptr::copy_nonoverlapping` can be used here
since it's statically known these areas don't overlap. While this
doesn't end up having a performance difference it's something I kept
noticing while looking at the disassembly of `initialize_vmcontext` so I
figured I'd go ahead and implement.
* Indirect shared signature ids in the VMContext
This commit is a small improvement for the instantiation time of modules
by avoiding copying a list of `VMSharedSignatureIndex` entries into each
`VMContext`, instead building one inside of a module and sharing that
amongst all instances. This involves less lookups at instantiation time
and less movement of data during instantiation. The downside is that
type-checks on `call_indirect` now involve an additionally load, but I'm
assuming that these are somewhat pessimized enough as-is that the
runtime impact won't be much there.
For instantiation performance this is a 5-10% win with
rustpyhon/spidermonky instantiation. This should also reduce the size of
each `VMContext` for an instantiation since signatures are no longer
stored inline but shared amongst all instances with one module.
Note that one subtle change here is that the array of
`VMSharedSignatureIndex` was previously indexed by `TypeIndex`, and now
it's indexed by `SignaturedIndex` which is a deduplicated form of
`TypeIndex`. This is done because we already had a list of those lying
around in `Module`, so it was easier to reuse that than to build a
separate array and store it somewhere.
* Reserve space in `Store<T>` with `InstancePre`
This commit updates the instantiation process to reserve space in a
`Store<T>` for the functions that an `InstancePre<T>`, as part of
instantiation, will insert into it. Using an `InstancePre<T>` to
instantiate allows pre-computing the number of host functions that will
be inserted into a store, and by pre-reserving space we can avoid costly
reallocations during instantiation by ensuring the function vector has
enough space to fit everything during the instantiation process.
Overall this makes instantiation of rustpython/spidermonkey about 8%
faster locally.
* Fix tests
* Use checked arithmetic
* Skip memfd creation with precompiled modules
This commit updates the memfd support internally to not actually use a
memfd if a compiled module originally came from disk via the
`wasmtime::Module::deserialize_file` API. In this situation we already
have a file descriptor open and there's no need to copy a module's heap
image to a new file descriptor.
To facilitate a new source of `mmap` the currently-memfd-specific-logic
of creating a heap image is generalized to a new form of
`MemoryInitialization` which is attempted for all modules at
module-compile-time. This means that the serialized artifact to disk
will have the memory image in its entirety waiting for us. Furthermore
the memory image is ensured to be padded and aligned carefully to the
target system's page size, notably meaning that the data section in the
final object file is page-aligned and the size of the data section is
also page aligned.
This means that when a precompiled module is mapped from disk we can
reuse the underlying `File` to mmap all initial memory images. This
means that the offset-within-the-memory-mapped-file can differ for
memfd-vs-not, but that's just another piece of state to track in the
memfd implementation.
In the limit this waters down the term "memfd" for this technique of
quickly initializing memory because we no longer use memfd
unconditionally (only when the backing file isn't available).
This does however open up an avenue in the future to porting this
support to other OSes because while `memfd_create` is Linux-specific
both macOS and Windows support mapping a file with copy-on-write. This
porting isn't done in this PR and is left for a future refactoring.
Closes#3758
* Enable "memfd" support on all unix systems
Cordon off the Linux-specific bits and enable the memfd support to
compile and run on platforms like macOS which have a Linux-like `mmap`.
This only works if a module is mapped from a precompiled module file on
disk, but that's better than not supporting it at all!
* Fix linux compile
* Use `Arc<File>` instead of `MmapVecFileBacking`
* Use a named struct instead of mysterious tuples
* Comment about unsafety in `Module::deserialize_file`
* Fix tests
* Fix uffd compile
* Always align data segments
No need to have conditional alignment since their sizes are all aligned
anyway
* Update comment in build.rs
* Use rustix, not `region`
* Fix some confusing logic/names around memory indexes
These functions all work with memory indexes, not specifically defined
memory indexes.
* Move function names out of `Module`
This commit moves function names in a module out of the
`wasmtime_environ::Module` type and into separate sections stored in the
final compiled artifact. Spurred on by #3787 to look at module load
times I noticed that a huge amount of time was spent in deserializing
this map. The `spidermonkey.wasm` file, for example, has a 3MB name
section which is a lot of unnecessary data to deserialize at module load
time.
The names of functions are now split out into their own dedicated
section of the compiled artifact and metadata about them is stored in a
more compact format at runtime by avoiding a `BTreeMap` and instead
using a sorted array. Overall this improves deserialize times by up to
80% for modules with large name sections since the name section is no
longer deserialized at load time and it's lazily paged in as names are
actually referenced.
* Fix a typo
* Fix compiled module determinism
Need to not only sort afterwards but also first to ensure the data of
the name section is consistent.
During instance initialization, we build two sorts of arrays eagerly:
- We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function
in an instance.
- We initialize every element of a funcref table with an initializer to
a pointer to one of these anyfuncs.
Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
`ref.func`. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.
This PR implements two basic ideas:
- The anyfunc array can be lazily initialized as long as we retain the
information needed to do so. For now, in this PR, we just recreate the
anyfunc whenever a pointer is taken to it, because doing so is fast
enough; in the future we could keep some state to know whether the
anyfunc has been written yet and skip this work if redundant.
This technique allows us to leave the anyfunc array as uninitialized
memory, which can be a significant savings. Filling it with
initialized anyfuncs is very expensive, but even zeroing it is
expensive: e.g. in a large module, it can be >500KB.
- A funcref table can be lazily initialized as long as we retain a link
to its corresponding instance and function index for each element. A
zero in a table element means "uninitialized", and a slowpath does the
initialization.
Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.
We do have to set up a precomputed array of `FuncIndex`s for the table
in order for this to work. We do this as part of the module compilation.
This PR also refactors the way that the runtime crate gains access to
information computed during module compilation.
Performance effect measured with in-tree benches/instantiation.rs, using
SpiderMonkey built for WASI, and with memfd enabled:
```
BEFORE:
sequential/default/spidermonkey.wasm
time: [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
time: [69.406 us 69.435 us 69.465 us]
parallel/default/spidermonkey.wasm: with 1 background thread
time: [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
time: [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
time: [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
time: [326.81 us 337.32 us 347.01 us]
WITH THIS PR:
sequential/default/spidermonkey.wasm
time: [6.7821 us 6.8096 us 6.8397 us]
change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
Performance has improved.
sequential/pooling/spidermonkey.wasm
time: [3.0410 us 3.0558 us 3.0724 us]
change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
Performance has improved.
parallel/default/spidermonkey.wasm: with 1 background thread
time: [7.2643 us 7.2689 us 7.2735 us]
change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
time: [147.36 us 148.99 us 150.74 us]
change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
time: [3.1009 us 3.1021 us 3.1033 us]
change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
time: [49.449 us 50.475 us 51.540 us]
change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
Performance has improved.
```
So an improvement of something like 80-95% for a very large module (7420
functions in its one funcref table, 31928 functions total).
This commit corrects a few places where `MemoryIndex` was used and treated like
a `DefinedMemoryIndex` in the pooling instance allocator.
When the unstable `multi-memory` proposal is enabled, it is possible to cause a
newly allocated instance to use an incorrect base address for any defined
memories by having the module being instantiated also import a memory.
This requires enabling the unstable `multi-memory` proposal, configuring the
use of the pooling instance allocator (not the default), and then configuring
the module limits to allow imported memories (also not the default).
The fix is to replace all uses of `MemoryIndex` with `DefinedMemoryIndex` in
the pooling instance allocator.
Several `debug_assert!` have also been updated to `assert!` to sanity check the
state of the pooling allocator even in release builds.
This commit updates the `memfd` support in Wasmtime to have a runtime
toggle as to whether it's used or not. The compile-time feature gating
`memfd` support is now also re-enabled by default, but the new runtime
switch is still disabled-by-default.
Additionally this commit updates our fuzz oracle to turn on/off the
memfd flag to re-enable fuzzing with memfd on oss-fuzz.
Since memfd support just landed, and has had only ~0.5 weeks to bake
with fuzzing, we want to make release 0.34.0 of Wasmtime without it
enabled by default. This PR disables memfd by default; it can be enabled
by specifying the `memfd` feature for the `wasmtime` crate, or when
building the commandline binary.
We plan to explicitly add memfd-enabled fuzzing targets, let that go for
a while, then probably re-enable memfd in the subsequent release if no
issues come up.
As a followup to the recent memfd allocator work, this PR makes the
memfd image creation occur on the first instantiation, rather than
immediately when the `Module` is loaded.
This shaves off a potentially surprising cost spike that would have
otherwise occurred: prior to the memfd work, no allocator eagerly read
the module's initial heap state into RAM. The behavior should now more
closely resemble what happened before (and the improvements in overall
instantiation time and performance, as compared to either eager init
with pure-mmap memory or user-mode pagefault handling with uffd,
remain).
* Tweak memfd-related features crates
This commit changes the `memfd` feature for the `wasmtime-cli` crate
from an always-on feature to a default-on feature which can be disabled
at compile time. Additionally the `pooling-allocator` feature is also
given similar treatment.
Additionally some documentation was added for the `memfd` feature on the
`wasmtime` crate.
* Don't store `Arc<T>` in `InstanceAllocationRequest`
Instead store `&Arc<T>` to avoid having the clone that lives in
`InstanceAllocationRequest` not actually going anywhere. Otherwise all
instance allocation requires an extra clone to create it for the request
and an extra decrement when the request goes away. Internally clones are
made as necessary when creating instances.
* Enable the pooling allocator by default for `wasmtime-cli`
While perhaps not the most useful option since the CLI doesn't have a
great way to take advantage of this it probably makes sense to at least
match the features of `wasmtime` itself.
* Fix some lints and issues
* More compile fixes
This policy attempts to reuse the same instance slot for subsequent
instantiations of the same module. This is particularly useful when
using a pooling backend such as memfd that benefits from this reuse: for
example, in the memfd case, instantiating the same module into the same
slot allows us to avoid several calls to mmap() because the same
mappings can be reused.
The policy tracks a freelist per "compiled module ID", and when
allocating a slot for an instance, tries these three options in order:
1. A slot from the freelist for this module (i.e., last used for another
instantiation of this particular module), or
3. A slot that was last used by some other module or never before.
The "victim" slot for choice 2 is randomly chosen.
The data structures are carefully designed so that all updates are O(1),
and there is no retry-loop in any of the random selection.
This policy is now the default when the memfd backend is selected via
the `memfd-allocator` feature flag.
* Lazily populate a store's trampoline map
This commit is another installment of "how fast can we make
instantiation". Currently when instantiating a module with many function
imports each function, typically from the host, is inserted into the
store. This insertion process stores the `VMTrampoline` for the host
function in a side table so it can be looked up later if the host
function is called through the `Func` interface. This insertion process,
however, involves a hash map insertion which can be relatively expensive
at the scale of the rest of the instantiation process.
The optimization implemented in this commit is to avoid inserting
trampolines into the store at `Func`-insertion-time (aka instantiation
time) and instead only lazily populate the map of trampolines when
needed. The theory behind this is that almost all `Func` instances that
are called indirectly from the host are actually wasm functions, not
host-defined functions. This means that they already don't need to go
through the map of host trampolines and can instead be looked up from
the module they're defined in. With the assumed rarity of host functions
making `lookup_trampoline` a bit slower seems ok.
The `lookup_trampoline` function will now, on a miss from the wasm
modules and `host_trampolines` map, lazily iterate over the functions
within the store and insert trampolines into the `host_trampolines` map.
This process will eventually reach something which matches the function
provided because it should at least hit the same host function. The
relevant `lookup_trampoline` now sports a new documentation block
explaining all this as well for future readers.
Concretely this commit speeds up instantiation of an empty module with
100 imports and ~80 unique signatures from 10.6us to 6.4us, a 40%
improvement.
* Review comments
* Remove debug assert
As first suggested by Jan on the Zulip here [1], a cheap and effective
way to obtain copy-on-write semantics of a "backing image" for a Wasm
memory is to mmap a file with `MAP_PRIVATE`. The `memfd` mechanism
provided by the Linux kernel allows us to create anonymous,
in-memory-only files that we can use for this mapping, so we can
construct the image contents on-the-fly then effectively create a CoW
overlay. Furthermore, and importantly, `madvise(MADV_DONTNEED, ...)`
will discard the CoW overlay, returning the mapping to its original
state.
By itself this is almost enough for a very fast
instantiation-termination loop of the same image over and over,
without changing the address space mapping at all (which is
expensive). The only missing bit is how to implement
heap *growth*. But here memfds can help us again: if we create another
anonymous file and map it where the extended parts of the heap would
go, we can take advantage of the fact that a `mmap()` mapping can
be *larger than the file itself*, with accesses beyond the end
generating a `SIGBUS`, and the fact that we can cheaply resize the
file with `ftruncate`, even after a mapping exists. So we can map the
"heap extension" file once with the maximum memory-slot size and grow
the memfd itself as `memory.grow` operations occur.
The above CoW technique and heap-growth technique together allow us a
fastpath of `madvise()` and `ftruncate()` only when we re-instantiate
the same module over and over, as long as we can reuse the same
slot. This fastpath avoids all whole-process address-space locks in
the Linux kernel, which should mean it is highly scalable. It also
avoids the cost of copying data on read, as the `uffd` heap backend
does when servicing pagefaults; the kernel's own optimized CoW
logic (same as used by all file mmaps) is used instead.
[1] https://bytecodealliance.zulipchat.com/#narrow/stream/206238-general/topic/Copy.20on.20write.20based.20instance.20reuse/near/266657772
* Lazily load types into `Func`
This commit changes the construction of a `Func` to lazily load the type
information if required instead of always loading the type information
at `Func`-construction time. The main purpose of this change is to
accelerate instantiation of instances which have many imports. Currently
in the fast way of doing this the instantiation loop looks like:
let mut store = Store::new(&engine, ...);
let instance = instance_pre.instantiate(&mut store);
In this situation the `instance_pre` will typically load host-defined
functions (defined via `Linker` APIs) into the `Store` as individual
`Func` items and then perform the instantiation process. The operation
of loading a `HostFunc` into a `Store` however currently involves two
expensive operations:
* First a read-only lock is taken on the `RwLock` around engine
signatures.
* Next a clone of the wasm type is made to pull it out of the engine
signature registry.
Neither of these is actually necessary for imported functions. The
`FuncType` for imported functions is never used since all comparisons
happen with the intern'd indices instead. The only time a `FuncType` is
used typically is for exported functions when using `Func::typed` or
similar APIs which need type information.
This commit makes this path faster by storing `Option<FuncType>` instead
of `FuncType` within a `Func`. This means that it starts out as `None`
and is only filled in on-demand as necessary. This means that when
instantiating a module with many imports no clones/locks are done.
On a simple microbenchmark where a module with 100 imports is
instantiated this PR improves instantiation time by ~35%. Due to the
rwlock used here and the general inefficiency of pthreads rwlocks the
effect is even more profound when many threads are performing the same
instantiation process. On x86_64 with 8 threads performing instantiation
this PR improves instantiation time by 80% and on arm64 it improves by
97% (wow read-contended glibc rwlocks on arm64 are slow).
Note that much of the improvement here is also from memory
allocatoins/deallocations no longer being performed because dropping
functions within a store no longer requires deallocating the `FuncType`
if it's not present.
A downside of this PR is that `Func::ty` is now unconditionally taking
an rwlock if the type hasn't already been filled in. (it uses the
engine). If this is an issue in the future though we can investigate at
that time using somthing like a `Once` to lazily fill in even when
mutable access to the store isn't available.
* Review comments