Addresses #3809: when we are asked to create a Cranelift backend with
shared flags that indicate support for SIMD, we should check that the
ISA level needed for our SIMD lowerings is present.
* Port fix for `CVE-2022-23636` to `main`.
This commit ports the fix for `CVE-2022-23636` to `main`, but performs a
refactoring that makes it unnecessary for the instance itself to track if it
has been initialized; such a change was not targeted enough for a security
patch.
The pooling allocator will now only initialize an instance if all of its
associated resource creation succeeds. If the resource creation fails, no
instance is dropped as none was initialized.
Also updates `RELEASES.md` to include the related patch releases.
* Add `Instance::new_at` to fully initialize an instance.
Added `Instance::new_at` to fully initialize an instance at a given address.
This will hopefully prevent the possibility that an `Instance` structure
doesn't have an initialized `VMContext` when it is dropped.
* Unconditionally enable sse3, ssse3, and sse4.1 when fuzzing
This commit unconditionally enables some x86_64 instructions when
fuzzing because the cranelift backend is known to not work if these
features are disabled. From discussion on the wasm simd proposal the
assumed general baseline for running simd code is SSE4.1 anyway.
At this time I haven't added any sort of checks in Wasmtime itself.
Wasmtime by default uses the native architecture and when explicitly
enabling features this still needs to be explicitly specified.
Closes#3809
* Update crates/fuzzing/src/generators.rs
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
Co-authored-by: Andrew Brown <andrew.brown@intel.com>
This commit improves the stability of the fuzz targets by ensuring the
generated configs and modules are congruent, especially when the pooling
allocator is being used.
For the `differential` target, this means both configurations must use the same
allocation strategy for now as one side generates the module that might not be
compatible with another arbitrary config now that we fuzz the pooling
allocator.
These changes also ensure that constraints put on the config are more
consistently applied, especially when using a fuel-based timeout.
* Fuzz cranelift cpu flag settings with Wasmtime
This commit updates the `Config` fuzz-generator to consume some of the
input as configuration settings for codegen flags we pass to cranelift.
This should allow for ideally some more coverage where settings are
disabled or enabled, ideally finding possible bugs in feature-specific
implementations or generic implementations that are rarely used if the
feature-specific ones almost always take precedent.
The technique used in this commit is to weight selection of codegen
settings less frequently than using the native settings. Afterwards each
listed feature is individually enabled or disabled depending on the
input fuzz data, and if a feature is enabled but the host doesn't
actually support it then the fuzz input is rejected with a log message.
The goal here is to still have many fuzz inputs accepted but also ensure
determinism across hosts. If there's a bug specifically related to
enabling a flag then running it on a host without the flag should
indicate that the flag isn't supported rather than silently leaving it
disabled and reporting the fuzz case a success.
* Use built-in `Unstructured::ratio` method
* Tweak macro
* Bump arbitrary dep version
This commit makes it such that the pooling allocator will be configured with a
much lower upper bound for memory pages, which will greatly reduce the
likelihood that the fuzzer memory limits will be hit from having too many
memories from too many instances committed.
* Shrink the size of `FuncData`
Before this commit on a 64-bit system the `FuncData` type had a size of
88 bytes and after this commit it has a size of 32 bytes. A `FuncData`
is required for all host functions in a store, including those inserted
from a `Linker` into a store used during linking. This means that
instantiation ends up creating a nontrivial number of these types and
pushing them into the store. Looking at some profiles there were some
surprisingly expensive movements of `FuncData` from the stack to a
vector for moves-by-value generated by Rust. Shrinking this type enables
more efficient code to be generated and additionally means less storage
is needed in a store's function array.
For instantiating the spidermonkey and rustpython modules this improves
instantiation by 10% since they each import a fair number of host
functions and the speedup here is relative to the number of items
imported.
* Use `ptr::copy_nonoverlapping` during initialization
Prevoiusly `ptr::copy` was used for copying imports into place which
translates to `memmove`, but `ptr::copy_nonoverlapping` can be used here
since it's statically known these areas don't overlap. While this
doesn't end up having a performance difference it's something I kept
noticing while looking at the disassembly of `initialize_vmcontext` so I
figured I'd go ahead and implement.
* Indirect shared signature ids in the VMContext
This commit is a small improvement for the instantiation time of modules
by avoiding copying a list of `VMSharedSignatureIndex` entries into each
`VMContext`, instead building one inside of a module and sharing that
amongst all instances. This involves less lookups at instantiation time
and less movement of data during instantiation. The downside is that
type-checks on `call_indirect` now involve an additionally load, but I'm
assuming that these are somewhat pessimized enough as-is that the
runtime impact won't be much there.
For instantiation performance this is a 5-10% win with
rustpyhon/spidermonky instantiation. This should also reduce the size of
each `VMContext` for an instantiation since signatures are no longer
stored inline but shared amongst all instances with one module.
Note that one subtle change here is that the array of
`VMSharedSignatureIndex` was previously indexed by `TypeIndex`, and now
it's indexed by `SignaturedIndex` which is a deduplicated form of
`TypeIndex`. This is done because we already had a list of those lying
around in `Module`, so it was easier to reuse that than to build a
separate array and store it somewhere.
* Reserve space in `Store<T>` with `InstancePre`
This commit updates the instantiation process to reserve space in a
`Store<T>` for the functions that an `InstancePre<T>`, as part of
instantiation, will insert into it. Using an `InstancePre<T>` to
instantiate allows pre-computing the number of host functions that will
be inserted into a store, and by pre-reserving space we can avoid costly
reallocations during instantiation by ensuring the function vector has
enough space to fit everything during the instantiation process.
Overall this makes instantiation of rustpython/spidermonkey about 8%
faster locally.
* Fix tests
* Use checked arithmetic
* Skip memfd creation with precompiled modules
This commit updates the memfd support internally to not actually use a
memfd if a compiled module originally came from disk via the
`wasmtime::Module::deserialize_file` API. In this situation we already
have a file descriptor open and there's no need to copy a module's heap
image to a new file descriptor.
To facilitate a new source of `mmap` the currently-memfd-specific-logic
of creating a heap image is generalized to a new form of
`MemoryInitialization` which is attempted for all modules at
module-compile-time. This means that the serialized artifact to disk
will have the memory image in its entirety waiting for us. Furthermore
the memory image is ensured to be padded and aligned carefully to the
target system's page size, notably meaning that the data section in the
final object file is page-aligned and the size of the data section is
also page aligned.
This means that when a precompiled module is mapped from disk we can
reuse the underlying `File` to mmap all initial memory images. This
means that the offset-within-the-memory-mapped-file can differ for
memfd-vs-not, but that's just another piece of state to track in the
memfd implementation.
In the limit this waters down the term "memfd" for this technique of
quickly initializing memory because we no longer use memfd
unconditionally (only when the backing file isn't available).
This does however open up an avenue in the future to porting this
support to other OSes because while `memfd_create` is Linux-specific
both macOS and Windows support mapping a file with copy-on-write. This
porting isn't done in this PR and is left for a future refactoring.
Closes#3758
* Enable "memfd" support on all unix systems
Cordon off the Linux-specific bits and enable the memfd support to
compile and run on platforms like macOS which have a Linux-like `mmap`.
This only works if a module is mapped from a precompiled module file on
disk, but that's better than not supporting it at all!
* Fix linux compile
* Use `Arc<File>` instead of `MmapVecFileBacking`
* Use a named struct instead of mysterious tuples
* Comment about unsafety in `Module::deserialize_file`
* Fix tests
* Fix uffd compile
* Always align data segments
No need to have conditional alignment since their sizes are all aligned
anyway
* Update comment in build.rs
* Use rustix, not `region`
* Fix some confusing logic/names around memory indexes
These functions all work with memory indexes, not specifically defined
memory indexes.
* Move function names out of `Module`
This commit moves function names in a module out of the
`wasmtime_environ::Module` type and into separate sections stored in the
final compiled artifact. Spurred on by #3787 to look at module load
times I noticed that a huge amount of time was spent in deserializing
this map. The `spidermonkey.wasm` file, for example, has a 3MB name
section which is a lot of unnecessary data to deserialize at module load
time.
The names of functions are now split out into their own dedicated
section of the compiled artifact and metadata about them is stored in a
more compact format at runtime by avoiding a `BTreeMap` and instead
using a sorted array. Overall this improves deserialize times by up to
80% for modules with large name sections since the name section is no
longer deserialized at load time and it's lazily paged in as names are
actually referenced.
* Fix a typo
* Fix compiled module determinism
Need to not only sort afterwards but also first to ensure the data of
the name section is consistent.
* Add the instance allocation strategy to generated fuzzing configs.
This commit adds support for generating configs with arbitrary instance
allocation strategies.
With this, the pooling allocator will be fuzzed as part of the existing fuzz
targets.
* Refine maximum constants for arbitrary module limits.
* Add an `instantiate-many` fuzz target.
This commit adds a new `instantiate-many` fuzz target that will attempt to
instantiate and terminate modules in an arbitrary order.
It generates up to 5 modules, from which a random sequence of instances will be
created.
The primary benefactor of this fuzz target is the pooling instance allocator.
* Allow no aliasing in generated modules when using the pooling allocator.
This commit prevents aliases in the generated modules as they might count
against the configured import limits of the pooling allocator.
As the existing module linking proposal implementation will eventually be
deprecated in favor of the component model proposal, it isn't very important
that we test aliases in generated modules with the pooling allocator.
* Improve distribution of memory config in fuzzing.
The previous commit attempted to provide a 32-bit upper bound to 64-bit
arbitrary values, which skewed the distribution heavily in favor of the upper
bound.
This commit removes the constraint and instead uses arbitrary 32-bit values
that are converted to 64-bit values in the `Arbitrary` implementation.
In working on #3787 I see now that our coverage of loading precompiled
files specifically is somewhat lacking, so this adds a config option to
the fuzzers where, if enabled, will round-trip all compiled modules
through the filesystem to test out the mmapped-file case.
Other than doc updates, this just contains bytecodealliance/cap-std#235,
a fix for compilation errors on Rust nightly that look like this:
```
error[E0308]: mismatched types
--> cap-primitives/src/fs/via_parent/rename.rs:22:58
|
22 | let (old_dir, old_basename) = open_parent(old_start, &old_path)?;
| ^^^^^^^^^ expected struct `Path`, found opaque type
|
::: cap-primitives/src/rustix/fs/dir_utils.rs:67:48
|
67 | pub(crate) fn strip_dir_suffix(path: &Path) -> impl Deref<Target = Path> + '_ {
| ------------------------------ the found opaque type
|
= note: expected struct `Path`
found opaque type `impl Deref<Target = Path>`
```
During instance initialization, we build two sorts of arrays eagerly:
- We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function
in an instance.
- We initialize every element of a funcref table with an initializer to
a pointer to one of these anyfuncs.
Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
`ref.func`. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.
This PR implements two basic ideas:
- The anyfunc array can be lazily initialized as long as we retain the
information needed to do so. For now, in this PR, we just recreate the
anyfunc whenever a pointer is taken to it, because doing so is fast
enough; in the future we could keep some state to know whether the
anyfunc has been written yet and skip this work if redundant.
This technique allows us to leave the anyfunc array as uninitialized
memory, which can be a significant savings. Filling it with
initialized anyfuncs is very expensive, but even zeroing it is
expensive: e.g. in a large module, it can be >500KB.
- A funcref table can be lazily initialized as long as we retain a link
to its corresponding instance and function index for each element. A
zero in a table element means "uninitialized", and a slowpath does the
initialization.
Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.
We do have to set up a precomputed array of `FuncIndex`s for the table
in order for this to work. We do this as part of the module compilation.
This PR also refactors the way that the runtime crate gains access to
information computed during module compilation.
Performance effect measured with in-tree benches/instantiation.rs, using
SpiderMonkey built for WASI, and with memfd enabled:
```
BEFORE:
sequential/default/spidermonkey.wasm
time: [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
time: [69.406 us 69.435 us 69.465 us]
parallel/default/spidermonkey.wasm: with 1 background thread
time: [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
time: [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
time: [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
time: [326.81 us 337.32 us 347.01 us]
WITH THIS PR:
sequential/default/spidermonkey.wasm
time: [6.7821 us 6.8096 us 6.8397 us]
change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
Performance has improved.
sequential/pooling/spidermonkey.wasm
time: [3.0410 us 3.0558 us 3.0724 us]
change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
Performance has improved.
parallel/default/spidermonkey.wasm: with 1 background thread
time: [7.2643 us 7.2689 us 7.2735 us]
change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
time: [147.36 us 148.99 us 150.74 us]
change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
time: [3.1009 us 3.1021 us 3.1033 us]
change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
time: [49.449 us 50.475 us 51.540 us]
change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
Performance has improved.
```
So an improvement of something like 80-95% for a very large module (7420
functions in its one funcref table, 31928 functions total).
This commit corrects a few places where `MemoryIndex` was used and treated like
a `DefinedMemoryIndex` in the pooling instance allocator.
When the unstable `multi-memory` proposal is enabled, it is possible to cause a
newly allocated instance to use an incorrect base address for any defined
memories by having the module being instantiated also import a memory.
This requires enabling the unstable `multi-memory` proposal, configuring the
use of the pooling instance allocator (not the default), and then configuring
the module limits to allow imported memories (also not the default).
The fix is to replace all uses of `MemoryIndex` with `DefinedMemoryIndex` in
the pooling instance allocator.
Several `debug_assert!` have also been updated to `assert!` to sanity check the
state of the pooling allocator even in release builds.
This commit updates the `memfd` support in Wasmtime to have a runtime
toggle as to whether it's used or not. The compile-time feature gating
`memfd` support is now also re-enabled by default, but the new runtime
switch is still disabled-by-default.
Additionally this commit updates our fuzz oracle to turn on/off the
memfd flag to re-enable fuzzing with memfd on oss-fuzz.
We currently skip some tests when running our qemu-based tests for
aarch64 and s390x. Qemu has broken madvise(MADV_DONTNEED) semantics --
specifically, it just ignores madvise() [1].
We could continue to whack-a-mole the tests whenever we create new
functionality that relies on madvise() semantics, but ideally we'd just
have emulation that properly emulates!
The earlier discussions on the qemu mailing list [2] had a proposed
patch for this, but (i) this patch doesn't seem to apply cleanly anymore
(it's 3.5 years old) and (ii) it's pretty complex due to the need to
handle qemu's ability to emulate differing page sizes on host and guest.
It turns out that we only really need this for CI when host and guest
have the same page size (4KiB), so we *could* just pass the madvise()s
through. I wouldn't expect such a patch to ever land upstream in qemu,
but it satisfies our needs I think. So this PR modifies our CI setup to
patch qemu before building it locally with a little one-off patch.
[1]
https://github.com/bytecodealliance/wasmtime/pull/2518#issuecomment-747280133
[2]
https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg05416.html
Since memfd support just landed, and has had only ~0.5 weeks to bake
with fuzzing, we want to make release 0.34.0 of Wasmtime without it
enabled by default. This PR disables memfd by default; it can be enabled
by specifying the `memfd` feature for the `wasmtime` crate, or when
building the commandline binary.
We plan to explicitly add memfd-enabled fuzzing targets, let that go for
a while, then probably re-enable memfd in the subsequent release if no
issues come up.
* Consolidate methods of memory initialization
This commit consolidates the few locations that we have which are
performing memory initialization. Namely the uffd logic for creating
paged memory as well as the memfd logic for creating a memory image now
share an implementation to avoid duplicating bounds-checks or other
validation conditions. The main purpose of this commit is to fix a
fuzz-bug where a multiplication overflowed. The overflow itself was
benign but it seemed better to fix the overflow in only one place
instead of multiple.
The overflow in question is specifically when an initializer is checked
to be statically out-of-bounds and multiplies a memory's minimum size by
the wasm page size, returning the result as a `u64`. For
memory64-memories of size `1 << 48` this multiplication will overflow.
This was actually a preexisting bug with the `try_paged_init` function
which was copied for memfd, but cropped up here since memfd is used more
often than paged initialization. The fix here is to skip validation of
the `end` index if the size of memory is `1 << 64` since if the `end`
index can be represented as a `u64` then it's in-bounds. This is
somewhat of an esoteric case, though, since a memory of minimum size `1
<< 64` can't ever exist (we can't even ask the os for that much memory,
and even if we could it would fail).
* Fix memfd test
* Fix some tests
* Remove InitMemory enum
* Add an `is_segmented` helper method
* More clear variable name
* Make arguments to `init_memory` more descriptive
As a followup to the recent memfd allocator work, this PR makes the
memfd image creation occur on the first instantiation, rather than
immediately when the `Module` is loaded.
This shaves off a potentially surprising cost spike that would have
otherwise occurred: prior to the memfd work, no allocator eagerly read
the module's initial heap state into RAM. The behavior should now more
closely resemble what happened before (and the improvements in overall
instantiation time and performance, as compared to either eager init
with pure-mmap memory or user-mode pagefault handling with uffd,
remain).
This fixes a bug in the memfd-related management of a linear memory
where for dynamic memories memfd wasn't informed of the extra room that
the dynamic memory could grow into, only the actual size of linear
memory, which ended up tripping an assert once the memory was grown. The
fix here is pretty simple which is to factor in this extra space when
passing the allocation size to the creation of the `MemFdSlot`.
* Tweak memfd-related features crates
This commit changes the `memfd` feature for the `wasmtime-cli` crate
from an always-on feature to a default-on feature which can be disabled
at compile time. Additionally the `pooling-allocator` feature is also
given similar treatment.
Additionally some documentation was added for the `memfd` feature on the
`wasmtime` crate.
* Don't store `Arc<T>` in `InstanceAllocationRequest`
Instead store `&Arc<T>` to avoid having the clone that lives in
`InstanceAllocationRequest` not actually going anywhere. Otherwise all
instance allocation requires an extra clone to create it for the request
and an extra decrement when the request goes away. Internally clones are
made as necessary when creating instances.
* Enable the pooling allocator by default for `wasmtime-cli`
While perhaps not the most useful option since the CLI doesn't have a
great way to take advantage of this it probably makes sense to at least
match the features of `wasmtime` itself.
* Fix some lints and issues
* More compile fixes
* memfd: Reduce some syscalls in the on-demand case
This tweaks the internal organization of the `MemFdSlot` to avoid some
syscalls in the default case as well as opportunistically in the pooling
case. The two cases added here are:
* A `MemFdSlot` is now created with a specified initial size. For
pooling this is 0 but for the on-demand case this can be non-zero.
* When `instantiate` is called with no prior image and the sizes match
(as will be the case for on-demand allocation) then `mprotect` is
skipped entirely.
* In the `clear_and_remain-ready` case the `mprotect` is skipped if the
heap wasn't grown at all.
This should avoid ever using `mprotect` unnecessarily and makes the
ranges we `mprotect` a bit smaller as well.
* Review comments
* Tweak allow to apply to whole crate
This policy attempts to reuse the same instance slot for subsequent
instantiations of the same module. This is particularly useful when
using a pooling backend such as memfd that benefits from this reuse: for
example, in the memfd case, instantiating the same module into the same
slot allows us to avoid several calls to mmap() because the same
mappings can be reused.
The policy tracks a freelist per "compiled module ID", and when
allocating a slot for an instance, tries these three options in order:
1. A slot from the freelist for this module (i.e., last used for another
instantiation of this particular module), or
3. A slot that was last used by some other module or never before.
The "victim" slot for choice 2 is randomly chosen.
The data structures are carefully designed so that all updates are O(1),
and there is no retry-loop in any of the random selection.
This policy is now the default when the memfd backend is selected via
the `memfd-allocator` feature flag.
* Lazily populate a store's trampoline map
This commit is another installment of "how fast can we make
instantiation". Currently when instantiating a module with many function
imports each function, typically from the host, is inserted into the
store. This insertion process stores the `VMTrampoline` for the host
function in a side table so it can be looked up later if the host
function is called through the `Func` interface. This insertion process,
however, involves a hash map insertion which can be relatively expensive
at the scale of the rest of the instantiation process.
The optimization implemented in this commit is to avoid inserting
trampolines into the store at `Func`-insertion-time (aka instantiation
time) and instead only lazily populate the map of trampolines when
needed. The theory behind this is that almost all `Func` instances that
are called indirectly from the host are actually wasm functions, not
host-defined functions. This means that they already don't need to go
through the map of host trampolines and can instead be looked up from
the module they're defined in. With the assumed rarity of host functions
making `lookup_trampoline` a bit slower seems ok.
The `lookup_trampoline` function will now, on a miss from the wasm
modules and `host_trampolines` map, lazily iterate over the functions
within the store and insert trampolines into the `host_trampolines` map.
This process will eventually reach something which matches the function
provided because it should at least hit the same host function. The
relevant `lookup_trampoline` now sports a new documentation block
explaining all this as well for future readers.
Concretely this commit speeds up instantiation of an empty module with
100 imports and ~80 unique signatures from 10.6us to 6.4us, a 40%
improvement.
* Review comments
* Remove debug assert
Following up on #3696, use the new is-terminal crate to test for a tty
rather than having platform-specific logic in Wasmtime. The is-terminal
crate has a platform-independent API which takes a handle.
This also updates the tree to cap-std 0.24 etc., to avoid depending on
multiple versions of io-lifetimes at once, as enforced by the cargo deny
check.
With the addition of `sock_accept()` in `wasi-0.11.0`, wasmtime can now
implement basic networking for pre-opened sockets.
For Windows `AsHandle` was replaced with `AsRawHandleOrSocket` to cope
with the duality of Handles and Sockets.
For Unix a `wasi_cap_std_sync::net::Socket` enum was created to handle
the {Tcp,Unix}{Listener,Stream} more efficiently in
`WasiCtxBuilder::preopened_socket()`.
The addition of that many `WasiFile` implementors was mainly necessary,
because of the difference in the `num_ready_bytes()` function.
A known issue is Windows now busy polling on sockets, because except
for `stdin`, nothing is querying the status of windows handles/sockets.
Another know issue on Windows, is that there is no crate providing
support for `fcntl(fd, F_GETFL, 0)` on a socket.
Signed-off-by: Harald Hoyer <harald@profian.com>
Testing so far with recent Wasmtime has not been able to show the need
for avoiding the process-wide mmap lock in real-world use-cases. As
such, the technique of using an anonymous file and ftruncate() to extend
it seems unnecessary; instead, memfd can always use anonymous zeroed
memory for heap backing where the CoW image is not present, and
mprotect() to extend the heap limit by changing page protections.
As first suggested by Jan on the Zulip here [1], a cheap and effective
way to obtain copy-on-write semantics of a "backing image" for a Wasm
memory is to mmap a file with `MAP_PRIVATE`. The `memfd` mechanism
provided by the Linux kernel allows us to create anonymous,
in-memory-only files that we can use for this mapping, so we can
construct the image contents on-the-fly then effectively create a CoW
overlay. Furthermore, and importantly, `madvise(MADV_DONTNEED, ...)`
will discard the CoW overlay, returning the mapping to its original
state.
By itself this is almost enough for a very fast
instantiation-termination loop of the same image over and over,
without changing the address space mapping at all (which is
expensive). The only missing bit is how to implement
heap *growth*. But here memfds can help us again: if we create another
anonymous file and map it where the extended parts of the heap would
go, we can take advantage of the fact that a `mmap()` mapping can
be *larger than the file itself*, with accesses beyond the end
generating a `SIGBUS`, and the fact that we can cheaply resize the
file with `ftruncate`, even after a mapping exists. So we can map the
"heap extension" file once with the maximum memory-slot size and grow
the memfd itself as `memory.grow` operations occur.
The above CoW technique and heap-growth technique together allow us a
fastpath of `madvise()` and `ftruncate()` only when we re-instantiate
the same module over and over, as long as we can reuse the same
slot. This fastpath avoids all whole-process address-space locks in
the Linux kernel, which should mean it is highly scalable. It also
avoids the cost of copying data on read, as the `uffd` heap backend
does when servicing pagefaults; the kernel's own optimized CoW
logic (same as used by all file mmaps) is used instead.
[1] https://bytecodealliance.zulipchat.com/#narrow/stream/206238-general/topic/Copy.20on.20write.20based.20instance.20reuse/near/266657772