Testing so far with recent Wasmtime has not been able to show the need
for avoiding the process-wide mmap lock in real-world use-cases. As
such, the technique of using an anonymous file and ftruncate() to extend
it seems unnecessary; instead, memfd can always use anonymous zeroed
memory for heap backing where the CoW image is not present, and
mprotect() to extend the heap limit by changing page protections.
As first suggested by Jan on the Zulip here [1], a cheap and effective
way to obtain copy-on-write semantics of a "backing image" for a Wasm
memory is to mmap a file with `MAP_PRIVATE`. The `memfd` mechanism
provided by the Linux kernel allows us to create anonymous,
in-memory-only files that we can use for this mapping, so we can
construct the image contents on-the-fly then effectively create a CoW
overlay. Furthermore, and importantly, `madvise(MADV_DONTNEED, ...)`
will discard the CoW overlay, returning the mapping to its original
state.
By itself this is almost enough for a very fast
instantiation-termination loop of the same image over and over,
without changing the address space mapping at all (which is
expensive). The only missing bit is how to implement
heap *growth*. But here memfds can help us again: if we create another
anonymous file and map it where the extended parts of the heap would
go, we can take advantage of the fact that a `mmap()` mapping can
be *larger than the file itself*, with accesses beyond the end
generating a `SIGBUS`, and the fact that we can cheaply resize the
file with `ftruncate`, even after a mapping exists. So we can map the
"heap extension" file once with the maximum memory-slot size and grow
the memfd itself as `memory.grow` operations occur.
The above CoW technique and heap-growth technique together allow us a
fastpath of `madvise()` and `ftruncate()` only when we re-instantiate
the same module over and over, as long as we can reuse the same
slot. This fastpath avoids all whole-process address-space locks in
the Linux kernel, which should mean it is highly scalable. It also
avoids the cost of copying data on read, as the `uffd` heap backend
does when servicing pagefaults; the kernel's own optimized CoW
logic (same as used by all file mmaps) is used instead.
[1] https://bytecodealliance.zulipchat.com/#narrow/stream/206238-general/topic/Copy.20on.20write.20based.20instance.20reuse/near/266657772
This PR introduces a new way of performing cooperative timeslicing that
is intended to replace the "fuel" mechanism. The tradeoff is that this
mechanism interrupts with less precision: not at deterministic points
where fuel runs out, but rather when the Engine enters a new epoch. The
generated code instrumentation is substantially faster, however, because
it does not need to do as much work as when tracking fuel; it only loads
the global "epoch counter" and does a compare-and-branch at backedges
and function prologues.
This change has been measured as ~twice as fast as fuel-based
timeslicing for some workloads, especially control-flow-intensive
workloads such as the SpiderMonkey JS interpreter on Wasm/WASI.
The intended interface is that the embedder of the `Engine` performs an
`engine.increment_epoch()` call periodically, e.g. once per millisecond.
An async invocation of a Wasm guest on a `Store` can specify a number of
epoch-ticks that are allowed before an async yield back to the
executor's event loop. (The initial amount and automatic "refills" are
configured on the `Store`, just as for fuel.) This call does only
signal-safe work (it increments an `AtomicU64`) so could be invoked from
a periodic signal, or from a thread that wakes up once per period.
In preparing to move the s390x back-end to ISLE, I noticed a few
missing pieces in the common prelude code. This patch:
- Defines the reference types $R32 / $R64.
- Provides a trap_code_bad_conversion_to_integer helper.
- Provides an avoid_div_traps helper. This requires passing the
generic flags in addition to the ISA-specifc flags into the
ISLE lowering context.
In preparing the back-end to move to ISLE, I detected a
number of codegen bugs in the existing code, which are
fixed here:
- Fix internal compiler error with uload16/icmp corner case.
- Fix broken Cls lowering.
- Correctly mask shift count for i8/i16 shifts.
In addition, I made several changes to operand encodings
in various MInst patterns. These should not have any
functional effect, but will make the ISLE migration easier:
- Encode floating-point constants as u32/u64 in MInst patterns.
- Encode shift amounts as u8 and Reg in ShiftOp pattern.
- Use MemArg in LoadMultiple64 and StoreMultiple64 patterns.
- Cranelift: items to discuss 2022 roadmap, coordinate who's working on
what in ISLE transition for all platforms, and discuss platform tiers
and arm32 (wrt above)
- Wasmtime: item to discuss new revelations re: memfd/CoW and epoch
interruption schemes
This PR adds a flag to each block that can be set via the frontend/builder
interface that indicates that the block will not be frequently
executed. As such, the compiler backend should place the block "out of
line" in the final machine code, so that the ordinary, more frequent
execution path that excludes the block does not have to jump around it.
This is useful for adding handlers for exceptional conditions
(slow-paths, guard violations) in a way that minimizes performance cost.
Fixes#2747.
* Provide helpers for demangling function names
* Profile trampolines in vtune too
* get rid of mapping
* avoid code duplication with jitdump_linux
* maintain previous default display name for wasm functions
* no dash, grrr
* Remove unused profiling error type
Before this PR, each profiler (perf/vtune, at the moment) had to have a
demangler for each of the programming languages that could have been
compiled to wasm and fed into wasmtime. With this, wasmtime now
demangles names before even forwarding them to the underlying profiler,
which makes for a unified representation in profilers, and avoids
incorrect demangling in profilers.
* Update lots of `isa/*/*.clif` tests to `precise-output`
This commit goes through the `aarch64` and `x64` subdirectories and
subjectively changes tests from `test compile` to add `precise-output`.
This then auto-updates all the test expectations so they can be
automatically instead of manually updated in the future. Not all tests
were migrated, largely subject to the whims of myself, mainly looking to
see if the test was looking for specific instructions or just checking
the whole assembly output.
* Filter out `;;` comments from test expctations
Looks like the cranelift parser picks up all comments, not just those
trailing the function, so use a convention where `;;` is used for
human-readable-comments in test cases and `;`-prefixed comments are the
test expectation.
* cranelift: Add ability to auto-update test expectations
One of the problems of the current `*.clif` testing is that the files
are difficult to update when widespread changes are made (such as
removing modification of the frame pointer). Additionally when changing
register allocation or similar it can cause a large number of changes in
tests but the tests themselves didn't actually break. For this reason
this commit adds the ability to automatically update test expectations.
The idea behind this commit is that tests of the form `test compile` can
also optionally be flagged with the `precise-output` flag:
test compile precise-output
and when doing so the compiled form of each function is asserted to 100%
match the following comments and their test expectations. If a match is
not found then a `BLESS=1` environment variable can be used to
automatically rewrite the test file itself with the correct assertion.
If the environment variable isn't present and the expectation doesn't
match then the test fails.
It's hoped that, if approved, a follow-up commit can add
`precise-output` to all current `test compile` tests (or make it the
default) and all tests can be mass-updated. When developing locally test
expectations need not be written and instead tests can be run with
`BLESS=1` and the output can be manually verified. The environment
variable will not be present on CI which means that changes to the
output which don't also change the test expectation will cause CI to
fail. Furthermore this should still make updates to the test output
easily readable in review on CI because the test expectations are
intended to look the same as before.
Closes#1539
* Use raw vcode output in tests
* Fix a merge conflict
* Review comments