Commit Graph

9803 Commits

Author SHA1 Message Date
Chris Fallin
84a8368e88 Fix to the optimization: mprotect(NONE) sometimes needed after skipping the initial mmap. 2022-02-01 16:34:06 -08:00
Chris Fallin
01e6bb81fb Review feedback. 2022-02-01 15:49:44 -08:00
Nick Fitzgerald
491e98233e Merge pull request #3750 from bytecodealliance/pch/fix_3749
fix #3749: returns count should count the returns, not the params.
2022-02-01 10:29:17 -08:00
Pat Hickey
aa4c81a4e7 fix #3749: returns count should count the returns, not the params. 2022-02-01 09:46:46 -08:00
Chris Fallin
0ff8f6ab20 Make build-config magic use memfd by default. 2022-01-31 22:39:20 -08:00
Chris Fallin
ccfa245261 Optimization: only mprotect the *new* bit of heap, not all of it.
(This was not a correctness bug, but is an obvious performance bug...)
2022-01-31 21:25:40 -08:00
Chris Fallin
982df2f2e5 Review feedback. 2022-01-31 16:40:14 -08:00
Harald Hoyer
853a025613 Implement sock_accept
With the addition of `sock_accept()` in `wasi-0.11.0`, wasmtime can now
implement basic networking for pre-opened sockets.

For Windows `AsHandle` was replaced with `AsRawHandleOrSocket` to cope
with the duality of Handles and Sockets.

For Unix a `wasi_cap_std_sync::net::Socket` enum was created to handle
the {Tcp,Unix}{Listener,Stream} more efficiently in
`WasiCtxBuilder::preopened_socket()`.

The addition of that many `WasiFile` implementors was mainly necessary,
because of the difference in the `num_ready_bytes()` function.

A known issue is Windows now busy polling on sockets, because except
for `stdin`, nothing is querying the status of windows handles/sockets.

Another know issue on Windows, is that there is no crate providing
support for `fcntl(fd, F_GETFL, 0)` on a socket.

Signed-off-by: Harald Hoyer <harald@profian.com>
2022-01-31 16:25:11 -08:00
Nick Fitzgerald
8ba3294881 Merge pull request #3745 from uweigand/move-emit
Move emit and emit_safepoint to prelude.isle
2022-01-31 14:43:36 -08:00
Chris Fallin
570dee63f3 Use MemFdSlot in the on-demand allocator as well. 2022-01-31 13:59:51 -08:00
Ulrich Weigand
a3e2f5c28b Move emit and emit_safepoint to prelude.isle
Even though the implementation of emit and emit_safepoint may
be platform-specific, the interface ought to be common so that
other code in prelude.isle may safely call these constructors.

This patch moves the definition of emit (from all platforms)
and emit_safepoint (s390x only) to prelude.isle.  This required
adding an emit_safepoint implementation to aarch64 and x64 as
well - the latter is still a stub as special move mitosis
handling will be required.
2022-01-31 22:54:04 +01:00
Nick Fitzgerald
3c2f695dd3 Merge pull request #3724 from uweigand/s390x-isle-branchtrap
s390x: Migrate branches and traps to ISLE
2022-01-31 13:42:49 -08:00
Chris Fallin
3702e81d30 Remove ftruncate-trick for heap growth with memfd backend.
Testing so far with recent Wasmtime has not been able to show the need
for avoiding the process-wide mmap lock in real-world use-cases. As
such, the technique of using an anonymous file and ftruncate() to extend
it seems unnecessary; instead, memfd can always use anonymous zeroed
memory for heap backing where the CoW image is not present, and
mprotect() to extend the heap limit by changing page protections.
2022-01-31 12:53:22 -08:00
Chris Fallin
b73ac83c37 Add a pooling allocator mode based on copy-on-write mappings of memfds.
As first suggested by Jan on the Zulip here [1], a cheap and effective
way to obtain copy-on-write semantics of a "backing image" for a Wasm
memory is to mmap a file with `MAP_PRIVATE`. The `memfd` mechanism
provided by the Linux kernel allows us to create anonymous,
in-memory-only files that we can use for this mapping, so we can
construct the image contents on-the-fly then effectively create a CoW
overlay. Furthermore, and importantly, `madvise(MADV_DONTNEED, ...)`
will discard the CoW overlay, returning the mapping to its original
state.

By itself this is almost enough for a very fast
instantiation-termination loop of the same image over and over,
without changing the address space mapping at all (which is
expensive). The only missing bit is how to implement
heap *growth*. But here memfds can help us again: if we create another
anonymous file and map it where the extended parts of the heap would
go, we can take advantage of the fact that a `mmap()` mapping can
be *larger than the file itself*, with accesses beyond the end
generating a `SIGBUS`, and the fact that we can cheaply resize the
file with `ftruncate`, even after a mapping exists. So we can map the
"heap extension" file once with the maximum memory-slot size and grow
the memfd itself as `memory.grow` operations occur.

The above CoW technique and heap-growth technique together allow us a
fastpath of `madvise()` and `ftruncate()` only when we re-instantiate
the same module over and over, as long as we can reuse the same
slot. This fastpath avoids all whole-process address-space locks in
the Linux kernel, which should mean it is highly scalable. It also
avoids the cost of copying data on read, as the `uffd` heap backend
does when servicing pagefaults; the kernel's own optimized CoW
logic (same as used by all file mmaps) is used instead.

[1] https://bytecodealliance.zulipchat.com/#narrow/stream/206238-general/topic/Copy.20on.20write.20based.20instance.20reuse/near/266657772
2022-01-31 12:53:18 -08:00
Alex Crichton
a25f7bdba5 Don't copy VMBuiltinFunctionsArray into each VMContext (#3741)
* Don't copy `VMBuiltinFunctionsArray` into each `VMContext`

This is another PR along the lines of "let's squeeze all possible
performance we can out of instantiation". Before this PR we would copy,
by value, the contents of `VMBuiltinFunctionsArray` into each
`VMContext` allocated. This array of function pointers is modestly-sized
but growing over time as we add various intrinsics. Additionally it's
the exact same for all `VMContext` allocations.

This PR attempts to speed up instantiation slightly by instead storing
an indirection to the function array. This means that calling a builtin
intrinsic is a tad bit slower since it requires two loads instead of one
(one to get the base pointer, another to get the actual address).
Otherwise though `VMContext` initialization is now simply setting one
pointer instead of doing a `memcpy` from one location to another.

With some macro-magic this commit also replaces the previous
implementation with one that's more `const`-friendly which also gets us
compile-time type-checks of libcalls as well as compile-time
verification that all libcalls are defined.

Overall, as with #3739, the win is very modest here. Locally I measured
a speedup from 1.9us to 1.7us taken to instantiate an empty module with
one function. While small at these scales it's still a 10% improvement!

* Review comments
2022-01-28 16:24:34 -06:00
Alex Crichton
2f494240f8 Lazily allocate the bump-alloc chunk in the externref table (#3739)
This commit updates the allocation of a `VMExternRefActivationsTable`
structure to perform zero malloc memory allocations. Previously it would
allocate a page-size of `chunk` plus some space in hash sets for future
insertions. The main trick here implemented is that after the first gc
during the slow path the fast chunk allocation is allocated and
configured.

The motivation for this PR is that given our recent work to further
refine and optimize the instantiation process this allocation started to
show up in a nontrivial fashion. Most modules today never touch this
table anyway as almost none of them use reference types, so the time
spent allocation and deallocating the table per-store was largely wasted
time.

Concretely on a microbenchmark this PR speeds up instantiation of a
module with one function by 30%, decreasing the instantiation cost from
1.8us to 1.2us. Overall a pretty minor win but when the instantiation
times we're measuring start being in the single-digit microseconds this
win ends up getting magnified!
2022-01-28 16:10:05 -06:00
Nick Fitzgerald
34537a39cd Merge pull request #3734 from fitzgen/externref-debug-assertion-fix
Fix a debug assertion in `externref` garbage collections
2022-01-28 11:03:03 -08:00
Nick Fitzgerald
19f8d94959 Expand on activations table invariants comment in libcalls.rs 2022-01-28 09:47:05 -08:00
Nick Fitzgerald
f292ff55cf Do another GC after running Wasm in the table_ops test oracle 2022-01-27 16:53:10 -08:00
Nick Fitzgerald
cc8d7778e2 Make the table_ops test case generator use globals as well
This will make it generate `table.set`s that come from `global.get`s and
`global.get`s that come from `table.set`s. Ultimately, it should give us much
more fuzzing coverage of `externref` globals, their barriers, and passing
`externref`s into and out of Wasm to get or set globals.
2022-01-27 16:53:10 -08:00
Andrew Brown
90bfa123e0 docs: render the fcmp tables as code (#3735)
Looking at [the `fcmp`
documentation](https://docs.rs/cranelift-codegen/0.80.0/cranelift_codegen/ir/trait.InstBuilder.html#method.fcmp)--generated
from Cranelift's instruction definitions, the charts explaining the
logic for the various conditions is unreadable. Since rendering those charts
as plain text is problematic, this change wraps them as code sections
for a consistent layout.
2022-01-27 16:36:33 -08:00
Nick Fitzgerald
cbc6f6071f Fix a debug assertion in externref garbage collections
When we GC, we assert the invariant that all `externref`s we find on the stack
have a corresponding entry in the `VMExternRefActivationsTable`. However, we
also might be in code that is in the process of fixing up this invariant and
adding an entry to the table, but the table's bump chunk is full, and so we do a
GC and then add the entry into the table. This will ultimately maintain our
desired invariant, but there is a moment in time when we are doing the GC where
the invariant is relaxed which is okay because the reference will be in the
table before we return to Wasm or do anything else. This isn't a possible UAF,
in other words. To make it so that the assertion won't trip, we explicitly
insert the reference into the table *before* we GC, so that the invariant is not
relaxed across a possibly-GCing operation (even though it would be safe in this
particular case).
2022-01-27 14:46:01 -08:00
Alex Crichton
7928a3ffb4 Lazily load types into Func (#3727)
* Lazily load types into `Func`

This commit changes the construction of a `Func` to lazily load the type
information if required instead of always loading the type information
at `Func`-construction time. The main purpose of this change is to
accelerate instantiation of instances which have many imports. Currently
in the fast way of doing this the instantiation loop looks like:

    let mut store = Store::new(&engine, ...);
    let instance = instance_pre.instantiate(&mut store);

In this situation the `instance_pre` will typically load host-defined
functions (defined via `Linker` APIs) into the `Store` as individual
`Func` items and then perform the instantiation process. The operation
of loading a `HostFunc` into a `Store` however currently involves two
expensive operations:

* First a read-only lock is taken on the `RwLock` around engine
  signatures.
* Next a clone of the wasm type is made to pull it out of the engine
  signature registry.

Neither of these is actually necessary for imported functions. The
`FuncType` for imported functions is never used since all comparisons
happen with the intern'd indices instead. The only time a `FuncType` is
used typically is for exported functions when using `Func::typed` or
similar APIs which need type information.

This commit makes this path faster by storing `Option<FuncType>` instead
of `FuncType` within a `Func`. This means that it starts out as `None`
and is only filled in on-demand as necessary. This means that when
instantiating a module with many imports no clones/locks are done.

On a simple microbenchmark where a module with 100 imports is
instantiated this PR improves instantiation time by ~35%. Due to the
rwlock used here and the general inefficiency of pthreads rwlocks the
effect is even more profound when many threads are performing the same
instantiation process. On x86_64 with 8 threads performing instantiation
this PR improves instantiation time by 80% and on arm64 it improves by
97% (wow read-contended glibc rwlocks on arm64 are slow).

Note that much of the improvement here is also from memory
allocatoins/deallocations no longer being performed because dropping
functions within a store no longer requires deallocating the `FuncType`
if it's not present.

A downside of this PR is that `Func::ty` is now unconditionally taking
an rwlock if the type hasn't already been filled in. (it uses the
engine). If this is an issue in the future though we can investigate at
that time using somthing like a `Once` to lazily fill in even when
mutable access to the store isn't available.

* Review comments
2022-01-26 14:23:37 -06:00
Nick Fitzgerald
595e3227c5 Merge pull request #3725 from fitzgen/use-emit-helper-aarch64
cranelift: Use `emit` helper function on aarch64
2022-01-25 10:35:59 -08:00
Nick Fitzgerald
9c7de7692b cranelift: Use emit helper function on aarch64
...instead of pushing to the `emitted_insts` vec directly.
2022-01-25 09:59:39 -08:00
Ulrich Weigand
36369a6f35 s390x: Migrate branches and traps to ISLE
In order to migrate branches to ISLE, we define a second entry
point `lower_branch` which gets the list of branch targets as
additional argument.

This requires a small change to `lower_common`: the `isle_lower`
callback argument is changed from a function pointer to a closure.
This allows passing the extra argument via a closure.

Traps make use of the recently added facility to emit safepoints
from ISLE, but are otherwise straightforward.
2022-01-25 18:15:32 +01:00
Chris Fallin
cd6b73fc90 Merge pull request #3723 from uweigand/isle-safepoint
ISLE: Allow emitting safepoint insns
2022-01-25 08:56:22 -08:00
Chris Fallin
ce63a113ab Merge pull request #3717 from uweigand/s390x-branchtarget
s390x: Refactor branch and jumptable emission
2022-01-25 08:55:31 -08:00
Ulrich Weigand
906f6a35cf ISLE: Allow emitting safepoint insns
Change the implementation of emitted_insts in IsleContext from
a plain vector of instructions into a vector of tuples, where
the second element is a boolean that indicates whether this
instruction should be emitted as a safepoint.

This allows targets to emit safepoint insns via ISLE.
2022-01-25 14:21:41 +01:00
Dan Gohman
5fc01bafc7 Fix isatty in WASI. (#3696)
WASI doesn't have an `isatty` ioctl or syscall, so wasi-libc's `isatty`
implementation uses the file descriptor type and rights to determine if
the file descriptor is likely to be a tty. The real fix here will be to
add an `isatty` call to WASI. But for now, have Wasmtime set the
filetype and rights for file descriptors so that wasi-libc's `isatty`
works as expected.
2022-01-24 11:45:16 -08:00
Andrew Brown
b1ad02e43a docs: remove fuzzit badge (#3719)
The badge no longer displays. I believe this is due to the acquisition by GitLab; other repositories seem to be reporting similar issues.
2022-01-24 10:46:34 -08:00
Nick Fitzgerald
20225e2f88 Merge pull request #3718 from uweigand/clif-insndata
ISLE: Fix clif.isle InstructionData entries
2022-01-24 09:33:26 -08:00
Chris Fallin
fa29c6f029 Merge pull request #3720 from bytecodealliance/fitzgen-patch-1
Add cranelift meeting notes 2022-01-24
2022-01-24 09:32:13 -08:00
Nick Fitzgerald
5e28b4e3bc Add cranelift meeting notes 2022-01-24 2022-01-24 09:28:52 -08:00
Tim Park
198d4d64b0 Make examples compile without warnings (#3716) 2022-01-24 09:26:11 -06:00
Ulrich Weigand
071d3a68d0 ISLE: Fix clif.isle InstructionData entries
Attempt to match a Jump instruction in ISLE will currently lead to the
generated files not compiling.  This is because the definition of the
InstructionData enum in clif.isle does not match the actual type used
in Rust code.

Specifically, clif.isle erroneously omits the ValueList variable-length
argument entry if the format does not use a typevar operand.  This is
the case for Jump and a few other formats.  The problem is caused by
a bug in the gen_isle routine in meta/src/gen_inst.rs.
2022-01-24 12:54:16 +01:00
Ulrich Weigand
cee00c6591 s390x: Refactor branch and jumptable emission
The BranchTarget abstraction is no longer needed, since all branches are
being emitted using a MachLabel target.  Remove BranchTarget and simply
use MachLabel everywhere a branch target is required.  (This brings the
s390x back-end in line with what x64 does as well.)

In addition, simplify jumptable emission by moving all instructions
that do not depend on the internal label (i.e. the conditional branch
to the default label, as well as the scaling the index register) out of
the combined JTSequence instruction.

This refactoring will make moving branch generation to ISLE easier.
2022-01-24 12:22:53 +01:00
Dan Gohman
881c19473d Use ptr::cast instead of as casts in several places. (#3507)
`ptr::cast` has the advantage of being unable to silently cast
`*const T` to `*mut T`. This turned up several places that were
performing such casts, which this PR also fixes.
2022-01-21 13:03:17 -08:00
Chris Fallin
a9d209732d Merge pull request #3706 from uweigand/isle-s390x
s390x: Add ISLE support
2022-01-21 11:25:49 -08:00
Chris Fallin
d61e4e0559 Merge pull request #3709 from cfallin/cold-blocks-dead-code-bug
Cranelift: Fix cold-blocks-related lowering bug.
2022-01-21 11:18:48 -08:00
Chris Fallin
ef1b2d2fa8 Cranelift: Fix cold-blocks-related lowering bug.
If a block is marked cold but has side-effect-free code that is only
used by side-effectful code in non-cold blocks, we will erroneously fail
to emit it, causing a regalloc failure.

This is due to the interaction of block ordering and lowering: we rely
on block ordering to visit uses before defs (except for backedges) so
that we can effectively do an inline liveness analysis and skip lowering
operations that are not used anywhere. This "inline DCE" is needed
because instruction lowering can pattern-match and merge one instruction
into another, removing the need to generate the source instruction.

Unfortunately, the way that I added cold-block support in #3698 was
oblivious to this -- it just changed the block sort order. For
efficiency reasons, we generate code in its final order directly, so it
would not be tenable to generate it in e.g. RPO first and then reorder
cold blocks to the bottom; we really do want to visit in the same order
as the final code.

This PR fixes the bug by moving the point at which cold blocks are sunk
to emission-time instead. This is cheaper than either trying to visit
blocks during lowering in RPO but add to VCode out-of-order, or trying
to do some expensive analysis to recover proper liveness. It's not clear
that the latter would be possible anyway -- the need to lower some
instructions depends on other instructions' isel results/merging
success, so we really do need to visit in RPO, and we can't simply lower
all instructions as side-effecting roots (some can't be toplevel nodes).

The one downside of this approach is that the VCode itself still has
cold blocks inline; so in the text format (and hence compile-tests) it's
not possible to see the sinking. This PR adds a test for cold-block
sinking that actually verifies the machine code. (The test also includes
an add-instruction in the cold path that would have been incorrectly
skipped prior to this fix.)

Fortunately this bug would not have been triggered by the one current
use of cold blocks in #3699, because there the only operation in the
cold block was an (always effectful) call instruction. The worst-case
effect of the bug in other code would be a regalloc panic; no silent
miscompilations could result.
2022-01-21 10:47:49 -08:00
Ulrich Weigand
a94e72b5b7 s390x: Add ISLE support
This adds ISLE support for the s390x back-end and moves lowering
of most instructions to ISLE.  The only instructions still remaining
are calls, returns, traps, and branches, most of which will need
additional support in common code.

Generated code is not intended to be (significantly) different
than before; any additional optimizations now made easier to
implement due to the ISLE layer can be added in follow-on patches.

There were a few differences in some filetests, but those are all
just simple register allocation changes (and all to the better!).
2022-01-21 19:30:56 +01:00
Nick Fitzgerald
78ff829733 Merge pull request #3708 from cfallin/cold-blocks-clif
Add syntax for cold blocks to CLIF.
2022-01-21 10:19:46 -08:00
Alex Crichton
2b04466955 Document epoch interruption in async documentation (#3712)
I, after-the-fact, now recall that we document fuel and other
interruption schemes in the `Config::async_support` documentation so
I've re-worded that section to mention epoch-based interruption as well.
2022-01-21 11:16:27 -06:00
Chris Fallin
51649d56b7 Add syntax for cold blocks to CLIF.
This commit adds support for denoting cold blocks in the CLIF text
format as follows:

```plain

function %f() {
block0(...):
  ...

block1 cold:
  ...

block2(...) cold:
  ...

block3:
  ...
```

With this syntax, we are able to see the cold-block flag in CLIF, we can
write tests using it, and it is preserved when round-tripping.

Fixes #3701.
2022-01-20 16:49:52 -08:00
Chris Fallin
90e7cef56c Merge pull request #3699 from cfallin/epoch-interruption
Add epoch-based interruption for cooperative async timeslicing.
2022-01-20 14:45:30 -08:00
Chris Fallin
8a55b5c563 Add epoch-based interruption for cooperative async timeslicing.
This PR introduces a new way of performing cooperative timeslicing that
is intended to replace the "fuel" mechanism. The tradeoff is that this
mechanism interrupts with less precision: not at deterministic points
where fuel runs out, but rather when the Engine enters a new epoch. The
generated code instrumentation is substantially faster, however, because
it does not need to do as much work as when tracking fuel; it only loads
the global "epoch counter" and does a compare-and-branch at backedges
and function prologues.

This change has been measured as ~twice as fast as fuel-based
timeslicing for some workloads, especially control-flow-intensive
workloads such as the SpiderMonkey JS interpreter on Wasm/WASI.

The intended interface is that the embedder of the `Engine` performs an
`engine.increment_epoch()` call periodically, e.g. once per millisecond.
An async invocation of a Wasm guest on a `Store` can specify a number of
epoch-ticks that are allowed before an async yield back to the
executor's event loop. (The initial amount and automatic "refills" are
configured on the `Store`, just as for fuel.) This call does only
signal-safe work (it increments an `AtomicU64`) so could be invoked from
a periodic signal, or from a thread that wakes up once per period.
2022-01-20 13:58:17 -08:00
Chris Fallin
2615ef967f Merge pull request #3702 from uweigand/isle-prep-s390x
s390x: Codegen fixes and preparation for ISLE migration
2022-01-20 12:02:08 -08:00
Nick Fitzgerald
0670d7beb5 Merge pull request #3703 from uweigand/isle-prep-common
ISLE standard prelude: Additional types and helpers
2022-01-20 10:09:51 -08:00
Ulrich Weigand
be60a19623 ISLE standard prelude: Additional types and helpers
In preparing to move the s390x back-end to ISLE, I noticed a few
missing pieces in the common prelude code.  This patch:

- Defines the reference types $R32 / $R64.
- Provides a trap_code_bad_conversion_to_integer helper.
- Provides an avoid_div_traps helper.  This requires passing the
  generic flags in addition to the ISA-specifc flags into the
  ISLE lowering context.
2022-01-20 17:23:31 +01:00