Commit Graph

30 Commits

Author SHA1 Message Date
Alex Crichton
15bb0c6903 Remove the ModuleLimits pooling configuration structure (#3837)
* Remove the `ModuleLimits` pooling configuration structure

This commit is an attempt to improve the usability of the pooling
allocator by removing the need to configure a `ModuleLimits` structure.
Internally this structure has limits on all forms of wasm constructs but
this largely bottoms out in the size of an allocation for an instance in
the instance pooling allocator. Maintaining this list of limits can be
cumbersome as modules may get tweaked over time and there's otherwise no
real reason to limit the number of globals in a module since the main
goal is to limit the memory consumption of a `VMContext` which can be
done with a memory allocation limit rather than fine-tuned control over
each maximum and minimum.

The new approach taken in this commit is to remove `ModuleLimits`. Some
fields, such as `tables`, `table_elements` , `memories`, and
`memory_pages` are moved to `InstanceLimits` since they're still
enforced at runtime. A new field `size` is added to `InstanceLimits`
which indicates, in bytes, the maximum size of the `VMContext`
allocation. If the size of a `VMContext` for a module exceeds this value
then instantiation will fail.

This involved adding a few more checks to `{Table, Memory}::new_static`
to ensure that the minimum size is able to fit in the allocation, since
previously modules were validated at compile time of the module that
everything fit and that validation no longer happens (it happens at
runtime).

A consequence of this commit is that Wasmtime will have no built-in way
to reject modules at compile time if they'll fail to be instantiated
within a particular pooling allocator configuration. Instead a module
must attempt instantiation see if a failure happens.

* Fix benchmark compiles

* Fix some doc links

* Fix a panic by ensuring modules have limited tables/memories

* Review comments

* Add back validation at `Module` time instantiation is possible

This allows for getting an early signal at compile time that a module
will never be instantiable in an engine with matching settings.

* Provide a better error message when sizes are exceeded

Improve the error message when an instance size exceeds the maximum by
providing a breakdown of where the bytes are all going and why the large
size is being requested.

* Try to fix test in qemu

* Flag new test as 64-bit only

Sizes are all specific to 64-bit right now
2022-02-25 09:11:51 -06:00
Alex Crichton
c0c368d151 Use mmap'd *.cwasm as a source for memory initialization images (#3787)
* Skip memfd creation with precompiled modules

This commit updates the memfd support internally to not actually use a
memfd if a compiled module originally came from disk via the
`wasmtime::Module::deserialize_file` API. In this situation we already
have a file descriptor open and there's no need to copy a module's heap
image to a new file descriptor.

To facilitate a new source of `mmap` the currently-memfd-specific-logic
of creating a heap image is generalized to a new form of
`MemoryInitialization` which is attempted for all modules at
module-compile-time. This means that the serialized artifact to disk
will have the memory image in its entirety waiting for us. Furthermore
the memory image is ensured to be padded and aligned carefully to the
target system's page size, notably meaning that the data section in the
final object file is page-aligned and the size of the data section is
also page aligned.

This means that when a precompiled module is mapped from disk we can
reuse the underlying `File` to mmap all initial memory images. This
means that the offset-within-the-memory-mapped-file can differ for
memfd-vs-not, but that's just another piece of state to track in the
memfd implementation.

In the limit this waters down the term "memfd" for this technique of
quickly initializing memory because we no longer use memfd
unconditionally (only when the backing file isn't available).
This does however open up an avenue in the future to porting this
support to other OSes because while `memfd_create` is Linux-specific
both macOS and Windows support mapping a file with copy-on-write. This
porting isn't done in this PR and is left for a future refactoring.

Closes #3758

* Enable "memfd" support on all unix systems

Cordon off the Linux-specific bits and enable the memfd support to
compile and run on platforms like macOS which have a Linux-like `mmap`.
This only works if a module is mapped from a precompiled module file on
disk, but that's better than not supporting it at all!

* Fix linux compile

* Use `Arc<File>` instead of `MmapVecFileBacking`

* Use a named struct instead of mysterious tuples

* Comment about unsafety in `Module::deserialize_file`

* Fix tests

* Fix uffd compile

* Always align data segments

No need to have conditional alignment since their sizes are all aligned
anyway

* Update comment in build.rs

* Use rustix, not `region`

* Fix some confusing logic/names around memory indexes

These functions all work with memory indexes, not specifically defined
memory indexes.
2022-02-10 15:40:40 -06:00
Chris Fallin
39a52ceb4f Implement lazy funcref table and anyfunc initialization. (#3733)
During instance initialization, we build two sorts of arrays eagerly:

- We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function
  in an instance.

- We initialize every element of a funcref table with an initializer to
  a pointer to one of these anyfuncs.

Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
`ref.func`. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.

This PR implements two basic ideas:

- The anyfunc array can be lazily initialized as long as we retain the
  information needed to do so. For now, in this PR, we just recreate the
  anyfunc whenever a pointer is taken to it, because doing so is fast
  enough; in the future we could keep some state to know whether the
  anyfunc has been written yet and skip this work if redundant.

  This technique allows us to leave the anyfunc array as uninitialized
  memory, which can be a significant savings. Filling it with
  initialized anyfuncs is very expensive, but even zeroing it is
  expensive: e.g. in a large module, it can be >500KB.

- A funcref table can be lazily initialized as long as we retain a link
  to its corresponding instance and function index for each element. A
  zero in a table element means "uninitialized", and a slowpath does the
  initialization.

Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.

We do have to set up a precomputed array of `FuncIndex`s for the table
in order for this to work. We do this as part of the module compilation.

This PR also refactors the way that the runtime crate gains access to
information computed during module compilation.

Performance effect measured with in-tree benches/instantiation.rs, using
SpiderMonkey built for WASI, and with memfd enabled:

```
BEFORE:

sequential/default/spidermonkey.wasm
                        time:   [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
                        time:   [69.406 us 69.435 us 69.465 us]

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [326.81 us 337.32 us 347.01 us]

WITH THIS PR:

sequential/default/spidermonkey.wasm
                        time:   [6.7821 us 6.8096 us 6.8397 us]
                        change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
                        Performance has improved.
sequential/pooling/spidermonkey.wasm
                        time:   [3.0410 us 3.0558 us 3.0724 us]
                        change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [7.2643 us 7.2689 us 7.2735 us]
                        change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [147.36 us 148.99 us 150.74 us]
                        change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [3.1009 us 3.1021 us 3.1033 us]
                        change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [49.449 us 50.475 us 51.540 us]
                        change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
                        Performance has improved.
```

So an improvement of something like 80-95% for a very large module (7420
functions in its one funcref table, 31928 functions total).
2022-02-09 13:56:53 -08:00
Alex Crichton
04d2caea7b Consolidate methods of memory initialization (#3766)
* Consolidate methods of memory initialization

This commit consolidates the few locations that we have which are
performing memory initialization. Namely the uffd logic for creating
paged memory as well as the memfd logic for creating a memory image now
share an implementation to avoid duplicating bounds-checks or other
validation conditions. The main purpose of this commit is to fix a
fuzz-bug where a multiplication overflowed. The overflow itself was
benign but it seemed better to fix the overflow in only one place
instead of multiple.

The overflow in question is specifically when an initializer is checked
to be statically out-of-bounds and multiplies a memory's minimum size by
the wasm page size, returning the result as a `u64`. For
memory64-memories of size `1 << 48` this multiplication will overflow.
This was actually a preexisting bug with the `try_paged_init` function
which was copied for memfd, but cropped up here since memfd is used more
often than paged initialization. The fix here is to skip validation of
the `end` index if the size of memory is `1 << 64` since if the `end`
index can be represented as a `u64` then it's in-bounds. This is
somewhat of an esoteric case, though, since a memory of minimum size `1
<< 64` can't ever exist (we can't even ask the os for that much memory,
and even if we could it would fail).

* Fix memfd test

* Fix some tests

* Remove InitMemory enum

* Add an `is_segmented` helper method

* More clear variable name

* Make arguments to `init_memory` more descriptive
2022-02-04 13:17:25 -06:00
Alex Crichton
b647561c44 memfd: Some minor follow-ups (#3759)
* Tweak memfd-related features crates

This commit changes the `memfd` feature for the `wasmtime-cli` crate
from an always-on feature to a default-on feature which can be disabled
at compile time. Additionally the `pooling-allocator` feature is also
given similar treatment.

Additionally some documentation was added for the `memfd` feature on the
`wasmtime` crate.

* Don't store `Arc<T>` in `InstanceAllocationRequest`

Instead store `&Arc<T>` to avoid having the clone that lives in
`InstanceAllocationRequest` not actually going anywhere. Otherwise all
instance allocation requires an extra clone to create it for the request
and an extra decrement when the request goes away. Internally clones are
made as necessary when creating instances.

* Enable the pooling allocator by default for `wasmtime-cli`

While perhaps not the most useful option since the CLI doesn't have a
great way to take advantage of this it probably makes sense to at least
match the features of `wasmtime` itself.

* Fix some lints and issues

* More compile fixes
2022-02-03 09:17:04 -06:00
Chris Fallin
6011420557 Pooling allocator: add a reuse-affinity policy.
This policy attempts to reuse the same instance slot for subsequent
instantiations of the same module. This is particularly useful when
using a pooling backend such as memfd that benefits from this reuse: for
example, in the memfd case, instantiating the same module into the same
slot allows us to avoid several calls to mmap() because the same
mappings can be reused.

The policy tracks a freelist per "compiled module ID", and when
allocating a slot for an instance, tries these three options in order:

1. A slot from the freelist for this module (i.e., last used for another
   instantiation of this particular module), or
3. A slot that was last used by some other module or never before.

The "victim" slot for choice 2 is randomly chosen.

The data structures are carefully designed so that all updates are O(1),
and there is no retry-loop in any of the random selection.

This policy is now the default when the memfd backend is selected via
the `memfd-allocator` feature flag.
2022-02-02 12:25:30 -08:00
Chris Fallin
b73ac83c37 Add a pooling allocator mode based on copy-on-write mappings of memfds.
As first suggested by Jan on the Zulip here [1], a cheap and effective
way to obtain copy-on-write semantics of a "backing image" for a Wasm
memory is to mmap a file with `MAP_PRIVATE`. The `memfd` mechanism
provided by the Linux kernel allows us to create anonymous,
in-memory-only files that we can use for this mapping, so we can
construct the image contents on-the-fly then effectively create a CoW
overlay. Furthermore, and importantly, `madvise(MADV_DONTNEED, ...)`
will discard the CoW overlay, returning the mapping to its original
state.

By itself this is almost enough for a very fast
instantiation-termination loop of the same image over and over,
without changing the address space mapping at all (which is
expensive). The only missing bit is how to implement
heap *growth*. But here memfds can help us again: if we create another
anonymous file and map it where the extended parts of the heap would
go, we can take advantage of the fact that a `mmap()` mapping can
be *larger than the file itself*, with accesses beyond the end
generating a `SIGBUS`, and the fact that we can cheaply resize the
file with `ftruncate`, even after a mapping exists. So we can map the
"heap extension" file once with the maximum memory-slot size and grow
the memfd itself as `memory.grow` operations occur.

The above CoW technique and heap-growth technique together allow us a
fastpath of `madvise()` and `ftruncate()` only when we re-instantiate
the same module over and over, as long as we can reuse the same
slot. This fastpath avoids all whole-process address-space locks in
the Linux kernel, which should mean it is highly scalable. It also
avoids the cost of copying data on read, as the `uffd` heap backend
does when servicing pagefaults; the kernel's own optimized CoW
logic (same as used by all file mmaps) is used instead.

[1] https://bytecodealliance.zulipchat.com/#narrow/stream/206238-general/topic/Copy.20on.20write.20based.20instance.20reuse/near/266657772
2022-01-31 12:53:18 -08:00
Chris Fallin
8a55b5c563 Add epoch-based interruption for cooperative async timeslicing.
This PR introduces a new way of performing cooperative timeslicing that
is intended to replace the "fuel" mechanism. The tradeoff is that this
mechanism interrupts with less precision: not at deterministic points
where fuel runs out, but rather when the Engine enters a new epoch. The
generated code instrumentation is substantially faster, however, because
it does not need to do as much work as when tracking fuel; it only loads
the global "epoch counter" and does a compare-and-branch at backedges
and function prologues.

This change has been measured as ~twice as fast as fuel-based
timeslicing for some workloads, especially control-flow-intensive
workloads such as the SpiderMonkey JS interpreter on Wasm/WASI.

The intended interface is that the embedder of the `Engine` performs an
`engine.increment_epoch()` call periodically, e.g. once per millisecond.
An async invocation of a Wasm guest on a `Store` can specify a number of
epoch-ticks that are allowed before an async yield back to the
executor's event loop. (The initial amount and automatic "refills" are
configured on the `Store`, just as for fuel.) This call does only
signal-safe work (it increments an `AtomicU64`) so could be invoked from
a periodic signal, or from a thread that wakes up once per period.
2022-01-20 13:58:17 -08:00
Dan Gohman
ea0cb971fb Update to rustix 0.26.2. (#3521)
This pulls in a fix for Android, where Android's seccomp policy on older
versions is to make `openat2` irrecoverably crash the process, so we have
to do a version check up front rather than relying on `ENOSYS` to
determine if `openat2` is supported.

And it pulls in the fix for the link errors when multiple versions of
rsix/rustix are linked in.

And it has updates for two crate renamings: rsix has been renamed to
rustix, and unsafe-io has been renamed to io-extras.
2021-11-15 10:21:13 -08:00
Pat Hickey
52542b6c01 mock enough of the store to pass the uffd test 2021-10-22 08:56:13 -07:00
Pat Hickey
efef0769fe make uffd test compile, but not pass 2021-10-22 08:39:00 -07:00
Dan Gohman
47490b4383 Use rsix to make system calls in Wasmtime. (#3355)
* Use rsix to make system calls in Wasmtime.

`rsix` is a system call wrapper crate that we use in `wasi-common`,
which can provide the following advantages in the rest of Wasmtime:

 - It eliminates some `unsafe` blocks in Wasmtime's code. There's
   still an `unsafe` block in the library, but this way, the `unsafe`
   is factored out and clearly scoped.

 - And, it makes error handling more consistent, factoring out code for
   checking return values and `io::Error::last_os_error()`, and code that
   does `errno::set_errno(0)`.

This doesn't cover *all* system calls; `rsix` doesn't implement
signal-handling APIs, and this doesn't cover calls made through `std` or
crates like `userfaultfd`, `rand`, and `region`.
2021-09-17 15:28:56 -07:00
Alex Crichton
a237e73b5a Remove some allocations in CodeMemory (#3253)
* Remove some allocations in `CodeMemory`

This commit removes the `FinishedFunctions` type as well as allocations
associated with trampolines when allocating inside of a `CodeMemory`.
The main goal of this commit is to improve the time spent in
`CodeMemory` where currently today a good portion of time is spent
simply parsing symbol names and trying to extract function indices from
them. Instead this commit implements a new strategy (different from #3236)
where compilation records offset/length information for all
functions/trampolines so this doesn't need to be re-learned from the
object file later.

A consequence of this commit is that this offset information will be
decoded/encoded through `bincode` unconditionally, but we can also
optimize that later if necessary as well.

Internally this involved quite a bit of refactoring since the previous
map for `FinishedFunctions` was relatively heavily relied upon.

* comments
2021-08-30 10:35:17 -05:00
Alex Crichton
a662f5361d Move wasm data sections out of wasmtime_environ::Module (#3231)
* Reduce indentation in `to_paged`

Use a few early-returns from `match` to avoid lots of extra indentation.

* Move wasm data sections out of `wasmtime_environ::Module`

This is the first step down the road of #3230. The long-term goal is
that `Module` is always `bincode`-decoded, but wasm data segments are a
possibly very-large portion of this residing in modules which we don't
want to shove through bincode. This refactors the internals of wasmtime
to be ok with this data living separately from the `Module` itself,
providing access at necessary locations.

Wasm data segments are now extracted from a wasm module and
concatenated directly. Data sections then describe ranges within this
concatenated list of data, and passive data works the same way. This
implementation does not lend itself to eventually optimizing the case
where passive data is dropped and no longer needed. That's left for a
future PR.
2021-08-24 14:04:03 -05:00
Alex Crichton
87c33c2969 Remove wasmtime-environ's dependency on cranelift-codegen (#3199)
* Move `CompiledFunction` into wasmtime-cranelift

This commit moves the `wasmtime_environ::CompiledFunction` type into the
`wasmtime-cranelift` crate. This type has lots of Cranelift-specific
pieces of compilation and doesn't need to be generated by all Wasmtime
compilers. This replaces the usage in the `Compiler` trait with a
`Box<Any>` type that each compiler can select. Each compiler must still
produce a `FunctionInfo`, however, which is shared information we'll
deserialize for each module.

The `wasmtime-debug` crate is also folded into the `wasmtime-cranelift`
crate as a result of this commit. One possibility was to move the
`CompiledFunction` commit into its own crate and have `wasmtime-debug`
depend on that, but since `wasmtime-debug` is Cranelift-specific at this
time it didn't seem like it was too too necessary to keep it separate.
If `wasmtime-debug` supports other backends in the future we can
recreate a new crate, perhaps with it refactored to not depend on
Cranelift.

* Move wasmtime_environ::reference_type

This now belongs in wasmtime-cranelift and nowhere else

* Remove `Type` reexport in wasmtime-environ

One less dependency on `cranelift-codegen`!

* Remove `types` reexport from `wasmtime-environ`

Less cranelift!

* Remove `SourceLoc` from wasmtime-environ

Change the `srcloc`, `start_srcloc`, and `end_srcloc` fields to a custom
`FilePos` type instead of `ir::SourceLoc`. These are only used in a few
places so there's not much to lose from an extra abstraction for these
leaf use cases outside of cranelift.

* Remove wasmtime-environ's dep on cranelift's `StackMap`

This commit "clones" the `StackMap` data structure in to
`wasmtime-environ` to have an independent representation that that
chosen by Cranelift. This allows Wasmtime to decouple this runtime
dependency of stack map information and let the two evolve
independently, if necessary.

An alternative would be to refactor cranelift's implementation into a
separate crate and have wasmtime depend on that but it seemed a bit like
overkill to do so and easier to clone just a few lines for this.

* Define code offsets in wasmtime-environ with `u32`

Don't use Cranelift's `binemit::CodeOffset` alias to define this field
type since the `wasmtime-environ` crate will be losing the
`cranelift-codegen` dependency soon.

* Commit to using `cranelift-entity` in Wasmtime

This commit removes the reexport of `cranelift-entity` from the
`wasmtime-environ` crate and instead directly depends on the
`cranelift-entity` crate in all referencing crates. The original reason
for the reexport was to make cranelift version bumps easier since it's
less versions to change, but nowadays we have a script to do that.
Otherwise this encourages crates to use whatever they want from
`cranelift-entity` since  we'll always depend on the whole crate.

It's expected that the `cranelift-entity` crate will continue to be a
lean crate in dependencies and suitable for use at both runtime and
compile time. Consequently there's no need to avoid its usage in
Wasmtime at runtime, since "remove Cranelift at compile time" is
primarily about the `cranelift-codegen` crate.

* Remove most uses of `cranelift-codegen` in `wasmtime-environ`

There's only one final use remaining, which is the reexport of
`TrapCode`, which will get handled later.

* Limit the glob-reexport of `cranelift_wasm`

This commit removes the glob reexport of `cranelift-wasm` from the
`wasmtime-environ` crate. This is intended to explicitly define what
we're reexporting and is a transitionary step to curtail the amount of
dependencies taken on `cranelift-wasm` throughout the codebase. For
example some functions used by debuginfo mapping are better imported
directly from the crate since they're Cranelift-specific. Note that
this is intended to be a temporary state affairs, soon this reexport
will be gone entirely.

Additionally this commit reduces imports from `cranelift_wasm` and also
primarily imports from `crate::wasm` within `wasmtime-environ` to get a
better sense of what's imported from where and what will need to be
shared.

* Extract types from cranelift-wasm to cranelift-wasm-types

This commit creates a new crate called `cranelift-wasm-types` and
extracts type definitions from the `cranelift-wasm` crate into this new
crate. The purpose of this crate is to be a shared definition of wasm
types that can be shared both by compilers (like Cranelift) as well as
wasm runtimes (e.g. Wasmtime). This new `cranelift-wasm-types` crate
doesn't depend on `cranelift-codegen` and is the final step in severing
the unconditional dependency from Wasmtime to `cranelift-codegen`.

The final refactoring in this commit is to then reexport this crate from
`wasmtime-environ`, delete the `cranelift-codegen` dependency, and then
update all `use` paths to point to these new types.

The main change of substance here is that the `TrapCode` enum is
mirrored from Cranelift into this `cranelift-wasm-types` crate. While
this unfortunately results in three definitions (one more which is
non-exhaustive in Wasmtime itself) it's hopefully not too onerous and
ideally something we can patch up in the future.

* Get lightbeam compiling

* Remove unnecessary dependency

* Fix compile with uffd

* Update publish script

* Fix more uffd tests

* Rename cranelift-wasm-types to wasmtime-types

This reflects the purpose a bit more where it's types specifically
intended for Wasmtime and its support.

* Fix publish script
2021-08-18 13:14:52 -05:00
Alex Crichton
e68aa99588 Implement the memory64 proposal in Wasmtime (#3153)
* Implement the memory64 proposal in Wasmtime

This commit implements the WebAssembly [memory64 proposal][proposal] in
both Wasmtime and Cranelift. In terms of work done Cranelift ended up
needing very little work here since most of it was already prepared for
64-bit memories at one point or another. Most of the work in Wasmtime is
largely refactoring, changing a bunch of `u32` values to something else.

A number of internal and public interfaces are changing as a result of
this commit, for example:

* Acessors on `wasmtime::Memory` that work with pages now all return
  `u64` unconditionally rather than `u32`. This makes it possible to
  accommodate 64-bit memories with this API, but we may also want to
  consider `usize` here at some point since the host can't grow past
  `usize`-limited pages anyway.

* The `wasmtime::Limits` structure is removed in favor of
  minimum/maximum methods on table/memory types.

* Many libcall intrinsics called by jit code now unconditionally take
  `u64` arguments instead of `u32`. Return values are `usize`, however,
  since the return value, if successful, is always bounded by host
  memory while arguments can come from any guest.

* The `heap_addr` clif instruction now takes a 64-bit offset argument
  instead of a 32-bit one. It turns out that the legalization of
  `heap_addr` already worked with 64-bit offsets, so this change was
  fairly trivial to make.

* The runtime implementation of mmap-based linear memories has changed
  to largely work in `usize` quantities in its API and in bytes instead
  of pages. This simplifies various aspects and reflects that
  mmap-memories are always bound by `usize` since that's what the host
  is using to address things, and additionally most calculations care
  about bytes rather than pages except for the very edge where we're
  going to/from wasm.

Overall I've tried to minimize the amount of `as` casts as possible,
using checked `try_from` and checked arithemtic with either error
handling or explicit `unwrap()` calls to tell us about bugs in the
future. Most locations have relatively obvious things to do with various
implications on various hosts, and I think they should all be roughly of
the right shape but time will tell. I mostly relied on the compiler
complaining that various types weren't aligned to figure out
type-casting, and I manually audited some of the more obvious locations.
I suspect we have a number of hidden locations that will panic on 32-bit
hosts if 64-bit modules try to run there, but otherwise I think we
should be generally ok (famous last words). In any case I wouldn't want
to enable this by default naturally until we've fuzzed it for some time.

In terms of the actual underlying implementation, no one should expect
memory64 to be all that fast. Right now it's implemented with
"dynamic" heaps which have a few consequences:

* All memory accesses are bounds-checked. I'm not sure how aggressively
  Cranelift tries to optimize out bounds checks, but I suspect not a ton
  since we haven't stressed this much historically.

* Heaps are always precisely sized. This means that every call to
  `memory.grow` will incur a `memcpy` of memory from the old heap to the
  new. We probably want to at least look into `mremap` on Linux and
  otherwise try to implement schemes where dynamic heaps have some
  reserved pages to grow into to help amortize the cost of
  `memory.grow`.

The memory64 spec test suite is scheduled to now run on CI, but as with
all the other spec test suites it's really not all that comprehensive.
I've tried adding more tests for basic things as I've had to implement
guards for them, but I wouldn't really consider the testing adequate
from just this PR itself. I did try to take care in one test to actually
allocate a 4gb+ heap and then avoid running that in the pooling
allocator or in emulation because otherwise that may fail or take
excessively long.

[proposal]: https://github.com/WebAssembly/memory64/blob/master/proposals/memory64/Overview.md

* Fix some tests

* More test fixes

* Fix wasmtime tests

* Fix doctests

* Revert to 32-bit immediate offsets in `heap_addr`

This commit updates the generation of addresses in wasm code to always
use 32-bit offsets for `heap_addr`, and if the calculated offset is
bigger than 32-bits we emit a manual add with an overflow check.

* Disable memory64 for spectest fuzzing

* Fix wrong offset being added to heap addr

* More comments!

* Clarify bytes/pages
2021-08-12 09:40:20 -05:00
Alex Crichton
7ce46043dc Add guard pages to the front of linear memories (#2977)
* Add guard pages to the front of linear memories

This commit implements a safety feature for Wasmtime to place guard
pages before the allocation of all linear memories. Guard pages placed
after linear memories are typically present for performance (at least)
because it can help elide bounds checks. Guard pages before a linear
memory, however, are never strictly needed for performance or features.
The intention of a preceding guard page is to help insulate against bugs
in Cranelift or other code generators, such as CVE-2021-32629.

This commit adds a `Config::guard_before_linear_memory` configuration
option, defaulting to `true`, which indicates whether guard pages should
be present both before linear memories as well as afterwards. Guard
regions continue to be controlled by
`{static,dynamic}_memory_guard_size` methods.

The implementation here affects both on-demand allocated memories as
well as the pooling allocator for memories. For on-demand memories this
adjusts the size of the allocation as well as adjusts the calculations
for the base pointer of the wasm memory. For the pooling allocator this
will place a singular extra guard region at the very start of the
allocation for memories. Since linear memories in the pooling allocator
are contiguous every memory already had a preceding guard region in
memory, it was just the previous memory's guard region afterwards. Only
the first memory needed this extra guard.

I've attempted to write some tests to help test all this, but this is
all somewhat tricky to test because the settings are pretty far away
from the actual behavior. I think, though, that the tests added here
should help cover various use cases and help us have confidence in
tweaking the various `Config` settings beyond their defaults.

Note that this also contains a semantic change where
`InstanceLimits::memory_reservation_size` has been removed. Instead this
field is now inferred from the `static_memory_maximum_size` and guard
size settings. This should hopefully remove some duplication in these
settings, canonicalizing on the guard-size/static-size settings as the
way to control memory sizes and virtual reservations.

* Update config docs

* Fix a typo

* Fix benchmark

* Fix wasmtime-runtime tests

* Fix some more tests

* Try to fix uffd failing test

* Review items

* Tweak 32-bit defaults

Makes the pooling allocator a bit more reasonable by default on 32-bit
with these settings.
2021-06-18 09:57:08 -05:00
Alex Crichton
7a1b7cdf92 Implement RFC 11: Redesigning Wasmtime's APIs (#2897)
Implement Wasmtime's new API as designed by RFC 11. This is quite a large commit which has had lots of discussion externally, so for more information it's best to read the RFC thread and the PR thread.
2021-06-03 09:10:53 -05:00
Peter Huene
f12b4c467c Add resource limiting to the Wasmtime API. (#2736)
* Add resource limiting to the Wasmtime API.

This commit adds a `ResourceLimiter` trait to the Wasmtime API.

When used in conjunction with `Store::new_with_limiter`, this can be used to
monitor and prevent WebAssembly code from growing linear memories and tables.

This is particularly useful when hosts need to take into account host resource
usage to determine if WebAssembly code can consume more resources.

A simple `StaticResourceLimiter` is also included with these changes that will
simply limit the size of linear memories or tables for all instances created in
the store based on static values.

* Code review feedback.

* Implemented `StoreLimits` and `StoreLimitsBuilder`.
* Moved `max_instances`, `max_memories`, `max_tables` out of `Config` and into
  `StoreLimits`.
* Moved storage of the limiter in the runtime into `Memory` and `Table`.
* Made `InstanceAllocationRequest` use a reference to the limiter.
* Updated docs.
* Made `ResourceLimiterProxy` generic to remove a level of indirection.
* Fixed the limiter not being used for `wasmtime::Memory` and
  `wasmtime::Table`.

* Code review feedback and bug fix.

* `Memory::new` now returns `Result<Self>` so that an error can be returned if
  the initial requested memory exceeds any limits placed on the store.

* Changed an `Arc` to `Rc` as the `Arc` wasn't necessary.

* Removed `Store` from the `ResourceLimiter` callbacks. Custom resource limiter
  implementations are free to capture any context they want, so no need to
  unnecessarily store a weak reference to `Store` from the proxy type.

* Fixed a bug in the pooling instance allocator where an instance would be
  leaked from the pool. Previously, this would only have happened if the OS was
  unable to make the necessary linear memory available for the instance. With
  these changes, however, the instance might not be created due to limits
  placed on the store. We now properly deallocate the instance on error.

* Added more tests, including one that covers the fix mentioned above.

* Code review feedback.

* Add another memory to `test_pooling_allocator_initial_limits_exceeded` to
  ensure a partially created instance is successfully deallocated.
* Update some doc comments for better documentation of `Store` and
  `ResourceLimiter`.
2021-04-19 09:19:20 -05:00
Peter Huene
b775b68cfb Make module information lookup from runtime safe.
This commit uses a two-phase lookup of stack map information from modules
rather than giving back raw pointers to stack maps.

First the runtime looks up information about a module from a pc value, which
returns an `Arc` it keeps a reference on while completing the stack map lookup.

Second it then queries the module information for the stack map from a pc
value, getting a reference to the stack map (which is now safe because of the
`Arc` held by the runtime).
2021-04-16 12:30:10 -07:00
Peter Huene
ea72c621f0 Remove the stack map registry.
This commit removes the stack map registry and instead uses the existing
information from the store's module registry to lookup stack maps.

A trait is now used to pass the lookup context to the runtime, implemented by
`Store` to do the lookup.

With this change, module registration in `Store` is now entirely limited to
inserting the module into the module registry.
2021-04-16 11:08:21 -07:00
Alex Crichton
18dd82ba7d Improve signature lookup happening during instantiation (#2818)
This commit is intended to be a perf improvement for instantiation of
modules with lots of functions. Previously the `lookup_shared_signature`
callback was showing up quite high in profiles as part of instantiation.

As some background, this callback is used to translate from a module's
`SignatureIndex` to a `VMSharedSignatureIndex` which the instance
stores. This callback is called for two reasons, one is to translate all
of the module's own types into `VMSharedSignatureIndex` for the purposes
of `call_indirect` (the translation of that loads from this table to
compare indices). The second reason is that a `VMCallerCheckedAnyfunc`
is prepared for all functions and this embeds a `VMSharedSignatureIndex`
inside of it.

The slow part today is that the lookup callback was called
once-per-function and each lookup involved hashing a full
`WasmFuncType`. Albeit our hash algorithm is still Rust's default
SipHash algorithm which is quite slow, but we also shouldn't need to
re-hash each signature if we see it multiple times anyway.

The fix applied in this commit is to change this lookup callback to an
`enum` where one variant is that there's a table to lookup from. This
table is a `PrimaryMap` which means that lookup is quite fast. The only
thing we need to do is to prepare the table ahead of time. Currently
this happens on the instantiation path because in my measurments the
creation of the table is quite fast compared to the rest of
instantiation. If this becomes an issue, though, we can look into
creating the table as part of `SigRegistry::register_module` and caching
it somewhere (I'm not entirely sure where but I'm sure we can figure it
out).

There's in generally not a ton of efficiency around the `SigRegistry`
type. I'm hoping though that this fixes the next-lowest-hanging-fruit in
terms of performance without complicating the implementation too much. I
tried a few variants and this change seemed like the best balance
between simplicity and still a nice performance gain.

Locally I measured an improvement in instantiation time for a large-ish
module by reducing the time from ~3ms to ~2.6ms per instance.
2021-04-08 15:04:18 -05:00
Peter Huene
f8f51afac1 Split out fiber stacks from fibers.
This commit splits out a `FiberStack` from `Fiber`, allowing the instance
allocator trait to return `FiberStack` rather than raw stack pointers. This
keeps the stack creation mostly in `wasmtime_fiber`, but now the on-demand
instance allocator can make use of it.

The instance allocators no longer have to return a "not supported" error to
indicate that the store should allocate its own fiber stack.

This includes a bunch of cleanup in the instance allocator to scope stacks to
the new "async" feature in the runtime.

Closes #2708.
2021-03-18 20:21:02 -07:00
Peter Huene
5fa0f8d469 Move linear memory faulted guard page tracking into Memory.
This commit moves the tracking for faulted guard pages in a linear memory into
`Memory`.
2021-03-08 11:27:25 -08:00
Peter Huene
7a93132ffa Code review feedback.
* Improve comments.
* Drop old table element *after* updating the table.
* Extract out the same `cfg_if!` to a single constant.
2021-03-08 09:04:13 -08:00
Peter Huene
ff840b3d3b More PR feedback changes.
* More use of `anyhow`.
* Change `make_accessible` into `protect_linear_memory` to better demonstrate
  what it is used for; this will make the uffd implementation make a little
  more sense.
* Remove `create_memory_map` in favor of just creating the `Mmap` instances in
  the pooling allocator. This also removes the need for `MAP_NORESERVE` in the
  uffd implementation.
* Moar comments.
* Remove `BasePointerIterator` in favor of `impl Iterator`.
* The uffd implementation now only monitors linear memory pages and will only
  receive faults on pages that could potentially be accessible and never on a
  statically known guard page.
* Stop allocating memory or table pools if the maximum limit of the memory or
  table is 0.
2021-03-04 20:14:40 -08:00
Peter Huene
a464465e2f Code review feedback changes.
* Add `anyhow` dependency to `wasmtime-runtime`.
* Revert `get_data` back to `fn`.
* Remove `DataInitializer` and box the data in `Module` translation instead.
* Improve comments on `MemoryInitialization`.
* Remove `MemoryInitialization::OutOfBounds` in favor of proper bulk memory
  semantics.
* Use segmented memory initialization except for when the uffd feature is
  enabled on Linux.
* Validate modules with the allocator after translation.
* Updated various functions in the runtime to return `anyhow::Result`.
* Use a slice when copying pages instead of `ptr::copy_nonoverlapping`.
* Remove unnecessary casts in `OnDemandAllocator::deallocate`.
* Better document the `uffd` feature.
* Use WebAssembly page-sized pages in the paged initialization.
* Remove the stack pool from the uffd handler and simply protect just the guard
  pages.
2021-03-04 18:19:46 -08:00
Peter Huene
505437e353 Code cleanup.
Last minute code clean up to fix some comments and rename `address_space_size`
to `memory_reservation_size` to better describe what the option is doing.
2021-03-04 18:19:46 -08:00
Peter Huene
f5c4d87c45 Implement on-demand memory initialization for the uffd feature.
This commit implements copying paged initialization data upon a fault of a
linear memory page.

If the initialization data is "paged", then the appropriate pages are copied
into the Wasm page (or zeroed if the page is not present in the
initialization data).

If the initialization data is not "paged", the Wasm page is zeroed so that
module instantiation can initialize the pages.
2021-03-04 18:19:45 -08:00
Peter Huene
a2c439117a Implement user fault handling with userfaultfd on Linux.
This commit implements the `uffd` feature which turns on support for utilizing
the `userfaultfd` system call on Linux for the pooling instance allocator.

By handling page faults in userland, we are able to detect guard page accesses
without having to constantly change memory page protections.

This should help reduce the number of syscalls as well as kernel lock
contentions when many threads are allocating and deallocating instances.

Additionally, the user fault handler can lazy initialize linear
memories of an instance (implementation to come).
2021-03-04 18:18:52 -08:00