wasmtime

Author	SHA1	Message	Date
Peter Huene	ea72c621f0	Remove the stack map registry. This commit removes the stack map registry and instead uses the existing information from the store's module registry to lookup stack maps. A trait is now used to pass the lookup context to the runtime, implemented by `Store` to do the lookup. With this change, module registration in `Store` is now entirely limited to inserting the module into the module registry.	2021-04-16 11:08:21 -07:00
Peter Huene	a2466b3c23	Move the signature registry into `Engine`. This commit moves the shared signature registry out of `Store` and into `Engine`. This helps eliminate work that was performed whenever a `Module` was instantiated into a `Store`. Now a `Module` is registered with the shared signature registry upon creation, storing the mapping from the module's signature index space to the shared index space. This also refactors the "frame info" registry into a general purpose "module registry" that is used to look up trap information, signature information, and (soon) stack map information.	2021-04-16 11:06:44 -07:00
Nick Fitzgerald	2a32567871	Merge pull request #2821 from alexcrichton/faster-vmoffsets Precompute fields in `VMOffsets`	2021-04-08 14:17:11 -07:00
Alex Crichton	18dd82ba7d	Improve signature lookup happening during instantiation (#2818 ) This commit is intended to be a perf improvement for instantiation of modules with lots of functions. Previously the `lookup_shared_signature` callback was showing up quite high in profiles as part of instantiation. As some background, this callback is used to translate from a module's `SignatureIndex` to a `VMSharedSignatureIndex` which the instance stores. This callback is called for two reasons, one is to translate all of the module's own types into `VMSharedSignatureIndex` for the purposes of `call_indirect` (the translation of that loads from this table to compare indices). The second reason is that a `VMCallerCheckedAnyfunc` is prepared for all functions and this embeds a `VMSharedSignatureIndex` inside of it. The slow part today is that the lookup callback was called once-per-function and each lookup involved hashing a full `WasmFuncType`. Albeit our hash algorithm is still Rust's default SipHash algorithm which is quite slow, but we also shouldn't need to re-hash each signature if we see it multiple times anyway. The fix applied in this commit is to change this lookup callback to an `enum` where one variant is that there's a table to lookup from. This table is a `PrimaryMap` which means that lookup is quite fast. The only thing we need to do is to prepare the table ahead of time. Currently this happens on the instantiation path because in my measurments the creation of the table is quite fast compared to the rest of instantiation. If this becomes an issue, though, we can look into creating the table as part of `SigRegistry::register_module` and caching it somewhere (I'm not entirely sure where but I'm sure we can figure it out). There's in generally not a ton of efficiency around the `SigRegistry` type. I'm hoping though that this fixes the next-lowest-hanging-fruit in terms of performance without complicating the implementation too much. I tried a few variants and this change seemed like the best balance between simplicity and still a nice performance gain. Locally I measured an improvement in instantiation time for a large-ish module by reducing the time from ~3ms to ~2.6ms per instance.	2021-04-08 15:04:18 -05:00
Alex Crichton	c91e14d83f	Precompute fields in `VMOffsets` This commit updates the implementation of `VMOffsets` to frontload all checked arithmetic on construction of the `VMOffsets` which allows eliding all checked arithmetic when accessing the fields of `VMOffsets`. For testing and such this adds a new constructor as well from a new `VMOffsetsFields` structure which is a clone of the old definition. This should help speed up some profile hot spots I've been seeing where with all the checked arithmetic on field sizes this was slowing down the various accessors during instantiation (which uses `VMOffsets` to initialize various fields of the `VMContext`).	2021-04-08 12:46:17 -07:00
Alex Crichton	d4b54ee0a8	More optimizations for calling into WebAssembly (#2759 ) * Combine stack-based cleanups for faster wasm calls This commit is an extension of #2757 where the goal is to optimize entry into WebAssembly. Currently wasmtime has two stack-based cleanups when entering wasm, one for the externref activation table and another for stack limits getting reset. This commit fuses these two cleanups together into one and moves some code around which enables less captures for fewer closures and such to speed up calls in to wasm a bit more. Overall this drops the execution time from 88ns to 80ns locally for me. This also updates the atomic orderings when updating the stack limit from `SeqCst` to `Relaxed`. While `SeqCst` is a reasonable starting point the usage here should be safe to use `Relaxed` since we're not using the atomics to actually protect any memory, it's simply receiving signals from other threads. * Determine whether a pc is wasm via a global map The macOS implementation of traps recently changed to using mach ports for handlers instead of signal handlers. This means that a previously relied upon invariant, each thread fixes its own trap, was broken. The macOS implementation worked around this by maintaining a global map from thread id to thread local information, however, to solve the problem. This global map is quite slow though. It involves taking a lock and updating a hash map on all calls into WebAssembly. In my local testing this accounts for >70% of the overhead of calling into WebAssembly on macOS. Naturally it'd be great to remove this! This commit fixes this issue and removes the global lock/map that is updated on all calls into WebAssembly. The fix is to maintain a global map of wasm modules and their trap addresses in the `wasmtime` crate. Doing so is relatively simple since we're already tracking this information at the `Store` level. Once we've got a global map then the macOS implementation can use this from a foreign thread and everything works out. Locally this brings the overhead, on macOS specifically, of calling into wasm from 80ns to ~20ns. * Fix compiles * Review comments	2021-03-24 11:41:33 -05:00
Alex Crichton	c95971ab59	Optimize calling a WebAssembly function (#2757 ) This commit implements a few optimizations, mainly inlining, that should improve the performance of calling a WebAssembly function. This code path can be quite hot depending on the embedding case and we hadn't really put much effort into optimizing the nitty gritty. The predominant optimization here is adding `#[inline]` to trivial functions so performance is improved without having to compile with LTO. Another optimization is to call `lazy_per_thread_init` when traps are initialized per-thread (when a `Store` is created) rather than each time a function is called. The next optimization is to change the unwind reason in the `CallThreadState` to `MaybeUninit` to avoid extra checks in the default case about whether we need to drop its variants (since in the happy path we never need to drop it). The final optimization is to optimize out a few checks when `async` support is disabled for a small speed boost. In a small benchmark where wasmtime calls a simple wasm function my macOS computer dropped from 110ns to 86ns overhead, a 20% decrease. The macOS overhead is still largely dominated by the global lock acquisition and hash table management for traps right now, but I suspect the Linux overhead is much better (should be on the order of ~30 or so ns). We still have a long way to go to compete with SpiderMonkey which, in testing, seem to have ~6ns overhead in calling the same wasm function on my computer.	2021-03-23 15:22:37 -05:00
Peter Huene	1a0493946d	Make the storage of `wasmtime_runtime::Table` consistent. This change makes the storage of `Table` more internally consistent. Elements are stored as raw pointers for both static and dynamic table storage. Explicitly storing elements as pointers removes assumptions being made by the pooling allocator in terms of the size and default representation of the elements. However, care must be made to properly clone externrefs for table operations.	2021-03-05 18:36:14 -08:00
Alex Crichton	0e41861662	Implement limiting WebAssembly execution with fuel (#2611 ) * Consume fuel during function execution This commit adds codegen infrastructure necessary to instrument wasm code to consume fuel as it executes. Currently nothing is really done with the fuel, but that'll come in later commits. The focus of this commit is to implement the codegen infrastructure necessary to consume fuel and account for fuel consumed correctly. * Periodically check remaining fuel in wasm JIT code This commit enables wasm code to periodically check to see if fuel has run out. When fuel runs out an intrinsic is called which can do what it needs to do in the result of fuel running out. For now a trap is thrown to have at least some semantics in synchronous stores, but another planned use for this feature is for asynchronous stores to periodically yield back to the host based on fuel running out. Checks for remaining fuel happen in the same locations as interrupt checks, which is to say the start of the function as well as loop headers. * Improve codegen by caching `const VMInterrupts` The location of the shared interrupt value and fuel value is through a double-indirection on the vmctx (load through the vmctx and then load through that pointer). The second pointer in this chain, however, never changes, so we can alter codegen to account for this and remove some extraneous load instructions and hopefully reduce some register pressure even maybe. Add tests fuel can abort infinite loops * More fuzzing with fuel Use fuel to time out modules in addition to time, using fuzz input to figure out which. * Update docs on trapping instructions * Fix doc links * Fix a fuzz test * Change setting fuel to adding fuel * Fix a doc link * Squelch some rustdoc warnings	2021-01-29 08:57:17 -06:00
Alex Crichton	068340d30f	Fix a case of using the wrong stack map during gcs (#2396 ) This commit fixes an issue where when looking up the stack map for a pc within a function we might end up reading the previous function's stack maps. This then later caused asserts to trip because we started interpreting random data as a `VMExternRef` when it wasn't. The fix was to add `None` markers for "this range has no stack map" in the function ranges map. Closes #2386	2020-11-12 13:24:00 -06:00
Nick Fitzgerald	05bf9ea3f3	Rename "Stackmap" to "StackMap" And "stackmap" to "stack_map". This commit is purely mechanical.	2020-08-07 10:08:44 -07:00
Nick Fitzgerald	98e899f6b3	fuzz: Add a fuzz target for `table.{get,set}` operations This new fuzz target exercises sequences of `table.get`s, `table.set`s, and GCs. It already found a couple bugs: * Some leaks due to ref count cycles between stores and host-defined functions closing over those stores. * If there are no live references for a PC, Cranelift can avoid emiting an associated stack map. This was running afoul of a debug assertion.	2020-06-30 12:00:57 -07:00
Nick Fitzgerald	8c5f59c0cf	wasmtime: Implement `table.get` and `table.set` These instructions have fast, inline JIT paths for the common cases, and only call out to host VM functions for the slow paths. This required some changes to `cranelift-wasm`'s `FuncEnvironment`: instead of taking a `FuncCursor` to insert an instruction sequence within the current basic block, `FuncEnvironment::translate_table_{get,set}` now take a `&mut FunctionBuilder` so that they can create whole new basic blocks. This is necessary for implementing GC read/write barriers that involve branching (e.g. checking for null, or whether a store buffer is at capacity). Furthermore, it required that the `load`, `load_complex`, and `store` instructions handle loading and storing through an `r{32,64}` rather than just `i{32,64}` addresses. This involved making `r{32,64}` types acceptable instantiations of the `iAddr` type variable, plus a few new instruction encodings. Part of #929	2020-06-30 12:00:57 -07:00
Alex Crichton	0acd2072c2	Fix doc warnings and link failures (#1948 ) Also add configuration to CI to fail doc generation if any links are broken. Unfortunately we can't blanket deny all warnings in rustdoc since some are unconditional warnings, but for now this is hopefully good enough. Closes #1947	2020-06-30 13:01:49 -05:00
Nick Fitzgerald	58bb5dd953	wasmtime: Add support for `func.ref` and `table.grow` with `funcref`s `funcref`s are implemented as `NonNull<VMCallerCheckedAnyfunc>`. This should be more efficient than using a `VMExternRef` that points at a `VMCallerCheckedAnyfunc` because it gets rid of an indirection, dynamic allocation, and some reference counting. Note that the null function reference is NOT a null pointer; it is a `VMCallerCheckedAnyfunc` that has a null `func_ptr` member. Part of #929	2020-06-24 10:08:13 -07:00
Nick Fitzgerald	7e167cae10	externref: Address review feedback	2020-06-15 15:39:26 -07:00
Nick Fitzgerald	618c278e41	externref: implement a canary for GC stack walking This allows us to detect when stack walking has failed to walk the whole stack, and we are potentially missing on-stack roots, and therefore it would be unsafe to do a GC because we could free objects too early, leading to use-after-free. When we detect this scenario, we skip the GC.	2020-06-15 09:39:37 -07:00
Nick Fitzgerald	f30ce1fe97	externref: implement stack map-based garbage collection For host VM code, we use plain reference counting, where cloning increments the reference count, and dropping decrements it. We can avoid many of the on-stack increment/decrement operations that typically plague the performance of reference counting via Rust's ownership and borrowing system. Moving a `VMExternRef` avoids mutating its reference count, and borrowing it either avoids the reference count increment or delays it until if/when the `VMExternRef` is cloned. When passing a `VMExternRef` into compiled Wasm code, we don't want to do reference count mutations for every compiled `local.{get,set}`, nor for every function call. Therefore, we use a variation of deferred reference counting, where we only mutate reference counts when storing `VMExternRef`s somewhere that outlives the activation: into a global or table. Simultaneously, we over-approximate the set of `VMExternRef`s that are inside Wasm function activations. Periodically, we walk the stack at GC safe points, and use stack map information to precisely identify the set of `VMExternRef`s inside Wasm activations. Then we take the difference between this precise set and our over-approximation, and decrement the reference count for each of the `VMExternRef`s that are in our over-approximation but not in the precise set. Finally, the over-approximation is replaced with the precise set. The `VMExternRefActivationsTable` implements the over-approximized set of `VMExternRef`s referenced by Wasm activations. Calling a Wasm function and passing it a `VMExternRef` moves the `VMExternRef` into the table, and the compiled Wasm function logically "borrows" the `VMExternRef` from the table. Similarly, `global.get` and `table.get` operations clone the gotten `VMExternRef` into the `VMExternRefActivationsTable` and then "borrow" the reference out of the table. When a `VMExternRef` is returned to host code from a Wasm function, the host increments the reference count (because the reference is logically "borrowed" from the `VMExternRefActivationsTable` and the reference count from the table will be dropped at the next GC). For more general information on deferred reference counting, see An Examination of Deferred Reference Counting and Cycle Detection by Quinane: https://openresearch-repository.anu.edu.au/bitstream/1885/42030/2/hon-thesis.pdf cc #929 Fixes #1804	2020-06-15 09:39:37 -07:00
Nick Fitzgerald	b78eafcfd3	externref: Do not impl common traits, have free functions instead If you aren't expecting `VMExternRef`'s pointer-equality semantics, then these trait implementations can be foot guns. Instead of implementing the trait, make free functions in the `VMExternRef` namespace. This way, callers have to be a little more explicit.	2020-06-01 15:09:51 -07:00
Nick Fitzgerald	98da9ee8a9	Use `std::alloc::handle_alloc_failure` instead of home-rolled version	2020-06-01 15:09:51 -07:00
Nick Fitzgerald	25548d7fbe	externref: Share more `Layout`-computing code	2020-06-01 15:09:51 -07:00
Nick Fitzgerald	a8ee0554a9	wasmtime: Initial, partial support for `externref` This is enough to get an `externref -> externref` identity function passing. However, `externref`s that are dropped by compiled Wasm code are (safely) leaked. Follow up work will leverage cranelift's stack maps to resolve this issue.	2020-06-01 15:09:51 -07:00
Nick Fitzgerald	1898d52966	runtime: Introduce `VMExternRef` and `VMExternData` `VMExternRef` is a reference-counted box for any kind of data that is external and opaque to running Wasm. Sometimes it might hold a Wasmtime thing, other times it might hold something from a Wasmtime embedder and is opaque even to us. It is morally equivalent to `Rc<dyn Any>` in Rust, but additionally always fits in a pointer-sized word. `VMExternRef` is non-nullable, but `Option<VMExternRef>` is a null pointer. The one part of `VMExternRef` that can't ever be opaque to us is the reference count. Even when we don't know what's inside an `VMExternRef`, we need to be able to manipulate its reference count as we add and remove references to it. And we need to do this from compiled Wasm code, so it must be `repr(C)`! `VMExternRef` itself is just a pointer to an `VMExternData`, which holds the opaque, boxed value, its reference count, and its vtable pointer. The `VMExternData` struct is preceded by the dynamically-sized value boxed up and referenced by one or more `VMExternRef`s: ```ignore ,-------------------------------------------------------. \| \| V \| +----------------------------+-----------+-----------+ \| \| dynamically-sized value... \| ref_count \| value_ptr \|---' +----------------------------+-----------+-----------+ \| VMExternData \| +-----------------------+ ^ +-------------+ \| \| VMExternRef \|-------------------+ +-------------+ \| \| +-------------+ \| \| VMExternRef \|-------------------+ +-------------+ \| \| ... === \| +-------------+ \| \| VMExternRef \|-------------------' +-------------+ ``` The `value_ptr` member always points backwards to the start of the dynamically-sized value (which is also the start of the heap allocation for this value-and-`VMExternData` pair). Because it is a `dyn` pointer, it is fat, and also points to the value's `Any` vtable. The boxed value and the `VMExternRef` footer are held a single heap allocation. The layout described above is used to make satisfying the value's alignment easy: we just need to ensure that the heap allocation used to hold everything satisfies its alignment. It also ensures that we don't need a ton of excess padding between the `VMExternData` and the value for values with large alignment.	2020-06-01 14:53:10 -07:00

23 Commits