Implement lazy funcref table and anyfunc initialization. (#3733)
During instance initialization, we build two sorts of arrays eagerly:
- We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function
in an instance.
- We initialize every element of a funcref table with an initializer to
a pointer to one of these anyfuncs.
Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
`ref.func`. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.
This PR implements two basic ideas:
- The anyfunc array can be lazily initialized as long as we retain the
information needed to do so. For now, in this PR, we just recreate the
anyfunc whenever a pointer is taken to it, because doing so is fast
enough; in the future we could keep some state to know whether the
anyfunc has been written yet and skip this work if redundant.
This technique allows us to leave the anyfunc array as uninitialized
memory, which can be a significant savings. Filling it with
initialized anyfuncs is very expensive, but even zeroing it is
expensive: e.g. in a large module, it can be >500KB.
- A funcref table can be lazily initialized as long as we retain a link
to its corresponding instance and function index for each element. A
zero in a table element means "uninitialized", and a slowpath does the
initialization.
Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.
We do have to set up a precomputed array of `FuncIndex`s for the table
in order for this to work. We do this as part of the module compilation.
This PR also refactors the way that the runtime crate gains access to
information computed during module compilation.
Performance effect measured with in-tree benches/instantiation.rs, using
SpiderMonkey built for WASI, and with memfd enabled:
```
BEFORE:
sequential/default/spidermonkey.wasm
time: [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
time: [69.406 us 69.435 us 69.465 us]
parallel/default/spidermonkey.wasm: with 1 background thread
time: [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
time: [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
time: [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
time: [326.81 us 337.32 us 347.01 us]
WITH THIS PR:
sequential/default/spidermonkey.wasm
time: [6.7821 us 6.8096 us 6.8397 us]
change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
Performance has improved.
sequential/pooling/spidermonkey.wasm
time: [3.0410 us 3.0558 us 3.0724 us]
change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
Performance has improved.
parallel/default/spidermonkey.wasm: with 1 background thread
time: [7.2643 us 7.2689 us 7.2735 us]
change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
time: [147.36 us 148.99 us 150.74 us]
change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
time: [3.1009 us 3.1021 us 3.1033 us]
change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
time: [49.449 us 50.475 us 51.540 us]
change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
Performance has improved.
```
So an improvement of something like 80-95% for a very large module (7420
functions in its one funcref table, 31928 functions total).
This commit is contained in:
@@ -22,8 +22,13 @@
|
||||
#![cfg_attr(not(memfd), allow(unused_variables, unreachable_code))]
|
||||
|
||||
use std::sync::atomic::AtomicU64;
|
||||
use std::sync::Arc;
|
||||
|
||||
use anyhow::Error;
|
||||
use wasmtime_environ::DefinedFuncIndex;
|
||||
use wasmtime_environ::DefinedMemoryIndex;
|
||||
use wasmtime_environ::FunctionInfo;
|
||||
use wasmtime_environ::SignatureIndex;
|
||||
|
||||
mod export;
|
||||
mod externref;
|
||||
@@ -145,3 +150,42 @@ pub unsafe trait Store {
|
||||
/// completely semantically transparent. Returns the new deadline.
|
||||
fn new_epoch(&mut self) -> Result<u64, Error>;
|
||||
}
|
||||
|
||||
/// Functionality required by this crate for a particular module. This
|
||||
/// is chiefly needed for lazy initialization of various bits of
|
||||
/// instance state.
|
||||
///
|
||||
/// When an instance is created, it holds an Arc<dyn ModuleRuntimeInfo>
|
||||
/// so that it can get to signatures, metadata on functions, memfd and
|
||||
/// funcref-table images, etc. All of these things are ordinarily known
|
||||
/// by the higher-level layers of Wasmtime. Specifically, the main
|
||||
/// implementation of this trait is provided by
|
||||
/// `wasmtime::module::ModuleInner`. Since the runtime crate sits at
|
||||
/// the bottom of the dependence DAG though, we don't know or care about
|
||||
/// that; we just need some implementor of this trait for each
|
||||
/// allocation request.
|
||||
pub trait ModuleRuntimeInfo: Send + Sync + 'static {
|
||||
/// The underlying Module.
|
||||
fn module(&self) -> &Arc<wasmtime_environ::Module>;
|
||||
|
||||
/// The signatures.
|
||||
fn signature(&self, index: SignatureIndex) -> VMSharedSignatureIndex;
|
||||
|
||||
/// The base address of where JIT functions are located.
|
||||
fn image_base(&self) -> usize;
|
||||
|
||||
/// Descriptors about each compiled function, such as the offset from
|
||||
/// `image_base`.
|
||||
fn function_info(&self, func_index: DefinedFuncIndex) -> &FunctionInfo;
|
||||
|
||||
/// memfd images, if any, for this module.
|
||||
fn memfd_image(&self, memory: DefinedMemoryIndex) -> anyhow::Result<Option<&Arc<MemoryMemFd>>>;
|
||||
|
||||
/// A unique ID for this particular module. This can be used to
|
||||
/// allow for fastpaths to optimize a "re-instantiate the same
|
||||
/// module again" case.
|
||||
fn unique_id(&self) -> Option<CompiledModuleId>;
|
||||
|
||||
/// A slice pointing to all data that is referenced by this instance.
|
||||
fn wasm_data(&self) -> &[u8];
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user