Implement lazy funcref table and anyfunc initialization. (#3733)

During instance initialization, we build two sorts of arrays eagerly:

- We create an "anyfunc" (a `VMCallerCheckedAnyfunc`) for every function
  in an instance.

- We initialize every element of a funcref table with an initializer to
  a pointer to one of these anyfuncs.

Most instances will not touch (via call_indirect or table.get) all
funcref table elements. And most anyfuncs will never be referenced,
because most functions are never placed in tables or used with
`ref.func`. Thus, both of these initialization tasks are quite wasteful.
Profiling shows that a significant fraction of the remaining
instance-initialization time after our other recent optimizations is
going into these two tasks.

This PR implements two basic ideas:

- The anyfunc array can be lazily initialized as long as we retain the
  information needed to do so. For now, in this PR, we just recreate the
  anyfunc whenever a pointer is taken to it, because doing so is fast
  enough; in the future we could keep some state to know whether the
  anyfunc has been written yet and skip this work if redundant.

  This technique allows us to leave the anyfunc array as uninitialized
  memory, which can be a significant savings. Filling it with
  initialized anyfuncs is very expensive, but even zeroing it is
  expensive: e.g. in a large module, it can be >500KB.

- A funcref table can be lazily initialized as long as we retain a link
  to its corresponding instance and function index for each element. A
  zero in a table element means "uninitialized", and a slowpath does the
  initialization.

Funcref tables are a little tricky because funcrefs can be null. We need
to distinguish "element was initially non-null, but user stored explicit
null later" from "element never touched" (ie the lazy init should not
blow away an explicitly stored null). We solve this by stealing the LSB
from every funcref (anyfunc pointer): when the LSB is set, the funcref
is initialized and we don't hit the lazy-init slowpath. We insert the
bit on storing to the table and mask it off after loading.

We do have to set up a precomputed array of `FuncIndex`s for the table
in order for this to work. We do this as part of the module compilation.

This PR also refactors the way that the runtime crate gains access to
information computed during module compilation.

Performance effect measured with in-tree benches/instantiation.rs, using
SpiderMonkey built for WASI, and with memfd enabled:

```
BEFORE:

sequential/default/spidermonkey.wasm
                        time:   [68.569 us 68.696 us 68.856 us]
sequential/pooling/spidermonkey.wasm
                        time:   [69.406 us 69.435 us 69.465 us]

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [69.444 us 69.470 us 69.497 us]
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [183.72 us 184.31 us 184.89 us]
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [69.018 us 69.070 us 69.136 us]
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [326.81 us 337.32 us 347.01 us]

WITH THIS PR:

sequential/default/spidermonkey.wasm
                        time:   [6.7821 us 6.8096 us 6.8397 us]
                        change: [-90.245% -90.193% -90.142%] (p = 0.00 < 0.05)
                        Performance has improved.
sequential/pooling/spidermonkey.wasm
                        time:   [3.0410 us 3.0558 us 3.0724 us]
                        change: [-95.566% -95.552% -95.537%] (p = 0.00 < 0.05)
                        Performance has improved.

parallel/default/spidermonkey.wasm: with 1 background thread
                        time:   [7.2643 us 7.2689 us 7.2735 us]
                        change: [-89.541% -89.533% -89.525%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/default/spidermonkey.wasm: with 16 background threads
                        time:   [147.36 us 148.99 us 150.74 us]
                        change: [-18.997% -18.081% -17.285%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 1 background thread
                        time:   [3.1009 us 3.1021 us 3.1033 us]
                        change: [-95.517% -95.511% -95.506%] (p = 0.00 < 0.05)
                        Performance has improved.
parallel/pooling/spidermonkey.wasm: with 16 background threads
                        time:   [49.449 us 50.475 us 51.540 us]
                        change: [-85.423% -84.964% -84.465%] (p = 0.00 < 0.05)
                        Performance has improved.
```

So an improvement of something like 80-95% for a very large module (7420
functions in its one funcref table, 31928 functions total).
This commit is contained in:
Chris Fallin
2022-02-09 13:56:53 -08:00
committed by GitHub
parent 1b27508a42
commit 39a52ceb4f
26 changed files with 1000 additions and 437 deletions

View File

@@ -18,8 +18,12 @@ macro_rules! foreach_builtin_function {
memory_fill(vmctx, i32, i64, i32, i64) -> ();
/// Returns an index for wasm's `memory.init` instruction.
memory_init(vmctx, i32, i32, i64, i32, i32) -> ();
/// Returns a value for wasm's `ref.func` instruction.
ref_func(vmctx, i32) -> (pointer);
/// Returns an index for wasm's `data.drop` instruction.
data_drop(vmctx, i32) -> ();
/// Returns a table entry after lazily initializing it.
table_get_lazy_init_funcref(vmctx, i32, i32) -> (pointer);
/// Returns an index for Wasm's `table.grow` instruction for `funcref`s.
table_grow_funcref(vmctx, i32, i32, pointer) -> (i32);
/// Returns an index for Wasm's `table.grow` instruction for `externref`s.

View File

@@ -29,6 +29,7 @@ mod compilation;
mod module;
mod module_environ;
pub mod obj;
mod ref_bits;
mod stack_map;
mod trap_encoding;
mod tunables;
@@ -39,6 +40,7 @@ pub use crate::builtin::*;
pub use crate::compilation::*;
pub use crate::module::*;
pub use crate::module_environ::*;
pub use crate::ref_bits::*;
pub use crate::stack_map::StackMap;
pub use crate::trap_encoding::*;
pub use crate::tunables::Tunables;

View File

@@ -1,6 +1,7 @@
//! Data structures for representing decoded wasm modules.
use crate::{EntityRef, ModuleTranslation, PrimaryMap, Tunables, WASM_PAGE_SIZE};
use crate::{ModuleTranslation, PrimaryMap, Tunables, WASM_PAGE_SIZE};
use cranelift_entity::{packed_option::ReservedValue, EntityRef};
use indexmap::IndexMap;
use serde::{Deserialize, Serialize};
use std::collections::BTreeMap;
@@ -259,6 +260,92 @@ impl ModuleTranslation<'_> {
}
self.module.memory_initialization = MemoryInitialization::Paged { map };
}
/// Attempts to convert the module's table initializers to
/// FuncTable form where possible. This enables lazy table
/// initialization later by providing a one-to-one map of initial
/// table values, without having to parse all segments.
pub fn try_func_table_init(&mut self) {
// This should be large enough to support very large Wasm
// modules with huge funcref tables, but small enough to avoid
// OOMs or DoS on truly sparse tables.
const MAX_FUNC_TABLE_SIZE: u32 = 1024 * 1024;
let segments = match &self.module.table_initialization {
TableInitialization::Segments { segments } => segments,
TableInitialization::FuncTable { .. } => {
// Already done!
return;
}
};
// Build the table arrays per-table.
let mut tables = PrimaryMap::with_capacity(self.module.table_plans.len());
// Keep the "leftovers" for eager init.
let mut leftovers = vec![];
for segment in segments {
// Skip imported tables: we can't provide a preconstructed
// table for them, because their values depend on the
// imported table overlaid with whatever segments we have.
if self
.module
.defined_table_index(segment.table_index)
.is_none()
{
leftovers.push(segment.clone());
continue;
}
// If this is not a funcref table, then we can't support a
// pre-computed table of function indices.
if self.module.table_plans[segment.table_index].table.wasm_ty != WasmType::FuncRef {
leftovers.push(segment.clone());
continue;
}
// If the base of this segment is dynamic, then we can't
// include it in the statically-built array of initial
// contents.
if segment.base.is_some() {
leftovers.push(segment.clone());
continue;
}
// Get the end of this segment. If out-of-bounds, or too
// large for our dense table representation, then skip the
// segment.
let top = match segment.offset.checked_add(segment.elements.len() as u32) {
Some(top) => top,
None => {
leftovers.push(segment.clone());
continue;
}
};
let table_size = self.module.table_plans[segment.table_index].table.minimum;
if top > table_size || top > MAX_FUNC_TABLE_SIZE {
leftovers.push(segment.clone());
continue;
}
// We can now incorporate this segment into the initializers array.
while tables.len() <= segment.table_index.index() {
tables.push(vec![]);
}
let elements = &mut tables[segment.table_index];
if elements.is_empty() {
elements.resize(table_size as usize, FuncIndex::reserved_value());
}
let dst = &mut elements[(segment.offset as usize)..(top as usize)];
dst.copy_from_slice(&segment.elements[..]);
}
self.module.table_initialization = TableInitialization::FuncTable {
tables,
segments: leftovers,
};
}
}
impl Default for MemoryInitialization {
@@ -460,7 +547,7 @@ impl TablePlan {
}
}
/// A WebAssembly table initializer.
/// A WebAssembly table initializer segment.
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct TableInitializer {
/// The index of a table to initialize.
@@ -473,6 +560,56 @@ pub struct TableInitializer {
pub elements: Box<[FuncIndex]>,
}
/// Table initialization data for all tables in the module.
#[derive(Clone, Debug, Serialize, Deserialize)]
pub enum TableInitialization {
/// "Segment" mode: table initializer segments, possibly with
/// dynamic bases, possibly applying to an imported memory.
///
/// Every kind of table initialization is supported by the
/// Segments mode.
Segments {
/// The segment initializers. All apply to the table for which
/// this TableInitialization is specified.
segments: Vec<TableInitializer>,
},
/// "FuncTable" mode: a single array per table, with a function
/// index or null per slot. This is only possible to provide for a
/// given table when it is defined by the module itself, and can
/// only include data from initializer segments that have
/// statically-knowable bases (i.e., not dependent on global
/// values).
///
/// Any segments that are not compatible with this mode are held
/// in the `segments` array of "leftover segments", which are
/// still processed eagerly.
///
/// This mode facilitates lazy initialization of the tables. It is
/// thus "nice to have", but not necessary for correctness.
FuncTable {
/// For each table, an array of function indices (or
/// FuncIndex::reserved_value(), meaning no initialized value,
/// hence null by default). Array elements correspond
/// one-to-one to table elements; i.e., `elements[i]` is the
/// initial value for `table[i]`.
tables: PrimaryMap<TableIndex, Vec<FuncIndex>>,
/// Leftover segments that need to be processed eagerly on
/// instantiation. These either apply to an imported table (so
/// we can't pre-build a full image of the table from this
/// overlay) or have dynamically (at instantiation time)
/// determined bases.
segments: Vec<TableInitializer>,
},
}
impl Default for TableInitialization {
fn default() -> Self {
TableInitialization::Segments { segments: vec![] }
}
}
/// Different types that can appear in a module.
///
/// Note that each of these variants are intended to index further into a
@@ -512,8 +649,8 @@ pub struct Module {
/// The module "start" function, if present.
pub start_func: Option<FuncIndex>,
/// WebAssembly table initializers.
pub table_initializers: Vec<TableInitializer>,
/// WebAssembly table initialization data, per table.
pub table_initialization: TableInitialization,
/// WebAssembly linear memory initializer.
pub memory_initialization: MemoryInitialization,

View File

@@ -5,8 +5,8 @@ use crate::module::{
use crate::{
DataIndex, DefinedFuncIndex, ElemIndex, EntityIndex, EntityType, FuncIndex, Global,
GlobalIndex, GlobalInit, InstanceIndex, InstanceTypeIndex, MemoryIndex, ModuleIndex,
ModuleTypeIndex, PrimaryMap, SignatureIndex, TableIndex, Tunables, TypeIndex, WasmError,
WasmFuncType, WasmResult,
ModuleTypeIndex, PrimaryMap, SignatureIndex, TableIndex, TableInitialization, Tunables,
TypeIndex, WasmError, WasmFuncType, WasmResult,
};
use cranelift_entity::packed_option::ReservedValue;
use std::borrow::Cow;
@@ -512,9 +512,6 @@ impl<'data> ModuleEnvironment<'data> {
Payload::ElementSection(elements) => {
validator.element_section(&elements)?;
let cnt = usize::try_from(elements.get_count()).unwrap();
self.result.module.table_initializers.reserve_exact(cnt);
for (index, entry) in elements.into_iter().enumerate() {
let wasmparser::Element {
kind,
@@ -527,7 +524,7 @@ impl<'data> ModuleEnvironment<'data> {
// entries listed in this segment. Note that it's not
// possible to create anything other than a `ref.null
// extern` for externref segments, so those just get
// translate to the reserved value of `FuncIndex`.
// translated to the reserved value of `FuncIndex`.
let items_reader = items.get_items_reader()?;
let mut elements =
Vec::with_capacity(usize::try_from(items_reader.get_count()).unwrap());
@@ -576,15 +573,18 @@ impl<'data> ModuleEnvironment<'data> {
)));
}
};
self.result
.module
.table_initializers
.push(TableInitializer {
table_index,
base,
offset,
elements: elements.into(),
});
let table_segments = match &mut self.result.module.table_initialization
{
TableInitialization::Segments { segments } => segments,
TableInitialization::FuncTable { .. } => unreachable!(),
};
table_segments.push(TableInitializer {
table_index,
base,
offset,
elements: elements.into(),
});
}
ElementKind::Passive => {

View File

@@ -0,0 +1,36 @@
//! Definitions for bits in the in-memory / in-table representation of references.
/// An "initialized bit" in a funcref table.
///
/// We lazily initialize tables of funcrefs, and this mechanism
/// requires us to interpret zero as "uninitialized", triggering a
/// slowpath on table read to possibly initialize the element. (This
/// has to be *zero* because that is the only value we can cheaply
/// initialize, e.g. with newly mmap'd memory.)
///
/// However, the user can also store a null reference into a table. We
/// have to interpret this as "actually null", and not "lazily
/// initialize to the original funcref that this slot had".
///
/// To do so, we rewrite nulls into the "initialized null" value. Note
/// that this should *only exist inside the table*: whenever we load a
/// value out of a table, we immediately mask off the low bit that
/// contains the initialized-null flag. Conversely, when we store into
/// a table, we have to translate a true null into an "initialized
/// null".
///
/// We can generalize a bit in order to simply the table-set logic: we
/// can set the LSB of *all* explicitly stored values to 1 in order to
/// note that they are indeed explicitly stored. We then mask off this
/// bit every time we load.
///
/// Note that we take care to set this bit and mask it off when
/// accessing tables direclty in fastpaths in generated code as well.
pub const FUNCREF_INIT_BIT: usize = 1;
/// The mask we apply to all refs loaded from funcref tables.
///
/// This allows us to use the LSB as an "initialized flag" (see below)
/// to distinguish from an uninitialized element in a
/// lazily-initialized funcref table.
pub const FUNCREF_MASK: usize = !FUNCREF_INIT_BIT;