Add support for keeping pooling allocator pages resident (#5207)

When new wasm instances are created repeatedly in high-concurrency
environments one of the largest bottlenecks is the contention on
kernel-level locks having to do with the virtual memory. It's expected
that usage in this environment is leveraging the pooling instance
allocator with the `memory-init-cow` feature enabled which means that
the kernel level VM lock is acquired in operations such as:

1. Growing a heap with `mprotect` (write lock)
2. Faulting in memory during usage (read lock)
3. Resetting a heap's contents with `madvise` (read lock)
4. Shrinking a heap with `mprotect` when reusing a slot (write lock)

Rapid usage of these operations can lead to detrimental performance
especially on otherwise heavily loaded systems, worsening the more
frequent the above operations are. This commit is aimed at addressing
the (2) case above, reducing the number of page faults that are
fulfilled by the kernel.

Currently these page faults happen for three reasons:

* When memory is first accessed after the heap is grown.
* When the initial linear memory image is accessed for the first time.
* When the initial zero'd heap contents, not part of the linear memory
  image, are accessed.

This PR is attempting to address the latter of these cases, and to a
lesser extent the first case as well. Specifically this PR provides the
ability to partially reset a pooled linear memory with `memset` rather
than `madvise`. This is done to have the same effect of resetting
contents to zero but namely has a different effect on paging, notably
keeping the pages resident in memory rather than returning them to the
kernel. This means that reuse of a linear memory slot on a page that was
previously `memset` will not trigger a page fault since everything
remains paged into the process.

The end result is that any access to linear memory which has been
touched by `memset` will no longer page fault on reuse. On more recent
kernels (6.0+) this also means pages which were zero'd by `memset`, made
inaccessible with `PROT_NONE`, and then made accessible again with
`PROT_READ | PROT_WRITE` will not page fault. This can be common when a
wasm instances grows its heap slightly, uses that memory, but then it's
shrunk when the memory is reused for the next instance. Note that this
kernel optimization requires a 6.0+ kernel.

This same optimization is furthermore applied to both async stacks with
the pooling memory allocator in addition to table elements. The defaults
of Wasmtime are not changing with this PR, instead knobs are being
exposed for embedders to turn if they so desire. This is currently being
experimented with at Fastly and I may come back and alter the defaults
of Wasmtime if it seems suitable after our measurements.
This commit is contained in:
Alex Crichton
2022-11-04 15:56:34 -05:00
committed by GitHub
parent b14551d7ca
commit d3a6181939
5 changed files with 320 additions and 52 deletions

View File

@@ -126,6 +126,8 @@ struct InstancePool {
index_allocator: Mutex<PoolingAllocationState>,
memories: MemoryPool,
tables: TablePool,
linear_memory_keep_resident: usize,
table_keep_resident: usize,
}
impl InstancePool {
@@ -156,6 +158,8 @@ impl InstancePool {
)),
memories: MemoryPool::new(&config.limits, tunables)?,
tables: TablePool::new(&config.limits)?,
linear_memory_keep_resident: config.linear_memory_keep_resident,
table_keep_resident: config.table_keep_resident,
};
Ok(pool)
@@ -373,7 +377,10 @@ impl InstancePool {
// image, just drop it here, and let the drop handler for the
// slot unmap in a way that retains the address space
// reservation.
if image.clear_and_remain_ready().is_ok() {
if image
.clear_and_remain_ready(self.linear_memory_keep_resident)
.is_ok()
{
self.memories
.return_memory_image_slot(instance_index, def_mem_idx, image);
}
@@ -437,10 +444,20 @@ impl InstancePool {
);
drop(table);
decommit_table_pages(base, size).expect("failed to decommit table pages");
self.reset_table_pages_to_zero(base, size)
.expect("failed to decommit table pages");
}
}
fn reset_table_pages_to_zero(&self, base: *mut u8, size: usize) -> Result<()> {
let size_to_memset = size.min(self.table_keep_resident);
unsafe {
std::ptr::write_bytes(base, 0, size_to_memset);
decommit_table_pages(base.add(size_to_memset), size - size_to_memset)?;
}
Ok(())
}
fn validate_table_plans(&self, module: &Module) -> Result<()> {
let tables = module.table_plans.len() - module.num_imported_tables;
if tables > self.tables.max_tables {
@@ -807,6 +824,7 @@ struct StackPool {
page_size: usize,
index_allocator: Mutex<PoolingAllocationState>,
async_stack_zeroing: bool,
async_stack_keep_resident: usize,
}
#[cfg(all(feature = "async", unix))]
@@ -852,6 +870,7 @@ impl StackPool {
max_instances,
page_size,
async_stack_zeroing: config.async_stack_zeroing,
async_stack_keep_resident: config.async_stack_keep_resident,
// We always use a `NextAvailable` strategy for stack
// allocation. We don't want or need an affinity policy
// here: stacks do not benefit from being allocated to the
@@ -919,11 +938,32 @@ impl StackPool {
assert!(index < self.max_instances);
if self.async_stack_zeroing {
reset_stack_pages_to_zero(bottom_of_stack as _, stack_size).unwrap();
self.zero_stack(bottom_of_stack, stack_size);
}
self.index_allocator.lock().unwrap().free(SlotId(index));
}
fn zero_stack(&self, bottom: usize, size: usize) {
// Manually zero the top of the stack to keep the pages resident in
// memory and avoid future page faults. Use the system to deallocate
// pages past this. This hopefully strikes a reasonable balance between:
//
// * memset for the whole range is probably expensive
// * madvise for the whole range incurs expensive future page faults
// * most threads probably don't use most of the stack anyway
let size_to_memset = size.min(self.async_stack_keep_resident);
unsafe {
std::ptr::write_bytes(
(bottom + size - size_to_memset) as *mut u8,
0,
size_to_memset,
);
}
// Use the system to reset remaining stack pages to zero.
reset_stack_pages_to_zero(bottom as _, size - size_to_memset).unwrap();
}
}
/// Configuration options for the pooling instance allocator supplied at
@@ -940,6 +980,22 @@ pub struct PoolingInstanceAllocatorConfig {
pub limits: InstanceLimits,
/// Whether or not async stacks are zeroed after use.
pub async_stack_zeroing: bool,
/// If async stack zeroing is enabled and the host platform is Linux this is
/// how much memory to zero out with `memset`.
///
/// The rest of memory will be zeroed out with `madvise`.
pub async_stack_keep_resident: usize,
/// How much linear memory, in bytes, to keep resident after resetting for
/// use with the next instance. This much memory will be `memset` to zero
/// when a linear memory is deallocated.
///
/// Memory exceeding this amount in the wasm linear memory will be released
/// with `madvise` back to the kernel.
///
/// Only applicable on Linux.
pub linear_memory_keep_resident: usize,
/// Same as `linear_memory_keep_resident` but for tables.
pub table_keep_resident: usize,
}
impl Default for PoolingInstanceAllocatorConfig {
@@ -949,6 +1005,9 @@ impl Default for PoolingInstanceAllocatorConfig {
stack_size: 2 << 20,
limits: InstanceLimits::default(),
async_stack_zeroing: false,
async_stack_keep_resident: 0,
linear_memory_keep_resident: 0,
table_keep_resident: 0,
}
}
}