Implement strings in adapter modules (#4623)

* Implement strings in adapter modules

This commit is a hefty addition to Wasmtime's support for the component
model. This implements the final remaining type (in the current type
hierarchy) unimplemented in adapter module trampolines: strings. Strings
are the most complicated type to implement in adapter trampolines
because they are highly structured chunks of data in memory (according
to specific encodings). Additionally each lift/lower operation can
choose its own encoding for strings meaning that Wasmtime, the host, may
have to convert between any pairwise ordering of string encodings.

The `CanonicalABI.md` in the component-model repo in general specifies
all the fiddly bits of string encoding so there's not a ton of wiggle
room for Wasmtime to get creative. This PR largely "just" implements
that. The high-level architecture of this implementation is:

* Fused adapters are first identified to determine src/dst string
  encodings. This statically fixes what transcoding operation is being
  performed.

* The generated adapter will be responsible for managing calls to
  `realloc` and performing bounds checks. The adapter itself does not
  perform memory copies or validation of string contents, however.
  Instead each transcoding operation is modeled as an imported function
  into the adapter module.  This means that the adapter module
  dynamically, during compile time, determines what string transcoders
  are needed. Note that an imported transcoder is not only parameterized
  over the transcoding operation but additionally which memory is the
  source and which is the destination.

* The imported core wasm functions are modeled as a new
  `CoreDef::Transcoder` structure. These transcoders end up being small
  Cranelift-compiled trampolines. The Cranelift-compiled trampoline will
  load the actual base pointer of memory and add it to the relative
  pointers passed as function arguments. This trampoline then calls a
  transcoder "libcall" which enters Rust-defined functions for actual
  transcoding operations.

* Each possible transcoding operation is implemented in Rust with a
  unique name and a unique signature depending on the needs of the
  transcoder. I've tried to document inline what each transcoder does.

This means that the `Module::translate_string` in adapter modules is by
far the largest translation method. The main reason for this is due to
the management around calling the imported transcoder functions in the
face of validating string pointer/lengths and performing the dance of
`realloc`-vs-transcode at the right time. I've tried to ensure that each
individual case in transcoding is documented well enough to understand
what's going on as well.

Additionally in this PR is a full implementation in the host for the
`latin1+utf16` encoding which means that both lifting and lowering host
strings now works with this encoding.

Currently the implementation of each transcoder function is likely far
from optimal. Where possible I've leaned on the standard library itself
and for latin1-related things I'm leaning on the `encoding_rs` crate. I
initially tried to implement everything with `encoding_rs` but was
unable to uniformly do so easily. For now I settled on trying to get a
known-correct (even in the face of endianness) implementation for all of
these transcoders. If an when performance becomes an issue it should be
possible to implement more optimized versions of each of these
transcoding operations.

Testing this commit has been somewhat difficult and my general plan,
like with the `(list T)` type, is to rely heavily on fuzzing to cover
the various cases here. In this PR though I've added a simple test that
pushes some statically known strings through all the pairs of encodings
between source and destination. I've attempted to pick "interesting"
strings that one way or another stress the various paths in each
transcoding operation to ideally get full branch coverage there.
Additionally a suite of "negative" tests have also been added to ensure
that validity of encoding is actually checked.

* Fix a temporarily commented out case

* Fix wasmtime-runtime tests

* Update deny.toml configuration

* Add `BSD-3-Clause` for the `encoding_rs` crate
* Remove some unused licenses

* Add an exemption for `encoding_rs` for now

* Split up the `translate_string` method

Move out all the closures and package up captured state into smaller
lists of arguments.

* Test out-of-bounds for zero-length strings
This commit is contained in:
Alex Crichton
2022-08-08 11:01:57 -05:00
committed by GitHub
parent e6d339b6ac
commit 650979ae40
33 changed files with 3239 additions and 190 deletions

View File

@@ -37,6 +37,7 @@ once_cell = "1.12.0"
rayon = { version = "1.0", optional = true }
object = { version = "0.29", default-features = false, features = ['read_core', 'elf'] }
async-trait = { version = "0.1.51", optional = true }
encoding_rs = { version = "0.8.31", optional = true }
[target.'cfg(target_os = "windows")'.dependencies.windows-sys]
version = "0.36.0"
@@ -116,4 +117,5 @@ component-model = [
"wasmtime-runtime/component-model",
"dep:wasmtime-component-macro",
"dep:wasmtime-component-util",
"dep:encoding_rs",
]

View File

@@ -11,7 +11,7 @@ use std::ptr::NonNull;
use std::sync::Arc;
use wasmtime_environ::component::{
AlwaysTrapInfo, ComponentTypes, FunctionInfo, GlobalInitializer, LoweredIndex,
RuntimeAlwaysTrapIndex, StaticModuleIndex, Translator,
RuntimeAlwaysTrapIndex, RuntimeTranscoderIndex, StaticModuleIndex, Translator,
};
use wasmtime_environ::{PrimaryMap, ScopeVec, SignatureIndex, Trampoline, TrapCode};
use wasmtime_jit::CodeMemory;
@@ -64,6 +64,10 @@ struct ComponentInner {
/// These functions are "degenerate functions" here solely to implement
/// functions that are `canon lift`'d then immediately `canon lower`'d.
always_trap: PrimaryMap<RuntimeAlwaysTrapIndex, AlwaysTrapInfo>,
/// Where all the cranelift-generated transcode functions are located in the
/// compiled image of this component.
transcoders: PrimaryMap<RuntimeTranscoderIndex, FunctionInfo>,
}
impl Component {
@@ -158,7 +162,7 @@ impl Component {
|| Component::compile_component(engine, &component, &types, &provided_trampolines),
);
let static_modules = static_modules?;
let (lowerings, always_trap, trampolines, trampoline_obj) = trampolines?;
let (lowerings, always_trap, transcoders, trampolines, trampoline_obj) = trampolines?;
let mut trampoline_obj = CodeMemory::new(trampoline_obj);
let code = trampoline_obj.publish()?;
let text = wasmtime_jit::subslice_range(code.text, code.mmap);
@@ -218,6 +222,7 @@ impl Component {
text,
lowerings,
always_trap,
transcoders,
}),
})
}
@@ -231,6 +236,7 @@ impl Component {
) -> Result<(
PrimaryMap<LoweredIndex, FunctionInfo>,
PrimaryMap<RuntimeAlwaysTrapIndex, AlwaysTrapInfo>,
PrimaryMap<RuntimeTranscoderIndex, FunctionInfo>,
Vec<Trampoline>,
wasmtime_runtime::MmapVec,
)> {
@@ -239,22 +245,31 @@ impl Component {
|| -> Result<_> {
Ok(engine.join_maybe_parallel(
|| compile_always_trap(engine, component, types),
|| compile_trampolines(engine, component, types, provided_trampolines),
|| -> Result<_> {
Ok(engine.join_maybe_parallel(
|| compile_transcoders(engine, component, types),
|| compile_trampolines(engine, component, types, provided_trampolines),
))
},
))
},
);
let (lowerings, other) = results;
let (always_trap, trampolines) = other?;
let (always_trap, other) = other?;
let (transcoders, trampolines) = other?;
let mut obj = engine.compiler().object()?;
let (lower, traps, trampolines) = engine.compiler().component_compiler().emit_obj(
lowerings?,
always_trap?,
trampolines?,
&mut obj,
)?;
let (lower, traps, transcoders, trampolines) =
engine.compiler().component_compiler().emit_obj(
lowerings?,
always_trap?,
transcoders?,
trampolines?,
&mut obj,
)?;
return Ok((
lower,
traps,
transcoders,
trampolines,
wasmtime_jit::mmap_vec_from_obj(obj)?,
));
@@ -307,6 +322,30 @@ impl Component {
.collect())
}
fn compile_transcoders(
engine: &Engine,
component: &wasmtime_environ::component::Component,
types: &ComponentTypes,
) -> Result<PrimaryMap<RuntimeTranscoderIndex, Box<dyn Any + Send>>> {
let always_trap = component
.initializers
.iter()
.filter_map(|init| match init {
GlobalInitializer::Transcoder(i) => Some(i),
_ => None,
})
.collect::<Vec<_>>();
Ok(engine
.run_maybe_parallel(always_trap, |info| {
engine
.compiler()
.component_compiler()
.compile_transcoder(component, info, types)
})?
.into_iter()
.collect())
}
fn compile_trampolines(
engine: &Engine,
component: &wasmtime_environ::component::Component,
@@ -376,6 +415,11 @@ impl Component {
self.func(&info.info)
}
pub(crate) fn transcoder_ptr(&self, index: RuntimeTranscoderIndex) -> NonNull<VMFunctionBody> {
let info = &self.inner.transcoders[index];
self.func(info)
}
fn func(&self, info: &FunctionInfo) -> NonNull<VMFunctionBody> {
let text = self.text();
let trampoline = &text[info.start as usize..][..info.length as usize];

View File

@@ -1,7 +1,7 @@
use crate::component::func::{Func, Memory, MemoryMut, Options};
use crate::store::StoreOpaque;
use crate::{AsContext, AsContextMut, StoreContext, StoreContextMut, ValRaw};
use anyhow::{bail, Context, Result};
use anyhow::{anyhow, bail, Context, Result};
use std::borrow::Cow;
use std::fmt;
use std::marker;
@@ -801,6 +801,10 @@ unsafe impl Lift for char {
}
}
// TODO: these probably need different constants for memory64
const UTF16_TAG: usize = 1 << 31;
const MAX_STRING_BYTE_LENGTH: usize = (1 << 31) - 1;
// Note that this is similar to `ComponentType for WasmStr` except it can only
// be used for lowering, not lifting.
unsafe impl ComponentType for str {
@@ -843,16 +847,51 @@ unsafe impl Lower for str {
}
fn lower_string<T>(mem: &mut MemoryMut<'_, T>, string: &str) -> Result<(usize, usize)> {
// Note that in general the wasm module can't assume anything about what the
// host strings are encoded as. Additionally hosts are allowed to have
// differently-encoded strings at runtime. Finally when copying a string
// into wasm it's somewhat strict in the sense that the various patterns of
// allocation and such are already dictated for us.
//
// In general what this means is that when copying a string from the host
// into the destination we need to follow one of the cases of copying into
// WebAssembly. It doesn't particularly matter which case as long as it ends
// up in the right encoding. For example a destination encoding of
// latin1+utf16 has a number of ways to get copied into and we do something
// here that isn't the default "utf8 to latin1+utf16" since we have access
// to simd-accelerated helpers in the `encoding_rs` crate. This is ok though
// because we can fake that the host string was already stored in latin1
// format and follow that copy pattern instead.
match mem.string_encoding() {
// This corresponds to `store_string_copy` in the canonical ABI where
// the host's representation is utf-8 and the wasm module wants utf-8 so
// a copy is all that's needed (and the `realloc` can be precise for the
// initial memory allocation).
StringEncoding::Utf8 => {
if string.len() > MAX_STRING_BYTE_LENGTH {
bail!(
"string length of {} too large to copy into wasm",
string.len()
);
}
let ptr = mem.realloc(0, 0, 1, string.len())?;
if string.len() > 0 {
mem.as_slice_mut()[ptr..][..string.len()].copy_from_slice(string.as_bytes());
}
Ok((ptr, string.len()))
}
// This corresponds to `store_utf8_to_utf16` in the canonical ABI. Here
// an over-large allocation is performed and then shrunk afterwards if
// necessary.
StringEncoding::Utf16 => {
let size = string.len() * 2;
if size > MAX_STRING_BYTE_LENGTH {
bail!(
"string length of {} too large to copy into wasm",
string.len()
);
}
let mut ptr = mem.realloc(0, 0, 2, size)?;
let mut copied = 0;
if size > 0 {
@@ -869,8 +908,60 @@ fn lower_string<T>(mem: &mut MemoryMut<'_, T>, string: &str) -> Result<(usize, u
}
Ok((ptr, copied))
}
StringEncoding::CompactUtf16 => {
unimplemented!("compact-utf-16");
// This corresponds to `store_string_to_latin1_or_utf16`
let bytes = string.as_bytes();
let mut iter = string.char_indices();
let mut ptr = mem.realloc(0, 0, 2, bytes.len())?;
let mut dst = &mut mem.as_slice_mut()[ptr..][..bytes.len()];
let mut result = 0;
while let Some((i, ch)) = iter.next() {
// Test if this `char` fits into the latin1 encoding.
if let Ok(byte) = u8::try_from(u32::from(ch)) {
dst[result] = byte;
result += 1;
continue;
}
// .. if utf16 is forced to be used then the allocation is
// bumped up to the maximum size.
let worst_case = bytes
.len()
.checked_mul(2)
.ok_or_else(|| anyhow!("byte length overflow"))?;
if worst_case > MAX_STRING_BYTE_LENGTH {
bail!("byte length too large");
}
ptr = mem.realloc(ptr, bytes.len(), 2, worst_case)?;
dst = &mut mem.as_slice_mut()[ptr..][..worst_case];
// Previously encoded latin1 bytes are inflated to their 16-bit
// size for utf16
for i in (0..result).rev() {
dst[2 * i] = dst[i];
dst[2 * i + 1] = 0;
}
// and then the remainder of the string is encoded.
for (u, bytes) in string[i..]
.encode_utf16()
.zip(dst[2 * result..].chunks_mut(2))
{
let u_bytes = u.to_le_bytes();
bytes[0] = u_bytes[0];
bytes[1] = u_bytes[1];
result += 1;
}
if worst_case > 2 * result {
ptr = mem.realloc(ptr, worst_case, 2, 2 * result)?;
}
return Ok((ptr, result | UTF16_TAG));
}
if result < bytes.len() {
ptr = mem.realloc(ptr, bytes.len(), 2, result)?;
}
Ok((ptr, result))
}
}
}
@@ -898,7 +989,13 @@ impl WasmStr {
let byte_len = match memory.string_encoding() {
StringEncoding::Utf8 => Some(len),
StringEncoding::Utf16 => len.checked_mul(2),
StringEncoding::CompactUtf16 => unimplemented!(),
StringEncoding::CompactUtf16 => {
if len & UTF16_TAG == 0 {
Some(len)
} else {
(len ^ UTF16_TAG).checked_mul(2)
}
}
};
match byte_len.and_then(|len| ptr.checked_add(len)) {
Some(n) if n <= memory.as_slice().len() => {}
@@ -939,8 +1036,14 @@ impl WasmStr {
fn to_str_from_store<'a>(&self, store: &'a StoreOpaque) -> Result<Cow<'a, str>> {
match self.options.string_encoding() {
StringEncoding::Utf8 => self.decode_utf8(store),
StringEncoding::Utf16 => self.decode_utf16(store),
StringEncoding::CompactUtf16 => unimplemented!(),
StringEncoding::Utf16 => self.decode_utf16(store, self.len),
StringEncoding::CompactUtf16 => {
if self.len & UTF16_TAG == 0 {
self.decode_latin1(store)
} else {
self.decode_utf16(store, self.len ^ UTF16_TAG)
}
}
}
}
@@ -952,10 +1055,10 @@ impl WasmStr {
Ok(str::from_utf8(&memory[self.ptr..][..self.len])?.into())
}
fn decode_utf16<'a>(&self, store: &'a StoreOpaque) -> Result<Cow<'a, str>> {
fn decode_utf16<'a>(&self, store: &'a StoreOpaque, len: usize) -> Result<Cow<'a, str>> {
let memory = self.options.memory(store);
// See notes in `decode_utf8` for why this is panicking indexing.
let memory = &memory[self.ptr..][..self.len * 2];
let memory = &memory[self.ptr..][..len * 2];
Ok(std::char::decode_utf16(
memory
.chunks(2)
@@ -964,6 +1067,14 @@ impl WasmStr {
.collect::<Result<String, _>>()?
.into())
}
fn decode_latin1<'a>(&self, store: &'a StoreOpaque) -> Result<Cow<'a, str>> {
// See notes in `decode_utf8` for why this is panicking indexing.
let memory = self.options.memory(store);
Ok(encoding_rs::mem::decode_latin1(
&memory[self.ptr..][..self.len],
))
}
}
// Note that this is similar to `ComponentType for str` except it can only be
@@ -1068,7 +1179,7 @@ where
let size = list
.len()
.checked_mul(elem_size)
.ok_or_else(|| anyhow::anyhow!("size overflow copying a list"))?;
.ok_or_else(|| anyhow!("size overflow copying a list"))?;
let ptr = mem.realloc(0, 0, T::ALIGN32, size)?;
let mut cur = ptr;
for item in list {

View File

@@ -10,9 +10,9 @@ use std::sync::Arc;
use wasmtime_environ::component::{
AlwaysTrap, ComponentTypes, CoreDef, CoreExport, Export, ExportItem, ExtractMemory,
ExtractPostReturn, ExtractRealloc, GlobalInitializer, InstantiateModule, LowerImport,
RuntimeImportIndex, RuntimeInstanceIndex, RuntimeModuleIndex,
RuntimeImportIndex, RuntimeInstanceIndex, RuntimeModuleIndex, Transcoder,
};
use wasmtime_environ::{EntityIndex, Global, GlobalInit, PrimaryMap, WasmType};
use wasmtime_environ::{EntityIndex, EntityType, Global, GlobalInit, PrimaryMap, WasmType};
use wasmtime_runtime::component::{ComponentInstance, OwnedComponentInstance};
/// An instantiated component.
@@ -142,6 +142,11 @@ impl InstanceData {
},
})
}
CoreDef::Transcoder(idx) => {
wasmtime_runtime::Export::Function(wasmtime_runtime::ExportFunction {
anyfunc: self.state.transcoder_anyfunc(*idx),
})
}
}
}
@@ -287,6 +292,8 @@ impl<'a> Instantiator<'a> {
_ => unreachable!(),
});
}
GlobalInitializer::Transcoder(e) => self.transcoder(e),
}
}
Ok(())
@@ -328,6 +335,17 @@ impl<'a> Instantiator<'a> {
);
}
fn transcoder(&mut self, transcoder: &Transcoder) {
self.data.state.set_transcoder(
transcoder.index,
self.component.transcoder_ptr(transcoder.index),
self.component
.signatures()
.shared_signature(transcoder.signature)
.expect("found unregistered signature"),
);
}
fn extract_memory(&mut self, store: &mut StoreOpaque, memory: &ExtractMemory) {
let mem = match self.data.lookup_export(store, &memory.export) {
wasmtime_runtime::Export::Memory(m) => m,
@@ -371,24 +389,16 @@ impl<'a> Instantiator<'a> {
// core wasm instantiations internally within a component are
// unnecessary and superfluous. Naturally though mistakes may be
// made, so double-check this property of wasmtime in debug mode.
if cfg!(debug_assertions) {
let export = self.data.lookup_def(store, arg);
let (_, _, expected) = imports.next().unwrap();
let val = unsafe { crate::Extern::from_wasmtime_export(export, store) };
crate::types::matching::MatchCx {
store,
engine: store.engine(),
signatures: module.signatures(),
types: module.types(),
}
.extern_(&expected, &val)
.expect("unexpected typecheck failure");
self.assert_type_matches(store, module, arg, expected);
}
let export = self.data.lookup_def(store, arg);
// The unsafety here should be ok since the `export` is loaded
// directly from an instance which should only give us valid export
// items.
let export = self.data.lookup_def(store, arg);
unsafe {
self.core_imports.push_export(&export);
}
@@ -397,6 +407,41 @@ impl<'a> Instantiator<'a> {
&self.core_imports
}
fn assert_type_matches(
&mut self,
store: &mut StoreOpaque,
module: &Module,
arg: &CoreDef,
expected: EntityType,
) {
let export = self.data.lookup_def(store, arg);
// If this value is a core wasm function then the type check is inlined
// here. This can otherwise fail `Extern::from_wasmtime_export` because
// there's no guarantee that there exists a trampoline for `f` so this
// can't fall through to the case below
if let wasmtime_runtime::Export::Function(f) = &export {
match expected {
EntityType::Function(expected) => {
let actual = unsafe { f.anyfunc.as_ref().type_index };
assert_eq!(module.signatures().shared_signature(expected), Some(actual));
return;
}
_ => panic!("function not expected"),
}
}
let val = unsafe { crate::Extern::from_wasmtime_export(export, store) };
crate::types::matching::MatchCx {
store,
engine: store.engine(),
signatures: module.signatures(),
types: module.types(),
}
.extern_(&expected, &val)
.expect("unexpected typecheck failure");
}
}
/// A "pre-instantiated" [`Instance`] which has all of its arguments already

View File

@@ -1462,6 +1462,15 @@ impl Config {
compiler.build()
}
/// Internal setting for whether adapter modules for components will have
/// extra WebAssembly instructions inserted performing more debug checks
/// then are necessary.
#[cfg(feature = "component-model")]
pub fn debug_adapter_modules(&mut self, debug: bool) -> &mut Self {
self.tunables.debug_adapter_modules = debug;
self
}
}
fn round_up_to_pages(val: u64) -> u64 {