Implement strings in adapter modules (#4623)

* Implement strings in adapter modules This commit is a hefty addition to Wasmtime's support for the component model. This implements the final remaining type (in the current type hierarchy) unimplemented in adapter module trampolines: strings. Strings are the most complicated type to implement in adapter trampolines because they are highly structured chunks of data in memory (according to specific encodings). Additionally each lift/lower operation can choose its own encoding for strings meaning that Wasmtime, the host, may have to convert between any pairwise ordering of string encodings. The `CanonicalABI.md` in the component-model repo in general specifies all the fiddly bits of string encoding so there's not a ton of wiggle room for Wasmtime to get creative. This PR largely "just" implements that. The high-level architecture of this implementation is: * Fused adapters are first identified to determine src/dst string encodings. This statically fixes what transcoding operation is being performed. * The generated adapter will be responsible for managing calls to `realloc` and performing bounds checks. The adapter itself does not perform memory copies or validation of string contents, however. Instead each transcoding operation is modeled as an imported function into the adapter module. This means that the adapter module dynamically, during compile time, determines what string transcoders are needed. Note that an imported transcoder is not only parameterized over the transcoding operation but additionally which memory is the source and which is the destination. * The imported core wasm functions are modeled as a new `CoreDef::Transcoder` structure. These transcoders end up being small Cranelift-compiled trampolines. The Cranelift-compiled trampoline will load the actual base pointer of memory and add it to the relative pointers passed as function arguments. This trampoline then calls a transcoder "libcall" which enters Rust-defined functions for actual transcoding operations. * Each possible transcoding operation is implemented in Rust with a unique name and a unique signature depending on the needs of the transcoder. I've tried to document inline what each transcoder does. This means that the `Module::translate_string` in adapter modules is by far the largest translation method. The main reason for this is due to the management around calling the imported transcoder functions in the face of validating string pointer/lengths and performing the dance of `realloc`-vs-transcode at the right time. I've tried to ensure that each individual case in transcoding is documented well enough to understand what's going on as well. Additionally in this PR is a full implementation in the host for the `latin1+utf16` encoding which means that both lifting and lowering host strings now works with this encoding. Currently the implementation of each transcoder function is likely far from optimal. Where possible I've leaned on the standard library itself and for latin1-related things I'm leaning on the `encoding_rs` crate. I initially tried to implement everything with `encoding_rs` but was unable to uniformly do so easily. For now I settled on trying to get a known-correct (even in the face of endianness) implementation for all of these transcoders. If an when performance becomes an issue it should be possible to implement more optimized versions of each of these transcoding operations. Testing this commit has been somewhat difficult and my general plan, like with the `(list T)` type, is to rely heavily on fuzzing to cover the various cases here. In this PR though I've added a simple test that pushes some statically known strings through all the pairs of encodings between source and destination. I've attempted to pick "interesting" strings that one way or another stress the various paths in each transcoding operation to ideally get full branch coverage there. Additionally a suite of "negative" tests have also been added to ensure that validity of encoding is actually checked. * Fix a temporarily commented out case * Fix wasmtime-runtime tests * Update deny.toml configuration * Add `BSD-3-Clause` for the `encoding_rs` crate * Remove some unused licenses * Add an exemption for `encoding_rs` for now * Split up the `translate_string` method Move out all the closures and package up captured state into smaller lists of arguments. * Test out-of-bounds for zero-length strings
2022-08-08 11:01:57 -05:00
parent e6d339b6ac
commit 650979ae40
33 changed files with 3239 additions and 190 deletions
--- a/crates/environ/src/fact/transcode.rs
+++ b/crates/environ/src/fact/transcode.rs
@@ -0,0 +1,178 @@
+use crate::fact::core_types::CoreTypes;
+use crate::MemoryIndex;
+use serde::{Deserialize, Serialize};
+use std::collections::HashMap;
+use wasm_encoder::{EntityType, ValType};
+
+pub struct Transcoders {
+    imported: HashMap<Transcoder, u32>,
+    prev_func_imports: u32,
+    imports: Vec<(String, EntityType, Transcoder)>,
+}
+
+#[derive(Copy, Clone, Hash, Eq, PartialEq)]
+pub struct Transcoder {
+    pub from_memory: MemoryIndex,
+    pub from_memory64: bool,
+    pub to_memory: MemoryIndex,
+    pub to_memory64: bool,
+    pub op: Transcode,
+}
+
+/// Possible transcoding operations that must be provided by the host.
+///
+/// Note that each transcoding operation may have a unique signature depending
+/// on the precise operation.
+#[allow(missing_docs)]
+#[derive(Debug, Copy, Clone, Hash, Eq, PartialEq, Serialize, Deserialize)]
+pub enum Transcode {
+    Copy(FixedEncoding),
+    Latin1ToUtf16,
+    Latin1ToUtf8,
+    Utf16ToCompactProbablyUtf16,
+    Utf16ToCompactUtf16,
+    Utf16ToLatin1,
+    Utf16ToUtf8,
+    Utf8ToCompactUtf16,
+    Utf8ToLatin1,
+    Utf8ToUtf16,
+}
+
+#[derive(Debug, Copy, Clone, Hash, Eq, PartialEq, Serialize, Deserialize)]
+#[allow(missing_docs)]
+pub enum FixedEncoding {
+    Utf8,
+    Utf16,
+    Latin1,
+}
+
+impl Transcoders {
+    pub fn new(prev_func_imports: u32) -> Transcoders {
+        Transcoders {
+            imported: HashMap::new(),
+            prev_func_imports,
+            imports: Vec::new(),
+        }
+    }
+
+    pub fn import(&mut self, types: &mut CoreTypes, transcoder: Transcoder) -> u32 {
+        *self.imported.entry(transcoder).or_insert_with(|| {
+            let idx = self.prev_func_imports + (self.imports.len() as u32);
+            self.imports
+                .push((transcoder.name(), transcoder.ty(types), transcoder));
+            idx
+        })
+    }
+
+    pub fn imports(&self) -> impl Iterator<Item = (&str, &str, EntityType, &Transcoder)> {
+        self.imports
+            .iter()
+            .map(|(name, ty, transcoder)| ("transcode", &name[..], *ty, transcoder))
+    }
+}
+
+impl Transcoder {
+    fn name(&self) -> String {
+        format!(
+            "{} (mem{} => mem{})",
+            self.op.desc(),
+            self.from_memory.as_u32(),
+            self.to_memory.as_u32(),
+        )
+    }
+
+    fn ty(&self, types: &mut CoreTypes) -> EntityType {
+        let from_ptr = if self.from_memory64 {
+            ValType::I64
+        } else {
+            ValType::I32
+        };
+        let to_ptr = if self.to_memory64 {
+            ValType::I64
+        } else {
+            ValType::I32
+        };
+
+        let ty = match self.op {
+            // These direct transcodings take the source pointer, the source
+            // code units, and the destination pointer.
+            //
+            // The memories being copied between are part of each intrinsic and
+            // the destination code units are the same as the source.
+            // Note that the pointers are dynamically guaranteed to be aligned
+            // and in-bounds for the code units length as defined by the string
+            // encoding.
+            Transcode::Copy(_) | Transcode::Latin1ToUtf16 => {
+                types.function(&[from_ptr, from_ptr, to_ptr], &[])
+            }
+
+            // Transcoding from utf8 to utf16 takes the from ptr/len as well as
+            // a destination. The destination is valid for len*2 bytes. The
+            // return value is how many code units were written to the
+            // destination.
+            Transcode::Utf8ToUtf16 => types.function(&[from_ptr, from_ptr, to_ptr], &[to_ptr]),
+
+            // Transcoding to utf8 as a smaller format takes all the parameters
+            // and returns the amount of space consumed in the src/destination
+            Transcode::Utf16ToUtf8 | Transcode::Latin1ToUtf8 => {
+                types.function(&[from_ptr, from_ptr, to_ptr, to_ptr], &[from_ptr, to_ptr])
+            }
+
+            // The return type is a tagged length which indicates which was
+            // used
+            Transcode::Utf16ToCompactProbablyUtf16 => {
+                types.function(&[from_ptr, from_ptr, to_ptr], &[to_ptr])
+            }
+
+            // The initial step of transcoding from a fixed format to a compact
+            // format. Takes the ptr/len of the source the the destination
+            // pointer. The destination length is implicitly the same. Returns
+            // how many code units were consumed in the source, which is also
+            // how many bytes were written to the destination.
+            Transcode::Utf8ToLatin1 | Transcode::Utf16ToLatin1 => {
+                types.function(&[from_ptr, from_ptr, to_ptr], &[from_ptr, to_ptr])
+            }
+
+            // The final step of transcoding to a compact format when the fixed
+            // transcode has failed. This takes the ptr/len of the source that's
+            // remaining to transcode. Then this takes the destination ptr/len
+            // as well as the destination bytes written so far with latin1.
+            // Finally this returns the number of code units written to the
+            // destination.
+            Transcode::Utf8ToCompactUtf16 | Transcode::Utf16ToCompactUtf16 => {
+                types.function(&[from_ptr, from_ptr, to_ptr, to_ptr, to_ptr], &[to_ptr])
+            }
+        };
+        EntityType::Function(ty)
+    }
+}
+
+impl Transcode {
+    /// Returns a human-readable description for this transcoding operation.
+    pub fn desc(&self) -> &'static str {
+        match self {
+            Transcode::Copy(FixedEncoding::Utf8) => "utf8-to-utf8",
+            Transcode::Copy(FixedEncoding::Utf16) => "utf16-to-utf16",
+            Transcode::Copy(FixedEncoding::Latin1) => "latin1-to-latin1",
+            Transcode::Latin1ToUtf16 => "latin1-to-utf16",
+            Transcode::Latin1ToUtf8 => "latin1-to-utf8",
+            Transcode::Utf16ToCompactProbablyUtf16 => "utf16-to-compact-probably-utf16",
+            Transcode::Utf16ToCompactUtf16 => "utf16-to-compact-utf16",
+            Transcode::Utf16ToLatin1 => "utf16-to-latin1",
+            Transcode::Utf16ToUtf8 => "utf16-to-utf8",
+            Transcode::Utf8ToCompactUtf16 => "utf8-to-compact-utf16",
+            Transcode::Utf8ToLatin1 => "utf8-to-latin1",
+            Transcode::Utf8ToUtf16 => "utf8-to-utf16",
+        }
+    }
+}
+
+impl FixedEncoding {
+    pub(crate) fn width(&self) -> u8 {
+        match self {
+            FixedEncoding::Utf8 => 1,
+            FixedEncoding::Utf16 => 2,
+            FixedEncoding::Latin1 => 1,
+        }
+    }
+}