This PR adds a new `isa::lookup_variant()` that takes a `BackendVariant`
(`Legacy`, `MachInst` or `Any`), and exposes both x86 backends as
separate variants if both are compiled into the build.
This will allow some new use-cases that require both backends in the
same process: for example, differential fuzzing between old and new
backends, or perhaps allowing for dynamic feature-flag selection between
the backends.
This makes fstat work for stdout, stdin and stderr as expected.
This seemed like the only reasonable functions to implement from the
filestat_* set, for stdout, stdin and stderr.
Fixes#2515
Mostly just tweaks to docs/naming/readability/tidying up.
The biggest thing is that the wasm bytes are passed in during compilation now,
rather than on initialization, which lets us remove the lifetime from our state
struct and makes wrangling unsafe conversions that much easier.
The jitdump header contains a "magic" field that is defined to hold
the value 0x4A695444 as u32 in native endianness. (This allows
consumers of the file to detect the endianness of the platform
where the file was written, and apply it when reading other fields.)
However, current code always writes 0x4A695444 in little-endian
byte order, even on big-endian system. This makes consumers fail
when attempting to read files written on big-endian platforms.
Fixed by always writing the magic in native endianness.
Android always has `utimensat` available, so it is not necessary (or
possible, for that matter) to emulate it. Mark the fallback path as
`unreachable!()`.
WebAssembly memory operations are by definition little-endian even on
big-endian target platforms. However, other memory accesses will require
native target endianness (e.g. to access parts of the VMContext that is
also accessed by VM native code). This means on big-endian targets,
the code generator will have to handle both little- and big-endian
memory accesses. However, there is currently no way to encode that
distinction into the Cranelift IR that describes memory accesses.
This patch provides such a way by adding an (optional) explicit
endianness marker to an instance of MemFlags. Since each Cranelift IR
instruction that describes memory accesses already has an instance of
MemFlags attached, this can now be used to provide endianness
information.
Note that by default, memory accesses will continue to use the native
target ISA endianness. To override this to specify an explicit
endianness, a MemFlags value that was built using the set_endianness
routine must be used. This patch does so for accesses that implement
WebAssembly memory operations.
This patch addresses issue #2124.
Recent changes to fuzzers made expectations more strict about handling
errors while fuzzing, but this erroneously changed a module compilation
step to always assume that the input wasm is valid. Instead a flag is
now passed through indicating whether the wasm blob is known valid or
invalid, and only if compilation fails and it's known valid do we panic.
This method attempted to reserve space in the `results` list of final
modules. Unfortunately `results.reserve(nmodules)` isn't enough here
because this can be called many times before a module is actually
finished and pushed onto the vector. The attempted logic to work around
this was buggy, however, and would simply trigger geometric growth on
every single reservation because it erroneously assumed that a
reservation would be exactly met.
This is fixed by avoiding looking at the vector's capacity and instead
keeping track of modules-to-be in a side field. This is the incremented
and passed to `reserve` as it represents the number of modules that will
eventually make their way into the result vector.
`fd_readdir` returns a "bufused" value, which indicates the number of
bytes read into the buffer. WASI libc expects this value to be equal
to the size of the buffer if the end of the directory has not yet
been scanned.
Previously, wasi-common's `fd_readdir` was writing as many complete
entries as it could fit and then stopping, but this meant it was
returning size less than the buffer size even when the directory had
more entries. This patch makes it continue writing up until the end
of the buffer, and return that number of bytes, to let WASI libc
know that there's more to be read.
Fixes#2493.
This commit updates all the wasm-tools crates that we use and enables
fuzzing of the module linking proposal in our various fuzz targets. This
also refactors some of the dummy value generation logic to not be
fallible and to always succeed, the thinking being that we don't want to
accidentally hide errors while fuzzing. Additionally instantiation is
only allowed to fail with a `Trap`, other failure reasons are unwrapped.
The new crate introduced here, `wasmtime-bench-api`, creates a shared library, e.g. `wasmtime_bench_api.so`, for executing Wasm benchmarks using Wasmtime. It allows us to measure several phases separately by exposing `engine_compile_module`, `engine_instantiate_module`, and `engine_execute_module`, which pass around an opaque pointer to the internally initialized state. This state is initialized and freed by `engine_create` and `engine_free`, respectively. The API also introduces a way of passing in functions to satisfy the `"bench" "start"` and `"bench" "end"` symbols that we expect Wasm benchmarks to import. The API is exposed in a C-compatible way so that we can dynamically load it (carefully) in our benchmark runner.
I was having limited success fuzzing locally because apparently the
fuzzer was spawning too many threads. Looking into it that indeed
appears to be the case! The threads which time out runtime of wasm only
exit after the sleep has completely finished, meaning that if we execute
a ton of wasm that exits quickly each run will generate a sleeping thread.
This commit fixes the issue by using some synchronization to ensure the
sleeping thread exits when our fuzzed run also exits.
As a subtle consequence of the recent load-op fusion, popcnt of a
value that came from a load.i32 was compiling into a 64-bit load. This
is a result of the way in which x86 infers the width of loads: it is a
consequence of the instruction containing the memory reference, not the
memory reference itself. So the `input_to_reg_mem()` helper (convert an
instruction input into a register or memory reference) was providing the
appropriate memory reference for the result of a load.i32, but never
encoded the assumption that it would only be used in a 32-bit
instruction. It turns out that popcnt.i32 uses a 64-bit instruction to
load this RM op, hence widening a 32-bit to 64-bit load (which is
problematic when the offset is (memory_length - 4)).
Separately, popcnt was using the RM operand twice, resulting in two
loads if we merged a load. This isn't a correctness bug in practice
because only a racy sequence (store interleaving between the loads)
would produce incorrect results, but we decided earlier to treat loads
as effectful for now, neither reordering nor duplicating them, to
deliberately reduce complexity.
Because of the second issue, the fix is just to force the operand into a
register always, so any source load will not be merged.
Discovered via fuzzing with oss-fuzz.
Lucet uses stack probes rather than explicit stack limit checks as
Wasmtime does. In bytecodealliance/lucet#616, I have discovered that I
previously was not running some Lucet runtime tests with the new
backend, so was missing some test failures due to missing pieces in the
new backend.
This PR adds (i) calls to probestack, when enabled, in the prologue of
every function with a stack frame larger than one page (configurable via
flags); and (ii) trap metadata for every instruction on x86-64 that can
access the stack, hence be the first point at which a stack overflow is
detected when the stack pointer is decremented.
The x64 backend currently builds the `RealRegUniverse` in a way that
is generating somewhat suboptimal code. In many blocks, we see uses of
callee-save (non-volatile) registers (r12, r13, r14, rbx) first, even in
very short leaf functions where there are plenty of volatiles to use.
This is leading to unnecessary spills/reloads.
On one (local) test program, a medium-sized C benchmark compiled to Wasm
and run on Wasmtime, I am seeing a ~10% performance improvement with
this change; it will be less pronounced in programs with high register
pressure (there we are likely to use all registers regardless, so the
prologue/epilogue will save/restore all callee-saves), or in programs
with fewer calls, but this is a clear win for small functions and in
many cases removes prologue/epilogue clobber-saves altogether.
Separately, I think the RA's coalescing is tripping up a bit in some
cases; see e.g. the filetest touched by this commit that loads a value
into %rsi then moves to %rax and returns immediately. This is an
orthogonal issue, though, and should be addressed (if worthwhile) in
regalloc.rs.