* Fix StructReturn handling: properly mark the clobber, and offset actual rets.
The legalization of `StructReturn` was causing issues in the new
call-handling code: the `StructReturn` ret was included in the `SigData` as
if it were an actual CLIF-level return value, but it is not.
Prior to using regalloc constraints for return values, we
unconditionally included rax (or the architecture's usual return
register) as a def, so it would be properly handled as "clobbered" by
the regalloc. With the new scheme, we include defs on the call only for
CLIF-level outputs. Callees with `StructReturn` args were thus not known
to clobber the return-value register, and values might be corrupted.
This PR updates the code to include a `StructReturn` ret as a clobber
rather than a returned value in the relevant spots. I observed it
causing saves/restores of rax in some CLIF that @bjorn3 provided me, but
I was having difficulty minimizing this into a test-case that I would be
comfortable including as a precise-output case (including the whole
thing verbatim would lock down a bunch of other irrelevant details and
cause test-update noise later). If we can find a more minimized example
I'm happy to include it as a filetest.
Fixes#5018.
* Replace resize+copy_from_slice with extend_from_slice
Vec::resize initializes the new space, which is wasted effort if we're
just going to call `copy_from_slice` on it immediately afterward. Using
`extend_from_slice` is simpler, and very slightly faster.
If the new size were bigger than the buffer we're copying from, then it
would make sense to initialize the excess. But it isn't: it's always
exactly the same size.
* Move helpers from Context to CompiledCode
These methods only use information from Context::compiled_code, so they
should live on CompiledCode instead.
* Remove an unnecessary #[cfg_attr]
There are other uses of `#[allow(clippy::too_many_arguments)]` in this
file, so apparently it doesn't need to be guarded by the "cargo-clippy"
feature.
* Fix a few comments
Two of these were wrong/misleading:
- `FunctionBuilder::new` does not clear the provided func_ctx. It does
debug-assert that the context is already clear, but I don't think
that's worth a comment.
- `switch_to_block` does not "create values for the arguments." That's
done by the combination of `append_block_params_for_function_params`
and `declare_wasm_parameters`.
* wasmtime-cranelift: Misc cleanups
The main change is to use the `CompiledCode` reference we already had
instead of getting it out of `Context` repeatedly. This removes a bunch
of `unwrap()` calls.
* wasmtime-cranelift: Factor out uncached compile
Resolve overlap in the RiscV64 backend by adding priorities to rules. Additionally, one test updated as a result of this work, as a peephole optimization for addition with immediates fires now.
Resolve overlap in the ISLE prelude and the x64 inst module by introducing new types that allow better sharing of extractor resuls, or falling back on priorities.
* Leverage Cargo's workspace inheritance feature
This commit is an attempt to reduce the complexity of the Cargo
manifests in this repository with Cargo's workspace-inheritance feature
becoming stable in Rust 1.64.0. This feature allows specifying fields in
the root workspace `Cargo.toml` which are then reused throughout the
workspace. For example this PR shares definitions such as:
* All of the Wasmtime-family of crates now use `version.workspace =
true` to have a single location which defines the version number.
* All crates use `edition.workspace = true` to have one default edition
for the entire workspace.
* Common dependencies are listed in `[workspace.dependencies]` to avoid
typing the same version number in a lot of different places (e.g. the
`wasmparser = "0.89.0"` is now in just one spot.
Currently the workspace-inheritance feature doesn't allow having two
different versions to inherit, so all of the Cranelift-family of crates
still manually specify their version. The inter-crate dependencies,
however, are shared amongst the root workspace.
This feature can be seen as a method of "preprocessing" of sorts for
Cargo manifests. This will help us develop Wasmtime but shouldn't have
any actual impact on the published artifacts -- everything's dependency
lists are still the same.
* Fix wasi-crypto tests
* Port branches to ISLE (AArch64)
Ported the existing implementations of the following opcodes for AArch64
to ISLE:
- `Brz`
- `Brnz`
- `Brif`
- `Brff`
- `BrIcmp`
- `Jump`
- `BrTable`
Copyright (c) 2022 Arm Limited
* Remove dead code
Copyright (c) 2022 Arm Limited
We weren't using the "union" cargo feature for the smallvec crate, which
reduces the size of a SmallVec by one machine word. This feature
requires Rust 1.49 but we already require much newer versions.
When using Wasmtime to compile pulldown-cmark from Sightglass, this
saves a decent amount of memory allocations and writes. According to
`valgrind --tool=dhat`:
- 6.2MiB (3.69%) less memory allocated over the program's lifetime
- 0.5MiB (4.13%) less memory allocated at maximum heap size
- 5.5MiB (1.88%) fewer bytes written to
- 0.44% fewer instructions executed
Sightglass reports a statistically significant runtime improvement too:
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 24379323.60 ± 20051394.04 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.01x to 1.13x faster than main-be690a468.so!
[227506364 355007998.78 423280514] main-be690a468.so
[227686018 330628675.18 406025344] shrink-abiarg-0406da67c.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 360151622.56 ± 278294316.90 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.01x to 1.07x faster than main-be690a468.so!
[8709162212 8911001926.44 9535111576] main-be690a468.so
[5058015392 8550850303.88 9282148438] shrink-abiarg-0406da67c.so
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 6936570.28 ± 6897696.38 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.00x to 1.08x faster than main-be690a468.so!
[155810934 175260571.20 234737344] main-be690a468.so
[119128240 168324000.92 257451074] shrink-abiarg-0406da67c.so
* Upgrade to regalloc2 0.4.1.
Incorporates bytecodealliance/regalloc2#85, which fixes a fuzzbug
related to constraints and liverange splits.
* Add audit of regalloc2 upgrade.
Ported the existing implementations of the following opcodes for AArch64
to ISLE:
- `Trueif`
- `Trueff`
- `Trapif`
- `Trapff`
- `Select`
- `Selectif`
- `SelectifSpectreGuard`
Copyright (c) 2022 Arm Limited
* ISLE: add support for multi-extractors and multi-constructors.
This support allows for rules that process multiple matching values per
extractor call on the left-hand side, and as a result, can produce
multiple values from the constructor whose body they define.
This is useful in situations where we are matching on an input data
structure that can have multiple "nodes" for a given value or ID, for
example in an e-graph.
* Review feedback: all multi-ctors and multi-etors return iterators; no `Vec` case.
* Add additional warning suppressions to generated-code toplevels to be consistent with new islec output.
Improved the instruction lowering for the following opcodes on AArch64,
and introduced support for converting to integers less than 32-bits wide
as per the docs:
- `FcvtToSintSat`
- `FcvtToUintSat`
Copyright (c) 2022 Arm Limited
* Vector bitcast support (AArch64 & Interpreter)
Implemented support for `bitcast` on vector values for AArch64 and the
interpreter.
Also corrected the verifier to ensure that the size, in bits, of the input and
output types match for a `bitcast`, per the docs.
Copyright (c) 2022 Arm Limited
* `I128` same-type bitcast support
Copyright (c) 2022 Arm Limited
* Directly return input for 64-bit GPR<=>GPR bitcast
Copyright (c) 2022 Arm Limited
* Cranelift: use regalloc2 constraints on caller side of ABI code.
This PR updates the shared ABI code and backends to use register-operand
constraints rather than explicit pinned-vreg moves for register
arguments and return values.
The s390x backend was not updated, because it has its own implementation
of ABI code. Ideally we could converge back to the code shared by x64
and aarch64 (which didn't exist when s390x ported calls to ISLE, so the
current situation is underestandable, to be clear!). I'll leave this for
future work.
This PR exposed several places where regalloc2 needed to be a bit more
flexible with constraints; it requires regalloc2#74 to be merged and
pulled in.
* Update to regalloc2 0.3.3.
In addition to version bump, this required removing two asserts as
`SpillSlot`s no longer carry their class (so we can't assert that they
have the correct class).
* Review comments.
* Filetest updates.
* Add cargo-vet audit for regalloc2 0.3.2 -> 0.3.3 upgrade.
* Update to regalloc2 0.4.0.
* Port `icmp` to ISLE (AArch64)
Ported the existing implementation of `icmp` (and, by extension, the
`lower_icmp` function) to ISLE for AArch64.
Copyright (c) 2022 Arm Limited
* Allow 'producer chains', eliminating `Nop0`s
Copyright (c) 2022 Arm Limited
* s390x: update some regalloc metadata to remove use of `reg_mod`.
This is a step toward ultimately removing modify-operands, which along
with removal of pinned vregs, lets us move to a completely
constraint-based and fully-SSA regalloc input and get some nice
advantages eventually.
There are still a few uses of `mod` operands and pinned vregs remaining,
especially around the "regpair" abstraction. Those proved to be a bit
trickier to update though, so will have to be done separately.
* Review feedback: restore two-arg pretty-print form.
* Review feedback.
* ABI: implement register arguments with constraints.
Currently, Cranelift's ABI code emits a sequence of moves from physical
registers into vregs at the top of the function body, one for every
register-carried argument.
For a number of reasons, we want to move to operand constraints instead,
and remove the use of explicitly-named "pinned vregs"; this allows for
better regalloc in theory, as it removes the need to "reverse-engineer"
the sequence of moves.
This PR alters the ABI code so that it generates a single "args"
pseudo-instruction as the first instruction in the function body. This
pseudo-inst defs all register arguments, and constrains them to the
appropriate registers at the def-point. Subsequently the regalloc can
move them wherever it needs to.
Some care was taken not to have this pseudo-inst show up in
post-regalloc disassemblies, but the change did cause a general regalloc
"shift" in many tests, so the precise-output updates are a bit noisy.
Sorry about that!
A subsequent PR will handle the other half of the ABI code, namely, the
callsite case, with a similar preg-to-constraint conversion.
* Update based on review feedback.
* Review feedback.
Previously, Cranelift panicked (via a a panic in regalloc2) when the
virtual-register limit of 2M (2^21) was reached. This resulted in a
perplexing and unhelpful failure when the user provided a too-large
input (such as the Wasm module in #4865).
This PR adds an explicit check when allocating vregs that fails with a
"code too large" error when the limit is hit, producing output such as
(on the minimized testcase from #4865):
```
Error: failed to compile wasm function 3785 at offset 0xa3f3
Caused by:
Compilation error: Code for function is too large
```
Fixes#4865.
* Initial forward-edge CFI implementation
Give the user the option to start all basic blocks that are targets
of indirect branches with the BTI instruction introduced by the
Branch Target Identification extension to the Arm instruction set
architecture.
Copyright (c) 2022, Arm Limited.
* Refactor `from_artifacts` to avoid second `make_executable` (#1)
This involves "parsing" twice but this is parsing just the header of an
ELF file so it's not a very intensive operation and should be ok to do
twice.
* Address the code review feedback
Copyright (c) 2022, Arm Limited.
Co-authored-by: Alex Crichton <alex@alexcrichton.com>
Using fallible extractors that produce no values for flag checks means
that it's not possible to pattern match cases where those flags are
false. This change reworks the existing flag-checking extractors to be
infallible, returning the flag's boolean value from the context instead.
This is a cherry-pick of a long-ago commit, 2d46637. The original
message reads:
> Now that `SyntheticAmode` can refer to constants, there is no longer a
> need for a separate instruction format--standard load instructions will
> work.
Since then, the transition to ISLE and the use of `XmmLoadConst` in many
more places makes this change a larger diff than the original. The basic
idea is the same, though: the extra indirection of `Inst::XMmLoadConst`
is removed and replaced by a direct use of `VCodeConstant` as a
`SyntheticAmode`. This has no effect on codegen, but the CLIF output is
now clearer in that the actual instruction is displayed (e.g., `movdqu`)
instead of a made-up instruction (`load_const`).
* Improve panic message if typevar_operand is None
* cranelift-fuzzgen: Don't allocate for each choice
I don't think the performance of test-case generation is at all
important here. I'm actually doing this in preparation for a bigger
refactor where I want to be able to borrow the list of valid choices for
a given opcode without worrying about lifetimes.
* cranelift-fuzzgen: Remove next_func_index
It's only used locally within `generate_funcrefs`, so it doesn't need to
be in the FunctionBuilder struct.
Also there's already a local counter that I think is good enough for
this. As far as I know, the function indexes only need to be distinct,
not contiguous.
* cranelift-fuzzgen: Separate resources from config
The function-global variables, blocks, etc that are generated before
generating instructions are all owned collections without any lifetime
parameters. By contrast, the Unstructured and Config are both borrowed.
Separating them will make it easier to borrow from the owned resources.
The previous implementation assumed that nothing had clobbered the
LR register since the current function had started executing, so
it would be incorrect for a non-leaf function, for example, that
contains the `get_return_address` operation right after a call.
The operation is valid only if the `preserve_frame_pointers` flag
is enabled, which implies that the presence of a frame record on
the stack is guaranteed.
Copyright (c) 2022, Arm Limited.
* cranelift: Remove of/nof overflow flags from icmp
Neither Wasmtime nor cg-clif use these flags under any circumstances.
From discussion on #3060 I see it's long been unclear what purpose these
flags served.
Fixes#3060, fixes#4406, and fixes #4875... by deleting all the code
that could have been buggy.
This changes the cranelift-fuzzgen input format by removing some IntCC
options, so I've gone ahead and enabled I128 icmp tests at the same
time. Since only the of/nof cases were failing before, I expect these to
work.
* Restore trapif tests
It's still useful to validate that iadd_ifcout's iflags result can be
forwarded correctly to trapif, and for that purpose it doesn't really
matter what condition code is checked.
This commit replaces #4869 and represents the actual version bump that
should have happened had I remembered to bump the in-tree version of
Wasmtime to 1.0.0 prior to the branch-cut date. Alas!
* cranelift-codegen: Remove all uses of DataValue
This type is only used by the interpreter, cranelift-fuzzgen, and
filetests. I haven't found another convenient crate for those to all
depend on where this type can live instead, but this small refactor at
least makes it obvious that code generation does not in any way depend
on the implementation of this type.
* Make DataValue, not Ieee32/64, respect IEEE754
This fixes#4857 by partially reverting #4849.
It turns out that Ieee32 and Ieee64 need bitwise equality semantics so
they can be used as hash-table keys.
Moving the IEEE754 semantics up a layer to DataValue makes sense in
conjunction with #4855, where we introduced a DataValue::bitwise_eq
alternative implementation of equality for those cases where users of
DataValue still want the bitwise equality semantics.
* cranelift-interpreter: Use eq/ord from DataValue
This fixes#4828, again, now that the comparison operators on DataValue
have the right IEEE754 semantics.
* Add regression test from issue #4857
* cranelift: Add `fcmp` tests
Some of these are disabled on aarch64 due to not being implemented yet.
* cranelift: Implement float PartialEq for Ieee{32,64} (fixes#4828)
Previously `PartialEq` was auto derived. This means that it was implemented in terms of PartialEq in a u32.
This is not correct for floats because `NaN != NaN`.
PartialOrd was manually implemented in 6d50099816, but it seems like it was an oversight to leave PartialEq out until now.
The test suite depends on the previous behaviour so we adjust it to keep comparing bits instead of floats.
* cranelift: Disable `fcmp ord` tests on aarch64
* cranelift: Disable `fcmp ueq` tests on aarch64
Previously the implementations of the various atomic memory IR operations
ignored the memory operation flags that were passed.
Copyright (c) 2022, Arm Limited.
Co-authored-by: Chris Fallin <chris@cfallin.org>
This slipped through the regalloc2 operand code update in #4811: the
CvtFloatToUintSeq pseudo-instruction actually clobbers its source. It
was marked as a "mod" operand in the original and I mistakenly
converted it to a "use" as I had not seen the actual clobber. The
instruction now takes an extra temp and makes a copy of `src` in the
appropriate place.
Fixes#4840.
This PR removes all uses of modify-operands in the aarch64 backend,
replacing them with reused-input operands instead. This has the nice
effect of removing a bunch of move instructions and more clearly
representing inputs and outputs.
This PR also removes the explicit use of pinned vregs in the aarch64
backend, instead using fixed-register constraints on the operands when
insts or pseudo-inst sequences require certain registers.
This is the second PR in the regalloc-semantics cleanup series; after
the remaining backend (s390x) and the ABI code are cleaned up as well,
we'll be able to simplify the regalloc2 frontend.