Introduce a new concept in the IR that allows a producer to create
dynamic vector types. An IR function can now contain global value(s)
that represent a dynamic scaling factor, for a given fixed-width
vector type. A dynamic type is then created by 'multiplying' the
corresponding global value with a fixed-width type. These new types
can be used just like the existing types and the type system has a
set of hard-coded dynamic types, such as I32X4XN, which the user
defined types map onto. The dynamic types are also used explicitly
to create dynamic stack slots, which have no set size like their
existing counterparts. New IR instructions are added to access these
new stack entities.
Currently, during codegen, the dynamic scaling factor has to be
lowered to a constant so the dynamic slots do eventually have a
compile-time known size, as do spill slots.
The current lowering for aarch64 just targets Neon, using a dynamic
scale of 1.
Copyright (c) 2022, Arm Limited.
@yuyang-ok reported via zulip that i128 overflow tests were:
1. different from the interpreter implementation
2. wrong on some of the test cases
This fixes both the tests and the aarch64 implementation and adds the
interpreter to the testsuite.
The previous `cls` code was producing wrong results when fed with a -1 i8.
The fix here is to sign extend instead of zero extending since we want
to keep the sign bit as one in order for it to be counted correctly
in the cls instruction
This also merges the interpreter only tests now that aarch64
correctly supports this instruction
* Upgrade to regalloc2 v0.2.3 to get bugfix from bytecodealliance/regalloc2#60.
* Update RELEASES.md.
* Update two compile tests based on slightly shifting regalloc output.
RA2 recently removed the need for a dedicated scratch register for
cyclic moves (bytecodealliance/regalloc2#51). This has moderate positive
performance impact on function bodies that were register-constrained, as
it means that one more register is available. In Sightglass, I measured
+5-8% on `blake3-scalar`, at least among current benchmarks.
Previously, the pinned register (enabled by the `enable_pinned_reg`
Cranelift setting and used via the `get_pinned_reg` and `set_pinned_reg`
CLIF ops) was only used when Cranelift was embedded in SpiderMonkey, in
order to support a pinned heap register. SpiderMonkey has its own
calling convention in Cranelift (named after the integration layer,
"Baldrdash").
However, the feature is more general, and should be usable with the
default system calling convention too, e.g. SysV or Windows Fastcall.
This PR fixes the ABI code to properly treat the pinned register as a
globally allocated register -- and hence an implicit input and output to
every function, not saved/restored in the prologue/epilogue -- for SysV
on x86-64 and aarch64, and Fastcall on x86-64.
Fixes#4170.
* Upgrade to regalloc2 0.1.3.
This pulls in bytecodealliance/regalloc2#49, which slightly improves
codegen in some cases where a safepoint (for reference-typed values)
occurs in the same liverange as a register-constrained use. For
example, in bytecodealliance/wasmtime#3785, an extra move instruction
appeared and a callee-save register was used (necessitating a more
expensive prologue) because of suboptimal splitting heuristics, which
this PR fixes. The updated RA2 heuristics appear to have no measured
downsides in existing benchmarks and improve the manually-observed
codegen issue.
* Update filetests where regalloc2 improvement altered behavior with reftypes.
Also fix and extend the current implementation:
- AtomicRMWOp::Clr != AtomicRmwOp::And, as the input needs to be
inverted first.
- Inputs to the cmp for the RMWLoop case are sign-extended when
needed.
- Lower Xchg to Swp.
- Lower Sub to Add with a negated input.
- Added more runtests.
Copyright (c) 2022, Arm Limited.
This PR switches Cranelift over to the new register allocator, regalloc2.
See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.
Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:
```
Benchmark Compilation (wallclock) Execution (wallclock)
blake3-scalar 25% faster 28% faster
blake3-simd no diff no diff
meshoptimizer 19% faster 17% faster
pulldown-cmark 17% faster no diff
bz2 15% faster no diff
SpiderMonkey, 21% faster 2% faster
fib(30)
clang.wasm 42% faster N/A
```
This change removes all variants of `load*_complex` and `store*_complex`
from Cranelift; this is a breaking change to the instructions exposed by
CLIF. The complete list of instructions removed is: `load_complex`,
`store_complex`, `uload8_complex`, `sload8_complex`, `istore8_complex`,
`sload8_complex`, `uload16_complex`, `sload16_complex`,
`istore16_complex`, `uload32_complex`, `sload32_complex`,
`istore32_complex`, `uload8x8_complex`, `sload8x8_complex`,
`sload16x4_complex`, `uload16x4_complex`, `uload32x2_complex`,
`sload32x2_complex`.
The rationale for this removal is that the Cranelift backend now has the
ability to pattern-match multiple upstream additions in order to
calculate the address to access. Previously, this was not possible so
the `*_complex` instructions were needed. Over time, these instructions
have fallen out of use in this repository, making the additional
overhead of maintaining them a chore.
* Update lots of `isa/*/*.clif` tests to `precise-output`
This commit goes through the `aarch64` and `x64` subdirectories and
subjectively changes tests from `test compile` to add `precise-output`.
This then auto-updates all the test expectations so they can be
automatically instead of manually updated in the future. Not all tests
were migrated, largely subject to the whims of myself, mainly looking to
see if the test was looking for specific instructions or just checking
the whole assembly output.
* Filter out `;;` comments from test expctations
Looks like the cranelift parser picks up all comments, not just those
trailing the function, so use a convention where `;;` is used for
human-readable-comments in test cases and `;`-prefixed comments are the
test expectation.
* cranelift: Add ability to auto-update test expectations
One of the problems of the current `*.clif` testing is that the files
are difficult to update when widespread changes are made (such as
removing modification of the frame pointer). Additionally when changing
register allocation or similar it can cause a large number of changes in
tests but the tests themselves didn't actually break. For this reason
this commit adds the ability to automatically update test expectations.
The idea behind this commit is that tests of the form `test compile` can
also optionally be flagged with the `precise-output` flag:
test compile precise-output
and when doing so the compiled form of each function is asserted to 100%
match the following comments and their test expectations. If a match is
not found then a `BLESS=1` environment variable can be used to
automatically rewrite the test file itself with the correct assertion.
If the environment variable isn't present and the expectation doesn't
match then the test fails.
It's hoped that, if approved, a follow-up commit can add
`precise-output` to all current `test compile` tests (or make it the
default) and all tests can be mass-updated. When developing locally test
expectations need not be written and instead tests can be run with
`BLESS=1` and the output can be manually verified. The environment
variable will not be present on CI which means that changes to the
output which don't also change the test expectation will cause CI to
fail. Furthermore this should still make updates to the test output
easily readable in review on CI because the test expectations are
intended to look the same as before.
Closes#1539
* Use raw vcode output in tests
* Fix a merge conflict
* Review comments
This commit migrates these existing instructions to ISLE from the manual
lowerings implemented today. This was mostly straightforward but while I
was at it I fixed what appeared to be broken translations for I{8,16}
for `clz`, `cls`, and `ctz`. Previously the lowerings would produce
results as-if the input was 32-bits, but now I believe they all
correctly account for the bit-width.
This patch makes spillslot allocation, spilling and reloading all based
on register class only. Hence when we have a 32- or 64-bit value in a
128-bit XMM register on x86-64 or vector register on aarch64, this
results in larger spillslots and spills/restores.
Why make this change, if it results in less efficient stack-frame usage?
Simply put, it is safer: there is always a risk when allocating
spillslots or spilling/reloading that we get the wrong type and make the
spillslot or the store/load too small. This was one contributing factor
to CVE-2021-32629, and is now the source of a fuzzbug in SIMD code that
puns an arbitrary user-controlled vector constant over another
stackslot. (If this were a pointer, that could result in RCE. SIMD is
not yet on by default in a release, fortunately.
In particular, we have not been particularly careful about using moves
between values of different types, for example with `raw_bitcast` or
with certain SIMD operations, and such moves indicate to regalloc.rs
that vregs are in equivalence classes and some arbitrary vreg in the
class is provided when allocating the spillslot or spilling/reloading.
Since regalloc.rs does not track actual type, and since we haven't been
careful about moves, we can't really trust this "arbitrary vreg in
equivalence class" to provide accurate type information.
In the fix to CVE-2021-32629 we fixed this for integer registers by
always spilling/reloading 64 bits; this fix can be seen as the analogous
change for FP/vector regs.
This commit translates the `rotl` and `rotr` lowerings already existing
to ISLE. The port was relatively straightforward with the biggest
changing being the instructions generated around i128 rotl/rotr
primarily due to register changes.
* aarch64: Migrate ishl/ushr/sshr to ISLE
This commit migrates the `ishl`, `ushr`, and `sshr` instructions to
ISLE. These involve special cases for almost all types of integers
(including vectors) and helper functions for the i128 lowerings since
the i128 lowerings look to be used for other instructions as well. This
doesn't delete the i128 lowerings in the Rust code just yet because
they're still used by Rust lowerings, but they should be deletable in
due time once those lowerings are translated to ISLE.
* Use more descriptive names for i128 lowerings
* Use a with_flags-lookalike for csel
* Use existing `with_flags_*`
* Coment backwards order
* Update generated code
* aarch64: Migrate some bit-ops to ISLE
This commit migrates these instructions to ISLE:
* `bnot`
* `band`
* `bor`
* `bxor`
* `band_not`
* `bor_not`
* `bxor_not`
The translations were relatively straightforward but the interesting
part here was trying to reduce the duplication between all these
instructions. I opted for a route that's similar to what the lowering
does today, having a `decl` which takes the `ALUOp` and then performs
further pattern matching internally. This enabled each instruction's
lowering to be pretty simple while we still get to handle all the fancy
cases of shifts, constants, etc, for each instruction.
* Actually delete previous lowerings
* Remove dead code
This commit migrates the sign/zero extension instructions from
`lower_inst.rs` to ISLE. There's actually a fair amount going on in this
migration since a few other pieces needed touching up along the way as
well:
* First is the actual migration of `uextend` and `sextend`. These
instructions are relatively simple but end up having a number of special
cases. I've attempted to replicate all the cases here but
double-checks would be good.
* This commit actually fixes a few issues where if the result of a vector
extraction is sign/zero-extended into i128 that actually results in
panics in the current backend.
* This commit adds exhaustive testing for
extension-of-a-vector-extraction is a noop wrt extraction.
* A bugfix around ISLE glue was required to get this commit working,
notably the case where the `RegMapper` implementation was trying to
map an input to an output (meaning ISLE was passing through an input
unmodified to the output) wasn't working. This requires a `mov`
instruction to be generated and this commit updates the glue to do
this. At the same time this commit updates the ISLE glue to share more
infrastructure between x64 and aarch64 so both backends get this fix
instead of just aarch64.
Overall I think that the translation to ISLE was a net benefit for these
instructions. It's relatively obvious what all the cases are now unlike
before where it took a few reads of the code and some boolean switches
to figure out which path was taken for each flavor of input. I think
there's still possible improvements here where, for example, the
`put_in_reg_{s,z}ext64` helper doesn't use this logic so technically
those helpers could also pattern match the "well atomic loads and vector
extractions automatically do this for us" but that's a possible future
improvement for later (and shouldn't be too too hard with some ISLE
refactoring).
* aarch64: Migrate {s,u}{div,rem} to ISLE
This commit migrates four different instructions at once to ISLE:
* `sdiv`
* `udiv`
* `srem`
* `urem`
These all share similar codegen and center around the `div` instruction
to use internally. The main feature of these was to model the manual
traps since the `div` instruction doesn't trap on overflow, instead
requiring manual checks to adhere to the semantics of the instruction
itself.
While I was here I went ahead and implemented an optimization for these
instructions when the right-hand-side is a constant with a known value.
For `udiv`, `srem`, and `urem` if the right-hand-side is a nonzero
constant then the checks for traps can be skipped entirely. For `sdiv`
if the constant is not 0 and not -1 then additionally all checks can be
elided. Finally if the right-hand-side of `sdiv` is -1 the zero-check is
elided, but it still needs a check for `i64::MIN` on the left-hand-side
and currently there's a TODO where `-1` is still checked too.
* Rebasing and review conflicts
This commit is the first "meaty" instruction added to ISLE for the
AArch64 backend. I chose to pick the first two in the current lowering's
`match` statement, `isub` and `iadd`. These two turned out to be
particularly interesting for a few reasons:
* Both had clearly migratable-to-ISLE behavior along the lines of
special-casing per type. For example 128-bit and vector arithmetic
were both easily translateable.
* The `iadd` instruction has special cases for fusing with a
multiplication to generate `madd` which is expressed pretty easily in
ISLE.
* Otherwise both instructions had a number of forms where they attempted
to interpret the RHS as various forms of constants, extends, or
shifts. There's a bit of a design space of how best to represent this
in ISLE and what I settled on was to have a special case for each form
of instruction, and the special cases are somewhat duplicated between
`iadd` and `isub`. There's custom "extractors" for the special cases
and instructions that support these special cases will have an
`rule`-per-case.
Overall I think the ISLE transitioned pretty well. I don't think that
the aarch64 backend is going to follow the x64 backend super closely,
though. For example the x64 backend is having a helper-per-instruction
at the moment but with AArch64 it seems to make more sense to only have
a helper-per-enum-variant-of-`MInst`. This is because the same
instruction (e.g. `ALUOp::Sub32`) can be expressed with multiple
different forms depending on the payload.
It's worth noting that the ISLE looks like it's a good deal larger than
the code actually being removed from lowering as part of this commit. I
think this is deceptive though because a lot of the logic in
`put_input_in_rse_imm12_maybe_negated` and `alu_inst_imm12` is being
inlined into the ISLE definitions for each instruction instead of having
it all packed into the helper functions. Some of the "boilerplate" here
is the addition of various ISLE utilities as well.
Rename the existing AtomicRMW to AtomicRMWLoop and directly lower
atomic_rmw operations, without a loop if LSE support is available.
Copyright (c) 2021, Arm Limited
* Cranelift AArch64: Simplify leaf functions that do not use the stack
Leaf functions that do not use the stack (e.g. do not clobber any
callee-saved registers) do not need a frame record.
Copyright (c) 2021, Arm Limited.
The AArch64 support was a bit broken and was using Armv7 style
barriers, which aren't required with Armv8 acquire-release
load/stores.
The fallback CAS loops and RMW, for AArch64, have also been updated
to use acquire-release, exclusive, instructions which, again, remove
the need for barriers. The CAS loop has also been further optimised
by using the extending form of the cmp instruction.
Copyright (c) 2021, Arm Limited.
* Change VMMemoryDefinition::current_length to `usize`
This commit changes the definition of
`VMMemoryDefinition::current_length` to `usize` from its previous
definition of `u32`. This is a pretty impactful change because it also
changes the cranelift semantics of "dynamic" heaps where the bound
global value specifier must now match the pointer type for the platform
rather than the index type for the heap.
The motivation for this change is that the `current_length` field (or
bound for the heap) is intended to reflect the current size of the heap.
This is bound by `usize` on the host platform rather than `u32` or`
u64`. The previous choice of `u32` couldn't represent a 4GB memory
because we couldn't put a number representing 4GB into the
`current_length` field. By using `usize`, which reflects the host's
memory allocation, this should better reflect the size of the heap and
allows Wasmtime to support a full 4GB heap for a wasm program (instead
of 4GB minus one page).
This commit also updates the legalization of the `heap_addr` clif
instruction to appropriately cast the address to the platform's pointer
type, handling bounds checks along the way. The practical impact for
today's targets is that a `uextend` is happening sooner than it happened
before, but otherwise there is no intended impact of this change. In the
future when 64-bit memories are supported there will likely need to be
fancier logic which handles offsets a bit differently (especially in the
case of a 64-bit memory on a 32-bit host).
The clif `filetest` changes should show the differences in codegen, and
the Wasmtime changes are largely removing casts here and there.
Closes#3022
* Add tests for memory.size at maximum memory size
* Add a dfg helper method
This commit addresses two issues:
* A panic when shifting any non i128 type by i128 amounts (#3064)
* Wrong results when lowering shifts with small types (i8, i16)
In these types when shifting for amounts larger than the size of the
type, we would not get the wrapping behaviour that we see on i32 and i64.
This is because in these larger types, the wrapping behaviour is automatically
implemented by using the appropriate instruction, however we do not
have i8 and i16 specific instructions, so we have to manually wrap
the shift amount with an AND instruction.
This issue is also found on x86_64 and s390x, and a separate issue will
be filed for those.
Closes#3064
Implement the `TlsValue` opcode in the aarch64 backend for ELF_GD.
This is a little bit unusual as the default TLS mechanism for aarch64 is TLS Descriptors in other compilers.
However currently we only recognize elf_gd so lets start with that as a TLS implementation.