Share a zero value in the translation of ushr for i128. This increases the lifetime of the value by a few instructions, and reduces the number of registers used in the translation by one, which seems like an acceptable trade-off.
When adding some optimization rules for `icmp` in the egraph
infrastructure, we ended up creating a path to legal CLIF but with
patterns unsupported by three of our four backends: specifically,
`select_spectre_guard` with a general truthy input, rather than an
`icmp`.
In #5206 we discussed replacing `select_spectre_guard` with something
more specific, and that could still be a long-term solution here, but
doing so now would interfere with ongoing refactoring of heap access
lowering, so I've opted not to do so. (In that issue I was concerned
about complexity and didn't see the need but with this fuzzbug I'm
starting to feel a bit differently; maybe we should remove this
non-orthogonal op in the long run.)
Fixes#5417.
All instructions using the CPU flags types (IFLAGS/FFLAGS) were already
removed. This patch completes the cleanup by removing all remaining
instructions that define values of CPU flags types, as well as the
types themselves.
Specifically, the following features are removed:
- The IFLAGS and FFLAGS types and the SpecialType category.
- Special handling of IFLAGS and FFLAGS in machinst/isle.rs and
machinst/lower.rs.
- The ifcmp, ifcmp_imm, ffcmp, iadd_ifcin, iadd_ifcout, iadd_ifcarry,
isub_ifbin, isub_ifbout, and isub_ifborrow instructions.
- The writes_cpu_flags instruction property.
- The flags verifier pass.
- Flags handling in the interpreter.
All of these features are currently unused; no functional change
intended by this patch.
This addresses https://github.com/bytecodealliance/wasmtime/issues/3249.
* cranelift-isle: Add "partial" flag for constructors
Instead of tying fallibility of constructors to whether they're either
internal or pure, this commit assumes all constructors are infallible
unless tagged otherwise with a "partial" flag.
Internal constructors without the "partial" flag are not allowed to use
constructors which have the "partial" flag on the right-hand side of any
rules, because they have no way to report last-minute match failures.
Multi-constructors should never be "partial"; they report match failures
with an empty iterator instead. In turn this means you can't use partial
constructors on the right-hand side of internal multi-constructor rules.
However, you can use the same constructors on the left-hand side with
`if` or `if-let` instead.
In many cases, ISLE can already trivially prove that an internal
constructor always returns `Some`. With this commit, those cases are
largely unchanged, except for removing all the `Option`s and `Some`s
from the generated code for those terms.
However, for internal non-partial constructors where ISLE could not
prove that, it now emits an `unreachable!` panic as the last-resort,
instead of returning `None` like it used to do. Among the existing
backends, here's how many constructors have these panic cases:
- x64: 14% (53/374)
- aarch64: 15% (41/277)
- riscv64: 23% (26/114)
- s390x: 47% (268/567)
It's often possible to rewrite rules so that ISLE can tell the panic can
never be hit. Just ensure that there's a lowest-priority rule which has
no constraints on the left-hand side.
But in many of these constructors, it's difficult to statically prove
the unhandled cases are unreachable because that's only down to
knowledge about how they're called or other preconditions.
So this commit does not try to enforce that all terms have a last-resort
fallback rule.
* Check term flags while translating expressions
Instead of doing it in a separate pass afterward.
This involved threading all the term flags (pure, multi, partial)
through the recursive `translate_expr` calls, so I extracted the flags
to a new struct so they can all be passed together.
* Validate multi-term usage
Now that I've threaded the flags through `translate_expr`, it's easy to
check this case too, so let's just do it.
* Extract `ReturnKind` to use in `ExternalSig`
There are only three legal states for the combination of `multi` and
`infallible`, so replace those fields of `ExternalSig` with a
three-state enum.
* Remove `Option` wrapper from multi-extractors too
If we'd had any external multi-constructors this would correct their
signatures as well.
* Update ISLE tests
* Tag prelude constructors as pure where appropriate
I believe the only reason these weren't marked `pure` before was because
that would have implied that they're also partial. Now that those two
states are specified separately we apply this flag more places.
* Fix my changes to aarch64 `lower_bmask` and `imm` terms
* Optimizations to egraph framework:
- Save elaborated results by canonical value, not latest value (union
value). Previously we were artificially skipping and re-elaborating
some values we already had because we were not finding them in the
map.
- Make some changes to handling of icmp results: when icmp became
I8-typed (when bools went away), many uses became `(uextend $I32 (icmp
$I8 ...))`, and so patterns in lowering backends were no longer
matching.
This PR includes an x64-specific change to match `(brz (uextend (icmp
...)))` and similarly for `brnz`, but it also takes advantage of the
ability to write rules easily in the egraph mid-end to rewrite selects
with icmp inputs appropriately.
- Extend constprop to understand selects in the egraph mid-end.
With these changes, bz2.wasm sees a ~1% speedup, and spidermonkey.wasm
with a fib.js input sees a 16.8% speedup:
```
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.base.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./fib.js 2.14s user 0.01s system 99% cpu 2.148 total
$ time taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./spidermonkey.egraphs.cwasm ./fib.js
1346269
taskset 1 target/release/wasmtime run --allow-precompiled --dir=. ./fib.js 1.78s user 0.01s system 99% cpu 1.788 total
```
* Review feedback.
Fix shadowing identified in #5322 for imul and swiden_high/swiden_low/uwiden_high/uwiden_low combinations in the x64 backend, and remove some redundant rules from the aarch64 dynamic neon ruleset. Additionally, add tests to the x64 backend showing that the imul specializations are firing.
Add a MemFlags operand to the bitcast instruction, where only the
`big` and `little` flags are accepted. These define the lane order
to be used when casting between types of different lane counts.
Update all users to pass an appropriate MemFlags argument.
Implement lane swaps where necessary in the s390x back-end.
This is the final part necessary to fix
https://github.com/bytecodealliance/wasmtime/issues/4566.
The simplifier was performing an optimization to replace bitselect
with vselect if the all bytes of the condition mask could be shown
to be all ones or all zeros.
This optimization only ever made any difference in codegen on the
x64 target. Therefore, move this optimization to the x64 back-end
and perform it in ISLE instead. Resulting codegen should be
unchanged, with slightly improved compile time.
This also eliminates a few endian-dependent bitcast operations.
* fuzzgen: Request only one variable for bswap
This was included by accident. Bswap only has one input, instead of two.
* cranelift: Add `bswap.i128` support
Adds support only for x86, AArch64, S390X.
RISCV does not yet have bswap.
This branch removes the trapif and trapff instructions, in favor of using an explicit comparison and trapnz. This moves us closer to removing iflags and fflags, but introduces the need to implement instructions like iadd_cout in the x64 and aarch64 backends.
- Allow bitcast for vectors with differing lane widths
- Remove raw_bitcast IR instruction
- Change all users of raw_bitcast to bitcast
- Implement support for no-op bitcast cases across backends
This implements the second step of the plan outlined here:
https://github.com/bytecodealliance/wasmtime/issues/4566#issuecomment-1234819394
Adds Bswap to the Cranelift IR. Implements the Bswap instruction
in the x64 and aarch64 codegen backends. Cranelift users can now:
```
builder.ins().bswap(value)
```
to get a native byteswap instruction.
* x64: implements the 32- and 64-bit bswap instruction, following
the pattern set by similar unary instrutions (Neg and Not) - it
only operates on a dst register, but is parameterized with both
a src and dst which are expected to be the same register.
As x64 bswap instruction is only for 32- or 64-bit registers,
the 16-bit swap is implemented as a rotate left by 8.
Updated x64 RexFlags type to support emitting for single-operand
instructions like bswap
* aarch64: Bswap gets emitted as aarch64 rev16, rev32,
or rev64 instruction as appropriate.
* s390x: Bswap was already supported in backend, just had to add
a bit of plumbing
* For completeness, added bswap to the interpreter as well.
* added filetests and runtests for each ISA
* added bswap to fuzzgen, thanks to afonso360 for the code there
* 128-bit swaps are not yet implemented, that can be done later
Add a new instruction uadd_overflow_trap, which is a fused version of iadd_ifcout and trapif. Adding this instruction removes a dependency on the iflags type, and would allow us to move closer to removing it entirely.
The instruction is defined for the i32 and i64 types only, and is currently only used in the legalization of heap_addr.
As discussed in the 2022/10/19 meeting, this PR removes many of the branch and select instructions that used iflags, in favor if using brz/brnz and select in their place. Additionally, it reworks selectif_spectre_guard to take an i8 input instead of an iflags input.
For reference, the removed instructions are: br_icmp, brif, brff, trueif, trueff, and selectif.
Remove the boolean types from cranelift, and the associated instructions breduce, bextend, bconst, and bint. Standardize on using 1/0 for the return value from instructions that produce scalar boolean results, and -1/0 for boolean vector elements.
Fixes#3205
Co-authored-by: Afonso Bordado <afonso360@users.noreply.github.com>
Co-authored-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Co-authored-by: Chris Fallin <chris@cfallin.org>
Using fallible extractors that produce no values for flag checks means
that it's not possible to pattern match cases where those flags are
false. This change reworks the existing flag-checking extractors to be
infallible, returning the flag's boolean value from the context instead.
Lower nop in ISLE in the x64 backend, and remove the final Ok(()) from the lower function to assert that all cases that aren't handled in ISLE will panic.
The x64 lowring of `vany_true` both sinks mergeable loads and uses the
original register. This PR fixes the lowering to force the value into a
register first. Ideally we should solve the issue by catching this in
the ISLE type system, as described in #4745, but this resolves the issue
for now.
Fixes#4807.
Lower extractlane, scalar_to_vector and splat in ISLE.
This PR also makes some changes to the SinkableLoad api
* change the return type of sink_load to RegMem as there are more functions available for dealing with RegMem
* add reg_mem_to_reg_mem_imm and register it as an automatic conversion
Lower `shuffle` and `swizzle` in ISLE.
This PR surfaced a bug with the lowering of `shuffle` when avx512vl and avx512vbmi are enabled: we use `vpermi2b` as the implementation, but panic if the immediate shuffle mask contains any out-of-bounds values. The behavior when the avx512 extensions are not present is that out-of-bounds values are turned into `0` in the result.
I've resolved this by detecting when the shuffle immediate has out-of-bounds indices in the avx512-enabled lowering, and generating an additional mask to zero out the lanes where those indices occur. This brings the avx512 case into line with the semantics of the `shuffle` op: 94bcbe8446/cranelift/codegen/meta/src/shared/instructions.rs (L1495-L1498)
* x64: Mask shift amounts for small types
* cranelift: Disable i128 shifts in fuzzer again
They are fixed. But we had a bunch of fuzzgen issues come in, and we don't want to accidentaly mark them as fixed
* cranelift: Avoid masking shifts for 32 and 64 bit cases
* cranelift: Add const shift tests and fix them
* cranelift: Remove const `rotl` cases
Now that `put_masked_in_imm8_gpr` works properly we can simplify rotl/rotr
Lower stack_addr, udiv, sdiv, urem, srem, umulhi, and smulhi in ISLE.
For udiv, sdiv, urem, and srem I opted to move the original lowering into an extern constructor, as the interactions with rax and rdx for the div instruction didn't seem meaningful to implement in ISLE. However, I'm happy to revisit this choice and move more of the embedding into ISLE.
Fixes#4736
Fix lowerings that were using values as both a Reg and a RegMem, making it look like a load could be sunk while its value in a register was still being used. Also add an assert that checks that loads that are sunk are never used.
The sse_cmp_op rule had cases that would produce SseOperand values that aren't legal to use with MInst.XmmRmR, and was only used in vector_all_ones when constructing an XmmRmR value. Additionally, vector_all_ones always called sse_cmp_op with the same type, so the other cases were redundant.
The solution in this PR is to remove sse_cmp_op entirely and inline a call to x64_pcmpeqd directly in vector_all_ones, and remove the unused argument from vector_all_ones.
All of the `*_imm` instructions are rewritten during legalization to an
explicit `iconst` plus the general form of the operator, so backends
never see them. Therefore these ISLE rules in the x64 backend can never
match anything.
* Add a test for iadd_pairwise with swiden input
* Implement iadd_pairwise for swiden_{low,high} input
* Add a test case for iadd_pairwise with uwiden input
* Implement iadd_pairwise with uwiden
* Add a test for the existing behavior of fcvt_from_unit
* Migrate the I8, I16, I32 cases of fcvt_from_uint
* Implement the I64 case of fcvt_from_uint
* Add a test for the existing behavior of fcvt_from_uint.f64x2
* Migrate fcvt_from_uint.f64x2 to ISLE
* Lower the last case of `fcvt_from_uint`
* Add a test for `fcvt_from_uint`
* Finish lowering fcmp_from_uint
* Format