Commit Graph

1601 Commits

Author SHA1 Message Date
Chris Fallin
54acd8b3e2 x64 backend: fix to_amode with constant address (no registers). (#4239)
If an address expression is given to `to_amode` that is completely
constant (no registers at all), then it will produce an `Amode` that has
the resulting constant as an offset, and `(invalid_reg)` as the base.
This is a side-effect of the way we build up the amode step-by-step --
we're waiting to see a register and plug it into the base field. If we
never get a reg though, we need to generate a constant zero into a
register and use that as the base. This PR adds a `finalize_amode`
helper to do just that.

Fixes #4234.
2022-06-07 11:40:10 -07:00
Johnnie Birch
3f152273d3 X64: Port fpromote to ISLE (#4230) 2022-06-06 14:47:44 -07:00
Andrew Brown
6df56e6aa6 x64: port atomic_cas to ISLE (#4223) 2022-06-06 13:20:33 -07:00
Sam Parker
acfeda4d80 [AArch64] Port IaddPairwise to ISLE (#4201)
Copyright (c) 2022, Arm Limited.
2022-06-06 15:37:13 +01:00
Andrew Brown
816aae6aca x64: port some atomics to ISLE (#4212)
* x64: port `fence` to ISLE
* x64: port `atomic_load` to ISLE
* x64: port `atomic_store` to ISLE
2022-06-02 14:13:10 -07:00
Sam Parker
010e028d67 [AArch64] Port AtomicCAS to isle (#4140)
Copyright (c) 2022, Arm Limited.
2022-05-25 09:19:24 +01:00
Chris Fallin
b830c3cf93 Pull in regalloc2 v0.2.0, with no more separate scratch registers. (#4182)
RA2 recently removed the need for a dedicated scratch register for
cyclic moves (bytecodealliance/regalloc2#51). This has moderate positive
performance impact on function bodies that were register-constrained, as
it means that one more register is available. In Sightglass, I measured
+5-8% on `blake3-scalar`, at least among current benchmarks.
2022-05-23 12:51:04 -07:00
Benjamin Bouvier
6e828df632 Remove unused SourceLoc in many Mach data structures (#4180)
* Remove unused srcloc in MachReloc

* Remove unused srcloc in MachTrap

* Use `into_iter` on array in bench code to suppress a warning

* Remove unused srcloc in MachCallSite
2022-05-23 09:27:28 -07:00
Chris Fallin
32622b3e6f Cranelift: fix use of pinned reg with SysV calling convention. (#4176)
Previously, the pinned register (enabled by the `enable_pinned_reg`
Cranelift setting and used via the `get_pinned_reg` and `set_pinned_reg`
CLIF ops) was only used when Cranelift was embedded in SpiderMonkey, in
order to support a pinned heap register. SpiderMonkey has its own
calling convention in Cranelift (named after the integration layer,
"Baldrdash").

However, the feature is more general, and should be usable with the
default system calling convention too, e.g. SysV or Windows Fastcall.

This PR fixes the ABI code to properly treat the pinned register as a
globally allocated register -- and hence an implicit input and output to
every function, not saved/restored in the prologue/epilogue -- for SysV
on x86-64 and aarch64, and Fastcall on x86-64.

Fixes #4170.
2022-05-23 09:18:51 -07:00
Chris Fallin
0824abbae4 Add a basic alias analysis with redundant-load elim and store-to-load fowarding opts. (#4163)
This PR adds a basic *alias analysis*, and optimizations that use it.
This is a "mid-end optimization": it operates on CLIF, the
machine-independent IR, before lowering occurs.

The alias analysis (or maybe more properly, a sort of memory-value
analysis) determines when it can prove a particular memory
location is equal to a given SSA value, and when it can, it replaces any
loads of that location.

This subsumes two common optimizations:

* Redundant load elimination: when the same memory address is loaded two
  times, and it can be proven that no intervening operations will write
  to that memory, then the second load is *redundant* and its result
  must be the same as the first. We can use the first load's result and
  remove the second load.

* Store-to-load forwarding: when a load can be proven to access exactly
  the memory written by a preceding store, we can replace the load's
  result with the store's data operand, and remove the load.

Both of these optimizations rely on a "last store" analysis that is a
sort of coloring mechanism, split across disjoint categories of abstract
state. The basic idea is that every memory-accessing operation is put
into one of N disjoint categories; it is disallowed for memory to ever
be accessed by an op in one category and later accessed by an op in
another category. (The frontend must ensure this.)

Then, given this, we scan the code and determine, for each
memory-accessing op, when a single prior instruction is a store to the
same category. This "colors" the instruction: it is, in a sense, a
static name for that version of memory.

This analysis provides an important invariant: if two operations access
memory with the same last-store, then *no other store can alias* in the
time between that last store and these operations. This must-not-alias
property, together with a check that the accessed address is *exactly
the same* (same SSA value and offset), and other attributes of the
access (type, extension mode) are the same, let us prove that the
results are the same.

Given last-store info, we scan the instructions and build a table from
"memory location" key (last store, address, offset, type, extension) to
known SSA value stored in that location. A store inserts a new mapping.
A load may also insert a new mapping, if we didn't already have one.
Then when a load occurs and an entry already exists for its "location",
we can reuse the value. This will be either RLE or St-to-Ld depending on
where the value came from.

Note that this *does* work across basic blocks: the last-store analysis
is a full iterative dataflow pass, and we are careful to check dominance
of a previously-defined value before aliasing to it at a potentially
redundant load. So we will do the right thing if we only have a
"partially redundant" load (loaded already but only in one predecessor
block), but we will also correctly reuse a value if there is a store or
load above a loop and a redundant load of that value within the loop, as
long as no potentially-aliasing stores happen within the loop.
2022-05-20 13:19:32 -07:00
Andrew Brown
e898cb750a x64: remove TODO for i128 load (#4159)
This work has already been finished in a previous PR.
2022-05-17 17:43:43 -07:00
Anton Kirilov
edf07a8da6 Cranelift AArch64: Migrate Bitselect and Vselect to ISLE (#4139)
Copyright (c) 2022, Arm Limited.
2022-05-16 09:39:28 -07:00
Ulrich Weigand
0243a16679 s390x: Fix bitwise operations (#4146)
Current codegen had a number of logic errors confusing
NAND with AND WITH COMPLEMENT, and NOR with OR WITH COMPLEMENT.

Add support for the missing z15 instructions and fix logic.
2022-05-12 10:05:22 -07:00
Chris Fallin
5d671952ee Cranelift: do not check in generated ISLE code; regenerate on every compile. (#4143)
This PR fixes #4066: it modifies the Cranelift `build.rs` workflow to
invoke the ISLE DSL compiler on every compilation, rather than only
when the user specifies a special "rebuild ISLE" feature.

The main benefit of this change is that it vastly simplifies the mental
model required of developers, and removes a bunch of failure modes
we have tried to work around in other ways. There is now just one
"source of truth", the ISLE source itself, in the repository, and so there
is no need to understand a special "rebuild" step and how to handle
merge errors. There is no special process needed to develop the compiler
when modifying the DSL. And there is no "noise" in the git history produced
by constantly-regenerated files.

The two main downsides we discussed in #4066 are:
- Compile time could increase, by adding more to the "meta" step before the main build;
- It becomes less obvious where the source definitions are (everything becomes
  more "magic"), which makes exploration and debugging harder.

This PR addresses each of these concerns:

1. To maintain reasonable compile time, it includes work to cut down the
   dependencies of the `cranelift-isle` crate to *nothing* (only the Rust stdlib),
   in the default build. It does this by putting the error-reporting bits
   (`miette` crate) under an optional feature, and the logging (`log` crate) under
   a feature-controlled macro, and manually writing an `Error` impl rather than
   using `thiserror`. This completely avoids proc macros and the `syn` build slowness.

   The user can still get nice errors out of `miette`: this is enabled by specifying
   a Cargo feature `--features isle-errors`.

2. To allow the user to optionally inspect the generated source, which nominally
   lives in a hard-to-find path inside `target/` now, this PR adds a feature `isle-in-source-tree`
   that, as implied by the name, moves the target for ISLE generated source into
   the source tree, at `cranelift/codegen/isle_generated_source/`. It seems reasonable
   to do this when an explicit feature (opt-in) is specified because this is how ISLE regeneration
   currently works as well. To prevent surprises, if the feature is *not* specified, the
   build fails if this directory exists.
2022-05-11 22:25:24 -07:00
Chris Fallin
eb435f3057 x64: use constant pool for u64 constants rather than movabs. (#4088)
* Allow emitting u64 constants into constant pool.

* Use constant pool for constants on x64 that do not fit in a simm32 and are needed as a RegMem or RegMemImm.

* Fix rip-relative addressing bug in pinsrd emission.
2022-05-10 09:21:05 -07:00
Benjamin Bouvier
71fc16bbeb Narrow allow(dead_code) declarations (#4116)
* Narrow `allow(dead_code)` declarations

Having module wide `allow(dead_code)` may hide some code that's really
dead. In this commit I just narrowed the declarations to the specific
enum variants that were not used (as it seems reasonable to keep them
and their handling in all the matches, for future use). And the compiler
found more dead code that I think we can remove safely in the short
term.

With this, the only files annotated with a module-wide
`allow(dead_code)` are isle-generated files.

* resurrect some functions as test helpers
2022-05-10 12:02:52 +02:00
Chris Fallin
2af8d1e93c Cranelift/ISLE: re-apply prio-trie fix, this time with fixed fix. (#4117)
* ISLE compiler: fix priority-trie interval bug. (#4093)

This PR fixes a bug in the ISLE compiler related to rule priorities.

An important note first: the bug did not affect the correctness of the
Cranelift backends, either in theory (because the rules should be
correct applied in any order, even contrary to the stated priorities)
or in practice (because the generated code actually does not change at
all with the DSL compiler fix, only with a separate minimized bug
example).

The issue was a simple swap of `min` for `max` (see first
commit). This is the minimal fix, I think, to get a correct
priority-trie with the minimized bug example in this commit.

However, while debugging this, I started to convince myself that the
complexity of merging multiple priority ranges using the sort of
hybrid interval tree / string-matching trie data structure was
unneeded. The original design was built with the assumption we might
have a bunch of different priority levels, and would need the
efficiency of merging where possible. But in practice we haven't used
priorities this way: the vast majority of lowering rules exist at the
default (priority 0), and just a few overrides are explicitly at prio
1, 2 or (rarely) 3.

So, it turns out to be a lot simpler to label trie edges with (prio,
symbol) rather than (prio-range, symbol), and delete the whole mess of
interval-splitting logic on insertion. It's easier (IMHO) to convince
oneself that the resulting insertion algorithm is correct.

I was worried that this might impact the size of the generated Rust
code or its runtime, but In fact, to my initial surprise (but it makes
sense given the above "rarely used" factor), the generated code with
this compiler fix is *exactly the same*. I rebuilt with `--features
rebuild-isle,all-arch` but... there were no diffs to commit! This is
to me the simplest evidence that we didn't really need that
complexity.

* Fix earlier commit from #4093: properly sort trie.

This commit fixes an in-hindsight-obvious bug in #4093: the trie's edges
must be sorted recursively, not just at the top level.

With this fix, the generated code differs only in one cosmetic way (a
let-binding moves) but otherwise is the same.

This includes @fitzgen's fix to the CI (from the revert in #4102) that
deletes manifests to actually check that the checked-in source is
consistent with the checked-in compiler. The force-rebuild step is now
in a shell script for convenience: anyone hacking on the ISLE compiler
itself can use this script to more easily rebuild everything.

* Add note to build.rs to remind to update force-rebuild-isle.sh
2022-05-09 16:36:48 -07:00
Benjamin Bouvier
8bd507db65 Partially rewrite the constant-phi-nodes pass to make it more idiomatic (#4111)
* Add [must_use] to the timer functions

And remove unused vcode_post_ra function

* Make remove-constant-phis code more idiomatic
2022-05-09 19:22:34 +02:00
Chris Fallin
019ebf47b1 x64: fix pretty-printing argument order for XmmRmR instructions. (#4094)
The pretty-printing had swapped dst and src2; this was introduced when
we moved to RA2 (sorry about that! IMHO we should do something to
automate the mapping between regalloc arg collection and pretty
printing/emission).

`src2` comes at the end because it has a variable number of register
mentions; this is in line with how many of the other inst formats work.

Actual emitted code was never incorrect, just the pretty-printing.

Updated test golden outputs look correct to me now, including the one
that we saw was incorrect in #3945.
2022-05-03 10:12:58 -07:00
Chris Fallin
f85047b084 Rework x64 addressing-mode lowering to be slightly more flexible. (#4080)
This PR refactors the x64 backend address-mode lowering to use an
incremental-build approach, where it considers each node in a tree of
`iadd`s that feed into a load/store address and, at each step, builds
the best possible `Amode`. It will combine an arbitrary number of
constant offsets (an extension beyond the current rules), and can
capture a left-shifted (scaled) index in any position of the tree
(another extension).

This doesn't have any measurable performance improvement on our Wasm
benchmarks in Sightglass, unfortunately, because the IR lowered from
wasm32 will do address computation in 32 bits and then `uextend` it to
add to the 64-bit heap base. We can't quite lift the 32-bit adds to 64
bits because this loses the wraparound semantics.

(We could label adds as "expected not to overflow", and allow *those* to
be lifted to 64 bit operations; wasm32 heap address computation should
fit this.  This is `add nuw` (no unsigned wrap) in LLVM IR terms. That's
likely my next step.)

Nevertheless, (i) this generalizes the cases we can handle, which should
be a good thing, all other things being equal (and in this case, no
compile time impact was measured); and (ii) might benefit non-Wasm
frontends.
2022-05-02 16:20:39 -07:00
Chris Fallin
61dc38c065 Implement Spectre mitigations for table accesses and br_tables. (#4092)
Currently, we have partial Spectre mitigation: we protect heap accesses
with dynamic bounds checks. Specifically, we guard against errant
accesses on the misspeculated path beyond the bounds-check conditional
branch by adding a conditional move that is also dependent on the
bounds-check condition. This data dependency on the condition is not
speculated and thus will always pick the "safe" value (in the heap case,
a NULL address) on the misspeculated path, until the pipeline flushes
and recovers onto the correct path.

This PR uses the same technique both for table accesses -- used to
implement Wasm tables -- and for jumptables, used to implement Wasm
`br_table` instructions.

In the case of Wasm tables, the cmove picks the table base address on
the misspeculated path. This is equivalent to reading the first table
entry. This prevents loads of arbitrary data addresses on the
misspeculated path.

In the case of `br_table`, the cmove picks index 0 on the misspeculated
path. This is safer than allowing a branch to an address loaded from an
index under misspeculation (i.e., it preserves control-flow integrity
even under misspeculation).

The table mitigation is controlled by a Cranelift setting, on by
default. The br_table mitigation is always on, because it is part of the
single lowering pseudoinstruction. In both cases, the impact should be
minimal: a single extra cmove in a (relatively) rarely-used operation.

The table mitigation is architecture-independent (happens during
legalization); the br_table mitigation has been implemented for both x64
and aarch64. (I don't know enough about s390x to implement this
confidently there, but would happily review a PR to do the same on that
platform.)
2022-05-02 11:19:16 -07:00
Chris Fallin
03793b71a7 ISLE: remove all uses of argument polarity, and remove it from the language. (#4091)
This PR removes "argument polarity": the feature of ISLE extractors that lets them take
inputs aside from the value to be matched.

Cases that need this expressivity have been subsumed by #4072 with if-let clauses;
we can now finally remove this misfeature of the language, which has caused significant
confusion and has always felt like a bit of a hack.

This PR (i) removes the feature from the ISLE compiler; (ii) removes it from the reference
documentation; and (iii) refactors away all uses of the feature in our three existing
backends written in ISLE.
2022-05-02 09:52:12 -07:00
Chris Fallin
eceb433b28 Remove =x uses from ISLE, and remove support from the DSL compiler. (#4078)
This is a follow-up on #4074: now that we have the simplified syntax, we
can remove the old, redundant syntax.
2022-04-28 11:17:08 -07:00
Nick Fitzgerald
7cbfb39047 Remove old peepmatic source file (#4085)
Peepmatic has long since been removed, we have ISLE now.
2022-04-28 11:04:14 -07:00
Sam Parker
12b4374cd5 [AArch64] Port atomic rmw to ISLE (#4021)
Also fix and extend the current implementation:
- AtomicRMWOp::Clr != AtomicRmwOp::And, as the input needs to be
  inverted first.
- Inputs to the cmp for the RMWLoop case are sign-extended when
  needed.
- Lower Xchg to Swp.
- Lower Sub to Add with a negated input.
- Added more runtests.

Copyright (c) 2022, Arm Limited.
2022-04-27 13:13:59 -07:00
Chris Fallin
dd45f44511 x64 backend: add lowerings with load-op-store fusion. (#4071)
x64 backend: add lowerings with load-op-store fusion.

These lowerings use the `OP [mem], reg` forms (or in AT&T syntax, `OP
%reg, (mem)`) -- i.e., x86 instructions that load from memory, perform
an ALU operation, and store the result, all in one instruction. Using
these instruction forms, we can merge three CLIF ops together: a load,
an arithmetic operation, and a store.
2022-04-26 18:58:26 -07:00
Chris Fallin
164bfeaf7e x64 backend: migrate stores, and remainder of loads (I128 case), to ISLE. (#4069) 2022-04-26 09:50:46 -07:00
Chris Fallin
f384938a10 x64 backend: fix a load-op merging bug with integer min/max. (#4068)
The recent work in #4061 introduced a notion of "unique uses" for CLIF
values that both simplified the load-op merging rules and allowed
loads to merge in some more places.

Unfortunately there's one factor that PR didn't account for: a unique
use at the CLIF level could become a multiple-use at the VCode level,
when a lowering uses a value multiple times!

Making this less error-prone in general is hard, because we don't know
the lowering in VCode until it's emitted, so we can't ahead-of-time
know that a value will be used multiple times and prevent its
merging. But we *can* know in the lowerings themselves when we're
doing this. At least we get a panic from regalloc when we get this
wrong; no bad code (uninitialized register being read) should ever
come from a backend bug like this.

This is still a bit less than ideal, but for now the fix is: in
`cmp_and_choose` in the x64 backend (which compares values, then
picks one or the other with a cmove), explicitly put values in
registers.

Fixes #4067 (thanks @Mrmaxmeier for the report!).
2022-04-25 10:32:09 -07:00
Chris Fallin
e4b7c8a737 Cranelift: fix #3953: rework single/multiple-use logic in lowering. (#4061)
* Cranelift: fix #3953: rework single/multiple-use logic in lowering.

This PR addresses the longstanding issue with loads trying to merge
into compares on x86-64, and more generally, with the lowering
framework falsely recognizing "single uses" of one op by
another (which would normally allow merging of side-effecting ops like
loads) when there is *indirect* duplication.

To fix this, we replace the direct `value_uses` count with a
transitive notion of uniqueness (not unlike Rust's `&`/`&mut` and how
a `&mut` downgrades to `&` when accessed through another `&`!). A
value is used multiple times transitively if it has multiple direct
uses, or is used by another op that is used multiple times
transitively.

The canonical example of badness is:

```
    v1 := load
    v2 := ifcmp v1, ...
    v3 := selectif v2, ...
    v4 := selectif v2, ...
```

both `v3` and `v4` effectively merge the `ifcmp` (`v2`), so even
though the use of `v1` is "unique", it is codegenned twice. This is
why we ~~can't have nice things~~ can't merge loads into
compares (#3953).

There is quite a subtle and interesting design space around this
problem and how we might solve it. See the long doc-comment on
`ValueUseState` in this PR for more justification for the particular
design here. In particular, this design deliberately simplifies a bit
relative to an "optimal" solution: some uses can *become* unique
depending on merging, but we don't design our data structures for such
updates because that would require significant extra costly
tracking (some sort of transitive refcounting). For example, in the
above, if `selectif` somehow did not merge `ifcmp`, then we would only
codegen the `ifcmp` once into its result register (and use that
register twice); then the load *is* uniquely used, and could be
merged. But that requires transitioning from "multiple use" back to
"unique use" with careful tracking as we do pattern-matching, which
I've chosen to make out-of-scope here for now. In practice, I don't
think it will matter too much (and we can always improve later).

With this PR, we can now re-enable load-op merging for compares. A
subsequent commit does this.

* Update x64 backend to allow load-op merging for `cmp`.

* Update filetests.

* Add test for cmp-mem merging on x64.

* Comment fixes.

* Rework ValueUseState analysis for better performance.

* Update s390x filetest: iadd_ifcout cannot merge loads anymore because it has multiple outputs (ValueUseState limitation)

* Address review comments.
2022-04-22 18:00:48 -07:00
Johnnie Birch
6a36a1d15d X64: Port Sqrt to ISLE (#4065) 2022-04-22 00:42:22 -07:00
Chris Fallin
0af8737ec3 Add support for running the regalloc2 checker. (#4043)
With these fixes, all this PR has to do is instantiate and run the
checker on the `regalloc2::Output`. This is off by default, and is
enabled by setting the `regalloc_checker` Cranelift option.

This restores the old functionality provided by e.g. the
`backtracking_checked` regalloc algorithm setting rather than
`backtracking` when we were still on regalloc.rs.
2022-04-18 14:06:07 -07:00
Chris Fallin
5aa9bdc7eb Cranelift: fix fuzzbug in critical-edge splitting. (#4044)
regalloc2 is a bit pickier about critical edges than regalloc.rs was,
because of how it inserts moves. In particular, if a branch has any
arguments (e.g., a conditional branch or br_table), its successors must
all have only one predecessor, so we can do edge moves at the top of
successor blocks rather than at the end of this block. Otherwise, moves
that semantically must come after the block's last uses (the branch's
args) would be placed before it.

This is almost always the case, because crit-edge splitting ensures that
if we have more than one succ, all our succs will have only one pred.
This is because branch kinds that take arguments (fixed args, not the
blockparam args) tend to have more than one successor: conditionals and
br_tables.

However, a fuzzbug recently illuminated one corner case I had missed: a
br_table can have *one* successor only, if it has a default target and
an empty table. In this case, crit-edge splitting will happily skip a
split and assume that we can insert edge moves at the end of the block
with the br_table. But this will fail.

regalloc2 explicitly checks this and bails with a panic, rather than
continue, so no miscompilation is possible; but without this fix, we
will get these panics on br_tables with empty tables.
2022-04-18 10:59:26 -07:00
Chris Fallin
5774e068b7 Cranelift: fix regalloc2 integration bug wrt blockparam branch args. (#4042)
Previously, the block successor accumulation and the blockparam branch
arg setup were decoupled. The lowering backend implicitly specified
the order of successor edges via its `MachTerminator` enum on the last
instruction in the block, while the `Lower` toplevel
machine-independent driver set up blockparam branch args in the edge
order seen in CLIF.

In some cases, these orders did not match -- for example, when the
conditional branch depended on an FP condition that was implemented by
swapping taken/not-taken edges and inverting the condition code.

This PR refactors the successor handling to be centralized in `Lower`
rather than flow through the terminator `MachInst`, and adds a
successor block and its blockparam args at the same time, ensuring the
orders match.
2022-04-18 09:53:57 -07:00
Chris Fallin
7cf5f05830 Cranelift: remove slow invariant validation in cfg(fuzzing) from MachBuffer. (#4038)
Following the merge of regalloc2 support, this became slower because we
are stricter about the critical-edge invariant, generating a separate
edge block for every out-edge even if two or more out-edges go to the
same successor (this is significant in cases of `br_table` with many
entries having the same target block, for example).

Many of those edge blocks are empty and end up collapsed by the
MachBuffer, which leads to a large set of aliased labels.

The invariant validation will dutifully iterate over all the data
structures at every step, validating all of our conditions. But this
gets way slower in the new context, to the point that we'll probably
have some fuzz timeouts.

This was pointed out in [1] but I missed removing this in #3989. Given
that `MachBuffer` has been around for nearly two years now, has been
fuzzed continuously with the invariant validation for that time, and
also has a correctness proof in the comments, it's probably reasonable
to remove this high (recently increased) cost from the fuzzing-specific
compilation configuration.

[1]
https://github.com/bytecodealliance/wasmtime/pull/3989#discussion_r847712263
2022-04-15 09:04:02 -05:00
Sam Parker
cf533a8041 [AArch64] Merge Fcmp32 and Fcmp64 (#4032)
Copyright (c) 2022, Arm Limited.
2022-04-14 15:39:43 -07:00
Sam Parker
682ef7b470 [AArch64] Refactor Mov instructions (#4033)
Merge Mov32 and Mov64 into a single instruction parameterized by a new
OperandSize field. Also combine the Mov[K,N,Z] into a single instruction
with a new opcode to select between the operations.

Copyright (c) 2022, Arm Limited.
2022-04-14 14:51:12 -07:00
Sam Parker
dd442a4d2f [AArch64] Merge 32- and 64-bit FPUOp1 (#4031)
Copyright (c) 2022, Arm Limited.
2022-04-14 14:00:48 -07:00
Sam Parker
7c0ea28fc8 [AArch64] Merge 32- and 64-bit FPUOp2 (#4029)
And remove the unused saturating add/sub opcodes.

Copyright (c) 2022, Arm Limited.
2022-04-14 13:07:00 -07:00
Sam Parker
e142f587a7 [AArch64] Refactor ALUOp3 (#3950)
As well as adding generic pattern for msub along with runtests
for madd and msub.

Copyright (c) 2022, Arm Limited.
2022-04-14 12:16:56 -07:00
Chris Fallin
a0318f36f0 Switch Cranelift over to regalloc2. (#3989)
This PR switches Cranelift over to the new register allocator, regalloc2.

See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.

Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:

```
Benchmark       Compilation (wallclock)     Execution (wallclock)
blake3-scalar   25% faster                  28% faster
blake3-simd     no diff                     no diff
meshoptimizer   19% faster                  17% faster
pulldown-cmark  17% faster                  no diff
bz2             15% faster                  no diff
SpiderMonkey,   21% faster                  2% faster
  fib(30)
clang.wasm      42% faster                  N/A
```
2022-04-14 10:28:21 -07:00
Nikita Baksalyar
f9cf4fe640 Fix documentation for codegen::Context::compile (#4019)
The function docs incorrectly referred to an argument that's no longer there.
2022-04-12 13:01:00 -07:00
Andrew Brown
7a55779c6b x64: fix miscompilation of select.i128 (#4017)
Issue #3963 identified a miscompilation with select in which the second
in the pair of `CMOV`s (one pair per `i128` register) used the wrong
flag. This change fixes the error in the x64 ISLE helper function
emitting these `CMOV` instructions.
2022-04-12 09:56:57 -07:00
Andrew Brown
f62199da8c x64: port load to ISLE (#3993)
This change moves the majority of the lowerings for CLIF's `load`
instruction over to ISLE. To do so, it also migrates the previous
mechanism for creating an `Amode` (`lower_to_amode`) to several ISLE
rules (see `to_amode`).
2022-04-07 18:31:22 -07:00
Chris Fallin
666c2554ea Merge pull request from GHSA-gwc9-348x-qwv2
* Run the GC smoketest with epoch support enabled as well.

* Handle safepoints in cold blocks properly.

Currently, the way that we find safepoint slots for a given instruction
relies on the instruction index order in the safepoint list matching the
order of instruction emission.

Previous to the introduction of cold-block support, this was trivially
satisfied by sorting the safepoint list: we emit instructions 0, 1, 2,
3, 4, ..., and so if we have safepoints at instructions 1 and 4, we will
encounter them in that order.

However, cold blocks are supported by swizzling the emission order at
the last moment (to avoid having to renumber instructions partway
through the compilation pipeline), so we actually emit instructions out
of index order when cold blocks are present.

Reference-type support in Wasm in particular uses cold blocks for
slowpaths, and has live refs and safepoints in these slowpaths, so we
can reliably "skip" a safepoint (not emit any metadata for it) in the
presence of reftype usage.

This PR fixes the emission code by building a map from instruction index
to safepoint index first, then doing lookups through this map, rather
than following along in-order as it emits instructions.
2022-03-31 14:26:01 -07:00
Andrew Brown
bd6fe11ca9 cranelift: remove load_complex and store_complex (#3976)
This change removes all variants of `load*_complex` and `store*_complex`
from Cranelift; this is a breaking change to the instructions exposed by
CLIF. The complete list of instructions removed is: `load_complex`,
`store_complex`, `uload8_complex`, `sload8_complex`, `istore8_complex`,
`sload8_complex`, `uload16_complex`, `sload16_complex`,
`istore16_complex`, `uload32_complex`, `sload32_complex`,
`istore32_complex`, `uload8x8_complex`, `sload8x8_complex`,
`sload16x4_complex`, `uload16x4_complex`, `uload32x2_complex`,
`sload32x2_complex`.

The rationale for this removal is that the Cranelift backend now has the
ability to pattern-match multiple upstream additions in order to
calculate the address to access. Previously, this was not possible so
the `*_complex` instructions were needed. Over time, these instructions
have fallen out of use in this repository, making the additional
overhead of maintaining them a chore.
2022-03-31 10:05:10 -07:00
Andrew Brown
e8dd13cf87 x64: port the remainder of select to ISLE (#3973)
Previous changes had ported the difficult "`select` based on an `fcmp`"
patterns to ISLE; this completes porting of `select` by moving over the
final two kinds of patterns:
 - `select` based on an `icmp`
 - `select` based on a value
2022-03-30 13:32:26 -07:00
Andrew Brown
5d8dd648d7 x64: port fcmp to ISLE (#3967)
* x64: port scalar `fcmp` to ISLE

Implement the CLIF lowering for the `fcmp` to ISLE. This adds a new
type-matcher, `ty_scalar_float`, for detecting uses of `F32` and `F64`.

* isle: rename `vec128` to `ty_vec12`

This refactoring changes the name of the `vec128` matcher function to
follow the `ty_*` convention of the other type matchers. It also makes
the helper an inline function call.

* x64: port vector `fcmp` to ISLE
2022-03-29 15:41:49 -07:00
Andrew Brown
4d5bd5f90e x64: fix register allocation panic due to load-coalesced value (#3954)
Fuzz testing identified a lowering case for CLIF's `icmp` in which the
double use of a loaded operand resulted in a register allocation error.
This change manually adds `put_in_xmm` to avoid load-coalescing these
values and includes a CLIF filetest to trigger this issue. Closes #3951.

I opened #3953 to discuss a way in which this kind of mistake (i.e.,
forgetting to add `put_in_*` in certain situations) could be avoided.
2022-03-21 18:46:27 -07:00
Andrew Brown
3bfbb3226e x64: prefix all machine instructions with x64_ (#3947)
This change is refactoring only--it should have no logic changes. As
discussed previously, prefixing all machine code instructions with
`x64_` will make it easier to identify what parts of the ISLE code
correspond to single instructions and what parts rely on helpers that
may emit more than one instruction.
2022-03-18 17:53:15 -07:00
Andrew Brown
5fa104205d x64: improve generation of i128 icmp (#3946)
Previously, we used the flags of `AND` for `SETcc`. This change uses
`TEST` instead, which discards the AND result but sets the flags needed
for `SETcc`. This reduces register pressure slightly for this sequence.
2022-03-18 16:36:31 -07:00