Commit Graph

156 Commits

Author SHA1 Message Date
Chris Fallin
61dc38c065 Implement Spectre mitigations for table accesses and br_tables. (#4092)
Currently, we have partial Spectre mitigation: we protect heap accesses
with dynamic bounds checks. Specifically, we guard against errant
accesses on the misspeculated path beyond the bounds-check conditional
branch by adding a conditional move that is also dependent on the
bounds-check condition. This data dependency on the condition is not
speculated and thus will always pick the "safe" value (in the heap case,
a NULL address) on the misspeculated path, until the pipeline flushes
and recovers onto the correct path.

This PR uses the same technique both for table accesses -- used to
implement Wasm tables -- and for jumptables, used to implement Wasm
`br_table` instructions.

In the case of Wasm tables, the cmove picks the table base address on
the misspeculated path. This is equivalent to reading the first table
entry. This prevents loads of arbitrary data addresses on the
misspeculated path.

In the case of `br_table`, the cmove picks index 0 on the misspeculated
path. This is safer than allowing a branch to an address loaded from an
index under misspeculation (i.e., it preserves control-flow integrity
even under misspeculation).

The table mitigation is controlled by a Cranelift setting, on by
default. The br_table mitigation is always on, because it is part of the
single lowering pseudoinstruction. In both cases, the impact should be
minimal: a single extra cmove in a (relatively) rarely-used operation.

The table mitigation is architecture-independent (happens during
legalization); the br_table mitigation has been implemented for both x64
and aarch64. (I don't know enough about s390x to implement this
confidently there, but would happily review a PR to do the same on that
platform.)
2022-05-02 11:19:16 -07:00
Chris Fallin
dd45f44511 x64 backend: add lowerings with load-op-store fusion. (#4071)
x64 backend: add lowerings with load-op-store fusion.

These lowerings use the `OP [mem], reg` forms (or in AT&T syntax, `OP
%reg, (mem)`) -- i.e., x86 instructions that load from memory, perform
an ALU operation, and store the result, all in one instruction. Using
these instruction forms, we can merge three CLIF ops together: a load,
an arithmetic operation, and a store.
2022-04-26 18:58:26 -07:00
Chris Fallin
a0318f36f0 Switch Cranelift over to regalloc2. (#3989)
This PR switches Cranelift over to the new register allocator, regalloc2.

See [this document](https://gist.github.com/cfallin/08553421a91f150254fe878f67301801)
for a summary of the design changes. This switchover has implications for
core VCode/MachInst types and the lowering pass.

Overall, this change brings improvements to both compile time and speed of
generated code (runtime), as reported in #3942:

```
Benchmark       Compilation (wallclock)     Execution (wallclock)
blake3-scalar   25% faster                  28% faster
blake3-simd     no diff                     no diff
meshoptimizer   19% faster                  17% faster
pulldown-cmark  17% faster                  no diff
bz2             15% faster                  no diff
SpiderMonkey,   21% faster                  2% faster
  fib(30)
clang.wasm      42% faster                  N/A
```
2022-04-14 10:28:21 -07:00
Chris Fallin
cd173cfe8e ISLE: port fmin, fmax, fmin_pseudo, fmax_pseudo on x64. (#3856) 2022-02-28 14:40:26 -08:00
Chris Fallin
24f145cd1e Migrate clz, ctz, popcnt, bitrev, is_null, is_invalid on x64 to ISLE. (#3848) 2022-02-28 09:45:13 -08:00
Andrew Brown
f87c61176a x64: port select to ISLE (#3682)
* x64: port `select` using an FP comparison to ISLE

This change includes quite a few interlocking parts, required mainly by
the current x64 conventions in ISLE:
 - it adds a way to emit a `cmove` with multiple OR-ing conditions;
   because x64 ISLE cannot currently safely emit a comparison followed
   by several jumps, this adds `MachInst::CmoveOr` and
   `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro
   instructions hide the multi-instruction sequence in `lower.isle`
 - to properly keep track of what instructions consume and produce
   flags, @cfallin added a way to pass around variants of
   `ConsumesFlags` and `ProducesFlags`--these changes affect all
   backends
 - then, to lower the `fcmp + select` CLIF, this change adds several
   `cmove*_from_values` helpers that perform all of the awkward
   conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one
   upside is that now these lowerings have much-improved documentation
   explaining why the various `FloatCC` and `CC` choices are made the
   the way they are.

Co-authored-by: Chris Fallin <chris@cfallin.org>
2022-02-23 10:03:16 -08:00
Nick Fitzgerald
dc86e7a6dc cranelift: Use GPR newtypes extensively in x64 lowering (#3798)
We already defined the `Gpr` newtype and used it in a few places, and we already
defined the `Xmm` newtype and used it extensively. This finishes the transition
to using the newtypes extensively in lowering by making use of `Gpr` in more
places.

Fixes #3685
2022-02-14 12:54:41 -08:00
Nick Fitzgerald
795b0aaf9a cranelift: Add newtype wrappers for x64 register classes
This primary motivation of this large commit (apologies for its size!) is to
introduce `Gpr` and `Xmm` newtypes over `Reg`. This should help catch
difficult-to-diagnose register class mixup bugs in x64 lowerings.

But having a newtype for `Gpr` and `Xmm` themselves isn't enough to catch all of
our operand-with-wrong-register-class bugs, because about 50% of operands on x64
aren't just a register, but a register or memory address or even an
immediate! So we have `{Gpr,Xmm}Mem[Imm]` newtypes as well.

Unfortunately, `GprMem` et al can't be `enum`s and are therefore a little bit
noisier to work with from ISLE. They need to maintain the invariant that their
registers really are of the claimed register class, so they need to encapsulate
the inner data. If they exposed the underlying `enum` variants, then anyone
could just change register classes or construct a `GprMem` that holds an XMM
register, defeating the whole point of these newtypes. So when working with
these newtypes from ISLE, we rely on external constructors like `(gpr_to_gpr_mem
my_gpr)` instead of `(GprMem.Gpr my_gpr)`.

A bit of extra lines of code are included to add support for register mapping
for all of these newtypes as well. Ultimately this is all a bit wordier than I'd
hoped it would be when I first started authoring this commit, but I think it is
all worth it nonetheless!

In the process of adding these newtypes, I didn't want to have to update both
the ISLE `extern` type definition of `MInst` and the Rust definition, so I move
the definition fully into ISLE, similar as aarch64.

Finally, this process isn't complete. I've introduced the newtypes here, and
I've made most XMM-using instructions switch from `Reg` to `Xmm`, as well as
register class-converting instructions, but I haven't moved all of the GPR-using
instructions over to the newtypes yet. I figured this commit was big enough as
it was, and I can continue the adoption of these newtypes in follow up commits.

Part of #3685.
2022-02-03 14:08:08 -08:00
bjorn3
17c3c1813f Remove MachInstEmitInfo 2022-01-04 18:06:01 +01:00
Nick Fitzgerald
d377b665c6 Initial ISLE integration with the x64 backend
On the build side, this commit introduces two things:

1. The automatic generation of various ISLE definitions for working with
CLIF. Specifically, it generates extern type definitions for clif opcodes and
the clif instruction data `enum`, as well as extractors for matching each clif
instructions. This happens inside the `cranelift-codegen-meta` crate.

2. The compilation of ISLE DSL sources to Rust code, that can be included in the
main `cranelift-codegen` compilation.

Next, this commit introduces the integration glue code required to get
ISLE-generated Rust code hooked up in clif-to-x64 lowering. When lowering a clif
instruction, we first try to use the ISLE code path. If it succeeds, then we are
done lowering this instruction. If it fails, then we proceed along the existing
hand-written code path for lowering.

Finally, this commit ports many lowering rules over from hand-written,
open-coded Rust to ISLE.

In the process of supporting ISLE, this commit also makes the x64 `Inst` capable
of expressing SSA by supporting 3-operand forms for all of the existing
instructions that only have a 2-operand form encoding:

    dst = src1 op src2

Rather than only the typical x86-64 2-operand form:

    dst = dst op src

This allows `MachInst` to be in SSA form, since `dst` and `src1` are
disentangled.

("3-operand" and "2-operand" are a little bit of a misnomer since not all
operations are binary operations, but we do the same thing for, e.g., unary
operations by disentangling the sole operand from the result.)

There are two motivations for this change:

1. To allow ISLE lowering code to have value-equivalence semantics. We want ISLE
   lowering to translate a CLIF expression that evaluates to some value into a
   `MachInst` expression that evaluates to the same value. We want both the
   lowering itself and the resulting `MachInst` to be pure and referentially
   transparent. This is both a nice paradigm for compiler writers that are
   authoring and maintaining lowering rules and is a prerequisite to any sort of
   formal verification of our lowering rules in the future.

2. Better align `MachInst` with `regalloc2`'s API, which requires that the input
   be in SSA form.
2021-10-12 17:11:58 -07:00
Johnnie Birch
e373ddfe1b Add extend-add-pairwise instructions x64 2021-07-30 15:06:58 -07:00
Johnnie Birch
500f530322 Add support for i32x4_trunc_sat_f64x2_s for x64 2021-07-26 22:24:30 -07:00
Johnnie Birch
23290f0450 Add support for i32x4_trunc_sat_f64x2_u for x64 2021-07-26 22:24:30 -07:00
Johnnie Birch
5deda27977 Add support for Saturating Rounding Q-format Multiplication for x64 2021-07-26 20:32:46 -07:00
Nick Fitzgerald
4283d2116d cranelift: Move most debug-level logs to the trace level
Cranelift crates have historically been much more verbose with debug-level
logging than most other crates in the Rust ecosystem. We log things like how
many parameters a basic block has, the color of virtual registers during
regalloc, etc. Even for Cranelift hackers, these things are largely only useful
when hacking specifically on Cranelift and looking at a particular test case,
not even when using some Cranelift embedding (such as Wasmtime).

Most of the time, when people want logging for their Rust programs, they do
something like:

    RUST_LOG=debug cargo run

This means that they get all that mostly not useful debug logging out of
Cranelift. So they might want to disable logging for Cranelift, or change it to
a higher log level:

    RUST_LOG=debug,cranelift=info cargo run

The problem is that this is already more annoying to type that `RUST_LOG=debug`,
and that Cranelift isn't one single crate, so you actually have to play
whack-a-mole with naming all the Cranelift crates off the top of your head,
something more like this:

    RUST_LOG=debug,cranelift=info,cranelift_codegen=info,cranelift_wasm=info,...

Therefore, we're changing most of the `debug!` logs into `trace!` logs: anything
that is very Cranelift-internal, unlikely to be useful/meaningful to the
"average" Cranelift embedder, or prints a message for each instruction visited
during a pass. On the other hand, things that just report a one line statistic
for a whole pass, for example, are left as `debug!`. The more verbose the log
messages are, the higher the bar they must clear to be `debug!` rather than
`trace!`.
2021-07-26 11:50:16 -07:00
Johnnie Birch
6fbe0b72bd Add simd_extmul_* support for x64 2021-07-15 01:07:52 -07:00
Johnnie Birch
2d676d838f Implements f64x2.convert_low_i32x4_u for x64 2021-07-09 10:39:05 -07:00
Johnnie Birch
1770880e19 x64: add support for packed promote and demote (#2783)
* Add support for x64 packed promote low

* Add support for x64 packed floating point demote

* Update vector promote low and demote by adding constraints

Also does some renaming and minor refactoring
2021-06-04 15:59:20 -07:00
Andrew Brown
8dc4cc9fe3 x64: fix AVX512 flag checks
Previously, the multiple flags for certain AVX512 instructions were
checked using `OR`: e.g., if the CPU has AVX512VL `OR` AVX512DQ,
emit `VPMULLQ`. This is incorrect--the logic should be `AND`. The Intel
Software Developer Manual, vol. 1, sec. 15.4, has more information on
this (notable there is the suggestion to check with `XGETBV` that the OS
is allowing the use of the XMM registers--but that is a separate issue).
This change switches to `AND` logic in the new backend.
2021-06-01 11:41:16 -07:00
Andrew Brown
2a9f458ea3 x64: lower i8x16.shuffle to VPERMI2B when possible
When shuffling values from two different registers, the x64 lowering for
`i8x16.shuffle` must first shuffle each register separately and then OR
the results with SSE instructions. With `VPERMI2B`, available in
AVX512VL + AVX512VBMI, this can be done in a single instruction after
the shuffle mask has been moved into the destination register. This
change uses `VPERMI2B` for that case when the CPU supports it.
2021-06-01 11:40:53 -07:00
Andrew Brown
459fce3467 x64: lower i8x16.popcnt to VPOPCNTB when possible
When AVX512VL or AVX512BITALG are available, Wasm SIMD's `popcnt`
instruction can be lowered to a single x64 instruction, `VPOPCNTB`,
instead of 8+ instructions.
2021-05-25 12:16:25 -07:00
Andrew Brown
54b45d28a3 x64: lower fcvt_from_uint to VCVTUDQ2PS when possible
When AVX512VL and AVX512F are available, use a single instruction
(`VCVTUDQ2PS`) instead of a length 9-instruction sequence. This
optimization is a port from the legacy x86 backend.
2021-05-19 12:20:11 -07:00
Andrew Brown
7ef3ae2903 x64: implement vselect with variable blend instructions
This change implements `vselect` using SSE4.1's `BLENDVPS`, `BLENDVPD`,
and `PBLENDVB`. `vselect` is a lane-selecting instruction that is used
by
[simple_preopt.rs](fa1faf5d22/cranelift/codegen/src/simple_preopt.rs (L947-L999))
to lower `bitselect` to a single x86 instruction when the condition mask
is known to be boolean (all 1s or 0s, e.g., from a conversion). This is
better than `bitselect` in general, which lowers to 4-5 instructions.
The old backend had the `vselect` lowering; this simply introduces it to
the new backend.
2021-05-17 11:23:33 -07:00
Andrew Brown
e676589b0c x64: lower i64x2.imul to VPMULLQ when possible
This adds the machinery to encode the VPMULLQ instruction which is
available in AVX512VL and AVX512DQ. When these feature sets are
available, we use this instruction instead of a lengthy 12-instruction
sequence.
2021-05-13 20:14:05 -07:00
Andrew Brown
02796fc670 x64: move encodings to a separate module
In order to benchmark the encoding code with criterion, the functions
and structures must be public. Moving this code to its own module
(instead of keeping as a submodule to `inst`), allows `inst` to remain
private. This avoids having to expose and document (or ignore
documenting) the numerous instruction variants in `inst` while allowing
access to the encoding code. This commit changes no functionality.
2021-05-13 10:46:08 -07:00
Andrew Brown
0acc1451ea x64: lower iabs.i64x2 using a single AVX512 instruction when possible (#2819)
* x64: add EVEX encoding mechanism

Also, includes an empty stub module for the VEX encoding.

* x64: lower abs.i64x2 to VPABSQ when available

* x64: refactor EVEX encodings to use `EvexInstruction`

This change replaces the `encode_evex` function with a builder-style struct, `EvexInstruction`. This approach clarifies the code, adds documentation, and results in slight speedups when benchmarked.

* x64: rename encoding CodeSink to ByteSink
2021-04-15 11:53:58 -07:00
bjorn3
b272d4b7da Fix srem.{i8,i16} 2021-04-13 21:28:27 +02:00
Andrew Brown
8e495ac79d x64: match multiple ISA requirements before emitting
Because there are instructions that are present in more than one ISA feature set, we need to see if any of the ISA requirements match before emitting. This change includes the `VPABSQ` instruction as an example, which is present in both `AVX512F` and `AVX512VL`.
2021-04-08 10:30:39 -07:00
Andrew Brown
d32501c554 x64: refactor REX-specific encoding machinery to its own module
In preparation for adding new encoding modes to the x64 backend (e.g. VEX,
EVEX), this change moves all of the current instruction encoding functions to
`encodings::rex`. This refactor does not change any logic.
2021-04-02 11:17:39 -07:00
Johnnie Birch
31d3db1ec2 Implements convert low signed integer to float for x64 simd 2021-03-26 12:13:29 -07:00
Chris Fallin
b429f77ee9 Handle srem properly when avoid_div_traps is false.
The codegen for div/rem ops has two modes, depending on the
`avoid_div_traps` flag: it can either do all checks for trapping
conditions explicitly, and use explicit trap instructions, then use a
hardware divide instruction that will not trap (`avoid_div_traps ==
true`); or it can run in a mode where a hardware FP fault on the divide
instruction implies a Wasm trap (`avoid_div_traps == false`). Wasmtime
uses the former while Lucet (for example) uses the latter.

It turns out that because we run all our spec tests run under Wasmtime,
we missed a spec corner case that fails in the latter: INT_MIN % -1 == 0
per the spec, but causes a trap with the x86 signed divide/remainder
instruction. Hence, in Lucet, this specific remainder computation would
incorrectly result in a Wasm trap.

This PR fixes the issue by just forcing use of the explicit-checks
implementation for `srem` even when `avoid_div_traps` is false.
2021-03-24 22:30:07 -07:00
Chris Fallin
2d5db92a9e Rework/simplify unwind infrastructure and implement Windows unwind.
Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.

A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.

The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.

It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.

There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for `alloca` (dynamic frame growth),
then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
`abi_impl.rs`.

The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from `sp` once to allocate space, then used stores to `[sp,
offset]` to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 384 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like
a "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.

This PR adds the new backend on Windows to CI as well.
2021-03-11 20:03:52 -08:00
Andrew Brown
508f8fa5a9 [x64] Add i64x2.abs
This instruction has a single instruction lowering in AVX512F/VL and a three instruction lowering in AVX but neither is currently supported in the x64 backend. To implement this, we instead subtract the vector from 0 and use a blending instruction to pick the lanes containing the absolute value.
2021-03-02 12:30:02 -08:00
Chris Fallin
cdb60ec5a9 Merge pull request #2682 from cfallin/shift-bugs
Fix some `i128` shift-related bugs in x64 backend.
2021-02-26 15:13:08 -08:00
Chris Fallin
40db4de44a Fix incomplete trap metadata due to multiple traps at one address.
If an instruction has more than one trap record associated with it (for
example: a divide instruction that has participated in load-op fusion,
so we have both a heap-out-of-bounds trap record due to its load and a
divide-by-zero trap record due to its divide op), the current MachBuffer
code would emit only one of the trap records to the sink.

Separately, divide instructions probably shouldn't merge loads, because
the two separate possible traps at one location might be confusing for
some embedders (certainly in Lucet). Divide seems to be the only case in
our current codegen where such merging might occur. This PR changes the
lowering to always force the divisor into a register.

Finally, while working out why trap records were not appearing, I had
noticed that `isa::x64::emit_std_enc_mem()` was only emitting heap-OOB
trap metadata for loads/stores when it had a srcloc. This PR ensures
that the metadata is emitted even when the srcloc is empty.

Note that none of the above presents a security or correctness problem;
trap metadata only affects the status that we return to the embedder
when a Wasm program terminates with a trap.
2021-02-24 15:13:45 -08:00
Chris Fallin
0f3e00b25e Fix some i128 shift-related bugs in x64 backend.
This fixes #2672 and #2679, and also fixes an incorrect instruction
emission (`test` with small immediate) that we had missed earlier.

The shift-related fixes have to do with (i) shifts by 0 bits, as a
special case that must be handled; and (ii) shifts by a 128-bit amount,
which we can handle by just dropping the upper half (we only use 3--7
bits of shift amount).

This adjusts the lowerings appropriately, and also adds run-tests to
ensure that the lowerings actually execute correctly (previously we only
had compile-tests with golden lowerings; I'd like to correct this for
more ops eventually, adding run-tests beyond what the Wasm spec and
frontend covers).
2021-02-23 14:22:04 -08:00
bjorn3
ff22842da5 More atomic ops 2021-02-18 14:16:15 +01:00
Kasey Carrothers
9c3edee9d0 Add methods to construct RexFlags from OperandSizes.
This unifies the logic around Rex prefix emission and hopefully makes REX prefix errors less likely.
There are still several instructions that use other sources to determine the flags, so set_w and clear_w are left as is.

Additional cleanups:
  * Change always_emit_if_8bit_needed to take a Reg instead of a u8 for type safety.
  * Deduplicated emission code in MovRM.
2021-02-17 18:48:05 -08:00
Kasey Carrothers
7bd96c8e2f Refactor x64::Insts that use an is_64 bool to use OperandSize. 2021-02-03 10:40:11 -08:00
Kasey Carrothers
3306408100 Refactor x64::Inst to use OperandSize instead of u8s.
TODO: some types take a 'is_64_bit' bool. Those are left unchanged for now.
2021-02-03 10:40:11 -08:00
Kasey Carrothers
b12d41bfe9 Expand x64 OperandSize to support 8 and 16-bit operands.
This is in preparation for refactoring all x64::Inst arms to use OperandSize.

Current uses of OperandSize fall into two categories:
  1. XMM operations which require 32/64 bit operands
  2. Immediates which only care about 64-bit or not.

Adds assertions to existing Inst constructors to check that they are passed valid sizes.
This change also removes the implicit widening of 1 and 2 byte values to 4 bytes. from_bytes() is only used by category 2, so removing this behavior will not change any visible behavior.

Overall this change should be a no-op.
2021-02-03 10:40:11 -08:00
Benjamin Bouvier
13027ad670 cranelift x64: add instruction set checks for popcnt/tzcnt/lzcnt; 2021-01-30 13:38:55 +01:00
Benjamin Bouvier
2275519cb1 cranelift x64: use the POPCNT instruction for Popcount when it's available; 2021-01-29 19:41:01 +01:00
Benjamin Bouvier
6bf6612d96 cranelift x64: use the TZCNT instruction for Ctz when it's available; 2021-01-29 19:41:01 +01:00
Benjamin Bouvier
d3acd9a283 cranelift x64: use the LZCNT instruction for Clz when it's available; 2021-01-29 19:41:01 +01:00
Johnnie Birch
cbd7a6a80e Add sse41 lowering for rounding x64 2021-01-28 17:37:17 -08:00
Chris Fallin
c84d6be6f4 Detailed debug-info (DWARF) support in new backends (initially x64).
This PR propagates "value labels" all the way from CLIF to DWARF
metadata on the emitted machine code. The key idea is as follows:

- Translate value-label metadata on the input into "value_label"
  pseudo-instructions when lowering into VCode. These
  pseudo-instructions take a register as input, denote a value label,
  and semantically are like a "move into value label" -- i.e., they
  update the current value (as seen by debugging tools) of the given
  local. These pseudo-instructions emit no machine code.

- Perform a dataflow analysis *at the machine-code level*, tracking
  value-labels that propagate into registers and into [SP+constant]
  stack storage. This is a forward dataflow fixpoint analysis where each
  storage location can contain a *set* of value labels, and each value
  label can reside in a *set* of storage locations. (Meet function is
  pairwise intersection by storage location.)

  This analysis traces value labels symbolically through loads and
  stores and reg-to-reg moves, so it will naturally handle spills and
  reloads without knowing anything special about them.

- When this analysis converges, we have, at each machine-code offset, a
  mapping from value labels to some number of storage locations; for
  each offset for each label, we choose the best location (prefer
  registers). Note that we can choose any location, as the symbolic
  dataflow analysis is sound and guarantees that the value at the
  value_label instruction propagates to all of the named locations.

- Then we can convert this mapping into a format that the DWARF
  generation code (wasmtime's debug crate) can use.

This PR also adds the new-backend variant to the gdb tests on CI.
2021-01-21 15:59:49 -08:00
bjorn3
81d248c057 Implement Mach-O TLS access for x64 newBE 2021-01-21 18:25:56 +01:00
Chris Fallin
0f563f786a Add ELF TLS support in new x64 backend.
This follows the implementation in the legacy x86 backend, including
hardcoded sequence that is compatible with what the linker expects. We
could potentially do better here, but it is likely not necessary.

Thanks to @bjorn3 for a bugfix to an earlier version of this.
2021-01-17 22:48:51 -08:00
Chris Fallin
71ead6e31d x64 backend: implement 128-bit ops and misc fixes.
This implements all of the ops on I128 that are implemented by the
legacy x86 backend, and includes all that are required by at least one
major use-case (cg_clif rustc backend).

The sequences are open-coded where necessary; for e.g. the bit
operations, this can be somewhat complex, but these sequences have been
tested carefully. This PR also includes a drive-by fix of clz/ctz for 8-
and 16-bit cases where they were incorrect previously.

Also includes ridealong fixes developed while bringing up cg_clif
support, because they are difficult to completely separate due to
other refactors that occurred in this PR:

- fix REX prefix logic for some 8-bit instructions.

  When using an 8-bit register in 64-bit mode on x86-64, the REX prefix
  semantics are somewhat subtle: without the REX prefix, register numbers
  4--7 correspond to the second-to-lowest byte of the first four registers
  (AH, CH, BH, DH), whereas with the REX prefix, these register numbers
  correspond to the usual encoding (SPL, BPL, SIL, DIL). We could always
  emit a REX byte for instructions with 8-bit cases (this is harmless even
  if unneeded), but this would unnecessarily inflate code size; instead,
  the usual approach is to emit it only for these registers.

  This logic was present in some cases but missing for some other
  instructions: divide, not, negate, shifts.

  Fixes #2508.

- avoid unaligned SSE loads on some f64 ops.

  The implementations of several FP ops, such as fabs/fneg, used SSE
  instructions. This is not a problem per-se, except that load-op merging
  did not take *alignment* into account. Specifically, if an op on an f64
  loaded from memory happened to merge that load, and the instruction into
  which it was merged was an SSE instruction, then the SSE instruction
  imposes stricter (128-bit) alignment requirements than the load.f64 did.

  This PR simply forces any instruction lowerings that could use SSE
  instructions to implement non-SIMD operations to take inputs in
  registers only, and avoid load-op merging.

  Fixes #2507.

- two bugfixes exposed by cg_clif: urem/srem.i8, select.b1.

  - urem/srem.i8: the 8-bit form of the DIV instruction on x86-64 places
    the remainder in AH, not RDX, different from all the other width-forms
    of this instruction.

  - select.b1: we were not recognizing selects of boolean values as
    integer-typed operations, so we were generating XMM moves instead (!).
2021-01-14 13:45:50 -08:00