Compute the bound values for expand_fcvt_to_sint using bitwise integer
arithmetic rather than floating-point arithmetic, to avoid relying on
host floating point arithmetic.
This allows the assertions to be disabled in release builds, so that
the code is faster and smaller, at the expense of not performing the
checks. Assertions can be re-enabled in release builds with the
debug-assertions flag in Cargo.toml, as the top-level Cargo.toml
file does.
When relaxing a branch, restrict the set of candidate encodings to those which
have the same input constraints as the original encoding choice. This prevents
situations where relaxation prefers a non-REX-prefixed encoding over a REX
prefixed one because the end of the instruction can be one byte closer to the
destination, in a situation where the encoding needs to be REX-prefixed
because of one of the operand registers.
This also makes the Context class perform encoding verification after
relaxation, to catch similar problems in the future.
Fixes#256.
Emergency stack slots are a new kind of stack slot added relatively
recently. They need to be allocated a stack offset just like explicit
and spill slots.
Also, make StackSlotData's offset field an Option, to catch problems
like this in the future. Previously the value 0 was used when offsets
weren't assigned yet, however that made it non-obvious when the field
meant "not assigned yet" and when it meant "assigned the value 0".
Spiderwasm on 32-bit x86 always uses a 16-byte-aligned stack pointer.
Change the setting for the "native" convention as well, for
compatibility with Linux and Darwin ABIs, and so that if a platform
has different ABI rules, the problem will be detected in code emitted by
Cretonne, rather than somewhere else.
The term "local variables" predated the SSA builder in the front-end
crate, which also provides a way to implement source-language local
variables. The name "explicit stack slot" makes it clear what this
construct is.
Adds support for transforming integer division and remainder by constants
into sequences that do not involve division instructions.
* div/rem by constant powers of two are turned into right shifts, plus some
fixups for the signed cases.
* div/rem by constant non-powers of two are turned into double length
multiplies by a magic constant, plus some fixups involving shifts,
addition and subtraction, that depends on the constant, the word size and
the signedness involved.
* The following cases are transformed: div and rem, signed or unsigned, 32
or 64 bit. The only un-transformed cases are: unsigned div and rem by
zero, signed div and rem by zero or -1.
* This is all incorporated within a new transformation pass, "preopt", in
lib/cretonne/src/preopt.rs.
* In preopt.rs, fn do_preopt() is the main driver. It is designed to be
extensible to transformations of other kinds of instructions. Currently
it merely uses a helper to identify div/rem transformation candidates and
another helper to perform the transformation.
* In preopt.rs, fn get_div_info() pattern matches to find candidates, both
cases where the second arg is an immediate, and cases where the second
arg is an identifier bound to an immediate at its definition point.
* In preopt.rs, fn do_divrem_transformation() does the heavy lifting of the
transformation proper. It in turn uses magic{S,U}{32,64} to calculate the
magic numbers required for the transformations.
* There are many test cases for the transformation proper:
filetests/preopt/div_by_const_non_power_of_2.cton
filetests/preopt/div_by_const_power_of_2.cton
filetests/preopt/rem_by_const_non_power_of_2.cton
filetests/preopt/rem_by_const_power_of_2.cton
filetests/preopt/div_by_const_indirect.cton
preopt.rs also contains a set of tests for magic number generation.
* The main (non-power-of-2) transformation requires instructions that return
the high word of a double-length multiply. For this, instructions umulhi
and smulhi have been added to the core instruction set. These will map
directly to single instructions on most non-intel targets.
* intel does not have an instruction exactly like that. For intel,
instructions x86_umulx and x86_smulx have been added. These map to real
instructions and return both result words. The intel legaliser will
rewrite {s,u}mulhi into x86_{s,u}mulx uses that throw away the lower half
word. Tests:
filetests/isa/intel/legalize-mulhi.cton (new file)
filetests/isa/intel/binary64.cton (added x86_{s,u}mulx encoding tests)
With the change to the parser to preserve indices, it now inserts
placeholders to pad out index spaces as needed. Placeholder functions
use reserved signature indices, so skip them when writing them out,
to avoid writing them out as "sig4294967295".
Cretonne clients don't need to know how the register allocator works.
Export the RegDiversions type from the binemit module instead. It is
used by the "test binemit" driver.
StackSlotKind::OutgoingArg stack slots have an offset that is relative
to our own stack pointer, while all other stack slot kinds have offsets
that are relative to the caller's stack pointer.
Make sure we generate the right sp-relative offsets for outgoing
arguments too.
This makes it easier to debug testcases:
- the entity numbers in a .cton file match the entity numbers used
within Cretonne.
- serializing and deserializing doesn't cause indices to change.
One disadvantage is that if a .cton file uses sparse entity numbers,
deserializing to the in-memory form doesn't compact it. However, the
text format is not intended to be performance-critical, so this isn't
expected to be a big burden.
When the input is a NaN, we need to generate a different trap code, so
use the new trapff instruction to generate such a trap after the first
floating point comparison.
This is the floating point equivalent of trapif: Trap when a given
condition is in the floating-point flags.
Define Intel encodings comparable to the trapif encodings.
This enables code generation that never causes a SIGFPE signal to be
raised from a division instruction. Instead, division and remainder
calculations are protected by explicit traps.
* gen_settings: dont try to display a Preset descriptor in Flags
Trying to display a preset doesnt make sense, and before this commit it
does not display anything meaningful - the printout just says e.g.
"haswell =\n".
The offset byte a preset descriptor isnt a valid offset into the
flag bytes, it is actually an offset into the PRESETS table. It will
cause a panic when the offset is out of bounds for the flag bytes,
which happens in the intel isa as of this commit.
* intel settings: test that display impl doesnt panic
This instruction loads a stack limit from a global variable and compares
it to the stack pointer, trapping if the stack has grown beyond the
limit.
Also add a expand_flags transform group containing legalization patterns
for ISAs with CPU flags.
Fixes#234.
Changes:
* Adds a new generic instruction, SELECTIF, that does value selection (a la
conditional move) similarly to existing SELECT, except that it is
controlled by condition code input and flags-register inputs.
* Adds a new Intel x86_64 variant, 'baseline', that supports SSE2 and
nothing else.
* Adds new Intel x86_64 instructions BSR and BSF.
* Implements generic CLZ, CTZ and POPCOUNT on x86_64 'baseline' targets
using the new BSR, BSF and SELECTIF instructions.
* Implements SELECTIF on x86_64 targets using conditional-moves.
* new test filetests/isa/intel/baseline_clz_ctz_popcount.cton
(for legalization)
* new test filetests/isa/intel/baseline_clz_ctz_popcount_encoding.cton
(for encoding)
* Allow lib/cretonne/meta/gen_legalizer.py to generate non-snake-caseified
Rust without rustc complaining.
Fixes#238.
This Function method can be used after the final code layout has been
computed. It returns all the instructions in an EBB along with their
encoded size and offset from the beginning of the function.
This is useful for extracting additional metadata about trapping
instructions and other things that may be needed by a VM.
The dominator tree pre-order is defined at the EBB granularity, but we
are looking for dominating nodes at the instruction level. This means
that we sometimes need to look higher up the DomForest stack for a
dominating node, using DominatorTree::dominates() instead of
DominatorTreePreorder::dominates().
Each dominance check involves the domtree.last_dominator() function
scanning up the dominator tree, starting from the new node that was
pushed. We can eliminate this duplicate work by exposing the
last_dominator() function to push_node().
As we are searching through nodes on the stack, maintain a last_dom
program point representing the previous return value from
last_dominator(). This way, we're only scanning the dominator tree once.
Use a better algorithm for resolving interferences in virtual registers.
This improves code quality by generating much fewer copies on some
complicated functions.
After the initial union-find phase, the check_vreg() function uses a
Budimlic forest to check for interference between the values in the
virtual registers, as before. All the interference-free vregs are done.
Others are passed to synthesize_vreg() which dissolves the vreg and then
attempts to rebuild one or more vregs from the contained values.
The pairwise interference checks use *virtual copies* to make sure that
any future conflicts can be resolved by inserting a copy instruction.
This technique was not present in the old coalescer which caused some
correctness issues.
This coalescing algorithm makes much better code, and it is generally a
bit slower than before. Some of the slowdown is made up by the following
passes being faster because they have to process less code.
Example 1, the Python interpreter which contains a very large function
with a lot of variables.
Before:
15.664 0.011 Register allocation
1.535 1.535 RA liveness analysis
2.872 1.911 RA coalescing CSSA
4.436 4.436 RA spilling
2.610 2.598 RA reloading
4.200 4.199 RA coloring
After:
9.795 0.013 Register allocation
1.372 1.372 RA liveness analysis
6.231 6.227 RA coalescing CSSA
0.712 0.712 RA spilling
0.598 0.598 RA reloading
0.869 0.869 RA coloring
Coalescing is more than twice as slow, but because of the vastly better
code quality, overall register allocation time is improved by 37%.
Example 2, the clang compiler.
Before:
57.148 0.035 Register allocation
9.630 9.630 RA liveness analysis
7.210 7.169 RA coalescing CSSA
9.972 9.972 RA spilling
11.602 11.572 RA reloading
18.698 18.672 RA coloring
After:
64.792 0.042 Register allocation
8.630 8.630 RA liveness analysis
22.937 22.928 RA coalescing CSSA
8.684 8.684 RA spilling
9.559 9.551 RA reloading
14.939 14.936 RA coloring
Here coalescing is 3x slower, but overall regalloc time only regresses
by 13%.
Most examples are less extreme than these two. They just get better code
at about the same compile time.