Add a calling-convention setting to the `Flags` used as part of the
`TargetIsa`. This allows Cretonne code that generates calls to use the
correct convention, such as when emitting libcalls during legalization
or when the wasm frontend is decoding functions. This setting can be
overridden per-function.
This also adds "fast", "cold", and "fastcall" conventions, with "fast"
as the new default. Note that "fast" and "cold" are not intended to be
ABI-compatible across Cretonne versions.
This will also ensure Windows users will get an `unimplemented!` rather
than silent calling-convention mismatches, which reflects the fact that
Windows calling conventions are not yet implemented.
This also renames SpiderWASM, which isn't camel-case, to Baldrdash,
which is, and which is also a more relevant name.
When an instruction has multiple valid encodings, such as with and
without a REX prefix on x86-64, Cretonne typically picks the encoding
which gives the register allocator the most flexibility, which is
typically the longest encoding. This patch adds a pass that runs after
register allocation that picks the smallest encoding, working within the
constraints of the register allocator's choices. The result is smaller
and easier to read encodings.
In the future, we may want to merge this pass into the relaxation pass,
or possibly fold it into the final encoding step, however for now, a
discrete pass will suffice.
Choosing smaller instruction encodings on eg. x86 is an optimization,
rather than a useful discrete setting.
Use "is_compressed" only for ISAs that have an explicit compression feature
that users of the output may to be aware of, such as RISC-V's RVC or
ARM's Thumb-2.
This makes it a little more consistent; now, "cretonne" is never capitalized
in identifier, path, or URL contexts. It is capitalized in natural
language contexts when referring to the project.
This adds a "colocated" flag to function and symbolic global variables which
indicates that they are defined along with the current function, so they can
use PC-relative addressing.
This also changes the function decl syntax; the name now always precedes the
signature, and the "function" keyword is no longer included.
The regmove and regfill instructions temporarily divert a value's
location, and these temporary diversions are not reflected in
`func.locations`. For now, make an extra scan through the instructions
of the function to find any regmove or regfill instructions in order to
find all used callee-saved registers.
This fixes#296.
The main use for non-PIC code at present is JIT code, and JIT code can
live anywhere in memory and reference other symbols defined anywhere in
memory, so it needs to use the "large" code model.
func_addr and globalsym_addr instructions were already using `movabs`
to support arbitrary 64-bit addresses, so this just makes calls be
legalized to support arbitrary 64-bit addresses also.
* Only save callee-saved registers that are actually being used.
* Rename AllocatableSet to RegisterSet
* Style cleanup and small renames for readability.
* Adjust x86 prologue-epilogue test to account for callee-saved register optimization.
* Add more tests for prologue-epilogue optimizations.
To keep cross-compiling straightforward, Cretonne shouldn't have any
behavior that depends on the host. This renames the "Native" calling
convention to "SystemV", which has a defined meaning for each target,
so that it's clear that the calling convention doesn't change
depending on what host Cretonne is running on.
* Add a pre-opt optimization to change constants into immediates.
This converts 'iadd' + 'iconst' into 'iadd_imm', and so on.
* Optimize away redundant `bint` instructions.
Cretonne has a concept of "Testable" values, which can be either boolean
or integer. When the an instruction needing a "Testable" value receives
the result of a `bint`, converting boolean to integer, eliminate the
`bint`, as it's redundant.
* Postopt: Optimize using CPU flags.
This introduces a post-legalization optimization pass which converts
compare+branch sequences to use flags values on CPUs which support it.
* Define a form of x86's `urm` that doesn't clobber FLAGS.
movzbl/movsbl/etc. don't clobber FLAGS; define a form of the `urm`
recipe that represents this.
* Implement a DCE pass.
This pass deletes instructions with no side effects and no results that
are used.
* Clarify ambiguity about "32-bit" and "64-bit" in comments.
* Add x86 encodings for icmp_imm.
* Add a testcase for postopt CPU flags optimization.
This covers the basic functionality of transforming compare+branch
sequences to use CPU flags.
* Pattern-match irsub_imm in preopt.
* First draft of TrapSink implementation.
* Add trap sink calls to 'trapif' and 'trapff' recipes.
* Add SourceLoc to trap sink calls, and add trap sink calls to all loads and stores.
* Add IntegerDivisionByZero trap to div recipe.
* Only emit load/store traps if 'notrap' flag is not set on the instruction.
* Update filetest machinery to add new trap sink functionality.
* Update filetests to include traps in output.
* Add a few more trap outputs to filetests.
* Add trap output to CLI tool.
Value aliases aren't instructions, so they don't have a location in the
CFG, so it's not meaningful to query whether a value alias is defined
within a loop.
While there may be CPUs that have a domain crossing penalty here,
this also helps the generated code look more like the code produced
by other compilers.
EFLAGS is a subregister of RFLAGS. For consistency with GPRs where we
use the 64-bit names to refer to the registers, use the 64-bit name for
RFLAGS as well.
Mark loads from globals generated by cton_wasm or by legalization as
`aligned` and `notrap`, since memory for these globals should be
allocated by the runtime environment for that purpose. This reduces
the number of potentially trapping instructions, which can reduce
the amount of metadata required by embedding environments.
* add x86 encodings for shift-immediate instructions
implements encodings for ishl_imm, sshr_imm, and ushr_imm. uses 8-bit immediates.
added tests for the encodings to intel/binary64.cton. Canonical versions
come from llvm-mc.
* translate test to use shift-immediates
* shift immediate encodings: use enc_i32_i64
and note why the regular shift encodings cant use it above
* add additional encoding tests for shift immediates
this covers 32 bit mode, and 64 bit operations in 64 bit mode.
The term "local variables" predated the SSA builder in the front-end
crate, which also provides a way to implement source-language local
variables. The name "explicit stack slot" makes it clear what this
construct is.
Adds support for transforming integer division and remainder by constants
into sequences that do not involve division instructions.
* div/rem by constant powers of two are turned into right shifts, plus some
fixups for the signed cases.
* div/rem by constant non-powers of two are turned into double length
multiplies by a magic constant, plus some fixups involving shifts,
addition and subtraction, that depends on the constant, the word size and
the signedness involved.
* The following cases are transformed: div and rem, signed or unsigned, 32
or 64 bit. The only un-transformed cases are: unsigned div and rem by
zero, signed div and rem by zero or -1.
* This is all incorporated within a new transformation pass, "preopt", in
lib/cretonne/src/preopt.rs.
* In preopt.rs, fn do_preopt() is the main driver. It is designed to be
extensible to transformations of other kinds of instructions. Currently
it merely uses a helper to identify div/rem transformation candidates and
another helper to perform the transformation.
* In preopt.rs, fn get_div_info() pattern matches to find candidates, both
cases where the second arg is an immediate, and cases where the second
arg is an identifier bound to an immediate at its definition point.
* In preopt.rs, fn do_divrem_transformation() does the heavy lifting of the
transformation proper. It in turn uses magic{S,U}{32,64} to calculate the
magic numbers required for the transformations.
* There are many test cases for the transformation proper:
filetests/preopt/div_by_const_non_power_of_2.cton
filetests/preopt/div_by_const_power_of_2.cton
filetests/preopt/rem_by_const_non_power_of_2.cton
filetests/preopt/rem_by_const_power_of_2.cton
filetests/preopt/div_by_const_indirect.cton
preopt.rs also contains a set of tests for magic number generation.
* The main (non-power-of-2) transformation requires instructions that return
the high word of a double-length multiply. For this, instructions umulhi
and smulhi have been added to the core instruction set. These will map
directly to single instructions on most non-intel targets.
* intel does not have an instruction exactly like that. For intel,
instructions x86_umulx and x86_smulx have been added. These map to real
instructions and return both result words. The intel legaliser will
rewrite {s,u}mulhi into x86_{s,u}mulx uses that throw away the lower half
word. Tests:
filetests/isa/intel/legalize-mulhi.cton (new file)
filetests/isa/intel/binary64.cton (added x86_{s,u}mulx encoding tests)
With the change to the parser to preserve indices, it now inserts
placeholders to pad out index spaces as needed. Placeholder functions
use reserved signature indices, so skip them when writing them out,
to avoid writing them out as "sig4294967295".
StackSlotKind::OutgoingArg stack slots have an offset that is relative
to our own stack pointer, while all other stack slot kinds have offsets
that are relative to the caller's stack pointer.
Make sure we generate the right sp-relative offsets for outgoing
arguments too.
This makes it easier to debug testcases:
- the entity numbers in a .cton file match the entity numbers used
within Cretonne.
- serializing and deserializing doesn't cause indices to change.
One disadvantage is that if a .cton file uses sparse entity numbers,
deserializing to the in-memory form doesn't compact it. However, the
text format is not intended to be performance-critical, so this isn't
expected to be a big burden.
This is the floating point equivalent of trapif: Trap when a given
condition is in the floating-point flags.
Define Intel encodings comparable to the trapif encodings.
This enables code generation that never causes a SIGFPE signal to be
raised from a division instruction. Instead, division and remainder
calculations are protected by explicit traps.
This instruction loads a stack limit from a global variable and compares
it to the stack pointer, trapping if the stack has grown beyond the
limit.
Also add a expand_flags transform group containing legalization patterns
for ISAs with CPU flags.
Fixes#234.
The instruction set has variants with 8-bit and 32-bit signed immediate
operands.
Add a TODO to use a TEST instruction for the special case ifcmp_imm x, 0.
Changes:
* Adds a new generic instruction, SELECTIF, that does value selection (a la
conditional move) similarly to existing SELECT, except that it is
controlled by condition code input and flags-register inputs.
* Adds a new Intel x86_64 variant, 'baseline', that supports SSE2 and
nothing else.
* Adds new Intel x86_64 instructions BSR and BSF.
* Implements generic CLZ, CTZ and POPCOUNT on x86_64 'baseline' targets
using the new BSR, BSF and SELECTIF instructions.
* Implements SELECTIF on x86_64 targets using conditional-moves.
* new test filetests/isa/intel/baseline_clz_ctz_popcount.cton
(for legalization)
* new test filetests/isa/intel/baseline_clz_ctz_popcount_encoding.cton
(for encoding)
* Allow lib/cretonne/meta/gen_legalizer.py to generate non-snake-caseified
Rust without rustc complaining.
Fixes#238.
The fuzzer bugs #219 and #227 are both cases where the register
allocator coloring pass "runs out of registers". What's really happening
is that the constraint solver failed to find a solution, even when one
existed.
Suppose we have three solver variables:
v0(GPR, out, global)
v1(GPR, in)
v2(GPR, in, out)
And suppose registers %r0 and %r1 are available on both input and output
sides of the instruction, but only %r1 is available for global outputs.
A valid solution would be:
v0 -> %r1
v1 -> %r1
v2 -> %r0
However, the solver would pick registers for the three values in
numerical order because v1 and v2 have the same domain size (=2). This
would assign v1 -> %r0 and then fail to find a free register for v2.
Fix this by prioritizing in+out variables over single-sided variables
even when their domains are equal. This means the v2 gets assigned a
register before v1, and it gets a chance to pick a register that is
still available on both in and out sides.
Also try to avoid depending on value numbers in the solver. These bugs
were hard to reproduce because a test case invariably would have
different value numbers, causing the solver to order its variables
differently and succeed. Throw in the previous solution and original
register assignments as tie breakers which are stable and not dependent
on value numbers.
This is still not a substitute for a proper solver search algorithm that
we will probably have to write eventually.
Fixes#219Fixes#227
The error exposed by this test case no longer happens after the
coalescer was rewritten to to follow the Budimlic paper. It's still a
good coalescer test.
Fixes#216 by including the test case.
The Intel instruction "v1 = ushr v2, v2" will implicitly fix the output
register for v2 to %rcx because the output is tied to the first input
operand and the second input operand is fixed to %rcx.
Make sure we handle this transitive constraint when checking for
interference with the globally live registers.
Fixes#218