The dominator tree pre-order is defined at the EBB granularity, but we
are looking for dominating nodes at the instruction level. This means
that we sometimes need to look higher up the DomForest stack for a
dominating node, using DominatorTree::dominates() instead of
DominatorTreePreorder::dominates().
Each dominance check involves the domtree.last_dominator() function
scanning up the dominator tree, starting from the new node that was
pushed. We can eliminate this duplicate work by exposing the
last_dominator() function to push_node().
As we are searching through nodes on the stack, maintain a last_dom
program point representing the previous return value from
last_dominator(). This way, we're only scanning the dominator tree once.
Use a better algorithm for resolving interferences in virtual registers.
This improves code quality by generating much fewer copies on some
complicated functions.
After the initial union-find phase, the check_vreg() function uses a
Budimlic forest to check for interference between the values in the
virtual registers, as before. All the interference-free vregs are done.
Others are passed to synthesize_vreg() which dissolves the vreg and then
attempts to rebuild one or more vregs from the contained values.
The pairwise interference checks use *virtual copies* to make sure that
any future conflicts can be resolved by inserting a copy instruction.
This technique was not present in the old coalescer which caused some
correctness issues.
This coalescing algorithm makes much better code, and it is generally a
bit slower than before. Some of the slowdown is made up by the following
passes being faster because they have to process less code.
Example 1, the Python interpreter which contains a very large function
with a lot of variables.
Before:
15.664 0.011 Register allocation
1.535 1.535 RA liveness analysis
2.872 1.911 RA coalescing CSSA
4.436 4.436 RA spilling
2.610 2.598 RA reloading
4.200 4.199 RA coloring
After:
9.795 0.013 Register allocation
1.372 1.372 RA liveness analysis
6.231 6.227 RA coalescing CSSA
0.712 0.712 RA spilling
0.598 0.598 RA reloading
0.869 0.869 RA coloring
Coalescing is more than twice as slow, but because of the vastly better
code quality, overall register allocation time is improved by 37%.
Example 2, the clang compiler.
Before:
57.148 0.035 Register allocation
9.630 9.630 RA liveness analysis
7.210 7.169 RA coalescing CSSA
9.972 9.972 RA spilling
11.602 11.572 RA reloading
18.698 18.672 RA coloring
After:
64.792 0.042 Register allocation
8.630 8.630 RA liveness analysis
22.937 22.928 RA coalescing CSSA
8.684 8.684 RA spilling
9.559 9.551 RA reloading
14.939 14.936 RA coloring
Here coalescing is 3x slower, but overall regalloc time only regresses
by 13%.
Most examples are less extreme than these two. They just get better code
at about the same compile time.
Use a dominator tree pre-order to speed up the dominance checks and only
check for interference with the nearest dominating predecessor in a
virtual register.
This makes the CSSA verification about 2x as fast.
The ir::layout module is assigning sequence numbers to all EBBs and
instructions so relative positions can be computed in constant time.
This works a lot like BASIC line numbers where we initially use numbers
10, 20, 30, ... so we can insert new instructions in the middle of the
sequence without renumbering everything.
In some cases where the coalescer is misbehaving and inserting a lot of
copy instructions, we end up having to renumber a larger and larger
number of instructions to make space in the sequence. This causes the
following reload pass to be very slow, spending most of its time
renumbering instructions.
Fix this by putting an upper limit on the number of instructions we're
willing to renumber locally. When the limit is reached, switch to a full
function renumbering with the major stride of 10. This gives us new
elasticity in the sequence numbers.
- Time to compile the Python interpreter in #229 drops from 4826 s -> 15.8 s.
- The godot benchmark in #226 drops from 1257 s -> 75 s.
- The AngryBots1 benchmark does not have the coalescer misbehavior.
Its compilation time changes 22.9 s -> 23.1 s.
It's worth noting that the sequence numbering is still technically
quadratic with this fix. The system is not designed to handle a large
number of instructions inserted in a single location. It expects a more
even distribution of new instructions.
We still need to fix the coalescer. It should not insert so many copies
in degenerate cases.
The fuzzer bugs #219 and #227 are both cases where the register
allocator coloring pass "runs out of registers". What's really happening
is that the constraint solver failed to find a solution, even when one
existed.
Suppose we have three solver variables:
v0(GPR, out, global)
v1(GPR, in)
v2(GPR, in, out)
And suppose registers %r0 and %r1 are available on both input and output
sides of the instruction, but only %r1 is available for global outputs.
A valid solution would be:
v0 -> %r1
v1 -> %r1
v2 -> %r0
However, the solver would pick registers for the three values in
numerical order because v1 and v2 have the same domain size (=2). This
would assign v1 -> %r0 and then fail to find a free register for v2.
Fix this by prioritizing in+out variables over single-sided variables
even when their domains are equal. This means the v2 gets assigned a
register before v1, and it gets a chance to pick a register that is
still available on both in and out sides.
Also try to avoid depending on value numbers in the solver. These bugs
were hard to reproduce because a test case invariably would have
different value numbers, causing the solver to order its variables
differently and succeed. Throw in the previous solution and original
register assignments as tie breakers which are stable and not dependent
on value numbers.
This is still not a substitute for a proper solver search algorithm that
we will probably have to write eventually.
Fixes#219Fixes#227
The error exposed by this test case no longer happens after the
coalescer was rewritten to to follow the Budimlic paper. It's still a
good coalescer test.
Fixes#216 by including the test case.
The Intel instruction "v1 = ushr v2, v2" will implicitly fix the output
register for v2 to %rcx because the output is tied to the first input
operand and the second input operand is fixed to %rcx.
Make sure we handle this transitive constraint when checking for
interference with the globally live registers.
Fixes#218
When the coloring pass sees an instruction with a fixed input register
constraint that is already satisfied, make sure to tell the solver
about it anyway.
There are situations where the solver wants to convert a value to a
solver variable, and we can't allow that if the same value is also used
for a fixed register operand.
Fixes#221.
The spiller wasn't tracking register pressure correctly for dead EBB
parameters in visit_ebb_header(). Make sure we free any dead EBB
parameters.
Fixes#223
* Switch RegClass to a bitmap implementation.
* Add special RegClass to remove r13 from 'ld' recipe.
* Use MASK_LEN constant instead of magic number.
* Enforce that RegClass slicing is only valid on contiguous classes.
* Use Optional[int] for RegClass optional bitmask parameter.
* Add comment explaining use of Intel ISA's GPR_NORIP register class.
The old coalescing algorithm had some algorithmic complexity issues when
dealing with large virtual registers. Reimplement to use a proper
union-find algorithm so we only need one pass through the dominator
forests for virtual registers that are interference free.
Virtual registers that do have interference are split and new registers
built.
This pass is about twice as fast as the old one when dealing with
complex virtual registers.
The initial phase of computing virtual registers can now be implemented
with a textbook union-find algorithm using a disjoint set forest
complete with rank and path compression optimizations.
The disjoint set forest is converted to virtual register value lists in
a single linear scan implemented in finish_union_find().
This union-find algorithm will soon be used by the coalescer.
- Allow the syntax "specials=True" to indicate that a type variable can
assume all special types. Use this for the unconstrained type variable
created in ast.py.
- Fix TypeSet.copy() to avoid deepcopy() which doesn't do the right
thing for the self.specials set.
- Fix TypeSet.typeset_key() to just use the name of special types
instead of the full SpecialType objects.
Ghost instructions and values are supposed to be stored as metadata
alongside the compiled program such that the ghost values can be
computed from the real register/stack values when the program is stopped
for debugging or de-optimization.
If we allow an EBB parameter to be a ghost value, we have no way of
computing its real value using ghost instructions. We would need to know
a complete execution trace of the stopped program to figure out which
values were passed to the ghost parameter.
Instead we require EBB parameters to be real values materialized in
registers or on the stack. We use the regclass_for_abi_type() TargetIsa
callback to determine the initial register class for these parameters.
They can then be spilled later if needed.
Fixes#215.
We want to disable dominance checks in unreachable code. The
is_reachable() check for EBB parameter values was checking if the
defining EBB was reachable, not the EBB using the value.
This bug showed up in fuzzing and in #213.
It looks like mypy 0.560 doesn't like when a local variable changes its
type inside a function.
Fixes introduce a new variable instead of reusing an existing one.
Add an addend field to reloc_external, and use it to move the
responsibility for accounting for the difference between the end of an
instruction (where the PC is considered to be in PC-relative on intel)
and the beginning of the immediate field into the encoding code.
Specifically, this makes IntelGOTPCRel4 directly correspond to
R_X86_64_GOTPCREL, instead of also carrying an implicit `- 4`.
When the spiller needs to make a register available for a conditional
branch instruction, it can be necessary to spill some of the EBB
arguments on the branch instruction. This is ok because EBB argument
values belong to the same virtual register as the corresponding EBB
parameter and we spill the whole virtreg to the same slot.
Also make sure free_regs() can handle values that are killed by the
current instruction *and* spilled.
The stack implementation if the Budimlic dominator forest doesn't work
correctly with a CFG RPO. It needs the domtree pre-order.
Also handle EBB pre-order vs inst-level preorder. Manage the stack
according to EBB dominance. Look for a dominating value by searching the
stack. This is different from the Budimlic algorithm because we're
computing the dominator tree pre-order with EBB granularity only.
Fixes#207.
Instead of requiring the values in a virtual register to be sorted
according to the domtree.rpo_cmp() order, just require any topological
ordering w.r.t. dominance.
The coalescer with stop using the RPO shortly.
The coalescer makes sure that matching EBB arguments and parameters are
always in the same virtual registers, and therefore also in the same
stack slot if they are spilled.
This means that the reload pass should never rewrite an EBB argument if
the argument value is spilled. This comes up in cases where the branch
instruction needs the same value in a register:
brnz v9, ebb3(v9)
If the virtual register containing v9 is spilled, the branch instruction
must be reloaded like:
v52 = fill v9
brnz v52, ebb3(v9)
The branch register argument must be rewritten, and the EBB argument
must be referring to the original stack value.
Fixes#208.
Add a DominatorTreePreorder data structure which can be initialized for
a DominatorTree and used for queries involving a pre-order of the
dominator tree.
Print out the pre-order and send it through filecheck in "test domtree"
file tests.
Add a "cfg_postorder:" printout to the "test domtree" file tests and use
that to check the computed CFG post-order instead of doing it manually
with Rust code.