Use a better algorithm for resolving interferences in virtual registers.
This improves code quality by generating much fewer copies on some
complicated functions.
After the initial union-find phase, the check_vreg() function uses a
Budimlic forest to check for interference between the values in the
virtual registers, as before. All the interference-free vregs are done.
Others are passed to synthesize_vreg() which dissolves the vreg and then
attempts to rebuild one or more vregs from the contained values.
The pairwise interference checks use *virtual copies* to make sure that
any future conflicts can be resolved by inserting a copy instruction.
This technique was not present in the old coalescer which caused some
correctness issues.
This coalescing algorithm makes much better code, and it is generally a
bit slower than before. Some of the slowdown is made up by the following
passes being faster because they have to process less code.
Example 1, the Python interpreter which contains a very large function
with a lot of variables.
Before:
15.664 0.011 Register allocation
1.535 1.535 RA liveness analysis
2.872 1.911 RA coalescing CSSA
4.436 4.436 RA spilling
2.610 2.598 RA reloading
4.200 4.199 RA coloring
After:
9.795 0.013 Register allocation
1.372 1.372 RA liveness analysis
6.231 6.227 RA coalescing CSSA
0.712 0.712 RA spilling
0.598 0.598 RA reloading
0.869 0.869 RA coloring
Coalescing is more than twice as slow, but because of the vastly better
code quality, overall register allocation time is improved by 37%.
Example 2, the clang compiler.
Before:
57.148 0.035 Register allocation
9.630 9.630 RA liveness analysis
7.210 7.169 RA coalescing CSSA
9.972 9.972 RA spilling
11.602 11.572 RA reloading
18.698 18.672 RA coloring
After:
64.792 0.042 Register allocation
8.630 8.630 RA liveness analysis
22.937 22.928 RA coalescing CSSA
8.684 8.684 RA spilling
9.559 9.551 RA reloading
14.939 14.936 RA coloring
Here coalescing is 3x slower, but overall regalloc time only regresses
by 13%.
Most examples are less extreme than these two. They just get better code
at about the same compile time.
Add EBB parameter and EBB argument to the langref glossary to clarify
the distinction between formal EBB parameter values and arguments passed
to branches.
- Replace "ebb_arg" with "ebb_param" in function names that deal with
EBB parameters.
- Rename the ValueDef variants to Result and Param.
- A bunch of other small langref fixes.
No functional changes intended.
The EntityRef trait is used by more than just the EntityMap now, so it
should live in its own module.
Also move the entity_impl! macro into the new module so it can be used
for defining new entity references anywhere.
When comparing instructions in the same EBB, behave like the RPO visits
instructions in program order.
- Add a Layout::pp_ebb() method for convenience. It gets the EBB
containing any program point.
- Add a conversion from ValueDef to ExpandedProgramPoint so it can be
used with the rpo_cmp method.
LiveRanges represent the live-in range of a value as a sorted
list of intervals. Each interval starts at an EBB and continues
to an instruction. Before this commit, the LiveRange would store
an interval for each EBB. This commit changes the representation
such that intervals continuing from one EBB to another are coalesced
into one.
Fixes#37.
We will track live ranges separately for each SSA value, rather than per
virtual register like LLVM does.
This is the basis for a register allocator, so place it in a new
regalloc module.
The ProgramOrder::cmp() comparison is often used where one or both
arguments are statically known to be an Inst or Ebb. Give the compiler a
better chance to discover this via inlining and other optimizations.
- Make cmp() generic with Into<ExpandedProgramPoint> bounds.
- Implement the natural From<T> traits for ExpandedProgramPoint.
- Make Layout::pp_seq() generic with the same bound.
Now, with inlining and constant folding, passing an Inst argument to
PO::cmp() will result in a call to a monomorphized Layout::seq::<Inst>()
which can avoid the dynamic match to select a table for looking up the
sequence number.
The result is that comparing two program points of statically known type
results in two direct table lookups and a sequence number comparison.
This all uses ExpandedProgramPoint because it is more likely to be
transparent to the constant folder than the bit-packed ProgramPoint
type.