Update the documentation for the merger, and also for various changes in Cranelift. Remove some old obsolete documentation, and convert the remaining Sphinx files to Markdown. Some of the remaining content is still out of date, but this is a step forward.
326 lines
15 KiB
Markdown
326 lines
15 KiB
Markdown
# Register Allocation in Cranelift
|
|
|
|
Cranelift uses a *decoupled, SSA-based* register allocator. Decoupled means that
|
|
register allocation is split into two primary phases: *spilling* and
|
|
*coloring*. SSA-based means that the code stays in SSA form throughout the
|
|
register allocator, and in fact is still in SSA form after register allocation.
|
|
|
|
Before the register allocator is run, all instructions in the function must be
|
|
*legalized*, which means that every instruction has an entry in the
|
|
`encodings` table. The encoding entries also provide register class
|
|
constraints on the instruction's operands that the register allocator must
|
|
satisfy.
|
|
|
|
After the register allocator has run, the `locations` table provides a
|
|
register or stack slot location for all SSA values used by the function. The
|
|
register allocator may have inserted `spill`, `fill`, and
|
|
`copy` instructions to make that possible.
|
|
|
|
## SSA-based register allocation
|
|
|
|
The phases of the SSA-based register allocator are:
|
|
|
|
Liveness analysis
|
|
For each SSA value, determine exactly where it is live.
|
|
|
|
Coalescing
|
|
Form *virtual registers* which are sets of SSA values that should be
|
|
assigned to the same location. Split live ranges such that values that
|
|
belong to the same virtual register don't have interfering live ranges.
|
|
|
|
Spilling
|
|
The process of deciding which SSA values go in a stack slot and which
|
|
values go in a register. The spilling phase can also split live ranges by
|
|
inserting `copy` instructions, or transform the code in other ways to
|
|
reduce the number of values kept in registers.
|
|
|
|
After spilling, the number of live register values never exceeds the number
|
|
of available registers.
|
|
|
|
Reload
|
|
Insert `spill` and `fill` instructions as necessary such that
|
|
instructions that expect their operands in registers won't see values that
|
|
live on the stack and vice versa.
|
|
|
|
Reuse registers containing values loaded from the stack as much as possible
|
|
without exceeding the maximum allowed register pressure.
|
|
|
|
Coloring
|
|
The process of assigning specific registers to the live values. It's a
|
|
property of SSA form that this can be done in a linear scan of the
|
|
dominator tree without causing any additional spills.
|
|
|
|
Make sure that specific register operand constraints are satisfied.
|
|
|
|
The contract between the spilling and coloring phases is that the number of
|
|
values in registers never exceeds the number of available registers. This
|
|
sounds simple enough in theory, but in practice there are some complications.
|
|
|
|
### Real-world complications to SSA coloring
|
|
|
|
In practice, instruction set architectures don't have "K interchangeable
|
|
registers", and register pressure can't be measured with a single number. There
|
|
are complications:
|
|
|
|
Different register banks
|
|
Most ISAs separate integer registers from floating point registers, and
|
|
instructions require their operands to come from a specific bank. This is a
|
|
fairly simple problem to deal with since the register banks are completely
|
|
disjoint. We simply count the number of integer and floating-point values
|
|
that are live independently, and make sure that each number does not exceed
|
|
the size of their respective register banks.
|
|
|
|
Instructions with fixed operands
|
|
Some instructions use a fixed register for an operand. This happens on the
|
|
x86 ISAs:
|
|
|
|
- Dynamic shift and rotate instructions take the shift amount in CL.
|
|
- Division instructions use RAX and RDX for both input and output operands.
|
|
- Wide multiply instructions use fixed RAX and RDX registers for input and
|
|
output operands.
|
|
- A few SSE variable blend instructions use a hardwired XMM0 input operand.
|
|
|
|
Operands constrained to register subclasses
|
|
Some instructions can only use a subset of the registers for some operands.
|
|
For example, the ARM NEON vmla (scalar) instruction requires the scalar
|
|
operand to be located in D0-15 or even D0-7, depending on the data type.
|
|
The other operands can be from the full D0-31 register set.
|
|
|
|
ABI boundaries
|
|
Before making a function call, arguments must be placed in specific
|
|
registers and stack locations determined by the ABI, and return values
|
|
appear in fixed registers.
|
|
|
|
Some registers can be clobbered by the call and some are saved by the
|
|
callee. In some cases, only the low bits of a register are saved by the
|
|
callee. For example, ARM64 callees save only the low 64 bits of v8-15, and
|
|
Win64 callees only save the low 128 bits of AVX registers.
|
|
|
|
ABI boundaries also affect the location of arguments to the entry block and
|
|
return values passed to the `return` instruction.
|
|
|
|
Aliasing registers
|
|
Different registers sometimes share the same bits in the register bank.
|
|
This can make it difficult to measure register pressure. For example, the
|
|
x86 registers RAX, EAX, AX, AL, and AH overlap.
|
|
|
|
If only one of the aliasing registers can be used at a time, the aliasing
|
|
doesn't cause problems since the registers can simply be counted as one
|
|
unit.
|
|
|
|
Early clobbers
|
|
Sometimes an instruction requires that the register used for an output
|
|
operand does not alias any of the input operands. This happens for inline
|
|
assembly and in some other special cases.
|
|
|
|
|
|
## Liveness Analysis
|
|
|
|
All the register allocator passes need to know exactly where SSA values are
|
|
live. The liveness analysis computes this information.
|
|
|
|
The data structure representing the live range of a value uses the linear
|
|
layout of the function. All instructions and EBB headers are assigned a
|
|
*program position*. A starting point for a live range can be one of the
|
|
following:
|
|
|
|
- The instruction where the value is defined.
|
|
- The EBB header where the value is an EBB parameter.
|
|
- An EBB header where the value is live-in because it was defined in a
|
|
dominating block.
|
|
|
|
The ending point of a live range can be:
|
|
|
|
- The last instruction to use the value.
|
|
- A branch or jump to an EBB where the value is live-in.
|
|
|
|
When all the EBBs in a function are laid out linearly, the live range of a
|
|
value doesn't have to be a contiguous interval, although it will be in a
|
|
majority of cases. There can be holes in the linear live range.
|
|
|
|
The part of a value's live range that falls inside a single EBB will always be
|
|
an interval without any holes. This follows from the dominance requirements of
|
|
SSA. A live range is represented as:
|
|
|
|
- The interval inside the EBB where the value is defined.
|
|
- A set of intervals for EBBs where the value is live-in.
|
|
|
|
Any value that is only used inside a single EBB will have an empty set of
|
|
live-in intervals. Some values are live across large parts of the function, and
|
|
this can often be represented with coalesced live-in intervals covering many
|
|
EBBs. It is important that the live range data structure doesn't have to grow
|
|
linearly with the number of EBBs covered by a live range.
|
|
|
|
This representation is very similar to LLVM's `LiveInterval` data structure
|
|
with a few important differences:
|
|
|
|
- The Cranelift `LiveRange` only covers a single SSA value, while LLVM's
|
|
`LiveInterval` represents the union of multiple related SSA values in a
|
|
virtual register. This makes Cranelift's representation smaller because
|
|
individual segments don't have to annotated with a value number.
|
|
- Cranelift stores the def-interval separately from a list of coalesced live-in
|
|
intervals, while LLVM stores an array of segments. The two representations
|
|
are equivalent, but Cranelift optimizes for the common case of a value that is
|
|
only used locally.
|
|
- It is simpler to check if two live ranges are overlapping. The dominance
|
|
properties of SSA form means that it is only necessary to check the
|
|
def-interval of each live range against the intervals of the other range. It
|
|
is not necessary to check for overlap between the two sets of live-in
|
|
intervals. This makes the overlap check logarithmic in the number of live-in
|
|
intervals instead of linear.
|
|
- LLVM represents a program point as `SlotIndex` which holds a pointer to a
|
|
32-byte `IndexListEntry` struct. The entries are organized in a double
|
|
linked list that mirrors the ordering of instructions in a basic block. This
|
|
allows 'tombstone' program points corresponding to instructions that have
|
|
been deleted.
|
|
|
|
Cranelift uses a 32-bit program point representation that encodes an
|
|
instruction or EBB number directly. There are no 'tombstones' for deleted
|
|
instructions, and no mirrored linked list of instructions. Live ranges must
|
|
be updated when instructions are deleted.
|
|
|
|
A consequence of Cranelift's more compact representation is that two program
|
|
points can't be compared without the context of a function layout.
|
|
|
|
## Coalescing algorithm
|
|
|
|
Unconstrained SSA form is not well suited to register allocation because of the problems
|
|
that can arise around EBB parameters and arguments. Consider this simple example:
|
|
|
|
```
|
|
function %interference(i32, i32) -> i32 {
|
|
ebb0(v0: i32, v1: i32):
|
|
brz v0, ebb1(v1)
|
|
jump ebb1(v0)
|
|
|
|
ebb1(v2: i32):
|
|
v3 = iadd v1, v2
|
|
return v3
|
|
}
|
|
```
|
|
|
|
Here, the value `v1` is both passed as an argument to `ebb1` *and* it is
|
|
live in to the EBB because it is used by the `iadd` instruction. Since
|
|
EBB arguments on the `brz` instruction need to be in the same register as
|
|
the corresponding EBB parameter `v2`, there is going to be interference
|
|
between `v1` and `v2` in the `ebb1` block.
|
|
|
|
The interference can be resolved by isolating the SSA values passed as EBB arguments:
|
|
|
|
```
|
|
function %coalesced(i32, i32) -> i32 {
|
|
ebb0(v0: i32, v1: i32):
|
|
v5 = copy v1
|
|
brz v0, ebb1(v5)
|
|
v6 = copy v0
|
|
jump ebb1(v6)
|
|
|
|
ebb1(v2: i32):
|
|
v3 = iadd.i32 v1, v2
|
|
return v3
|
|
}
|
|
```
|
|
|
|
Now the EBB argument is `v5` which is *not* itself live into `ebb1`,
|
|
resolving the interference.
|
|
|
|
The coalescing pass groups the SSA values into sets called *virtual registers*
|
|
and inserts copies such that:
|
|
|
|
1. Whenever a value is passed as an EBB argument, the corresponding EBB
|
|
parameter value belongs to the same virtual register as the passed argument
|
|
value.
|
|
2. The live ranges of values belonging to the same virtual register do not
|
|
interfere, i.e. they don't overlap anywhere.
|
|
|
|
Most virtual registers contains only a single isolated SSA value because most
|
|
SSA values are never passed as EBB arguments. The `VirtRegs` data structure
|
|
doesn't store any information about these singleton virtual registers, it only
|
|
tracks larger virtual registers and assumes that any value it doesn't know about
|
|
is its own singleton virtual register
|
|
|
|
Once the values have been partitioned into interference-free virtual registers,
|
|
the code is said to be in `conventional SSA form (CSSA)
|
|
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.107.7249>`_. A program
|
|
in CSSA form can be register allocated correctly by assigning all the values in
|
|
a virtual register to the same stack or register location.
|
|
|
|
Conventional SSA form and the virtual registers are maintained through all the
|
|
register allocator passes.
|
|
|
|
|
|
## Spilling algorithm
|
|
|
|
The spilling pass is responsible for lowering the register pressure enough that
|
|
the coloring pass is guaranteed to be able to find a coloring solution. It does
|
|
this by assigning whole virtual registers to stack slots.
|
|
|
|
Besides just counting registers, the spiller also has to look at the
|
|
instruction's operand constraints because sometimes the constraints can require
|
|
extra registers to solve, raising the register pressure:
|
|
|
|
- If a single value is used more than once by an instruction, and the operands
|
|
have conflicting constraints, two registers must be used. The most common case is
|
|
when a single value is passed as two separate arguments to a function call.
|
|
- If an instruction has a *tied operand constraint* where one of the input operands
|
|
must use the same register as the output operand, the spiller makes sure that
|
|
the tied input value doesn't interfere with the output value by inserting a copy
|
|
if needed.
|
|
|
|
The spilling heuristic used by Cranelift is very simple. Whenever the spiller
|
|
determines that the register pressure is too high at some instruction, it picks
|
|
the live SSA value whose definition is farthest away as the spill candidate.
|
|
Then it spills all values in the corresponding virtual register to the same
|
|
spill slot. It is important that all values in a virtual register get the same
|
|
spill slot, otherwise we could need memory-to-memory copies when passing spilled
|
|
arguments to a spilled EBB parameter.
|
|
|
|
This simple heuristic tends to spill values with long live ranges, and it
|
|
depends on the reload pass to do a good job of reusing registers reloaded from
|
|
spill slots if the spilled value gets used a lot. The idea is to minimize stack
|
|
*write* traffic with the spilling heuristic and to minimize stack *read* traffic
|
|
with the reload pass.
|
|
|
|
## Coloring algorithm
|
|
|
|
The SSA coloring algorithm is based on a single observation: If two SSA values
|
|
interfere, one of the values must be live where the other value is defined.
|
|
|
|
We visit the EBBs in a topological order such that all dominating EBBs are
|
|
visited before the current EBB. The instructions in an EBB are visited in a
|
|
top-down order, and each value define by the instruction is assigned an
|
|
available register. With this iteration order, every value that is live at an
|
|
instruction has already been assigned to a register.
|
|
|
|
This coloring algorithm works if the following condition holds:
|
|
|
|
At every instruction, consider the values live through the instruction. No
|
|
matter how the live values have been assigned to registers, there must be
|
|
available registers of the right register classes available for the values
|
|
defined by the instruction.
|
|
|
|
We'll need to modify this condition in order to deal with the real-world
|
|
complications.
|
|
|
|
The coloring algorithm needs to keep track of the set of live values at each
|
|
instruction. At the top of an EBB, this set can be computed as the union of:
|
|
|
|
- The set of live values before the immediately dominating branch or jump
|
|
instruction. The topological iteration order guarantees that this set is
|
|
available. Values whose live range indicate that they are not live-in to the
|
|
current EBB should be filtered out.
|
|
- The set of parameters the EBB. These values should all be live-in, although
|
|
it is possible that some are dead and never used anywhere.
|
|
|
|
For each live value, we also track its kill point in the current EBB. This is
|
|
the last instruction to use the value in the EBB. Values that are live-out
|
|
through the EBB terminator don't have a kill point. Note that the kill point
|
|
can be a branch to another EBB that uses the value, so the kill instruction
|
|
doesn't have to be a use of the value.
|
|
|
|
When advancing past an instruction, the live set is updated:
|
|
|
|
- Any values whose kill point is the current instruction are removed.
|
|
- Any values defined by the instruction are added, unless their kill point is
|
|
the current instruction. This corresponds to a dead def which has no uses.
|