x64 regalloc register order: put caller-saves (volatiles) first.
The x64 backend currently builds the `RealRegUniverse` in a way that is generating somewhat suboptimal code. In many blocks, we see uses of callee-save (non-volatile) registers (r12, r13, r14, rbx) first, even in very short leaf functions where there are plenty of volatiles to use. This is leading to unnecessary spills/reloads. On one (local) test program, a medium-sized C benchmark compiled to Wasm and run on Wasmtime, I am seeing a ~10% performance improvement with this change; it will be less pronounced in programs with high register pressure (there we are likely to use all registers regardless, so the prologue/epilogue will save/restore all callee-saves), or in programs with fewer calls, but this is a clear win for small functions and in many cases removes prologue/epilogue clobber-saves altogether. Separately, I think the RA's coalescing is tripping up a bit in some cases; see e.g. the filetest touched by this commit that loads a value into %rsi then moves to %rax and returns immediately. This is an orthogonal issue, though, and should be addressed (if worthwhile) in regalloc.rs.
This commit is contained in:
@@ -70,8 +70,8 @@ block0:
|
||||
return v1
|
||||
}
|
||||
; check: uninit %xmm0
|
||||
; nextln: pinsrw $$0, %r12, %xmm0
|
||||
; nextln: pinsrw $$1, %r12, %xmm0
|
||||
; nextln: pinsrw $$0, %rsi, %xmm0
|
||||
; nextln: pinsrw $$1, %rsi, %xmm0
|
||||
; nextln: pshufd $$0, %xmm0, %xmm0
|
||||
|
||||
function %splat_i32(i32) -> i32x4 {
|
||||
|
||||
Reference in New Issue
Block a user