* Cranelift: fix #3953: rework single/multiple-use logic in lowering. This PR addresses the longstanding issue with loads trying to merge into compares on x86-64, and more generally, with the lowering framework falsely recognizing "single uses" of one op by another (which would normally allow merging of side-effecting ops like loads) when there is *indirect* duplication. To fix this, we replace the direct `value_uses` count with a transitive notion of uniqueness (not unlike Rust's `&`/`&mut` and how a `&mut` downgrades to `&` when accessed through another `&`!). A value is used multiple times transitively if it has multiple direct uses, or is used by another op that is used multiple times transitively. The canonical example of badness is: ``` v1 := load v2 := ifcmp v1, ... v3 := selectif v2, ... v4 := selectif v2, ... ``` both `v3` and `v4` effectively merge the `ifcmp` (`v2`), so even though the use of `v1` is "unique", it is codegenned twice. This is why we ~~can't have nice things~~ can't merge loads into compares (#3953). There is quite a subtle and interesting design space around this problem and how we might solve it. See the long doc-comment on `ValueUseState` in this PR for more justification for the particular design here. In particular, this design deliberately simplifies a bit relative to an "optimal" solution: some uses can *become* unique depending on merging, but we don't design our data structures for such updates because that would require significant extra costly tracking (some sort of transitive refcounting). For example, in the above, if `selectif` somehow did not merge `ifcmp`, then we would only codegen the `ifcmp` once into its result register (and use that register twice); then the load *is* uniquely used, and could be merged. But that requires transitioning from "multiple use" back to "unique use" with careful tracking as we do pattern-matching, which I've chosen to make out-of-scope here for now. In practice, I don't think it will matter too much (and we can always improve later). With this PR, we can now re-enable load-op merging for compares. A subsequent commit does this. * Update x64 backend to allow load-op merging for `cmp`. * Update filetests. * Add test for cmp-mem merging on x64. * Comment fixes. * Rework ValueUseState analysis for better performance. * Update s390x filetest: iadd_ifcout cannot merge loads anymore because it has multiple outputs (ValueUseState limitation) * Address review comments.
82 lines
1.7 KiB
Plaintext
82 lines
1.7 KiB
Plaintext
test compile
|
|
target x86_64
|
|
|
|
function %add_from_mem_u32_1(i64, i32) -> i32 {
|
|
block0(v0: i64, v1: i32):
|
|
v2 = load.i32 v0
|
|
v3 = iadd.i32 v2, v1
|
|
; check: addl %esi, 0(%rdi), %esi
|
|
return v3
|
|
}
|
|
|
|
function %add_from_mem_u32_2(i64, i32) -> i32 {
|
|
block0(v0: i64, v1: i32):
|
|
v2 = load.i32 v0
|
|
v3 = iadd.i32 v1, v2
|
|
; check: addl %esi, 0(%rdi), %esi
|
|
return v3
|
|
}
|
|
|
|
function %add_from_mem_u64_1(i64, i64) -> i64 {
|
|
block0(v0: i64, v1: i64):
|
|
v2 = load.i64 v0
|
|
v3 = iadd.i64 v2, v1
|
|
; check: addq %rsi, 0(%rdi), %rsi
|
|
return v3
|
|
}
|
|
|
|
function %add_from_mem_u64_2(i64, i64) -> i64 {
|
|
block0(v0: i64, v1: i64):
|
|
v2 = load.i64 v0
|
|
v3 = iadd.i64 v1, v2
|
|
; check: addq %rsi, 0(%rdi), %rsi
|
|
return v3
|
|
}
|
|
|
|
; test narrow loads: 8-bit load should not merge because the `addl` is 32 bits
|
|
; and would load 32 bits from memory, which may go beyond the end of the heap.
|
|
function %add_from_mem_not_narrow(i64, i8) -> i8 {
|
|
block0(v0: i64, v1: i8):
|
|
v2 = load.i8 v0
|
|
v3 = iadd.i8 v2, v1
|
|
; check: movzbq 0(%rdi), %rax
|
|
; nextln: addl %eax, %esi, %eax
|
|
return v3
|
|
}
|
|
|
|
function %no_merge_if_lookback_use(i64, i64) -> i64 {
|
|
block0(v0: i64, v1: i64):
|
|
v2 = load.i64 v0
|
|
v3 = iadd.i64 v2, v0
|
|
store.i64 v3, v1
|
|
v4 = load.i64 v3
|
|
return v4
|
|
; check: movq 0(%rdi), %r11
|
|
; nextln: movq %r11, %rax
|
|
; nextln: addq %rax, %rdi, %rax
|
|
; nextln: movq %rax, 0(%rsi)
|
|
; nextln: movq 0(%r11,%rdi,1), %rax
|
|
}
|
|
|
|
function %merge_scalar_to_vector(i64) -> i32x4 {
|
|
block0(v0: i64):
|
|
v1 = load.i32 v0
|
|
v2 = scalar_to_vector.i32x4 v1
|
|
; check: movss 0(%rdi), %xmm0
|
|
|
|
jump block1
|
|
block1:
|
|
return v2
|
|
}
|
|
|
|
function %cmp_mem(i64) -> i64 {
|
|
block0(v0: i64):
|
|
v1 = load.i64 v0
|
|
v2 = icmp eq v0, v1
|
|
v3 = bint.i64 v2
|
|
return v3
|
|
|
|
; check: cmpq 0(%rdi), %rdi
|
|
; nextln: setz %al
|
|
}
|