x64: Only branch once in br_table (#5850)

This uses the `cmov`, which was previously necessary for Spectre
mitigation, to clamp the table index instead of zeroing it. By then
placing the default target as the last entry in the table, we can use
just one branch instruction in all cases.

Since there isn't a bounds-check branch any more, this sequence no
longer needs Spectre mitigation. And since we don't need to be careful
about preserving flags, half the instructions can be removed from this
pseudoinstruction and emitted as regular instructions instead.

This is a net savings of three bytes in the encoding of x64's br_table
pseudoinstruction. The generated code can sometimes be longer overall
because the blocks are emitted in a slightly different order.

My benchmark results show a very small effect on runtime performance
with this change.

The spidermonkey benchmark in Sightglass runs "1.01x faster" than main
by instructions retired, but with no significant difference in CPU
cycles. I think that means it rarely hit the default case in any
br_table instructions it executed.

The pulldown-cmark benchmark in Sightglass runs "1.01x faster" than main
by CPU cycles, but main runs "1.00x faster" by instructions retired. I
think that means this benchmark hit the default case a significant
amount of the time, so it executes a few more instructions per br_table,
but maybe the branches were predicted better.
This commit is contained in:
Jamey Sharp
2023-02-23 20:46:38 -08:00
committed by GitHub
parent c5d9d5b10f
commit 7d790fcdfe
5 changed files with 126 additions and 131 deletions

View File

@@ -1596,42 +1596,15 @@ pub(crate) fn emit(
// maximum range of 2 GB. If we later consider using shorter-range label references,
// this will need to be revisited.
// Save index in a tmp (the live range of ridx only goes to start of this
// sequence; rtmp1 or rtmp2 may overwrite it).
// We generate the following sequence:
// ;; generated by lowering: cmp #jmp_table_size, %idx
// jnb $default_target
// movl %idx, %tmp2
// mov $0, %tmp1
// cmovnb %tmp1, %tmp2 ;; Spectre mitigation.
// We generate the following sequence. Note that the only read of %idx is before the
// write to %tmp2, so regalloc may use the same register for both; fix x64/inst/mod.rs
// if you change this.
// lea start_of_jump_table_offset(%rip), %tmp1
// movslq [%tmp1, %tmp2, 4], %tmp2 ;; shift of 2, viz. multiply index by 4
// movslq [%tmp1, %idx, 4], %tmp2 ;; shift of 2, viz. multiply index by 4
// addq %tmp2, %tmp1
// j *%tmp1
// $start_of_jump_table:
// -- jump table entries
one_way_jmp(sink, CC::NB, *default_target); // idx unsigned >= jmp table size
// Copy the index (and make sure to clear the high 32-bits lane of tmp2).
let inst = Inst::movzx_rm_r(ExtMode::LQ, RegMem::reg(idx), tmp2);
inst.emit(&[], sink, info, state);
// Zero `tmp1` to overwrite `tmp2` with zeroes on the
// out-of-bounds case (Spectre mitigation using CMOV).
// Note that we need to do this with a move-immediate
// form, because we cannot clobber the flags.
let inst = Inst::imm(OperandSize::Size32, 0, tmp1);
inst.emit(&[], sink, info, state);
// Spectre mitigation: CMOV to zero the index if the out-of-bounds branch above misspeculated.
let inst = Inst::cmove(
OperandSize::Size64,
CC::NB,
RegMem::reg(tmp1.to_reg()),
tmp2,
);
inst.emit(&[], sink, info, state);
// Load base address of jump table.
let start_of_jumptable = sink.get_label();
@@ -1645,7 +1618,7 @@ pub(crate) fn emit(
RegMem::mem(Amode::imm_reg_reg_shift(
0,
Gpr::new(tmp1.to_reg()).unwrap(),
Gpr::new(tmp2.to_reg()).unwrap(),
Gpr::new(idx).unwrap(),
2,
)),
tmp2,
@@ -1668,7 +1641,7 @@ pub(crate) fn emit(
// Emit jump table (table of 32-bit offsets).
sink.bind_label(start_of_jumptable);
let jt_off = sink.cur_offset();
for &target in targets.iter() {
for &target in targets.iter().chain(std::iter::once(default_target)) {
let word_off = sink.cur_offset();
// off_into_table is an addend here embedded in the label to be later patched at
// the end of codegen. The offset is initially relative to this jump table entry;