x64: Add rudimentary support for some AVX instructions (#5795)

* x64: Add rudimentary support for some AVX instructions I was poking around Spidermonkey's wasm backend and saw that the various assembler functions used are all `v*`-prefixed which look like they're intended for use with AVX instructions. I looked at Cranelift and it currently doesn't have support for many AVX-based instructions, so I figured I'd take a crack at it! The support added here is a bit of a mishmash when viewed alone, but my general goal was to take a single instruction from the SIMD proposal for WebAssembly and migrate all of its component instructions to AVX. I, by random chance, picked a pretty complicated instruction of `f32x4.min`. This wasm instruction is implemented on x64 with 4 unique SSE instructions and ended up being a pretty good candidate. Further digging about AVX-vs-SSE shows that there should be two major benefits to using AVX over SSE: * Primarily AVX instructions largely use a three-operand form where two input registers are operated with and an output register is also specified. This is in contrast to SSE's predominant one-register-is-input-but-also-output pattern. This should help free up the register allocator a bit and additionally remove the need for movement between registers. * As #4767 notes the memory-based operations of VEX-encoded instructions (aka AVX instructions) do not have strict alignment requirements which means we would be able to sink loads and stores into individual instructions instead of having separate instructions. So I set out on my journey to implement the instructions used by `f32x4.min`. The first few were fairly easy. The machinst backends are already of the shape "take these inputs and compute the output" where the x86 requirement of a register being both input and output is postprocessed in. This means that the `inst.isle` creation helpers for SSE instructions were already of the correct form to use AVX. I chose to add new `rule` branches for the instruction creation helpers, for example `x64_andnps`. The new `rule` conditionally only runs if AVX is enabled and emits an AVX instruction instead of an SSE instruction for achieving the same goal. This means that no lowerings of clif instructions were modified, instead just new instructions are being generated. The VEX encoding was previously not heavily used in Cranelift. The only current user are the FMA-style instructions that Cranelift has at this time. These FMA instructions have one extra operand than `vandnps`, for example, so I split the existing `XmmRmRVex` into a few more variants to fit the shape of the instructions that needed generating for `f32x4.min`. This was accompanied then with more AVX opcode definitions, more emission support, etc. Upon implementing all of this it turned out that the test suite was failing on my machine due to the memory-operand encodings of VEX instructions not being supported. I didn't explicitly add those in myself but some preexisting RIP-relative addressing was leaking into the new instructions with existing tests. I opted to go ahead and fill out the memory addressing modes of VEX encoding to get the tests passing again. All-in-all this PR adds new instructions to the x64 backend for a number of AVX instructions, updates 5 existing instruction producers to use AVX instructions conditionally, implements VEX memory operands, and adds some simple tests for the new output of `f32x4.min`. The existing runtest for `f32x.min` caught a few intermediate bugs along the way and I additionally added a plain `target x86_64` to that runtest to ensure that it executes with and without AVX to test the various lowerings. I'll also note that this, and future support, should be well-fuzzed through Wasmtime's fuzzing which may explicitly disable AVX support despite the machine having access to AVX, so non-AVX lowerings should be well-tested into the future. It's also worth mentioning that I am not an AVX or VEX or x64 expert. Implementing the memory operand part for VEX was the hardest part of this PR and while I think it should be good someone else should definitely double-check me. Additionally I haven't added many instructions to the x64 backend yet so I may have missed obvious places to tests or such, so am happy to follow-up with anything to be more thorough if necessary. Finally I should note that this is just the tip of the iceberg when it comes to AVX. My hope is to get some of the idioms sorted out to make it easier for future PRs to add one-off instruction lowerings or such. * Review feedback
2023-02-16 19:29:55 -06:00
parent f8ca67cdc6
commit 453330b2db
12 changed files with 707 additions and 116 deletions
--- a/cranelift/codegen/src/isa/x64/encoding/rex.rs
+++ b/cranelift/codegen/src/isa/x64/encoding/rex.rs
@@ -312,18 +312,50 @@ pub(crate) fn emit_std_enc_mem(

    prefixes.emit(sink);

+    // After prefixes, first emit the REX byte depending on the kind of
+    // addressing mode that's being used.
    match *mem_e {
-        Amode::ImmReg { simm32, base, .. } => {
-            // First, the REX byte.
+        Amode::ImmReg { base, .. } => {
            let enc_e = int_reg_enc(base);
            rex.emit_two_op(sink, enc_g, enc_e);
+        }

-            // Now the opcode(s).  These include any other prefixes the caller
-            // hands to us.
-            while num_opcodes > 0 {
-                num_opcodes -= 1;
-                sink.put1(((opcodes >> (num_opcodes << 3)) & 0xFF) as u8);
-            }
+        Amode::ImmRegRegShift {
+            base: reg_base,
+            index: reg_index,
+            ..
+        } => {
+            let enc_base = int_reg_enc(*reg_base);
+            let enc_index = int_reg_enc(*reg_index);
+            rex.emit_three_op(sink, enc_g, enc_index, enc_base);
+        }
+
+        Amode::RipRelative { .. } => {
+            // note REX.B = 0.
+            rex.emit_two_op(sink, enc_g, 0);
+        }
+    }
+
+    // Now the opcode(s).  These include any other prefixes the caller
+    // hands to us.
+    while num_opcodes > 0 {
+        num_opcodes -= 1;
+        sink.put1(((opcodes >> (num_opcodes << 3)) & 0xFF) as u8);
+    }
+
+    // And finally encode the mod/rm bytes and all further information.
+    emit_modrm_sib_disp(sink, enc_g, mem_e, bytes_at_end)
+}
+
+pub(crate) fn emit_modrm_sib_disp(
+    sink: &mut MachBuffer<Inst>,
+    enc_g: u8,
+    mem_e: &Amode,
+    bytes_at_end: u8,
+) {
+    match *mem_e {
+        Amode::ImmReg { simm32, base, .. } => {
+            let enc_e = int_reg_enc(base);

            // Now the mod/rm and associated immediates.  This is
            // significantly complicated due to the multiple special cases.
@@ -377,15 +409,6 @@ pub(crate) fn emit_std_enc_mem(
            let enc_base = int_reg_enc(*reg_base);
            let enc_index = int_reg_enc(*reg_index);

-            // The rex byte.
-            rex.emit_three_op(sink, enc_g, enc_index, enc_base);
-
-            // All other prefixes and opcodes.
-            while num_opcodes > 0 {
-                num_opcodes -= 1;
-                sink.put1(((opcodes >> (num_opcodes << 3)) & 0xFF) as u8);
-            }
-
            // modrm, SIB, immediates.
            if low8_will_sign_extend_to_32(simm32) && enc_index != regs::ENC_RSP {
                sink.put1(encode_modrm(1, enc_g & 7, 4));
@@ -401,16 +424,6 @@ pub(crate) fn emit_std_enc_mem(
        }

        Amode::RipRelative { ref target } => {
-            // First, the REX byte, with REX.B = 0.
-            rex.emit_two_op(sink, enc_g, 0);
-
-            // Now the opcode(s).  These include any other prefixes the caller
-            // hands to us.
-            while num_opcodes > 0 {
-                num_opcodes -= 1;
-                sink.put1(((opcodes >> (num_opcodes << 3)) & 0xFF) as u8);
-            }
-
            // RIP-relative is mod=00, rm=101.
            sink.put1(encode_modrm(0, enc_g & 7, 0b101));

--- a/cranelift/codegen/src/isa/x64/encoding/vex.rs
+++ b/cranelift/codegen/src/isa/x64/encoding/vex.rs
@@ -4,7 +4,10 @@
 use super::evex::Register;
 use super::rex::{LegacyPrefixes, OpcodeMap};
 use super::ByteSink;
-use crate::isa::x64::encoding::rex::encode_modrm;
+use crate::isa::x64::args::Amode;
+use crate::isa::x64::encoding::rex;
+use crate::isa::x64::inst::Inst;
+use crate::machinst::MachBuffer;

 /// Constructs a VEX-encoded instruction using a builder pattern. This approach makes it visually
 /// easier to transform something the manual's syntax, `VEX.128.66.0F 73 /7 ib` to code:
@@ -16,11 +19,29 @@ pub struct VexInstruction {
    opcode: u8,
    w: bool,
    reg: u8,
-    rm: Register,
+    rm: RegisterOrAmode,
    vvvv: Option<Register>,
    imm: Option<u8>,
 }

+#[allow(missing_docs)]
+pub enum RegisterOrAmode {
+    Register(Register),
+    Amode(Amode),
+}
+
+impl From<u8> for RegisterOrAmode {
+    fn from(reg: u8) -> Self {
+        RegisterOrAmode::Register(reg.into())
+    }
+}
+
+impl From<Amode> for RegisterOrAmode {
+    fn from(amode: Amode) -> Self {
+        RegisterOrAmode::Amode(amode)
+    }
+}
+
 impl Default for VexInstruction {
    fn default() -> Self {
        Self {
@@ -30,7 +51,7 @@ impl Default for VexInstruction {
            opcode: 0x00,
            w: false,
            reg: 0x00,
-            rm: Register::default(),
+            rm: RegisterOrAmode::Register(Register::default()),
            vvvv: None,
            imm: None,
        }
@@ -105,12 +126,12 @@ impl VexInstruction {
        self
    }

-    /// Set the register to use for the `rm` bits; many instructions use this as the "read from
-    /// register/memory" operand. Currently this does not support memory addressing (TODO).Setting
-    /// this affects both the ModRM byte (`rm` section) and the VEX prefix (the extension bits for
-    /// register encodings > 8).
+    /// Set the register to use for the `rm` bits; many instructions use this
+    /// as the "read from register/memory" operand. Setting this affects both
+    /// the ModRM byte (`rm` section) and the VEX prefix (the extension bits
+    /// for register encodings > 8).
    #[inline(always)]
-    pub fn rm(mut self, reg: impl Into<Register>) -> Self {
+    pub fn rm(mut self, reg: impl Into<RegisterOrAmode>) -> Self {
        self.rm = reg.into();
        self
    }
@@ -150,15 +171,33 @@ impl VexInstruction {
    /// The X bit in encoded format (inverted).
    #[inline(always)]
    fn x_bit(&self) -> u8 {
-        // TODO
-        (!0) & 1
+        let reg = match &self.rm {
+            RegisterOrAmode::Register(_) => 0,
+            RegisterOrAmode::Amode(Amode::ImmReg { .. }) => 0,
+            RegisterOrAmode::Amode(Amode::ImmRegRegShift { index, .. }) => {
+                index.to_real_reg().unwrap().hw_enc()
+            }
+            RegisterOrAmode::Amode(Amode::RipRelative { .. }) => 0,
+        };
+
+        !(reg >> 3) & 1
    }

    /// The B bit in encoded format (inverted).
    #[inline(always)]
    fn b_bit(&self) -> u8 {
-        let rm: u8 = self.rm.into();
-        (!(rm >> 3)) & 1
+        let reg = match &self.rm {
+            RegisterOrAmode::Register(r) => (*r).into(),
+            RegisterOrAmode::Amode(Amode::ImmReg { base, .. }) => {
+                base.to_real_reg().unwrap().hw_enc()
+            }
+            RegisterOrAmode::Amode(Amode::ImmRegRegShift { base, .. }) => {
+                base.to_real_reg().unwrap().hw_enc()
+            }
+            RegisterOrAmode::Amode(Amode::RipRelative { .. }) => 0,
+        };
+
+        !(reg >> 3) & 1
    }

    /// Is the 2 byte prefix available for this instruction?
@@ -176,6 +215,7 @@ impl VexInstruction {
        // encoded by the three-byte form of VEX
        !(self.map == OpcodeMap::_0F3A || self.map == OpcodeMap::_0F38)
    }
+
    /// The last byte of the 2byte and 3byte prefixes is mostly the same, share the common
    /// encoding logic here.
    #[inline(always)]
@@ -225,8 +265,8 @@ impl VexInstruction {
        sink.put1(last_byte);
    }

-    /// Emit the VEX-encoded instruction to the code sink:
-    pub fn encode<CS: ByteSink + ?Sized>(&self, sink: &mut CS) {
+    /// Emit the VEX-encoded instruction to the provided buffer.
+    pub fn encode(&self, sink: &mut MachBuffer<Inst>) {
        // 2/3 byte prefix
        if self.use_2byte_prefix() {
            self.encode_2byte_prefix(sink);
@@ -237,13 +277,21 @@ impl VexInstruction {
        // 1 Byte Opcode
        sink.put1(self.opcode);

-        // 1 ModRM Byte
-        // Not all instructions use Reg as a reg, some use it as an extension of the opcode.
-        let rm: u8 = self.rm.into();
-        sink.put1(encode_modrm(3, self.reg & 7, rm & 7));
-
-        // TODO: 0/1 byte SIB
-        // TODO: 0/1/2/4 bytes DISP
+        match &self.rm {
+            // Not all instructions use Reg as a reg, some use it as an extension
+            // of the opcode.
+            RegisterOrAmode::Register(reg) => {
+                let rm: u8 = (*reg).into();
+                sink.put1(rex::encode_modrm(3, self.reg & 7, rm & 7));
+            }
+            // For address-based modes reuse the logic from the `rex` module
+            // for the modrm and trailing bytes since VEX uses the same
+            // encoding.
+            RegisterOrAmode::Amode(amode) => {
+                let bytes_at_end = if self.imm.is_some() { 1 } else { 0 };
+                rex::emit_modrm_sib_disp(sink, self.reg & 7, amode, bytes_at_end);
+            }
+        }

        // Optional 1 Byte imm
        if let Some(imm) = self.imm {
@@ -278,8 +326,9 @@ impl Default for VexVectorLength {
 #[cfg(test)]
 mod tests {
    use super::*;
+    use crate::isa::x64::inst::args::Gpr;
    use crate::isa::x64::inst::regs;
-    use std::vec::Vec;
+    use crate::opts::MemFlags;

    #[test]
    fn vpslldq() {
@@ -288,7 +337,7 @@ mod tests {

        let dst = regs::xmm1().to_real_reg().unwrap().hw_enc();
        let src = regs::xmm2().to_real_reg().unwrap().hw_enc();
-        let mut sink0 = Vec::new();
+        let mut sink = MachBuffer::new();

        VexInstruction::new()
            .length(VexVectorLength::V128)
@@ -299,9 +348,10 @@ mod tests {
            .vvvv(dst)
            .rm(src)
            .imm(0x17)
-            .encode(&mut sink0);
+            .encode(&mut sink);

-        assert_eq!(sink0, vec![0xc5, 0xf1, 0x73, 0xfa, 0x17]);
+        let bytes = sink.finish().data;
+        assert_eq!(bytes.as_slice(), [0xc5, 0xf1, 0x73, 0xfa, 0x17]);
    }

    #[test]
@@ -314,7 +364,7 @@ mod tests {
        let a = regs::xmm2().to_real_reg().unwrap().hw_enc();
        let b = regs::xmm3().to_real_reg().unwrap().hw_enc();
        let c = regs::xmm4().to_real_reg().unwrap().hw_enc();
-        let mut sink0 = Vec::new();
+        let mut sink = MachBuffer::new();

        VexInstruction::new()
            .length(VexVectorLength::V128)
@@ -326,9 +376,10 @@ mod tests {
            .vvvv(a)
            .rm(b)
            .imm_reg(c)
-            .encode(&mut sink0);
+            .encode(&mut sink);

-        assert_eq!(sink0, vec![0xc4, 0xe3, 0x69, 0x4b, 0xcb, 0x40]);
+        let bytes = sink.finish().data;
+        assert_eq!(bytes.as_slice(), [0xc4, 0xe3, 0x69, 0x4b, 0xcb, 0x40]);
    }

    #[test]
@@ -339,7 +390,7 @@ mod tests {
        let dst = regs::xmm10().to_real_reg().unwrap().hw_enc();
        let a = regs::xmm11().to_real_reg().unwrap().hw_enc();
        let b = regs::xmm12().to_real_reg().unwrap().hw_enc();
-        let mut sink0 = Vec::new();
+        let mut sink = MachBuffer::new();

        VexInstruction::new()
            .length(VexVectorLength::V256)
@@ -350,8 +401,91 @@ mod tests {
            .vvvv(a)
            .rm(b)
            .imm(4)
-            .encode(&mut sink0);
+            .encode(&mut sink);

-        assert_eq!(sink0, vec![0xc4, 0x41, 0x24, 0xc2, 0xd4, 0x04]);
+        let bytes = sink.finish().data;
+        assert_eq!(bytes.as_slice(), [0xc4, 0x41, 0x24, 0xc2, 0xd4, 0x04]);
+    }
+
+    #[test]
+    fn vandnps() {
+        // VEX.128.0F 55 /r
+        // VANDNPS xmm0, xmm1, xmm2
+
+        let dst = regs::xmm2().to_real_reg().unwrap().hw_enc();
+        let src1 = regs::xmm1().to_real_reg().unwrap().hw_enc();
+        let src2 = regs::xmm0().to_real_reg().unwrap().hw_enc();
+        let mut sink = MachBuffer::new();
+
+        VexInstruction::new()
+            .length(VexVectorLength::V128)
+            .prefix(LegacyPrefixes::None)
+            .map(OpcodeMap::_0F)
+            .opcode(0x55)
+            .reg(dst)
+            .vvvv(src1)
+            .rm(src2)
+            .encode(&mut sink);
+
+        let bytes = sink.finish().data;
+        assert_eq!(bytes.as_slice(), [0xc5, 0xf0, 0x55, 0xd0]);
+    }
+
+    #[test]
+    fn vandnps_mem() {
+        // VEX.128.0F 55 /r
+        // VANDNPS 10(%r13), xmm1, xmm2
+
+        let dst = regs::xmm2().to_real_reg().unwrap().hw_enc();
+        let src1 = regs::xmm1().to_real_reg().unwrap().hw_enc();
+        let src2 = Amode::ImmReg {
+            base: regs::r13(),
+            flags: MemFlags::trusted(),
+            simm32: 10,
+        };
+        let mut sink = MachBuffer::new();
+
+        VexInstruction::new()
+            .length(VexVectorLength::V128)
+            .prefix(LegacyPrefixes::None)
+            .map(OpcodeMap::_0F)
+            .opcode(0x55)
+            .reg(dst)
+            .vvvv(src1)
+            .rm(src2)
+            .encode(&mut sink);
+
+        let bytes = sink.finish().data;
+        assert_eq!(bytes.as_slice(), [0xc4, 0xc1, 0x70, 0x55, 0x55, 0x0a]);
+    }
+
+    #[test]
+    fn vandnps_more_mem() {
+        // VEX.128.0F 55 /r
+        // VANDNPS 100(%rax,%r13,4), xmm1, xmm2
+
+        let dst = regs::xmm2().to_real_reg().unwrap().hw_enc();
+        let src1 = regs::xmm1().to_real_reg().unwrap().hw_enc();
+        let src2 = Amode::ImmRegRegShift {
+            base: Gpr::new(regs::rax()).unwrap(),
+            index: Gpr::new(regs::r13()).unwrap(),
+            flags: MemFlags::trusted(),
+            simm32: 100,
+            shift: 2,
+        };
+        let mut sink = MachBuffer::new();
+
+        VexInstruction::new()
+            .length(VexVectorLength::V128)
+            .prefix(LegacyPrefixes::None)
+            .map(OpcodeMap::_0F)
+            .opcode(0x55)
+            .reg(dst)
+            .vvvv(src1)
+            .rm(src2)
+            .encode(&mut sink);
+
+        let bytes = sink.finish().data;
+        assert_eq!(bytes.as_slice(), [0xc4, 0xa1, 0x70, 0x55, 0x54, 0xa8, 100]);
    }
 }