Rework of MachInst isel, branch fixups and lowering, and block ordering.

This patch includes: - A complete rework of the way that CLIF blocks and edge blocks are lowered into VCode blocks. The new mechanism in `BlockLoweringOrder` computes RPO over the CFG, but with a twist: it merges edge blocks intto heads or tails of original CLIF blocks wherever possible, and it does this without ever actually materializing the full nodes-plus-edges graph first. The backend driver lowers blocks in final order so there's no need to reshuffle later. - A new `MachBuffer` that replaces the `MachSection`. This is a special version of a code-sink that is far more than a humble `Vec<u8>`. In particular, it keeps a record of label definitions and label uses, with a machine-pluggable `LabelUse` trait that defines various types of fixups (basically internal relocations). Importantly, it implements some simple peephole-style branch rewrites *inline in the emission pass*, without any separate traversals over the code to use fallthroughs, swap taken/not-taken arms, etc. It tracks branches at the tail of the buffer and can (i) remove blocks that are just unconditional branches (by redirecting the label), (ii) understand a conditional/unconditional pair and swap the conditional polarity when it's helpful; and (iii) remove branches that branch to the fallthrough PC. The `MachBuffer` also implements branch-island support. On architectures like AArch64, this is needed to allow conditional branches within plausibly-attainable ranges (+/- 1MB on AArch64 specifically). It also does this inline while streaming through the emission, without any sort of fixpoint algorithm or later moving of code, by simply tracking outstanding references and "deadlines" and emitting an island just-in-time when we're in danger of going out of range. - A rework of the instruction selector driver. This is largely following the same algorithm as before, but is cleaned up significantly, in particular in the API: the machine backend can ask for an input arg and get any of three forms (constant, register, producing instruction), indicating it needs the register or can merge the constant or producing instruction as appropriate. This new driver takes special care to emit constants right at use-sites (and at phi inputs), minimizing their live-ranges, and also special-cases the "pinned register" to avoid superfluous moves. Overall, on `bz2.wasm`, the results are: wasmtime full run (compile + runtime) of bz2: baseline: 9774M insns, 9742M cycles, 3.918s w/ changes: 7012M insns, 6888M cycles, 2.958s (24.5% faster, 28.3% fewer insns) clif-util wasm compile bz2: baseline: 2633M insns, 3278M cycles, 1.034s w/ changes: 2366M insns, 2920M cycles, 0.923s (10.7% faster, 10.1% fewer insns) All numbers are averages of two runs on an Ampere eMAG.
2020-05-15 19:04:50 -07:00
parent 463734b002
commit 72e6be9342
27 changed files with 3021 additions and 2035 deletions
--- a/cranelift/codegen/src/machinst/blockorder.rs
+++ b/cranelift/codegen/src/machinst/blockorder.rs
@@ -1,49 +1,579 @@
 //! Computation of basic block order in emitted code.
+//!
+//! This module handles the translation from CLIF BBs to VCode BBs.
+//!
+//! The basic idea is that we compute a sequence of "lowered blocks" that
+//! correspond to subgraphs of the CLIF CFG plus an implicit block on *every*
+//! edge (not just critical edges). Conceptually, the lowering pipeline wants to
+//! insert moves for phi-nodes on every block-to-block transfer; these blocks
+//! always conceptually exist, but may be merged with an "original" CLIF block
+//! (and hence not actually exist; this is equivalent to inserting the blocks
+//! only on critical edges).
+//!
+//! Each `LoweredBlock` names just an original CLIF block, an original CLIF
+//! block prepended or appended with an edge block (never both, though), or just
+//! an edge block.
+//!
+//! To compute this lowering, we do a DFS over the CLIF-plus-edge-block graph
+//! (never actually materialized, just defined by a "successors" function), and
+//! compute the reverse postorder.
+//!
+//! This algorithm isn't perfect w.r.t. generated code quality: we don't, for
+//! example, consider any information about whether edge blocks will actually
+//! have content, because this computation happens as part of lowering *before*
+//! regalloc, and regalloc may or may not insert moves/spills/reloads on any
+//! particular edge. But it works relatively well and is conceptually simple.

+use crate::entity::SecondaryMap;
+use crate::fx::{FxHashMap, FxHashSet};
+use crate::ir::{Block, Function, Inst, Opcode};
+use crate::machinst::lower::visit_block_succs;
 use crate::machinst::*;

-/// Simple reverse postorder-based block order emission.
-///
-/// TODO: use a proper algorithm, such as the bottom-up straight-line-section
-/// construction algorithm.
-struct BlockRPO {
-    visited: Vec<bool>,
-    postorder: Vec<BlockIndex>,
+use log::debug;
+use smallvec::SmallVec;
+
+/// Mapping from CLIF BBs to VCode BBs.
+#[derive(Debug)]
+pub struct BlockLoweringOrder {
+    /// Lowered blocks, in BlockIndex order. Each block is some combination of
+    /// (i) a CLIF block, and (ii) inserted crit-edge blocks before or after;
+    /// see [LoweredBlock] for details.
+    lowered_order: Vec<LoweredBlock>,
+    /// Successors for all lowered blocks, in one serialized vector. Indexed by
+    /// the ranges in `lowered_succ_ranges`.
+    lowered_succs: Vec<(Inst, LoweredBlock)>,
+    /// BlockIndex values for successors for all lowered blocks, in the same
+    /// order as `lowered_succs`.
+    lowered_succ_indices: Vec<(Inst, BlockIndex)>,
+    /// Ranges in `lowered_succs` giving the successor lists for each lowered
+    /// block. Indexed by lowering-order index (`BlockIndex`).
+    lowered_succ_ranges: Vec<(usize, usize)>,
+    /// Mapping from CLIF BB to BlockIndex (index in lowered order). Note that
+    /// some CLIF BBs may not be lowered; in particular, we skip unreachable
+    /// blocks.
+    orig_map: SecondaryMap<Block, Option<BlockIndex>>,
 }

-impl BlockRPO {
-    fn new<I: VCodeInst>(vcode: &VCode<I>) -> BlockRPO {
-        BlockRPO {
-            visited: vec![false; vcode.num_blocks()],
-            postorder: Vec::with_capacity(vcode.num_blocks()),
+/// The origin of a block in the lowered block-order: either an original CLIF
+/// block, or an inserted edge-block, or a combination of the two if an edge is
+/// non-critical.
+#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
+pub enum LoweredBlock {
+    /// Block in original CLIF, with no merged edge-blocks.
+    Orig {
+        /// Original CLIF block.
+        block: Block,
+    },
+    /// Block in the original CLIF, plus edge-block to one succ (which is the
+    /// one successor of the original block).
+    OrigAndEdge {
+        /// The original CLIF block contained in this lowered block.
+        block: Block,
+        /// The edge (jump) instruction transitioning from this block
+        /// to the next, i.e., corresponding to the included edge-block. This
+        /// will be an instruction in `block`.
+        edge_inst: Inst,
+        /// The successor CLIF block.
+        succ: Block,
+    },
+    /// Block in the original CLIF, preceded by edge-block from one pred (which
+    /// is the one pred of the original block).
+    EdgeAndOrig {
+        /// The previous CLIF block, i.e., the edge block's predecessor.
+        pred: Block,
+        /// The edge (jump) instruction corresponding to the included
+        /// edge-block. This will be an instruction in `pred`.
+        edge_inst: Inst,
+        /// The original CLIF block included in this lowered block.
+        block: Block,
+    },
+    /// Split critical edge between two CLIF blocks. This lowered block does not
+    /// correspond to any original CLIF blocks; it only serves as an insertion
+    /// point for work to happen on the transition from `pred` to `succ`.
+    Edge {
+        /// The predecessor CLIF block.
+        pred: Block,
+        /// The edge (jump) instruction corresponding to this edge's transition.
+        /// This will be an instruction in `pred`.
+        edge_inst: Inst,
+        /// The successor CLIF block.
+        succ: Block,
+    },
+}
+
+impl LoweredBlock {
+    /// The associated original (CLIF) block included in this lowered block, if
+    /// any.
+    pub fn orig_block(self) -> Option<Block> {
+        match self {
+            LoweredBlock::Orig { block, .. }
+            | LoweredBlock::OrigAndEdge { block, .. }
+            | LoweredBlock::EdgeAndOrig { block, .. } => Some(block),
+            LoweredBlock::Edge { .. } => None,
        }
    }

-    fn visit<I: VCodeInst>(&mut self, vcode: &VCode<I>, block: BlockIndex) {
-        self.visited[block as usize] = true;
-        for succ in vcode.succs(block) {
-            if !self.visited[succ.get() as usize] {
-                self.visit(vcode, succ.get());
+    /// The associated in-edge, if any.
+    pub fn in_edge(self) -> Option<(Block, Inst, Block)> {
+        match self {
+            LoweredBlock::EdgeAndOrig {
+                pred,
+                edge_inst,
+                block,
+            } => Some((pred, edge_inst, block)),
+            _ => None,
+        }
+    }
+
+    /// the associated out-edge, if any. Also includes edge-only blocks.
+    pub fn out_edge(self) -> Option<(Block, Inst, Block)> {
+        match self {
+            LoweredBlock::OrigAndEdge {
+                block,
+                edge_inst,
+                succ,
+            } => Some((block, edge_inst, succ)),
+            LoweredBlock::Edge {
+                pred,
+                edge_inst,
+                succ,
+            } => Some((pred, edge_inst, succ)),
+            _ => None,
+        }
+    }
+}
+
+impl BlockLoweringOrder {
+    /// Compute and return a lowered block order for `f`.
+    pub fn new(f: &Function) -> BlockLoweringOrder {
+        debug!("BlockLoweringOrder: function body {:?}", f);
+
+        // Step 1: compute the in-edge and out-edge count of every block.
+        let mut block_in_count = SecondaryMap::with_default(0);
+        let mut block_out_count = SecondaryMap::with_default(0);
+
+        // Cache the block successors to avoid re-examining branches below.
+        let mut block_succs: SmallVec<[(Inst, Block); 128]> = SmallVec::new();
+        let mut block_succ_range = SecondaryMap::with_default((0, 0));
+        let mut fallthrough_return_block = None;
+        for block in f.layout.blocks() {
+            let block_succ_start = block_succs.len();
+            visit_block_succs(f, block, |inst, succ| {
+                block_out_count[block] += 1;
+                block_in_count[succ] += 1;
+                block_succs.push((inst, succ));
+            });
+            let block_succ_end = block_succs.len();
+            block_succ_range[block] = (block_succ_start, block_succ_end);
+
+            for inst in f.layout.block_likely_branches(block) {
+                if f.dfg[inst].opcode() == Opcode::Return {
+                    // Implicit output edge for any return.
+                    block_out_count[block] += 1;
+                }
+                if f.dfg[inst].opcode() == Opcode::FallthroughReturn {
+                    // Fallthrough return block must come last.
+                    debug_assert!(fallthrough_return_block == None);
+                    fallthrough_return_block = Some(block);
+                }
            }
        }
-        if Some(block) != vcode.fallthrough_return_block {
-            self.postorder.push(block);
+        // Implicit input edge for entry block.
+        if let Some(entry) = f.layout.entry_block() {
+            block_in_count[entry] += 1;
        }
+
+        // Here we define the implicit CLIF-plus-edges graph. There are
+        // conceptually two such graphs: the original, with every edge explicit,
+        // and the merged one, with blocks (represented by `LoweredBlock`
+        // values) that contain original CLIF blocks, edges, or both. This
+        // function returns a lowered block's successors as per the latter, with
+        // consideration to edge-block merging.
+        //
+        // Note that there is a property of the block-merging rules below
+        // that is very important to ensure we don't miss any lowered blocks:
+        // any block in the implicit CLIF-plus-edges graph will *only* be
+        // included in one block in the merged graph.
+        //
+        // This, combined with the property that every edge block is reachable
+        // only from one predecessor (and hence cannot be reached by a DFS
+        // backedge), means that it is sufficient in our DFS below to track
+        // visited-bits per original CLIF block only, not per edge. This greatly
+        // simplifies the data structures (no need to keep a sparse hash-set of
+        // (block, block) tuples).
+        let compute_lowered_succs = |ret: &mut Vec<(Inst, LoweredBlock)>, block: LoweredBlock| {
+            let start_idx = ret.len();
+            match block {
+                LoweredBlock::Orig { block } | LoweredBlock::EdgeAndOrig { block, .. } => {
+                    // At an orig block; successors are always edge blocks,
+                    // possibly with orig blocks following.
+                    let range = block_succ_range[block];
+                    for &(edge_inst, succ) in &block_succs[range.0..range.1] {
+                        if block_in_count[succ] == 1 {
+                            ret.push((
+                                edge_inst,
+                                LoweredBlock::EdgeAndOrig {
+                                    pred: block,
+                                    edge_inst,
+                                    block: succ,
+                                },
+                            ));
+                        } else {
+                            ret.push((
+                                edge_inst,
+                                LoweredBlock::Edge {
+                                    pred: block,
+                                    edge_inst,
+                                    succ,
+                                },
+                            ));
+                        }
+                    }
+                }
+                LoweredBlock::Edge {
+                    succ, edge_inst, ..
+                }
+                | LoweredBlock::OrigAndEdge {
+                    succ, edge_inst, ..
+                } => {
+                    // At an edge block; successors are always orig blocks,
+                    // possibly with edge blocks following.
+                    if block_out_count[succ] == 1 {
+                        let range = block_succ_range[succ];
+                        // check if the one succ is a real CFG edge (vs.
+                        // implicit return succ).
+                        if range.1 - range.0 > 0 {
+                            debug_assert!(range.1 - range.0 == 1);
+                            let (succ_edge_inst, succ_succ) = block_succs[range.0];
+                            ret.push((
+                                edge_inst,
+                                LoweredBlock::OrigAndEdge {
+                                    block: succ,
+                                    edge_inst: succ_edge_inst,
+                                    succ: succ_succ,
+                                },
+                            ));
+                        } else {
+                            ret.push((edge_inst, LoweredBlock::Orig { block: succ }));
+                        }
+                    } else {
+                        ret.push((edge_inst, LoweredBlock::Orig { block: succ }));
+                    }
+                }
+            }
+            let end_idx = ret.len();
+            (start_idx, end_idx)
+        };
+
+        // Build the explicit LoweredBlock-to-LoweredBlock successors list.
+        let mut lowered_succs = vec![];
+        let mut lowered_succ_indices = vec![];
+
+        // Step 2: Compute RPO traversal of the implicit CLIF-plus-edge-block graph. Use an
+        // explicit stack so we don't overflow the real stack with a deep DFS.
+        #[derive(Debug)]
+        struct StackEntry {
+            this: LoweredBlock,
+            succs: (usize, usize), // range in lowered_succs
+            cur_succ: usize,       // index in lowered_succs
+        }
+
+        let mut stack: SmallVec<[StackEntry; 16]> = SmallVec::new();
+        let mut visited = FxHashSet::default();
+        let mut postorder = vec![];
+        if let Some(entry) = f.layout.entry_block() {
+            // FIXME(cfallin): we might be able to use OrigAndEdge. Find a way
+            // to not special-case the entry block here.
+            let block = LoweredBlock::Orig { block: entry };
+            visited.insert(block);
+            let range = compute_lowered_succs(&mut lowered_succs, block);
+            lowered_succ_indices.resize(lowered_succs.len(), 0);
+            stack.push(StackEntry {
+                this: block,
+                succs: range,
+                cur_succ: range.1,
+            });
+        }
+
+        let mut deferred_last = None;
+        while !stack.is_empty() {
+            let stack_entry = stack.last_mut().unwrap();
+            let range = stack_entry.succs;
+            if stack_entry.cur_succ == range.0 {
+                let orig_block = stack_entry.this.orig_block();
+                if orig_block.is_some() && orig_block == fallthrough_return_block {
+                    deferred_last = Some((stack_entry.this, range));
+                } else {
+                    postorder.push((stack_entry.this, range));
+                }
+                stack.pop();
+            } else {
+                // Heuristic: chase the children in reverse. This puts the first
+                // successor block first in RPO, all other things being equal,
+                // which tends to prioritize loop backedges over out-edges,
+                // putting the edge-block closer to the loop body and minimizing
+                // live-ranges in linear instruction space.
+                let next = lowered_succs[stack_entry.cur_succ - 1].1;
+                stack_entry.cur_succ -= 1;
+                if visited.contains(&next) {
+                    continue;
+                }
+                visited.insert(next);
+                let range = compute_lowered_succs(&mut lowered_succs, next);
+                lowered_succ_indices.resize(lowered_succs.len(), 0);
+                stack.push(StackEntry {
+                    this: next,
+                    succs: range,
+                    cur_succ: range.1,
+                });
+            }
+        }
+
+        postorder.reverse();
+        let mut rpo = postorder;
+        if let Some(d) = deferred_last {
+            rpo.push(d);
+        }
+
+        // Step 3: now that we have RPO, build the BlockIndex/BB fwd/rev maps.
+        let mut lowered_order = vec![];
+        let mut lowered_succ_ranges = vec![];
+        let mut lb_to_bindex = FxHashMap::default();
+        for (block, succ_range) in rpo.into_iter() {
+            lb_to_bindex.insert(block, lowered_order.len() as BlockIndex);
+            lowered_order.push(block);
+            lowered_succ_ranges.push(succ_range);
+        }
+
+        let lowered_succ_indices = lowered_succs
+            .iter()
+            .map(|&(inst, succ)| (inst, lb_to_bindex.get(&succ).cloned().unwrap()))
+            .collect();
+
+        let mut orig_map = SecondaryMap::with_default(None);
+        for (i, lb) in lowered_order.iter().enumerate() {
+            let i = i as BlockIndex;
+            if let Some(b) = lb.orig_block() {
+                orig_map[b] = Some(i);
+            }
+        }
+
+        let result = BlockLoweringOrder {
+            lowered_order,
+            lowered_succs,
+            lowered_succ_indices,
+            lowered_succ_ranges,
+            orig_map,
+        };
+        debug!("BlockLoweringOrder: {:?}", result);
+        result
    }

-    fn rpo<I: VCodeInst>(self, vcode: &VCode<I>) -> Vec<BlockIndex> {
-        let mut rpo = self.postorder;
-        rpo.reverse();
-        if let Some(block) = vcode.fallthrough_return_block {
-            rpo.push(block);
-        }
-        rpo
+    /// Get the lowered order of blocks.
+    pub fn lowered_order(&self) -> &[LoweredBlock] {
+        &self.lowered_order[..]
+    }
+
+    /// Get the successors for a lowered block, by index in `lowered_order()`'s
+    /// returned slice. Each successsor is paired with the edge-instruction
+    /// (branch) corresponding to this edge.
+    pub fn succs(&self, block: BlockIndex) -> &[(Inst, LoweredBlock)] {
+        let range = self.lowered_succ_ranges[block as usize];
+        &self.lowered_succs[range.0..range.1]
+    }
+
+    /// Get the successor indices for a lowered block.
+    pub fn succ_indices(&self, block: BlockIndex) -> &[(Inst, BlockIndex)] {
+        let range = self.lowered_succ_ranges[block as usize];
+        &self.lowered_succ_indices[range.0..range.1]
+    }
+
+    /// Get the lowered block index containing a CLIF block, if any. (May not be
+    /// present if the original CLIF block was unreachable.)
+    pub fn lowered_block_for_bb(&self, bb: Block) -> Option<BlockIndex> {
+        self.orig_map[bb]
    }
 }

-/// Compute the final block order.
-pub fn compute_final_block_order<I: VCodeInst>(vcode: &VCode<I>) -> Vec<BlockIndex> {
-    let mut rpo = BlockRPO::new(vcode);
-    rpo.visit(vcode, vcode.entry());
-    rpo.rpo(vcode)
+#[cfg(test)]
+mod test {
+    use super::*;
+    use crate::cursor::{Cursor, FuncCursor};
+    use crate::ir::types::*;
+    use crate::ir::{AbiParam, ExternalName, Function, InstBuilder, Signature};
+    use crate::isa::CallConv;
+
+    fn build_test_func(n_blocks: usize, edges: &[(usize, usize)]) -> Function {
+        assert!(n_blocks > 0);
+
+        let name = ExternalName::testcase("test0");
+        let mut sig = Signature::new(CallConv::SystemV);
+        sig.params.push(AbiParam::new(I32));
+        let mut func = Function::with_name_signature(name, sig);
+        let blocks = (0..n_blocks)
+            .map(|i| {
+                let bb = func.dfg.make_block();
+                assert!(bb.as_u32() == i as u32);
+                bb
+            })
+            .collect::<Vec<_>>();
+
+        let arg0 = func.dfg.append_block_param(blocks[0], I32);
+
+        let mut pos = FuncCursor::new(&mut func);
+
+        let mut edge = 0;
+        for i in 0..n_blocks {
+            pos.insert_block(blocks[i]);
+            let mut succs = vec![];
+            while edge < edges.len() && edges[edge].0 == i {
+                succs.push(edges[edge].1);
+                edge += 1;
+            }
+            if succs.len() == 0 {
+                pos.ins().return_(&[arg0]);
+            } else if succs.len() == 1 {
+                pos.ins().jump(blocks[succs[0]], &[]);
+            } else if succs.len() == 2 {
+                pos.ins().brnz(arg0, blocks[succs[0]], &[]);
+                pos.ins().jump(blocks[succs[1]], &[]);
+            } else {
+                panic!("Too many successors");
+            }
+        }
+
+        func
+    }
+
+    #[test]
+    fn test_blockorder_diamond() {
+        let func = build_test_func(4, &[(0, 1), (0, 2), (1, 3), (2, 3)]);
+        let order = BlockLoweringOrder::new(&func);
+
+        assert_eq!(order.lowered_order.len(), 6);
+
+        assert!(order.lowered_order[0].orig_block().unwrap().as_u32() == 0);
+        assert!(order.lowered_order[0].in_edge().is_none());
+        assert!(order.lowered_order[0].out_edge().is_none());
+
+        assert!(order.lowered_order[1].orig_block().unwrap().as_u32() == 1);
+        assert!(order.lowered_order[1].in_edge().unwrap().0.as_u32() == 0);
+        assert!(order.lowered_order[1].in_edge().unwrap().2.as_u32() == 1);
+
+        assert!(order.lowered_order[2].orig_block().is_none());
+        assert!(order.lowered_order[2].in_edge().is_none());
+        assert!(order.lowered_order[2].out_edge().unwrap().0.as_u32() == 1);
+        assert!(order.lowered_order[2].out_edge().unwrap().2.as_u32() == 3);
+
+        assert!(order.lowered_order[3].orig_block().unwrap().as_u32() == 2);
+        assert!(order.lowered_order[3].in_edge().unwrap().0.as_u32() == 0);
+        assert!(order.lowered_order[3].in_edge().unwrap().2.as_u32() == 2);
+        assert!(order.lowered_order[3].out_edge().is_none());
+
+        assert!(order.lowered_order[4].orig_block().is_none());
+        assert!(order.lowered_order[4].in_edge().is_none());
+        assert!(order.lowered_order[4].out_edge().unwrap().0.as_u32() == 2);
+        assert!(order.lowered_order[4].out_edge().unwrap().2.as_u32() == 3);
+
+        assert!(order.lowered_order[5].orig_block().unwrap().as_u32() == 3);
+        assert!(order.lowered_order[5].in_edge().is_none());
+        assert!(order.lowered_order[5].out_edge().is_none());
+    }
+
+    #[test]
+    fn test_blockorder_critedge() {
+        //            0
+        //          /   \
+        //         1     2
+        //        /  \     \
+        //       3    4    |
+        //       |\  _|____|
+        //       | \/ |
+        //       | /\ |
+        //       5    6
+        //
+        // (3 -> 5, 3 -> 6, 4 -> 6 are critical edges and must be split)
+        //
+        let func = build_test_func(
+            7,
+            &[
+                (0, 1),
+                (0, 2),
+                (1, 3),
+                (1, 4),
+                (2, 5),
+                (3, 5),
+                (3, 6),
+                (4, 6),
+            ],
+        );
+        let order = BlockLoweringOrder::new(&func);
+
+        assert_eq!(order.lowered_order.len(), 11);
+        println!("ordered = {:?}", order.lowered_order);
+
+        // block 0
+        assert!(order.lowered_order[0].orig_block().unwrap().as_u32() == 0);
+        assert!(order.lowered_order[0].in_edge().is_none());
+        assert!(order.lowered_order[0].out_edge().is_none());
+
+        // edge 0->1 + block 1
+        assert!(order.lowered_order[1].orig_block().unwrap().as_u32() == 1);
+        assert!(order.lowered_order[1].in_edge().unwrap().0.as_u32() == 0);
+        assert!(order.lowered_order[1].in_edge().unwrap().2.as_u32() == 1);
+        assert!(order.lowered_order[1].out_edge().is_none());
+
+        // edge 1->3 + block 3
+        assert!(order.lowered_order[2].orig_block().unwrap().as_u32() == 3);
+        assert!(order.lowered_order[2].in_edge().unwrap().0.as_u32() == 1);
+        assert!(order.lowered_order[2].in_edge().unwrap().2.as_u32() == 3);
+        assert!(order.lowered_order[2].out_edge().is_none());
+
+        // edge 3->5
+        assert!(order.lowered_order[3].orig_block().is_none());
+        assert!(order.lowered_order[3].in_edge().is_none());
+        assert!(order.lowered_order[3].out_edge().unwrap().0.as_u32() == 3);
+        assert!(order.lowered_order[3].out_edge().unwrap().2.as_u32() == 5);
+
+        // edge 3->6
+        assert!(order.lowered_order[4].orig_block().is_none());
+        assert!(order.lowered_order[4].in_edge().is_none());
+        assert!(order.lowered_order[4].out_edge().unwrap().0.as_u32() == 3);
+        assert!(order.lowered_order[4].out_edge().unwrap().2.as_u32() == 6);
+
+        // edge 1->4 + block 4
+        assert!(order.lowered_order[5].orig_block().unwrap().as_u32() == 4);
+        assert!(order.lowered_order[5].in_edge().unwrap().0.as_u32() == 1);
+        assert!(order.lowered_order[5].in_edge().unwrap().2.as_u32() == 4);
+        assert!(order.lowered_order[5].out_edge().is_none());
+
+        // edge 4->6
+        assert!(order.lowered_order[6].orig_block().is_none());
+        assert!(order.lowered_order[6].in_edge().is_none());
+        assert!(order.lowered_order[6].out_edge().unwrap().0.as_u32() == 4);
+        assert!(order.lowered_order[6].out_edge().unwrap().2.as_u32() == 6);
+
+        // block 6
+        assert!(order.lowered_order[7].orig_block().unwrap().as_u32() == 6);
+        assert!(order.lowered_order[7].in_edge().is_none());
+        assert!(order.lowered_order[7].out_edge().is_none());
+
+        // edge 0->2 + block 2
+        assert!(order.lowered_order[8].orig_block().unwrap().as_u32() == 2);
+        assert!(order.lowered_order[8].in_edge().unwrap().0.as_u32() == 0);
+        assert!(order.lowered_order[8].in_edge().unwrap().2.as_u32() == 2);
+        assert!(order.lowered_order[8].out_edge().is_none());
+
+        // edge 2->5
+        assert!(order.lowered_order[9].orig_block().is_none());
+        assert!(order.lowered_order[9].in_edge().is_none());
+        assert!(order.lowered_order[9].out_edge().unwrap().0.as_u32() == 2);
+        assert!(order.lowered_order[9].out_edge().unwrap().2.as_u32() == 5);
+
+        // block 5
+        assert!(order.lowered_order[10].orig_block().unwrap().as_u32() == 5);
+        assert!(order.lowered_order[10].in_edge().is_none());
+        assert!(order.lowered_order[10].out_edge().is_none());
+    }
 }
--- a/cranelift/codegen/src/machinst/buffer.rs
+++ b/cranelift/codegen/src/machinst/buffer.rs
--- a/cranelift/codegen/src/machinst/compile.rs
+++ b/cranelift/codegen/src/machinst/compile.rs
@@ -18,8 +18,12 @@ pub fn compile<B: LowerBackend + MachBackend>(
 where
    B::MInst: ShowWithRRU,
 {
-    // This lowers the CL IR.
-    let mut vcode = Lower::new(f, abi)?.lower(b)?;
+    // Compute lowered block order.
+    let block_order = BlockLoweringOrder::new(f);
+    // Build the lowering context.
+    let lower = Lower::new(f, abi, block_order)?;
+    // Lower the IR.
+    let mut vcode = lower.lower(b)?;

    debug!(
        "vcode from lowering: \n{}",
@@ -65,11 +69,6 @@ where
    // all at once. This also inserts prologues/epilogues.
    vcode.replace_insns_from_regalloc(result);

-    vcode.remove_redundant_branches();
-
-    // Do final passes over code to finalize branches.
-    vcode.finalize_branches();
-
    debug!(
        "vcode after regalloc: final version:\n{}",
        vcode.show_rru(Some(b.reg_universe()))
--- a/cranelift/codegen/src/machinst/lower.rs
+++ b/cranelift/codegen/src/machinst/lower.rs
--- a/cranelift/codegen/src/machinst/mod.rs
+++ b/cranelift/codegen/src/machinst/mod.rs
@@ -109,6 +109,7 @@ use regalloc::RegUsageCollector;
 use regalloc::{
    RealReg, RealRegUniverse, Reg, RegClass, RegUsageMapper, SpillSlot, VirtualReg, Writable,
 };
+use smallvec::SmallVec;
 use std::string::String;
 use target_lexicon::Triple;

@@ -124,8 +125,8 @@ pub mod abi;
 pub use abi::*;
 pub mod pretty_print;
 pub use pretty_print::*;
-pub mod sections;
-pub use sections::*;
+pub mod buffer;
+pub use buffer::*;
 pub mod adapter;
 pub use adapter::*;

@@ -152,6 +153,9 @@ pub trait MachInst: Clone + Debug {
    /// Generate a move.
    fn gen_move(to_reg: Writable<Reg>, from_reg: Reg, ty: Type) -> Self;

+    /// Generate a constant into a reg.
+    fn gen_constant(to_reg: Writable<Reg>, value: u64, ty: Type) -> SmallVec<[Self; 4]>;
+
    /// Generate a zero-length no-op.
    fn gen_zero_len_nop() -> Self;

@@ -166,7 +170,7 @@ pub trait MachInst: Clone + Debug {

    /// Generate a jump to another target. Used during lowering of
    /// control flow.
-    fn gen_jump(target: BlockIndex) -> Self;
+    fn gen_jump(target: MachLabel) -> Self;

    /// Generate a NOP. The `preferred_size` parameter allows the caller to
    /// request a NOP of that size, or as close to it as possible. The machine
@@ -175,22 +179,62 @@ pub trait MachInst: Clone + Debug {
    /// the instruction must have a nonzero size.
    fn gen_nop(preferred_size: usize) -> Self;

-    /// Rewrite block targets using the block-target map.
-    fn with_block_rewrites(&mut self, block_target_map: &[BlockIndex]);
-
-    /// Finalize branches once the block order (fallthrough) is known.
-    fn with_fallthrough_block(&mut self, fallthrough_block: Option<BlockIndex>);
-
-    /// Update instruction once block offsets are known.  These offsets are
-    /// relative to the beginning of the function. `targets` is indexed by
-    /// BlockIndex.
-    fn with_block_offsets(&mut self, my_offset: CodeOffset, targets: &[CodeOffset]);
+    /// Get the register universe for this backend.
+    fn reg_universe(flags: &Flags) -> RealRegUniverse;

    /// Align a basic block offset (from start of function).  By default, no
    /// alignment occurs.
    fn align_basic_block(offset: CodeOffset) -> CodeOffset {
        offset
    }
+
+    /// What is the worst-case instruction size emitted by this instruction type?
+    fn worst_case_size() -> CodeOffset;
+
+    /// A label-use kind: a type that describes the types of label references that
+    /// can occur in an instruction.
+    type LabelUse: MachInstLabelUse;
+}
+
+/// A descriptor of a label reference (use) in an instruction set.
+pub trait MachInstLabelUse: Clone + Copy + Debug + Eq {
+    /// Required alignment for any veneer. Usually the required instruction
+    /// alignment (e.g., 4 for a RISC with 32-bit instructions, or 1 for x86).
+    const ALIGN: CodeOffset;
+
+    /// What is the maximum PC-relative range (positive)? E.g., if `1024`, a
+    /// label-reference fixup at offset `x` is valid if the label resolves to `x
+    /// + 1024`.
+    fn max_pos_range(self) -> CodeOffset;
+    /// What is the maximum PC-relative range (negative)? This is the absolute
+    /// value; i.e., if `1024`, then a label-reference fixup at offset `x` is
+    /// valid if the label resolves to `x - 1024`.
+    fn max_neg_range(self) -> CodeOffset;
+    /// What is the size of code-buffer slice this label-use needs to patch in
+    /// the label's value?
+    fn patch_size(self) -> CodeOffset;
+    /// Perform a code-patch, given the offset into the buffer of this label use
+    /// and the offset into the buffer of the label's definition.
+    /// It is guaranteed that, given `delta = offset - label_offset`, we will
+    /// have `offset >= -self.max_neg_range()` and `offset <=
+    /// self.max_pos_range()`.
+    fn patch(self, buffer: &mut [u8], use_offset: CodeOffset, label_offset: CodeOffset);
+    /// Can the label-use be patched to a veneer that supports a longer range?
+    /// Usually valid for jumps (a short-range jump can jump to a longer-range
+    /// jump), but not for e.g. constant pool references, because the constant
+    /// load would require different code (one more level of indirection).
+    fn supports_veneer(self) -> bool;
+    /// How many bytes are needed for a veneer?
+    fn veneer_size(self) -> CodeOffset;
+    /// Generate a veneer. The given code-buffer slice is `self.veneer_size()`
+    /// bytes long at offset `veneer_offset` in the buffer. The original
+    /// label-use will be patched to refer to this veneer's offset.  A new
+    /// (offset, LabelUse) is returned that allows the veneer to use the actual
+    /// label. For veneers to work properly, it is expected that the new veneer
+    /// has a larger range; on most platforms this probably means either a
+    /// "long-range jump" (e.g., on ARM, the 26-bit form), or if already at that
+    /// stage, a jump that supports a full 32-bit range, for example.
+    fn generate_veneer(self, buffer: &mut [u8], veneer_offset: CodeOffset) -> (CodeOffset, Self);
 }

 /// Describes a block terminator (not call) in the vcode, when its branches
@@ -202,26 +246,26 @@ pub enum MachTerminator<'a> {
    /// A return instruction.
    Ret,
    /// An unconditional branch to another block.
-    Uncond(BlockIndex),
+    Uncond(MachLabel),
    /// A conditional branch to one of two other blocks.
-    Cond(BlockIndex, BlockIndex),
+    Cond(MachLabel, MachLabel),
    /// An indirect branch with known possible targets.
-    Indirect(&'a [BlockIndex]),
+    Indirect(&'a [MachLabel]),
 }

 /// A trait describing the ability to encode a MachInst into binary machine code.
-pub trait MachInstEmit<O: MachSectionOutput> {
+pub trait MachInstEmit: MachInst {
    /// Persistent state carried across `emit` invocations.
    type State: Default + Clone + Debug;
    /// Emit the instruction.
-    fn emit(&self, code: &mut O, flags: &Flags, state: &mut Self::State);
+    fn emit(&self, code: &mut MachBuffer<Self>, flags: &Flags, state: &mut Self::State);
 }

 /// The result of a `MachBackend::compile_function()` call. Contains machine
 /// code (as bytes) and a disassembly, if requested.
 pub struct MachCompileResult {
    /// Machine code.
-    pub sections: MachSections,
+    pub buffer: MachBufferFinalized,
    /// Size of stack frame, in bytes.
    pub frame_size: u32,
    /// Disassembly, if requested.
@@ -231,7 +275,7 @@ pub struct MachCompileResult {
 impl MachCompileResult {
    /// Get a `CodeInfo` describing section sizes from this compilation result.
    pub fn code_info(&self) -> CodeInfo {
-        let code_size = self.sections.total_size();
+        let code_size = self.buffer.total_size();
        CodeInfo {
            code_size,
            jumptables_size: 0,
--- a/cranelift/codegen/src/machinst/sections.rs
+++ b/cranelift/codegen/src/machinst/sections.rs
@@ -1,460 +0,0 @@
-//! In-memory representation of compiled machine code, in multiple sections
-//! (text, constant pool / rodata, etc). Emission occurs into multiple sections
-//! simultaneously, so we buffer the result in memory and hand off to the
-//! caller at the end of compilation.
-
-use crate::binemit::{Addend, CodeOffset, CodeSink, Reloc};
-use crate::ir::{ExternalName, Opcode, SourceLoc, TrapCode};
-
-use alloc::vec::Vec;
-
-/// A collection of sections with defined start-offsets.
-pub struct MachSections {
-    /// Sections, in offset order.
-    pub sections: Vec<MachSection>,
-}
-
-impl MachSections {
-    /// New, empty set of sections.
-    pub fn new() -> MachSections {
-        MachSections { sections: vec![] }
-    }
-
-    /// Add a section with a known offset and size. Returns the index.
-    pub fn add_section(&mut self, start: CodeOffset, length: CodeOffset) -> usize {
-        let idx = self.sections.len();
-        self.sections.push(MachSection::new(start, length));
-        idx
-    }
-
-    /// Mutably borrow the given section by index.
-    pub fn get_section<'a>(&'a mut self, idx: usize) -> &'a mut MachSection {
-        &mut self.sections[idx]
-    }
-
-    /// Get mutable borrows of two sections simultaneously. Used during
-    /// instruction emission to provide references to the .text and .rodata
-    /// (constant pool) sections.
-    pub fn two_sections<'a>(
-        &'a mut self,
-        idx1: usize,
-        idx2: usize,
-    ) -> (&'a mut MachSection, &'a mut MachSection) {
-        assert!(idx1 < idx2);
-        assert!(idx1 < self.sections.len());
-        assert!(idx2 < self.sections.len());
-        let (first, rest) = self.sections.split_at_mut(idx2);
-        (&mut first[idx1], &mut rest[0])
-    }
-
-    /// Emit this set of sections to a set of sinks for the code,
-    /// relocations, traps, and stackmap.
-    pub fn emit<CS: CodeSink>(&self, sink: &mut CS) {
-        // N.B.: we emit every section into the .text section as far as
-        // the `CodeSink` is concerned; we do not bother to segregate
-        // the contents into the actual program text, the jumptable and the
-        // rodata (constant pool). This allows us to generate code assuming
-        // that these will not be relocated relative to each other, and avoids
-        // having to designate each section as belonging in one of the three
-        // fixed categories defined by `CodeSink`. If this becomes a problem
-        // later (e.g. because of memory permissions or similar), we can
-        // add this designation and segregate the output; take care, however,
-        // to add the appropriate relocations in this case.
-
-        for section in &self.sections {
-            if section.data.len() > 0 {
-                while sink.offset() < section.start_offset {
-                    sink.put1(0);
-                }
-                section.emit(sink);
-            }
-        }
-        sink.begin_jumptables();
-        sink.begin_rodata();
-        sink.end_codegen();
-    }
-
-    /// Get a list of source location mapping tuples in sorted-by-start-offset order.
-    pub fn get_srclocs_sorted<'a>(&'a self) -> MachSectionsSrcLocs<'a> {
-        MachSectionsSrcLocs::new(&self.sections)
-    }
-
-    /// Get the total required size for these sections.
-    pub fn total_size(&self) -> CodeOffset {
-        if self.sections.len() == 0 {
-            0
-        } else {
-            // Find the last non-empty section.
-            self.sections
-                .iter()
-                .rev()
-                .find(|s| s.data.len() > 0)
-                .map(|s| s.cur_offset_from_start())
-                .unwrap_or(0)
-        }
-    }
-}
-
-/// An iterator over the srclocs in each section.
-/// Returns MachSrcLocs in an order sorted by start location.
-pub struct MachSectionsSrcLocs<'a> {
-    sections: &'a [MachSection],
-    cur_section: usize,
-    cur_srcloc: usize,
-    // For validation:
-    last_offset: CodeOffset,
-}
-
-impl<'a> MachSectionsSrcLocs<'a> {
-    fn new(sections: &'a [MachSection]) -> MachSectionsSrcLocs<'a> {
-        MachSectionsSrcLocs {
-            sections,
-            cur_section: 0,
-            cur_srcloc: 0,
-            last_offset: 0,
-        }
-    }
-}
-
-impl<'a> Iterator for MachSectionsSrcLocs<'a> {
-    type Item = &'a MachSrcLoc;
-
-    fn next(&mut self) -> Option<&'a MachSrcLoc> {
-        // We simply iterate through sections and srcloc records in order. This produces a
-        // sorted order naturally because sections are in starting-offset-order, and srclocs
-        // are produced as a section is emitted into, so are in order as well.
-
-        // If we're out of sections, we're done.
-        if self.cur_section >= self.sections.len() {
-            return None;
-        }
-
-        // Otherwise, make sure we have a srcloc in the current section left to return, and
-        // advance to the next section if not. Done if we run out of sections.
-        while self.cur_srcloc >= self.sections[self.cur_section].srclocs.len() {
-            self.cur_srcloc = 0;
-            self.cur_section += 1;
-            if self.cur_section >= self.sections.len() {
-                return None;
-            }
-        }
-
-        let loc = &self.sections[self.cur_section].srclocs[self.cur_srcloc];
-        self.cur_srcloc += 1;
-        debug_assert!(loc.start >= self.last_offset);
-        self.last_offset = loc.start;
-        Some(loc)
-    }
-}
-
-/// An abstraction over MachSection and MachSectionSize: some
-/// receiver of section data.
-pub trait MachSectionOutput {
-    /// Get the current offset from the start of all sections.
-    fn cur_offset_from_start(&self) -> CodeOffset;
-
-    /// Get the start offset of this section.
-    fn start_offset(&self) -> CodeOffset;
-
-    /// Add 1 byte to the section.
-    fn put1(&mut self, _: u8);
-
-    /// Add 2 bytes to the section.
-    fn put2(&mut self, value: u16) {
-        let [b0, b1] = value.to_le_bytes();
-        self.put1(b0);
-        self.put1(b1);
-    }
-
-    /// Add 4 bytes to the section.
-    fn put4(&mut self, value: u32) {
-        let [b0, b1, b2, b3] = value.to_le_bytes();
-        self.put1(b0);
-        self.put1(b1);
-        self.put1(b2);
-        self.put1(b3);
-    }
-
-    /// Add 8 bytes to the section.
-    fn put8(&mut self, value: u64) {
-        let [b0, b1, b2, b3, b4, b5, b6, b7] = value.to_le_bytes();
-        self.put1(b0);
-        self.put1(b1);
-        self.put1(b2);
-        self.put1(b3);
-        self.put1(b4);
-        self.put1(b5);
-        self.put1(b6);
-        self.put1(b7);
-    }
-
-    /// Add a slice of bytes to the section.
-    fn put_data(&mut self, data: &[u8]);
-
-    /// Add a relocation at the current offset.
-    fn add_reloc(&mut self, loc: SourceLoc, kind: Reloc, name: &ExternalName, addend: Addend);
-
-    /// Add a trap record at the current offset.
-    fn add_trap(&mut self, loc: SourceLoc, code: TrapCode);
-
-    /// Add a call return address record at the current offset.
-    fn add_call_site(&mut self, loc: SourceLoc, opcode: Opcode);
-
-    /// Start the output for the given source-location at the current offset.
-    fn start_srcloc(&mut self, loc: SourceLoc);
-
-    /// End the output for the previously-given source-location at the current offset.
-    fn end_srcloc(&mut self);
-
-    /// Align up to the given alignment.
-    fn align_to(&mut self, align_to: CodeOffset) {
-        assert!(align_to.is_power_of_two());
-        while self.cur_offset_from_start() & (align_to - 1) != 0 {
-            self.put1(0);
-        }
-    }
-}
-
-/// A section of output to be emitted to a CodeSink / RelocSink in bulk.
-/// Multiple sections may be created with known start offsets in advance; the
-/// usual use-case is to create the .text (code) and .rodata (constant pool) at
-/// once, after computing the length of the code, so that constant references
-/// can use known offsets as instructions are emitted.
-pub struct MachSection {
-    /// The starting offset of this section.
-    pub start_offset: CodeOffset,
-    /// The limit of this section, defined by the start of the next section.
-    pub length_limit: CodeOffset,
-    /// The section contents, as raw bytes.
-    pub data: Vec<u8>,
-    /// Any relocations referring to this section.
-    pub relocs: Vec<MachReloc>,
-    /// Any trap records referring to this section.
-    pub traps: Vec<MachTrap>,
-    /// Any call site records referring to this section.
-    pub call_sites: Vec<MachCallSite>,
-    /// Any source location mappings referring to this section.
-    pub srclocs: Vec<MachSrcLoc>,
-    /// The current source location in progress (after `start_srcloc()` and before `end_srcloc()`).
-    /// This is a (start_offset, src_loc) tuple.
-    pub cur_srcloc: Option<(CodeOffset, SourceLoc)>,
-}
-
-impl MachSection {
-    /// Create a new section, known to start at `start_offset` and with a size limited to `length_limit`.
-    pub fn new(start_offset: CodeOffset, length_limit: CodeOffset) -> MachSection {
-        MachSection {
-            start_offset,
-            length_limit,
-            data: vec![],
-            relocs: vec![],
-            traps: vec![],
-            call_sites: vec![],
-            srclocs: vec![],
-            cur_srcloc: None,
-        }
-    }
-
-    /// Emit this section to the CodeSink and other associated sinks.  The
-    /// current offset of the CodeSink must match the starting offset of this
-    /// section.
-    pub fn emit<CS: CodeSink>(&self, sink: &mut CS) {
-        assert!(sink.offset() == self.start_offset);
-
-        let mut next_reloc = 0;
-        let mut next_trap = 0;
-        let mut next_call_site = 0;
-        for (idx, byte) in self.data.iter().enumerate() {
-            if next_reloc < self.relocs.len() {
-                let reloc = &self.relocs[next_reloc];
-                if reloc.offset == idx as CodeOffset {
-                    sink.reloc_external(reloc.srcloc, reloc.kind, &reloc.name, reloc.addend);
-                    next_reloc += 1;
-                }
-            }
-            if next_trap < self.traps.len() {
-                let trap = &self.traps[next_trap];
-                if trap.offset == idx as CodeOffset {
-                    sink.trap(trap.code, trap.srcloc);
-                    next_trap += 1;
-                }
-            }
-            if next_call_site < self.call_sites.len() {
-                let call_site = &self.call_sites[next_call_site];
-                if call_site.ret_addr == idx as CodeOffset {
-                    sink.add_call_site(call_site.opcode, call_site.srcloc);
-                    next_call_site += 1;
-                }
-            }
-            sink.put1(*byte);
-        }
-    }
-}
-
-impl MachSectionOutput for MachSection {
-    fn cur_offset_from_start(&self) -> CodeOffset {
-        self.start_offset + self.data.len() as CodeOffset
-    }
-
-    fn start_offset(&self) -> CodeOffset {
-        self.start_offset
-    }
-
-    fn put1(&mut self, value: u8) {
-        assert!(((self.data.len() + 1) as CodeOffset) <= self.length_limit);
-        self.data.push(value);
-    }
-
-    fn put_data(&mut self, data: &[u8]) {
-        assert!(((self.data.len() + data.len()) as CodeOffset) <= self.length_limit);
-        self.data.extend_from_slice(data);
-    }
-
-    fn add_reloc(&mut self, srcloc: SourceLoc, kind: Reloc, name: &ExternalName, addend: Addend) {
-        let name = name.clone();
-        self.relocs.push(MachReloc {
-            offset: self.data.len() as CodeOffset,
-            srcloc,
-            kind,
-            name,
-            addend,
-        });
-    }
-
-    fn add_trap(&mut self, srcloc: SourceLoc, code: TrapCode) {
-        self.traps.push(MachTrap {
-            offset: self.data.len() as CodeOffset,
-            srcloc,
-            code,
-        });
-    }
-
-    fn add_call_site(&mut self, srcloc: SourceLoc, opcode: Opcode) {
-        self.call_sites.push(MachCallSite {
-            ret_addr: self.data.len() as CodeOffset,
-            srcloc,
-            opcode,
-        });
-    }
-
-    fn start_srcloc(&mut self, loc: SourceLoc) {
-        self.cur_srcloc = Some((self.cur_offset_from_start(), loc));
-    }
-
-    fn end_srcloc(&mut self) {
-        let (start, loc) = self
-            .cur_srcloc
-            .take()
-            .expect("end_srcloc() called without start_srcloc()");
-        let end = self.cur_offset_from_start();
-        // Skip zero-length extends.
-        debug_assert!(end >= start);
-        if end > start {
-            self.srclocs.push(MachSrcLoc { start, end, loc });
-        }
-    }
-}
-
-/// A MachSectionOutput implementation that records only size.
-pub struct MachSectionSize {
-    /// The starting offset of this section.
-    pub start_offset: CodeOffset,
-    /// The current offset of this section.
-    pub offset: CodeOffset,
-}
-
-impl MachSectionSize {
-    /// Create a new size-counting dummy section.
-    pub fn new(start_offset: CodeOffset) -> MachSectionSize {
-        MachSectionSize {
-            start_offset,
-            offset: start_offset,
-        }
-    }
-
-    /// Return the size this section would take if emitted with a real sink.
-    pub fn size(&self) -> CodeOffset {
-        self.offset - self.start_offset
-    }
-}
-
-impl MachSectionOutput for MachSectionSize {
-    fn cur_offset_from_start(&self) -> CodeOffset {
-        // All size-counting sections conceptually start at offset 0; this doesn't
-        // matter when counting code size.
-        self.offset
-    }
-
-    fn start_offset(&self) -> CodeOffset {
-        self.start_offset
-    }
-
-    fn put1(&mut self, _: u8) {
-        self.offset += 1;
-    }
-
-    fn put_data(&mut self, data: &[u8]) {
-        self.offset += data.len() as CodeOffset;
-    }
-
-    fn add_reloc(&mut self, _: SourceLoc, _: Reloc, _: &ExternalName, _: Addend) {}
-
-    fn add_trap(&mut self, _: SourceLoc, _: TrapCode) {}
-
-    fn add_call_site(&mut self, _: SourceLoc, _: Opcode) {}
-
-    fn start_srcloc(&mut self, _: SourceLoc) {}
-
-    fn end_srcloc(&mut self) {}
-}
-
-/// A relocation resulting from a compilation.
-pub struct MachReloc {
-    /// The offset at which the relocation applies, *relative to the
-    /// containing section*.
-    pub offset: CodeOffset,
-    /// The original source location.
-    pub srcloc: SourceLoc,
-    /// The kind of relocation.
-    pub kind: Reloc,
-    /// The external symbol / name to which this relocation refers.
-    pub name: ExternalName,
-    /// The addend to add to the symbol value.
-    pub addend: i64,
-}
-
-/// A trap record resulting from a compilation.
-pub struct MachTrap {
-    /// The offset at which the trap instruction occurs, *relative to the
-    /// containing section*.
-    pub offset: CodeOffset,
-    /// The original source location.
-    pub srcloc: SourceLoc,
-    /// The trap code.
-    pub code: TrapCode,
-}
-
-/// A call site record resulting from a compilation.
-pub struct MachCallSite {
-    /// The offset of the call's return address, *relative to the containing section*.
-    pub ret_addr: CodeOffset,
-    /// The original source location.
-    pub srcloc: SourceLoc,
-    /// The call's opcode.
-    pub opcode: Opcode,
-}
-
-/// A source-location mapping resulting from a compilation.
-#[derive(Clone, Debug)]
-pub struct MachSrcLoc {
-    /// The start of the region of code corresponding to a source location.
-    /// This is relative to the start of the function, not to the start of the
-    /// section.
-    pub start: CodeOffset,
-    /// The end of the region of code corresponding to a source location.
-    /// This is relative to the start of the section, not to the start of the
-    /// section.
-    pub end: CodeOffset,
-    /// The source location.
-    pub loc: SourceLoc,
-}
--- a/cranelift/codegen/src/machinst/vcode.rs
+++ b/cranelift/codegen/src/machinst/vcode.rs
@@ -17,8 +17,7 @@
 //! See the main module comment in `mod.rs` for more details on the VCode-based
 //! backend pipeline.

-use crate::entity::SecondaryMap;
-use crate::ir::{self, Block, SourceLoc};
+use crate::ir::{self, SourceLoc};
 use crate::machinst::*;
 use crate::settings;

@@ -30,8 +29,6 @@ use regalloc::{

 use alloc::boxed::Box;
 use alloc::{borrow::Cow, vec::Vec};
-use log::debug;
-use smallvec::SmallVec;
 use std::fmt;
 use std::iter;
 use std::string::String;
@@ -43,8 +40,8 @@ pub type BlockIndex = u32;

 /// VCodeInst wraps all requirements for a MachInst to be in VCode: it must be
 /// a `MachInst` and it must be able to emit itself at least to a `SizeCodeSink`.
-pub trait VCodeInst: MachInst + MachInstEmit<MachSection> + MachInstEmit<MachSectionSize> {}
-impl<I: MachInst + MachInstEmit<MachSection> + MachInstEmit<MachSectionSize>> VCodeInst for I {}
+pub trait VCodeInst: MachInst + MachInstEmit {}
+impl<I: MachInst + MachInstEmit> VCodeInst for I {}

 /// A function in "VCode" (virtualized-register code) form, after lowering.
 /// This is essentially a standard CFG of basic blocks, where each basic block
@@ -80,29 +77,11 @@ pub struct VCode<I: VCodeInst> {
    /// correspond to each basic block's successors.
    block_succs: Vec<BlockIx>,

-    /// Block indices by IR block.
-    block_by_bb: SecondaryMap<ir::Block, BlockIndex>,
-
-    /// IR block for each VCode Block. The length of this Vec will likely be
-    /// less than the total number of Blocks, because new Blocks (for edge
-    /// splits, for example) are appended during lowering.
-    bb_by_block: Vec<ir::Block>,
-
-    /// Order of block IDs in final generated code.
-    final_block_order: Vec<BlockIndex>,
-
-    /// Final block offsets. Computed during branch finalization and used
-    /// during emission.
-    final_block_offsets: Vec<CodeOffset>,
-
-    /// Size of code, accounting for block layout / alignment.
-    code_size: CodeOffset,
+    /// Block-order information.
+    block_order: BlockLoweringOrder,

    /// ABI object.
    abi: Box<dyn ABIBody<I = I>>,
-
-    /// The block targeted by fallthrough_returns, if there's one.
-    pub fallthrough_return_block: Option<BlockIndex>,
 }

 /// A builder for a VCode function body. This builder is designed for the
@@ -123,12 +102,8 @@ pub struct VCodeBuilder<I: VCodeInst> {
    /// In-progress VCode.
    vcode: VCode<I>,

-    /// Current basic block instructions, in reverse order (because blocks are
-    /// built bottom-to-top).
-    bb_insns: SmallVec<[(I, SourceLoc); 32]>,
-
-    /// Current IR-inst instructions, in forward order.
-    ir_inst_insns: SmallVec<[(I, SourceLoc); 4]>,
+    /// Index of the last block-start in the vcode.
+    block_start: InsnIndex,

    /// Start of succs for the current block in the concatenated succs list.
    succ_start: usize,
@@ -139,12 +114,11 @@ pub struct VCodeBuilder<I: VCodeInst> {

 impl<I: VCodeInst> VCodeBuilder<I> {
    /// Create a new VCodeBuilder.
-    pub fn new(abi: Box<dyn ABIBody<I = I>>) -> VCodeBuilder<I> {
-        let vcode = VCode::new(abi);
+    pub fn new(abi: Box<dyn ABIBody<I = I>>, block_order: BlockLoweringOrder) -> VCodeBuilder<I> {
+        let vcode = VCode::new(abi, block_order);
        VCodeBuilder {
            vcode,
-            bb_insns: SmallVec::new(),
-            ir_inst_insns: SmallVec::new(),
+            block_start: 0,
            succ_start: 0,
            cur_srcloc: SourceLoc::default(),
        }
@@ -155,14 +129,9 @@ impl<I: VCodeInst> VCodeBuilder<I> {
        &mut *self.vcode.abi
    }

-    /// Set the fallthrough_return target block for this function. There must be at most once per
-    /// function.
-    pub fn set_fallthrough_return_block(&mut self, bb: Block) {
-        debug_assert!(
-            self.vcode.fallthrough_return_block.is_none(),
-            "a function must have at most one fallthrough-return instruction"
-        );
-        self.vcode.fallthrough_return_block = Some(self.bb_to_bindex(bb));
+    /// Access to the BlockLoweringOrder object.
+    pub fn block_order(&self) -> &BlockLoweringOrder {
+        &self.vcode.block_order
    }

    /// Set the type of a VReg.
@@ -173,53 +142,17 @@ impl<I: VCodeInst> VCodeBuilder<I> {
        self.vcode.vreg_types[vreg.get_index()] = ty;
    }

-    /// Return the underlying bb-to-BlockIndex map.
-    pub fn blocks_by_bb(&self) -> &SecondaryMap<ir::Block, BlockIndex> {
-        &self.vcode.block_by_bb
-    }
-
-    /// Initialize the bb-to-BlockIndex map. Returns the first free
-    /// BlockIndex.
-    pub fn init_bb_map(&mut self, blocks: &[ir::Block]) -> BlockIndex {
-        let mut bindex: BlockIndex = 0;
-        for bb in blocks.iter() {
-            self.vcode.block_by_bb[*bb] = bindex;
-            self.vcode.bb_by_block.push(*bb);
-            bindex += 1;
-        }
-        bindex
-    }
-
-    /// Get the BlockIndex for an IR block.
-    pub fn bb_to_bindex(&self, bb: ir::Block) -> BlockIndex {
-        self.vcode.block_by_bb[bb]
-    }
-
    /// Set the current block as the entry block.
    pub fn set_entry(&mut self, block: BlockIndex) {
        self.vcode.entry = block;
    }

-    /// End the current IR instruction. Must be called after pushing any
-    /// instructions and prior to ending the basic block.
-    pub fn end_ir_inst(&mut self) {
-        while let Some(pair) = self.ir_inst_insns.pop() {
-            self.bb_insns.push(pair);
-        }
-    }
-
    /// End the current basic block. Must be called after emitting vcode insts
    /// for IR insts and prior to ending the function (building the VCode).
-    pub fn end_bb(&mut self) -> BlockIndex {
-        assert!(self.ir_inst_insns.is_empty());
-        let block_num = self.vcode.block_ranges.len() as BlockIndex;
-        // Push the instructions.
-        let start_idx = self.vcode.insts.len() as InsnIndex;
-        while let Some((i, loc)) = self.bb_insns.pop() {
-            self.vcode.insts.push(i);
-            self.vcode.srclocs.push(loc);
-        }
+    pub fn end_bb(&mut self) {
+        let start_idx = self.block_start;
        let end_idx = self.vcode.insts.len() as InsnIndex;
+        self.block_start = end_idx;
        // Add the instruction index range to the list of blocks.
        self.vcode.block_ranges.push((start_idx, end_idx));
        // End the successors list.
@@ -228,8 +161,6 @@ impl<I: VCodeInst> VCodeBuilder<I> {
            .block_succ_range
            .push((self.succ_start, succ_end));
        self.succ_start = succ_end;
-
-        block_num
    }

    /// Push an instruction for the current BB and current IR inst within the BB.
@@ -237,19 +168,27 @@ impl<I: VCodeInst> VCodeBuilder<I> {
        match insn.is_term() {
            MachTerminator::None | MachTerminator::Ret => {}
            MachTerminator::Uncond(target) => {
-                self.vcode.block_succs.push(BlockIx::new(target));
+                self.vcode.block_succs.push(BlockIx::new(target.get()));
            }
            MachTerminator::Cond(true_branch, false_branch) => {
-                self.vcode.block_succs.push(BlockIx::new(true_branch));
-                self.vcode.block_succs.push(BlockIx::new(false_branch));
+                self.vcode.block_succs.push(BlockIx::new(true_branch.get()));
+                self.vcode
+                    .block_succs
+                    .push(BlockIx::new(false_branch.get()));
            }
            MachTerminator::Indirect(targets) => {
                for target in targets {
-                    self.vcode.block_succs.push(BlockIx::new(*target));
+                    self.vcode.block_succs.push(BlockIx::new(target.get()));
                }
            }
        }
-        self.ir_inst_insns.push((insn, self.cur_srcloc));
+        self.vcode.insts.push(insn);
+        self.vcode.srclocs.push(self.cur_srcloc);
+    }
+
+    /// Get the current source location.
+    pub fn get_srcloc(&self) -> SourceLoc {
+        self.cur_srcloc
    }

    /// Set the current source location.
@@ -259,8 +198,6 @@ impl<I: VCodeInst> VCodeBuilder<I> {

    /// Build the final VCode.
    pub fn build(self) -> VCode<I> {
-        assert!(self.ir_inst_insns.is_empty());
-        assert!(self.bb_insns.is_empty());
        self.vcode
    }
 }
@@ -282,35 +219,9 @@ fn is_redundant_move<I: VCodeInst>(insn: &I) -> bool {
    }
 }

-fn is_trivial_jump_block<I: VCodeInst>(vcode: &VCode<I>, block: BlockIndex) -> Option<BlockIndex> {
-    let range = vcode.block_insns(BlockIx::new(block));
-
-    debug!(
-        "is_trivial_jump_block: block {} has len {}",
-        block,
-        range.len()
-    );
-
-    if range.len() != 1 {
-        return None;
-    }
-    let insn = range.first();
-
-    debug!(
-        " -> only insn is: {:?} with terminator {:?}",
-        vcode.get_insn(insn),
-        vcode.get_insn(insn).is_term()
-    );
-
-    match vcode.get_insn(insn).is_term() {
-        MachTerminator::Uncond(target) => Some(target),
-        _ => None,
-    }
-}
-
 impl<I: VCodeInst> VCode<I> {
    /// New empty VCode.
-    fn new(abi: Box<dyn ABIBody<I = I>>) -> VCode<I> {
+    fn new(abi: Box<dyn ABIBody<I = I>>, block_order: BlockLoweringOrder) -> VCode<I> {
        VCode {
            liveins: abi.liveins(),
            liveouts: abi.liveouts(),
@@ -321,13 +232,8 @@ impl<I: VCodeInst> VCode<I> {
            block_ranges: vec![],
            block_succ_range: vec![],
            block_succs: vec![],
-            block_by_bb: SecondaryMap::with_default(0),
-            bb_by_block: vec![],
-            final_block_order: vec![],
-            final_block_offsets: vec![],
-            code_size: 0,
+            block_order,
            abi,
-            fallthrough_return_block: None,
        }
    }

@@ -367,8 +273,6 @@ impl<I: VCodeInst> VCode<I> {
    /// instructions including spliced fill/reload/move instructions, and replace
    /// the VCode with them.
    pub fn replace_insns_from_regalloc(&mut self, result: RegAllocResult<Self>) {
-        self.final_block_order = compute_final_block_order(self);
-
        // Record the spillslot count and clobbered registers for the ABI/stack
        // setup code.
        self.abi.set_num_spillslots(result.num_spill_slots as usize);
@@ -383,11 +287,12 @@ impl<I: VCodeInst> VCode<I> {
        let mut final_block_ranges = vec![(0, 0); self.num_blocks()];
        let mut final_srclocs = vec![];

-        for block in &self.final_block_order {
-            let (start, end) = block_ranges[*block as usize];
+        for block in 0..self.num_blocks() {
+            let block = block as BlockIndex;
+            let (start, end) = block_ranges[block as usize];
            let final_start = final_insns.len() as InsnIndex;

-            if *block == self.entry {
+            if block == self.entry {
                // Start with the prologue.
                let prologue = self.abi.gen_prologue();
                let len = prologue.len();
@@ -429,7 +334,7 @@ impl<I: VCodeInst> VCode<I> {
            }

            let final_end = final_insns.len() as InsnIndex;
-            final_block_ranges[*block as usize] = (final_start, final_end);
+            final_block_ranges[block as usize] = (final_start, final_end);
        }

        debug_assert!(final_insns.len() == final_srclocs.len());
@@ -439,175 +344,68 @@ impl<I: VCodeInst> VCode<I> {
        self.block_ranges = final_block_ranges;
    }

-    /// Removes redundant branches, rewriting targets to point directly to the
-    /// ultimate block at the end of a chain of trivial one-target jumps.
-    pub fn remove_redundant_branches(&mut self) {
-        // For each block, compute the actual target block, looking through up to one
-        // block with single-target jumps (this will remove empty edge blocks inserted
-        // by phi-lowering).
-        let block_rewrites: Vec<BlockIndex> = (0..self.num_blocks() as u32)
-            .map(|bix| is_trivial_jump_block(self, bix).unwrap_or(bix))
-            .collect();
-        let mut refcounts: Vec<usize> = vec![0; self.num_blocks()];
-
-        debug!(
-            "remove_redundant_branches: block_rewrites = {:?}",
-            block_rewrites
-        );
-
-        refcounts[self.entry as usize] = 1;
-
-        for block in 0..self.num_blocks() as u32 {
-            for insn in self.block_insns(BlockIx::new(block)) {
-                self.get_insn_mut(insn)
-                    .with_block_rewrites(&block_rewrites[..]);
-                match self.get_insn(insn).is_term() {
-                    MachTerminator::Uncond(bix) => {
-                        refcounts[bix as usize] += 1;
-                    }
-                    MachTerminator::Cond(bix1, bix2) => {
-                        refcounts[bix1 as usize] += 1;
-                        refcounts[bix2 as usize] += 1;
-                    }
-                    MachTerminator::Indirect(blocks) => {
-                        for block in blocks {
-                            refcounts[*block as usize] += 1;
-                        }
-                    }
-                    _ => {}
-                }
-            }
-        }
-
-        let deleted: Vec<bool> = refcounts.iter().map(|r| *r == 0).collect();
-
-        let block_order = std::mem::replace(&mut self.final_block_order, vec![]);
-        self.final_block_order = block_order
-            .into_iter()
-            .filter(|b| !deleted[*b as usize])
-            .collect();
-
-        // Rewrite successor information based on the block-rewrite map.
-        for succ in &mut self.block_succs {
-            let new_succ = block_rewrites[succ.get() as usize];
-            *succ = BlockIx::new(new_succ);
-        }
-    }
-
-    /// Mutate branch instructions to (i) lower two-way condbrs to one-way,
-    /// depending on fallthrough; and (ii) use concrete offsets.
-    pub fn finalize_branches(&mut self)
+    /// Emit the instructions to a `MachBuffer`, containing fixed-up code and external
+    /// reloc/trap/etc. records ready for use.
+    pub fn emit(&self) -> MachBuffer<I>
    where
-        I: MachInstEmit<MachSectionSize>,
+        I: MachInstEmit,
    {
-        // Compute fallthrough block, indexed by block.
-        let num_final_blocks = self.final_block_order.len();
-        let mut block_fallthrough: Vec<Option<BlockIndex>> = vec![None; self.num_blocks()];
-        for i in 0..(num_final_blocks - 1) {
-            let from = self.final_block_order[i];
-            let to = self.final_block_order[i + 1];
-            block_fallthrough[from as usize] = Some(to);
-        }
-
-        // Pass over VCode instructions and finalize two-way branches into
-        // one-way branches with fallthrough.
-        for block in 0..self.num_blocks() {
-            let next_block = block_fallthrough[block];
-            let (start, end) = self.block_ranges[block];
-
-            for iix in start..end {
-                let insn = &mut self.insts[iix as usize];
-                insn.with_fallthrough_block(next_block);
-            }
-        }
-
-        let flags = self.abi.flags();
-
-        // Compute block offsets.
-        let mut code_section = MachSectionSize::new(0);
-        let mut block_offsets = vec![0; self.num_blocks()];
+        let mut buffer = MachBuffer::new();
        let mut state = Default::default();
-        for &block in &self.final_block_order {
-            code_section.offset = I::align_basic_block(code_section.offset);
-            block_offsets[block as usize] = code_section.offset;
-            let (start, end) = self.block_ranges[block as usize];
-            for iix in start..end {
-                self.insts[iix as usize].emit(&mut code_section, flags, &mut state);
-            }
-        }

-        // We now have the section layout.
-        self.final_block_offsets = block_offsets;
-        self.code_size = code_section.size();
-
-        // Update branches with known block offsets. This looks like the
-        // traversal above, but (i) does not update block_offsets, rather uses
-        // it (so forward references are now possible), and (ii) mutates the
-        // instructions.
-        let mut code_section = MachSectionSize::new(0);
-        let mut state = Default::default();
-        for &block in &self.final_block_order {
-            code_section.offset = I::align_basic_block(code_section.offset);
-            let (start, end) = self.block_ranges[block as usize];
-            for iix in start..end {
-                self.insts[iix as usize]
-                    .with_block_offsets(code_section.offset, &self.final_block_offsets[..]);
-                self.insts[iix as usize].emit(&mut code_section, flags, &mut state);
-            }
-        }
-    }
-
-    /// Emit the instructions to a list of sections.
-    pub fn emit(&self) -> MachSections
-    where
-        I: MachInstEmit<MachSection>,
-    {
-        let mut sections = MachSections::new();
-        let code_idx = sections.add_section(0, self.code_size);
-        let code_section = sections.get_section(code_idx);
-        let mut state = Default::default();
+        buffer.reserve_labels_for_blocks(self.num_blocks() as BlockIndex); // first N MachLabels are simply block indices.

        let flags = self.abi.flags();
        let mut cur_srcloc = None;
-        for &block in &self.final_block_order {
-            let new_offset = I::align_basic_block(code_section.cur_offset_from_start());
-            while new_offset > code_section.cur_offset_from_start() {
+        for block in 0..self.num_blocks() {
+            let block = block as BlockIndex;
+            let new_offset = I::align_basic_block(buffer.cur_offset());
+            while new_offset > buffer.cur_offset() {
                // Pad with NOPs up to the aligned block offset.
-                let nop = I::gen_nop((new_offset - code_section.cur_offset_from_start()) as usize);
-                nop.emit(code_section, flags, &mut Default::default());
+                let nop = I::gen_nop((new_offset - buffer.cur_offset()) as usize);
+                nop.emit(&mut buffer, flags, &mut Default::default());
            }
-            assert_eq!(code_section.cur_offset_from_start(), new_offset);
+            assert_eq!(buffer.cur_offset(), new_offset);

            let (start, end) = self.block_ranges[block as usize];
+            buffer.bind_label(MachLabel::from_block(block));
            for iix in start..end {
                let srcloc = self.srclocs[iix as usize];
                if cur_srcloc != Some(srcloc) {
                    if cur_srcloc.is_some() {
-                        code_section.end_srcloc();
+                        buffer.end_srcloc();
                    }
-                    code_section.start_srcloc(srcloc);
+                    buffer.start_srcloc(srcloc);
                    cur_srcloc = Some(srcloc);
                }

-                self.insts[iix as usize].emit(code_section, flags, &mut state);
+                self.insts[iix as usize].emit(&mut buffer, flags, &mut state);
            }

            if cur_srcloc.is_some() {
-                code_section.end_srcloc();
+                buffer.end_srcloc();
                cur_srcloc = None;
            }
+
+            // Do we need an island? Get the worst-case size of the next BB and see if, having
+            // emitted that many bytes, we will be beyond the deadline.
+            if block < (self.num_blocks() - 1) as BlockIndex {
+                let next_block = block + 1;
+                let next_block_range = self.block_ranges[next_block as usize];
+                let next_block_size = next_block_range.1 - next_block_range.0;
+                let worst_case_next_bb = I::worst_case_size() * next_block_size;
+                if buffer.island_needed(worst_case_next_bb) {
+                    buffer.emit_island();
+                }
+            }
        }

-        sections
+        buffer
    }

    /// Get the IR block for a BlockIndex, if one exists.
    pub fn bindex_to_bb(&self, block: BlockIndex) -> Option<ir::Block> {
-        if (block as usize) < self.bb_by_block.len() {
-            Some(self.bb_by_block[block as usize])
-        } else {
-            None
-        }
+        self.block_order.lowered_order()[block as usize].orig_block()
    }
 }

@@ -712,7 +510,6 @@ impl<I: VCodeInst> fmt::Debug for VCode<I> {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        writeln!(f, "VCode_Debug {{")?;
        writeln!(f, "  Entry block: {}", self.entry)?;
-        writeln!(f, "  Final block order: {:?}", self.final_block_order)?;

        for block in 0..self.num_blocks() {
            writeln!(f, "Block {}:", block,)?;
@@ -736,52 +533,21 @@ impl<I: VCodeInst + ShowWithRRU> ShowWithRRU for VCode<I> {
    fn show_rru(&self, mb_rru: Option<&RealRegUniverse>) -> String {
        use std::fmt::Write;

-        // Calculate an order in which to display the blocks.  This is the same
-        // as final_block_order, but also includes blocks which are in the
-        // representation but not in final_block_order.
-        let mut display_order = Vec::<usize>::new();
-        // First display blocks in `final_block_order`
-        for bix in &self.final_block_order {
-            assert!((*bix as usize) < self.num_blocks());
-            display_order.push(*bix as usize);
-        }
-        // Now also take care of those not listed in `final_block_order`.
-        // This is quadratic, but it's also debug-only code.
-        for bix in 0..self.num_blocks() {
-            if display_order.contains(&bix) {
-                continue;
-            }
-            display_order.push(bix);
-        }
-
        let mut s = String::new();
        write!(&mut s, "VCode_ShowWithRRU {{{{\n").unwrap();
        write!(&mut s, "  Entry block: {}\n", self.entry).unwrap();
-        write!(
-            &mut s,
-            "  Final block order: {:?}\n",
-            self.final_block_order
-        )
-        .unwrap();

        for i in 0..self.num_blocks() {
-            let block = display_order[i];
+            let block = i as BlockIndex;

-            let omitted = if !self.final_block_order.is_empty() && i >= self.final_block_order.len()
-            {
-                "** OMITTED **"
-            } else {
-                ""
-            };
-
-            write!(&mut s, "Block {}: {}\n", block, omitted).unwrap();
-            if let Some(bb) = self.bindex_to_bb(block as BlockIndex) {
+            write!(&mut s, "Block {}:\n", block).unwrap();
+            if let Some(bb) = self.bindex_to_bb(block) {
                write!(&mut s, "  (original IR block: {})\n", bb).unwrap();
            }
-            for succ in self.succs(block as BlockIndex) {
+            for succ in self.succs(block) {
                write!(&mut s, "  (successor: Block {})\n", succ.get()).unwrap();
            }
-            let (start, end) = self.block_ranges[block];
+            let (start, end) = self.block_ranges[block as usize];
            write!(&mut s, "  (instruction range: {} .. {})\n", start, end).unwrap();
            for inst in start..end {
                write!(