egraph support: rewrite to work in terms of CLIF data structures. (#5382)

* egraph support: rewrite to work in terms of CLIF data structures. This work rewrites the "egraph"-based optimization framework in Cranelift to operate on aegraphs (acyclic egraphs) represented in the CLIF itself rather than as a separate data structure to which and from which we translate the CLIF. The basic idea is to add a new kind of value, a "union", that is like an alias but refers to two other values rather than one. This allows us to represent an eclass of enodes (values) as a tree. The union node allows for a value to have *multiple representations*: either constituent value could be used, and (in well-formed CLIF produced by correct optimization rules) they must be equivalent. Like the old egraph infrastructure, we take advantage of acyclicity and eager rule application to do optimization in a single pass. Like before, we integrate GVN (during the optimization pass) and LICM (during elaboration). Unlike the old egraph infrastructure, everything stays in the DataFlowGraph. "Pure" enodes are represented as instructions that have values attached, but that are not placed into the function layout. When entering "egraph" form, we remove them from the layout while optimizing. When leaving "egraph" form, during elaboration, we can place an instruction back into the layout the first time we elaborate the enode; if we elaborate it more than once, we clone the instruction. The implementation performs two passes overall: - One, a forward pass in RPO (to see defs before uses), that (i) removes "pure" instructions from the layout and (ii) optimizes as it goes. As before, we eagerly optimize, so we form the entire union of optimized forms of a value before we see any uses of that value. This lets us rewrite uses to use the most "up-to-date" form of the value and canonicalize and optimize that form. The eager rewriting and acyclic representation make each other work (we could not eagerly rewrite if there were cycles; and acyclicity does not miss optimization opportunities only because the first time we introduce a value, we immediately produce its "best" form). This design choice is also what allows us to avoid the "parent pointers" and fixpoint loop of traditional egraphs. This forward optimization pass keeps a scoped hashmap to "intern" nodes (thus performing GVN), and also interleaves on a per-instruction level with alias analysis. The interleaving with alias analysis allows alias analysis to see the most optimized form of each address (so it can see equivalences), and allows the next value to see any equivalences (reuses of loads or stored values) that alias analysis uncovers. - Two, a forward pass in domtree preorder, that "elaborates" pure enodes back into the layout, possibly in multiple places if needed. This tracks the loop nest and hoists nodes as needed, performing LICM as it goes. Note that by doing this in forward order, we avoid the "fixpoint" that traditional LICM needs: we hoist a def before its uses, so when we place a node, we place it in the right place the first time rather than moving later. This PR replaces the old (a)egraph implementation. It removes both the cranelift-egraph crate and the logic in cranelift-codegen that uses it. On `spidermonkey.wasm` running a simple recursive Fibonacci microbenchmark, this work shows 5.5% compile-time reduction and 7.7% runtime improvement (speedup). Most of this implementation was done in (very productive) pair programming sessions with Jamey Sharp, thus: Co-authored-by: Jamey Sharp <jsharp@fastly.com> * Review feedback. * Review feedback. * Review feedback. * Bugfix: cprop rule: `(x + k1) - k2` becomes `x - (k2 - k1)`, not `x - (k1 - k2)`. Co-authored-by: Jamey Sharp <jsharp@fastly.com>
2022-12-06 14:58:57 -08:00
parent 08d44e3746
commit f980defe17
42 changed files with 1890 additions and 3884 deletions
--- a/cranelift/codegen/src/egraph.rs
+++ b/cranelift/codegen/src/egraph.rs
@@ -1,342 +1,462 @@
-//! Egraph-based mid-end optimization framework.
+//! Support for egraphs represented in the DataFlowGraph.

+use crate::alias_analysis::{AliasAnalysis, LastStores};
+use crate::ctxhash::{CtxEq, CtxHash, CtxHashMap};
+use crate::cursor::{Cursor, CursorPosition, FuncCursor};
 use crate::dominator_tree::DominatorTree;
-use crate::egraph::stores::PackedMemoryState;
-use crate::flowgraph::ControlFlowGraph;
-use crate::loop_analysis::{LoopAnalysis, LoopLevel};
-use crate::trace;
-use crate::{
-    fx::{FxHashMap, FxHashSet},
-    inst_predicates::has_side_effect,
-    ir::{Block, Function, Inst, InstructionData, InstructionImms, Opcode, Type},
+use crate::egraph::domtree::DomTreeWithChildren;
+use crate::egraph::elaborate::Elaborator;
+use crate::fx::FxHashSet;
+use crate::inst_predicates::is_pure_for_egraph;
+use crate::ir::{
+    DataFlowGraph, Function, Inst, InstructionData, Type, Value, ValueDef, ValueListPool,
 };
-use alloc::vec::Vec;
-use core::ops::Range;
-use cranelift_egraph::{EGraph, Id, Language, NewOrExisting};
-use cranelift_entity::EntityList;
+use crate::loop_analysis::LoopAnalysis;
+use crate::opts::generated_code::ContextIter;
+use crate::opts::IsleContext;
+use crate::trace;
+use crate::unionfind::UnionFind;
+use cranelift_entity::packed_option::ReservedValue;
 use cranelift_entity::SecondaryMap;
+use std::hash::Hasher;

+mod cost;
 mod domtree;
 mod elaborate;
-mod node;
-mod stores;

-use elaborate::Elaborator;
-pub use node::{Node, NodeCtx};
-pub use stores::{AliasAnalysis, MemoryState};
-
-pub struct FuncEGraph<'a> {
+/// Pass over a Function that does the whole aegraph thing.
+///
+/// - Removes non-skeleton nodes from the Layout.
+/// - Performs a GVN-and-rule-application pass over all Values
+///   reachable from the skeleton, potentially creating new Union
+///   nodes (i.e., an aegraph) so that some values have multiple
+///   representations.
+/// - Does "extraction" on the aegraph: selects the best value out of
+///   the tree-of-Union nodes for each used value.
+/// - Does "scoped elaboration" on the aegraph: chooses one or more
+///   locations for pure nodes to become instructions again in the
+///   layout, as forced by the skeleton.
+///
+/// At the beginning and end of this pass, the CLIF should be in a
+/// state that passes the verifier and, additionally, has no Union
+/// nodes. During the pass, Union nodes may exist, and instructions in
+/// the layout may refer to results of instructions that are not
+/// placed in the layout.
+pub struct EgraphPass<'a> {
+    /// The function we're operating on.
+    func: &'a mut Function,
    /// Dominator tree, used for elaboration pass.
    domtree: &'a DominatorTree,
-    /// Loop analysis results, used for built-in LICM during elaboration.
+    /// Alias analysis, used during optimization.
+    alias_analysis: &'a mut AliasAnalysis<'a>,
+    /// "Domtree with children": like `domtree`, but with an explicit
+    /// list of children, rather than just parent pointers.
+    domtree_children: DomTreeWithChildren,
+    /// Loop analysis results, used for built-in LICM during
+    /// elaboration.
    loop_analysis: &'a LoopAnalysis,
-    /// Last-store tracker for integrated alias analysis during egraph build.
-    alias_analysis: AliasAnalysis,
-    /// The egraph itself.
-    pub(crate) egraph: EGraph<NodeCtx, Analysis>,
-    /// "node context", containing arenas for node data.
-    pub(crate) node_ctx: NodeCtx,
-    /// Ranges in `side_effect_ids` for sequences of side-effecting
-    /// eclasses per block.
-    side_effects: SecondaryMap<Block, Range<u32>>,
-    side_effect_ids: Vec<Id>,
-    /// Map from store instructions to their nodes; used for store-to-load forwarding.
-    pub(crate) store_nodes: FxHashMap<Inst, (Type, Id)>,
-    /// Ranges in `blockparam_ids_tys` for sequences of blockparam
-    /// eclass IDs and types per block.
-    blockparams: SecondaryMap<Block, Range<u32>>,
-    blockparam_ids_tys: Vec<(Id, Type)>,
-    /// Which canonical node IDs do we want to rematerialize in each
+    /// Which canonical Values do we want to rematerialize in each
    /// block where they're used?
-    pub(crate) remat_ids: FxHashSet<Id>,
-    /// Which canonical node IDs have an enode whose value subsumes
-    /// all others it's unioned with?
-    pub(crate) subsume_ids: FxHashSet<Id>,
-    /// Statistics recorded during the process of building,
-    /// optimizing, and lowering out of this egraph.
+    ///
+    /// (A canonical Value is the *oldest* Value in an eclass,
+    /// i.e. tree of union value-nodes).
+    remat_values: FxHashSet<Value>,
+    /// Stats collected while we run this pass.
    pub(crate) stats: Stats,
-    /// Current rewrite-recursion depth. Used to enforce a finite
-    /// limit on rewrite rule application so that we don't get stuck
-    /// in an infinite chain.
+    /// Union-find that maps all members of a Union tree (eclass) back
+    /// to the *oldest* (lowest-numbered) `Value`.
+    eclasses: UnionFind<Value>,
+}
+
+/// Context passed through node insertion and optimization.
+pub(crate) struct OptimizeCtx<'opt, 'analysis>
+where
+    'analysis: 'opt,
+{
+    // Borrowed from EgraphPass:
+    pub(crate) func: &'opt mut Function,
+    pub(crate) value_to_opt_value: &'opt mut SecondaryMap<Value, Value>,
+    pub(crate) gvn_map: &'opt mut CtxHashMap<(Type, InstructionData), Value>,
+    pub(crate) eclasses: &'opt mut UnionFind<Value>,
+    pub(crate) remat_values: &'opt mut FxHashSet<Value>,
+    pub(crate) stats: &'opt mut Stats,
+    pub(crate) alias_analysis: &'opt mut AliasAnalysis<'analysis>,
+    pub(crate) alias_analysis_state: &'opt mut LastStores,
+    // Held locally during optimization of one node (recursively):
    pub(crate) rewrite_depth: usize,
+    pub(crate) subsume_values: FxHashSet<Value>,
 }

-#[derive(Clone, Debug, Default)]
-pub(crate) struct Stats {
-    pub(crate) node_created: u64,
-    pub(crate) node_param: u64,
-    pub(crate) node_result: u64,
-    pub(crate) node_pure: u64,
-    pub(crate) node_inst: u64,
-    pub(crate) node_load: u64,
-    pub(crate) node_dedup_query: u64,
-    pub(crate) node_dedup_hit: u64,
-    pub(crate) node_dedup_miss: u64,
-    pub(crate) node_ctor_created: u64,
-    pub(crate) node_ctor_deduped: u64,
-    pub(crate) node_union: u64,
-    pub(crate) node_subsume: u64,
-    pub(crate) store_map_insert: u64,
-    pub(crate) side_effect_nodes: u64,
-    pub(crate) rewrite_rule_invoked: u64,
-    pub(crate) rewrite_depth_limit: u64,
-    pub(crate) store_to_load_forward: u64,
-    pub(crate) elaborate_visit_node: u64,
-    pub(crate) elaborate_memoize_hit: u64,
-    pub(crate) elaborate_memoize_miss: u64,
-    pub(crate) elaborate_memoize_miss_remat: u64,
-    pub(crate) elaborate_licm_hoist: u64,
-    pub(crate) elaborate_func: u64,
-    pub(crate) elaborate_func_pre_insts: u64,
-    pub(crate) elaborate_func_post_insts: u64,
+/// For passing to `insert_pure_enode`. Sometimes the enode already
+/// exists as an Inst (from the original CLIF), and sometimes we're in
+/// the middle of creating it and want to avoid inserting it if
+/// possible until we know we need it.
+pub(crate) enum NewOrExistingInst {
+    New(InstructionData, Type),
+    Existing(Inst),
 }

-impl<'a> FuncEGraph<'a> {
-    /// Create a new EGraph for the given function. Requires the
-    /// domtree to be precomputed as well; the domtree is used for
-    /// scheduling when lowering out of the egraph.
-    pub fn new(
-        func: &Function,
-        domtree: &'a DominatorTree,
-        loop_analysis: &'a LoopAnalysis,
-        cfg: &ControlFlowGraph,
-    ) -> FuncEGraph<'a> {
-        let num_values = func.dfg.num_values();
-        let num_blocks = func.dfg.num_blocks();
-        let node_count_estimate = num_values * 2;
-        let alias_analysis = AliasAnalysis::new(func, cfg);
-        let mut this = Self {
-            domtree,
-            loop_analysis,
-            alias_analysis,
-            egraph: EGraph::with_capacity(node_count_estimate, Some(Analysis)),
-            node_ctx: NodeCtx::with_capacity_for_dfg(&func.dfg),
-            side_effects: SecondaryMap::with_capacity(num_blocks),
-            side_effect_ids: Vec::with_capacity(node_count_estimate),
-            store_nodes: FxHashMap::default(),
-            blockparams: SecondaryMap::with_capacity(num_blocks),
-            blockparam_ids_tys: Vec::with_capacity(num_blocks * 10),
-            remat_ids: FxHashSet::default(),
-            subsume_ids: FxHashSet::default(),
-            stats: Default::default(),
-            rewrite_depth: 0,
+impl NewOrExistingInst {
+    fn get_inst_key<'a>(&'a self, dfg: &'a DataFlowGraph) -> (Type, InstructionData) {
+        match self {
+            NewOrExistingInst::New(data, ty) => (*ty, *data),
+            NewOrExistingInst::Existing(inst) => {
+                let ty = dfg.ctrl_typevar(*inst);
+                (ty, dfg[*inst].clone())
+            }
+        }
+    }
+}
+
+impl<'opt, 'analysis> OptimizeCtx<'opt, 'analysis>
+where
+    'analysis: 'opt,
+{
+    /// Optimization of a single instruction.
+    ///
+    /// This does a few things:
+    /// - Looks up the instruction in the GVN deduplication map. If we
+    ///   already have the same instruction somewhere else, with the
+    ///   same args, then we can alias the original instruction's
+    ///   results and omit this instruction entirely.
+    ///   - Note that we do this canonicalization based on the
+    ///     instruction with its arguments as *canonical* eclass IDs,
+    ///     that is, the oldest (smallest index) `Value` reachable in
+    ///     the tree-of-unions (whole eclass). This ensures that we
+    ///     properly canonicalize newer nodes that use newer "versions"
+    ///     of a value that are still equal to the older versions.
+    /// - If the instruction is "new" (not deduplicated), then apply
+    ///   optimization rules:
+    ///   - All of the mid-end rules written in ISLE.
+    ///   - Store-to-load forwarding.
+    /// - Update the value-to-opt-value map, and update the eclass
+    ///   union-find, if we rewrote the value to different form(s).
+    pub(crate) fn insert_pure_enode(&mut self, inst: NewOrExistingInst) -> Value {
+        // Create the external context for looking up and updating the
+        // GVN map. This is necessary so that instructions themselves
+        // do not have to carry all the references or data for a full
+        // `Eq` or `Hash` impl.
+        let gvn_context = GVNContext {
+            union_find: self.eclasses,
+            value_lists: &self.func.dfg.value_lists,
        };
-        this.store_nodes.reserve(func.dfg.num_values() / 8);
-        this.remat_ids.reserve(func.dfg.num_values() / 4);
-        this.subsume_ids.reserve(func.dfg.num_values() / 4);
-        this.build(func);
-        this
+
+        self.stats.pure_inst += 1;
+        if let NewOrExistingInst::New(..) = inst {
+            self.stats.new_inst += 1;
+        }
+
+        // Does this instruction already exist? If so, add entries to
+        // the value-map to rewrite uses of its results to the results
+        // of the original (existing) instruction. If not, optimize
+        // the new instruction.
+        if let Some(&orig_result) = self
+            .gvn_map
+            .get(&inst.get_inst_key(&self.func.dfg), &gvn_context)
+        {
+            self.stats.pure_inst_deduped += 1;
+            if let NewOrExistingInst::Existing(inst) = inst {
+                debug_assert_eq!(self.func.dfg.inst_results(inst).len(), 1);
+                let result = self.func.dfg.first_result(inst);
+                self.value_to_opt_value[result] = orig_result;
+                self.eclasses.union(result, orig_result);
+                self.stats.union += 1;
+                result
+            } else {
+                orig_result
+            }
+        } else {
+            // Now actually insert the InstructionData and attach
+            // result value (exactly one).
+            let (inst, result, ty) = match inst {
+                NewOrExistingInst::New(data, typevar) => {
+                    let inst = self.func.dfg.make_inst(data);
+                    // TODO: reuse return value?
+                    self.func.dfg.make_inst_results(inst, typevar);
+                    let result = self.func.dfg.first_result(inst);
+                    // Add to eclass unionfind.
+                    self.eclasses.add(result);
+                    // New inst. We need to do the analysis of its result.
+                    (inst, result, typevar)
+                }
+                NewOrExistingInst::Existing(inst) => {
+                    let result = self.func.dfg.first_result(inst);
+                    let ty = self.func.dfg.ctrl_typevar(inst);
+                    (inst, result, ty)
+                }
+            };
+
+            let opt_value = self.optimize_pure_enode(inst);
+            let gvn_context = GVNContext {
+                union_find: self.eclasses,
+                value_lists: &self.func.dfg.value_lists,
+            };
+            self.gvn_map
+                .insert((ty, self.func.dfg[inst].clone()), opt_value, &gvn_context);
+            self.value_to_opt_value[result] = opt_value;
+            opt_value
+        }
    }

-    fn build(&mut self, func: &Function) {
-        // Mapping of SSA `Value` to eclass ID.
-        let mut value_to_id = FxHashMap::default();
+    /// Optimizes an enode by applying any matching mid-end rewrite
+    /// rules (or store-to-load forwarding, which is a special case),
+    /// unioning together all possible optimized (or rewritten) forms
+    /// of this expression into an eclass and returning the `Value`
+    /// that represents that eclass.
+    fn optimize_pure_enode(&mut self, inst: Inst) -> Value {
+        // A pure node always has exactly one result.
+        let orig_value = self.func.dfg.first_result(inst);

-        // For each block in RPO, create an enode for block entry, for
-        // each block param, and for each instruction.
-        for &block in self.domtree.cfg_postorder().iter().rev() {
-            let loop_level = self.loop_analysis.loop_level(block);
-            let blockparam_start =
-                u32::try_from(self.blockparam_ids_tys.len()).expect("Overflow in blockparam count");
-            for (i, &value) in func.dfg.block_params(block).iter().enumerate() {
-                let ty = func.dfg.value_type(value);
-                let param = self
-                    .egraph
-                    .add(
-                        Node::Param {
-                            block,
-                            index: i
-                                .try_into()
-                                .expect("blockparam index should fit in Node::Param"),
-                            ty,
-                            loop_level,
-                        },
-                        &mut self.node_ctx,
-                    )
-                    .get();
-                value_to_id.insert(value, param);
-                self.blockparam_ids_tys.push((param, ty));
-                self.stats.node_created += 1;
-                self.stats.node_param += 1;
-            }
-            let blockparam_end =
-                u32::try_from(self.blockparam_ids_tys.len()).expect("Overflow in blockparam count");
-            self.blockparams[block] = blockparam_start..blockparam_end;
+        let mut isle_ctx = IsleContext { ctx: self };

-            let side_effect_start =
-                u32::try_from(self.side_effect_ids.len()).expect("Overflow in side-effect count");
-            for inst in func.layout.block_insts(block) {
-                // Build args from SSA values.
-                let args = EntityList::from_iter(
-                    func.dfg.inst_args(inst).iter().map(|&arg| {
-                        let arg = func.dfg.resolve_aliases(arg);
-                        *value_to_id
-                            .get(&arg)
-                            .expect("Must have seen def before this use")
-                    }),
-                    &mut self.node_ctx.args,
+        // Limit rewrite depth. When we apply optimization rules, they
+        // may create new nodes (values) and those are, recursively,
+        // optimized eagerly as soon as they are created. So we may
+        // have more than one ISLE invocation on the stack. (This is
+        // necessary so that as the toplevel builds the
+        // right-hand-side expression bottom-up, it uses the "latest"
+        // optimized values for all the constituent parts.) To avoid
+        // infinite or problematic recursion, we bound the rewrite
+        // depth to a small constant here.
+        const REWRITE_LIMIT: usize = 5;
+        if isle_ctx.ctx.rewrite_depth > REWRITE_LIMIT {
+            isle_ctx.ctx.stats.rewrite_depth_limit += 1;
+            return orig_value;
+        }
+        isle_ctx.ctx.rewrite_depth += 1;
+
+        // Invoke the ISLE toplevel constructor, getting all new
+        // values produced as equivalents to this value.
+        trace!("Calling into ISLE with original value {}", orig_value);
+        isle_ctx.ctx.stats.rewrite_rule_invoked += 1;
+        let optimized_values =
+            crate::opts::generated_code::constructor_simplify(&mut isle_ctx, orig_value);
+
+        // Create a union of all new values with the original (or
+        // maybe just one new value marked as "subsuming" the
+        // original, if present.)
+        let mut union_value = orig_value;
+        if let Some(mut optimized_values) = optimized_values {
+            while let Some(optimized_value) = optimized_values.next(&mut isle_ctx) {
+                trace!(
+                    "Returned from ISLE for {}, got {:?}",
+                    orig_value,
+                    optimized_value
                );
+                if optimized_value == orig_value {
+                    trace!(" -> same as orig value; skipping");
+                    continue;
+                }
+                if isle_ctx.ctx.subsume_values.contains(&optimized_value) {
+                    // Merge in the unionfind so canonicalization
+                    // still works, but take *only* the subsuming
+                    // value, and break now.
+                    isle_ctx.ctx.eclasses.union(optimized_value, union_value);
+                    union_value = optimized_value;
+                    break;
+                }

-                let results = func.dfg.inst_results(inst);
-                let ty = if results.len() == 1 {
-                    func.dfg.value_type(results[0])
+                let old_union_value = union_value;
+                union_value = isle_ctx
+                    .ctx
+                    .func
+                    .dfg
+                    .union(old_union_value, optimized_value);
+                isle_ctx.ctx.stats.union += 1;
+                trace!(" -> union: now {}", union_value);
+                isle_ctx.ctx.eclasses.add(union_value);
+                isle_ctx
+                    .ctx
+                    .eclasses
+                    .union(old_union_value, optimized_value);
+                isle_ctx.ctx.eclasses.union(old_union_value, union_value);
+            }
+        }
+
+        isle_ctx.ctx.rewrite_depth -= 1;
+
+        union_value
+    }
+
+    /// Optimize a "skeleton" instruction, possibly removing
+    /// it. Returns `true` if the instruction should be removed from
+    /// the layout.
+    fn optimize_skeleton_inst(&mut self, inst: Inst) -> bool {
+        self.stats.skeleton_inst += 1;
+        // Not pure, but may still be a load or store:
+        // process it to see if we can optimize it.
+        if let Some(new_result) =
+            self.alias_analysis
+                .process_inst(self.func, self.alias_analysis_state, inst)
+        {
+            self.stats.alias_analysis_removed += 1;
+            let result = self.func.dfg.first_result(inst);
+            self.value_to_opt_value[result] = new_result;
+            true
+        } else {
+            // Set all results to identity-map to themselves
+            // in the value-to-opt-value map.
+            for &result in self.func.dfg.inst_results(inst) {
+                self.value_to_opt_value[result] = result;
+                self.eclasses.add(result);
+            }
+            false
+        }
+    }
+}
+
+impl<'a> EgraphPass<'a> {
+    /// Create a new EgraphPass.
+    pub fn new(
+        func: &'a mut Function,
+        domtree: &'a DominatorTree,
+        loop_analysis: &'a LoopAnalysis,
+        alias_analysis: &'a mut AliasAnalysis<'a>,
+    ) -> Self {
+        let num_values = func.dfg.num_values();
+        let domtree_children = DomTreeWithChildren::new(func, domtree);
+        Self {
+            func,
+            domtree,
+            domtree_children,
+            loop_analysis,
+            alias_analysis,
+            stats: Stats::default(),
+            eclasses: UnionFind::with_capacity(num_values),
+            remat_values: FxHashSet::default(),
+        }
+    }
+
+    /// Run the process.
+    pub fn run(&mut self) {
+        self.remove_pure_and_optimize();
+
+        trace!("egraph built:\n{}\n", self.func.display());
+        if cfg!(feature = "trace-log") {
+            for (value, def) in self.func.dfg.values_and_defs() {
+                trace!(" -> {} = {:?}", value, def);
+                match def {
+                    ValueDef::Result(i, 0) => {
+                        trace!("  -> {} = {:?}", i, self.func.dfg[i]);
+                    }
+                    _ => {}
+                }
+            }
+        }
+        trace!("stats: {:?}", self.stats);
+        self.elaborate();
+    }
+
+    /// Remove pure nodes from the `Layout` of the function, ensuring
+    /// that only the "side-effect skeleton" remains, and also
+    /// optimize the pure nodes. This is the first step of
+    /// egraph-based processing and turns the pure CFG-based CLIF into
+    /// a CFG skeleton with a sea of (optimized) nodes tying it
+    /// together.
+    ///
+    /// As we walk through the code, we eagerly apply optimization
+    /// rules; at any given point we have a "latest version" of an
+    /// eclass of possible representations for a `Value` in the
+    /// original program, which is itself a `Value` at the root of a
+    /// union-tree. We keep a map from the original values to these
+    /// optimized values. When we encounter any instruction (pure or
+    /// side-effecting skeleton) we rewrite its arguments to capture
+    /// the "latest" optimized forms of these values. (We need to do
+    /// this as part of this pass, and not later using a finished map,
+    /// because the eclass can continue to be updated and we need to
+    /// only refer to its subset that exists at this stage, to
+    /// maintain acyclicity.)
+    fn remove_pure_and_optimize(&mut self) {
+        let mut cursor = FuncCursor::new(self.func);
+        let mut value_to_opt_value: SecondaryMap<Value, Value> =
+            SecondaryMap::with_default(Value::reserved_value());
+        let mut gvn_map: CtxHashMap<(Type, InstructionData), Value> =
+            CtxHashMap::with_capacity(cursor.func.dfg.num_values());
+
+        // In domtree preorder, visit blocks. (TODO: factor out an
+        // iterator from this and elaborator.)
+        let root = self.domtree_children.root();
+        let mut block_stack = vec![root];
+        while let Some(block) = block_stack.pop() {
+            // We popped this block; push children
+            // immediately, then process this block.
+            block_stack.extend(self.domtree_children.children(block));
+
+            trace!("Processing block {}", block);
+            cursor.set_position(CursorPosition::Before(block));
+
+            let mut alias_analysis_state = self.alias_analysis.block_starting_state(block);
+
+            for &param in cursor.func.dfg.block_params(block) {
+                trace!("creating initial singleton eclass for blockparam {}", param);
+                self.eclasses.add(param);
+                value_to_opt_value[param] = param;
+            }
+            while let Some(inst) = cursor.next_inst() {
+                trace!("Processing inst {}", inst);
+
+                // While we're passing over all insts, create initial
+                // singleton eclasses for all result and blockparam
+                // values.  Also do initial analysis of all inst
+                // results.
+                for &result in cursor.func.dfg.inst_results(inst) {
+                    trace!("creating initial singleton eclass for {}", result);
+                    self.eclasses.add(result);
+                }
+
+                // Rewrite args of *all* instructions using the
+                // value-to-opt-value map.
+                cursor.func.dfg.resolve_aliases_in_arguments(inst);
+                for arg in cursor.func.dfg.inst_args_mut(inst) {
+                    let new_value = value_to_opt_value[*arg];
+                    trace!("rewriting arg {} of inst {} to {}", arg, inst, new_value);
+                    debug_assert_ne!(new_value, Value::reserved_value());
+                    *arg = new_value;
+                }
+
+                // Build a context for optimization, with borrows of
+                // state. We can't invoke a method on `self` because
+                // we've borrowed `self.func` mutably (as
+                // `cursor.func`) so we pull apart the pieces instead
+                // here.
+                let mut ctx = OptimizeCtx {
+                    func: cursor.func,
+                    value_to_opt_value: &mut value_to_opt_value,
+                    gvn_map: &mut gvn_map,
+                    eclasses: &mut self.eclasses,
+                    rewrite_depth: 0,
+                    subsume_values: FxHashSet::default(),
+                    remat_values: &mut self.remat_values,
+                    stats: &mut self.stats,
+                    alias_analysis: self.alias_analysis,
+                    alias_analysis_state: &mut alias_analysis_state,
+                };
+
+                if is_pure_for_egraph(ctx.func, inst) {
+                    // Insert into GVN map and optimize any new nodes
+                    // inserted (recursively performing this work for
+                    // any nodes the optimization rules produce).
+                    let inst = NewOrExistingInst::Existing(inst);
+                    ctx.insert_pure_enode(inst);
+                    // We've now rewritten all uses, or will when we
+                    // see them, and the instruction exists as a pure
+                    // enode in the eclass, so we can remove it.
+                    cursor.remove_inst_and_step_back();
                } else {
-                    crate::ir::types::INVALID
-                };
-
-                let load_mem_state = self.alias_analysis.get_state_for_load(inst);
-                let is_readonly_load = match func.dfg[inst] {
-                    InstructionData::Load {
-                        opcode: Opcode::Load,
-                        flags,
-                        ..
-                    } => flags.readonly() && flags.notrap(),
-                    _ => false,
-                };
-
-                // Create the egraph node.
-                let op = InstructionImms::from(&func.dfg[inst]);
-                let opcode = op.opcode();
-                let srcloc = func.srclocs[inst];
-                let arity = u16::try_from(results.len())
-                    .expect("More than 2^16 results from an instruction");
-
-                let node = if is_readonly_load {
-                    self.stats.node_created += 1;
-                    self.stats.node_pure += 1;
-                    Node::Pure {
-                        op,
-                        args,
-                        ty,
-                        arity,
-                    }
-                } else if let Some(load_mem_state) = load_mem_state {
-                    let addr = args.as_slice(&self.node_ctx.args)[0];
-                    trace!("load at inst {} has mem state {:?}", inst, load_mem_state);
-                    self.stats.node_created += 1;
-                    self.stats.node_load += 1;
-                    Node::Load {
-                        op,
-                        ty,
-                        addr,
-                        mem_state: load_mem_state,
-                        srcloc,
-                    }
-                } else if has_side_effect(func, inst) || opcode.can_load() {
-                    self.stats.node_created += 1;
-                    self.stats.node_inst += 1;
-                    Node::Inst {
-                        op,
-                        args,
-                        ty,
-                        arity,
-                        srcloc,
-                        loop_level,
-                    }
-                } else {
-                    self.stats.node_created += 1;
-                    self.stats.node_pure += 1;
-                    Node::Pure {
-                        op,
-                        args,
-                        ty,
-                        arity,
-                    }
-                };
-                let dedup_needed = self.node_ctx.needs_dedup(&node);
-                let is_pure = matches!(node, Node::Pure { .. });
-
-                let mut id = self.egraph.add(node, &mut self.node_ctx);
-
-                if dedup_needed {
-                    self.stats.node_dedup_query += 1;
-                    match id {
-                        NewOrExisting::New(_) => {
-                            self.stats.node_dedup_miss += 1;
-                        }
-                        NewOrExisting::Existing(_) => {
-                            self.stats.node_dedup_hit += 1;
-                        }
-                    }
-                }
-
-                if opcode == Opcode::Store {
-                    let store_data_ty = func.dfg.value_type(func.dfg.inst_args(inst)[0]);
-                    self.store_nodes.insert(inst, (store_data_ty, id.get()));
-                    self.stats.store_map_insert += 1;
-                }
-
-                // Loads that did not already merge into an existing
-                // load: try to forward from a store (store-to-load
-                // forwarding).
-                if let NewOrExisting::New(new_id) = id {
-                    if load_mem_state.is_some() {
-                        let opt_id = crate::opts::store_to_load(new_id, self);
-                        trace!("store_to_load: {} -> {}", new_id, opt_id);
-                        if opt_id != new_id {
-                            id = NewOrExisting::Existing(opt_id);
-                        }
-                    }
-                }
-
-                // Now either optimize (for new pure nodes), or add to
-                // the side-effecting list (for all other new nodes).
-                let id = match id {
-                    NewOrExisting::Existing(id) => id,
-                    NewOrExisting::New(id) if is_pure => {
-                        // Apply all optimization rules immediately; the
-                        // aegraph (acyclic egraph) works best when we do
-                        // this so all uses pick up the eclass with all
-                        // possible enodes.
-                        crate::opts::optimize_eclass(id, self)
-                    }
-                    NewOrExisting::New(id) => {
-                        self.side_effect_ids.push(id);
-                        self.stats.side_effect_nodes += 1;
-                        id
-                    }
-                };
-
-                // Create results and save in Value->Id map.
-                match results {
-                    &[] => {}
-                    &[one_result] => {
-                        trace!("build: value {} -> id {}", one_result, id);
-                        value_to_id.insert(one_result, id);
-                    }
-                    many_results => {
-                        debug_assert!(many_results.len() > 1);
-                        for (i, &result) in many_results.iter().enumerate() {
-                            let ty = func.dfg.value_type(result);
-                            let projection = self
-                                .egraph
-                                .add(
-                                    Node::Result {
-                                        value: id,
-                                        result: i,
-                                        ty,
-                                    },
-                                    &mut self.node_ctx,
-                                )
-                                .get();
-                            self.stats.node_created += 1;
-                            self.stats.node_result += 1;
-                            trace!("build: value {} -> id {}", result, projection);
-                            value_to_id.insert(result, projection);
-                        }
+                    if ctx.optimize_skeleton_inst(inst) {
+                        cursor.remove_inst_and_step_back();
                    }
                }
            }
-
-            let side_effect_end =
-                u32::try_from(self.side_effect_ids.len()).expect("Overflow in side-effect count");
-            let side_effect_range = side_effect_start..side_effect_end;
-            self.side_effects[block] = side_effect_range;
        }
    }

    /// Scoped elaboration: compute a final ordering of op computation
-    /// for each block and replace the given Func body.
+    /// for each block and update the given Func body. After this
+    /// runs, the function body is back into the state where every
+    /// Inst with an used result is placed in the layout (possibly
+    /// duplicated, if our code-motion logic decides this is the best
+    /// option).
    ///
    /// This works in concert with the domtree. We do a preorder
    /// traversal of the domtree, tracking a scoped map from Id to
@@ -354,76 +474,95 @@ impl<'a> FuncEGraph<'a> {
    /// thus computed "as late as possible", but then memoized into
    /// the Id-to-Value map and available to all dominated blocks and
    /// for the rest of this block. (This subsumes GVN.)
-    pub fn elaborate(&mut self, func: &mut Function) {
-        let mut elab = Elaborator::new(
-            func,
+    fn elaborate(&mut self) {
+        let mut elaborator = Elaborator::new(
+            self.func,
            self.domtree,
+            &self.domtree_children,
            self.loop_analysis,
-            &self.egraph,
-            &self.node_ctx,
-            &self.remat_ids,
+            &mut self.remat_values,
+            &mut self.eclasses,
            &mut self.stats,
        );
-        elab.elaborate(
-            |block| {
-                let blockparam_range = self.blockparams[block].clone();
-                &self.blockparam_ids_tys
-                    [blockparam_range.start as usize..blockparam_range.end as usize]
-            },
-            |block| {
-                let side_effect_range = self.side_effects[block].clone();
-                &self.side_effect_ids
-                    [side_effect_range.start as usize..side_effect_range.end as usize]
-            },
-        );
+        elaborator.elaborate();
+
+        self.check_post_egraph();
    }
-}

-/// State for egraph analysis that computes all needed properties.
-pub(crate) struct Analysis;
-
-/// Analysis results for each eclass id.
-#[derive(Clone, Debug)]
-pub(crate) struct AnalysisValue {
-    pub(crate) loop_level: LoopLevel,
-}
-
-impl Default for AnalysisValue {
-    fn default() -> Self {
-        Self {
-            loop_level: LoopLevel::root(),
+    #[cfg(debug_assertions)]
+    fn check_post_egraph(&self) {
+        // Verify that no union nodes are reachable from inst args,
+        // and that all inst args' defining instructions are in the
+        // layout.
+        for block in self.func.layout.blocks() {
+            for inst in self.func.layout.block_insts(block) {
+                for &arg in self.func.dfg.inst_args(inst) {
+                    match self.func.dfg.value_def(arg) {
+                        ValueDef::Result(i, _) => {
+                            debug_assert!(self.func.layout.inst_block(i).is_some());
+                        }
+                        ValueDef::Union(..) => {
+                            panic!("egraph union node {} still reachable at {}!", arg, inst);
+                        }
+                        _ => {}
+                    }
+                }
+            }
        }
    }
+
+    #[cfg(not(debug_assertions))]
+    fn check_post_egraph(&self) {}
 }

-impl cranelift_egraph::Analysis for Analysis {
-    type L = NodeCtx;
-    type Value = AnalysisValue;
+/// Implementation of external-context equality and hashing on
+/// InstructionData. This allows us to deduplicate instructions given
+/// some context that lets us see its value lists and the mapping from
+/// any value to "canonical value" (in an eclass).
+struct GVNContext<'a> {
+    value_lists: &'a ValueListPool,
+    union_find: &'a UnionFind<Value>,
+}

-    fn for_node(
+impl<'a> CtxEq<(Type, InstructionData), (Type, InstructionData)> for GVNContext<'a> {
+    fn ctx_eq(
        &self,
-        ctx: &NodeCtx,
-        n: &Node,
-        values: &SecondaryMap<Id, AnalysisValue>,
-    ) -> AnalysisValue {
-        let loop_level = match n {
-            &Node::Pure { ref args, .. } => args
-                .as_slice(&ctx.args)
-                .iter()
-                .map(|&arg| values[arg].loop_level)
-                .max()
-                .unwrap_or(LoopLevel::root()),
-            &Node::Load { addr, .. } => values[addr].loop_level,
-            &Node::Result { value, .. } => values[value].loop_level,
-            &Node::Inst { loop_level, .. } | &Node::Param { loop_level, .. } => loop_level,
-        };
-
-        AnalysisValue { loop_level }
-    }
-
-    fn meet(&self, _ctx: &NodeCtx, v1: &AnalysisValue, v2: &AnalysisValue) -> AnalysisValue {
-        AnalysisValue {
-            loop_level: std::cmp::max(v1.loop_level, v2.loop_level),
-        }
+        (a_ty, a_inst): &(Type, InstructionData),
+        (b_ty, b_inst): &(Type, InstructionData),
+    ) -> bool {
+        a_ty == b_ty
+            && a_inst.eq(b_inst, self.value_lists, |value| {
+                self.union_find.find(value)
+            })
    }
 }
+
+impl<'a> CtxHash<(Type, InstructionData)> for GVNContext<'a> {
+    fn ctx_hash<H: Hasher>(&self, state: &mut H, (ty, inst): &(Type, InstructionData)) {
+        std::hash::Hash::hash(&ty, state);
+        inst.hash(state, self.value_lists, |value| self.union_find.find(value));
+    }
+}
+
+/// Statistics collected during egraph-based processing.
+#[derive(Clone, Debug, Default)]
+pub(crate) struct Stats {
+    pub(crate) pure_inst: u64,
+    pub(crate) pure_inst_deduped: u64,
+    pub(crate) skeleton_inst: u64,
+    pub(crate) alias_analysis_removed: u64,
+    pub(crate) new_inst: u64,
+    pub(crate) union: u64,
+    pub(crate) subsume: u64,
+    pub(crate) remat: u64,
+    pub(crate) rewrite_rule_invoked: u64,
+    pub(crate) rewrite_depth_limit: u64,
+    pub(crate) elaborate_visit_node: u64,
+    pub(crate) elaborate_memoize_hit: u64,
+    pub(crate) elaborate_memoize_miss: u64,
+    pub(crate) elaborate_memoize_miss_remat: u64,
+    pub(crate) elaborate_licm_hoist: u64,
+    pub(crate) elaborate_func: u64,
+    pub(crate) elaborate_func_pre_insts: u64,
+    pub(crate) elaborate_func_post_insts: u64,
+}