egraph support: rewrite to work in terms of CLIF data structures. (#5382)

* egraph support: rewrite to work in terms of CLIF data structures. This work rewrites the "egraph"-based optimization framework in Cranelift to operate on aegraphs (acyclic egraphs) represented in the CLIF itself rather than as a separate data structure to which and from which we translate the CLIF. The basic idea is to add a new kind of value, a "union", that is like an alias but refers to two other values rather than one. This allows us to represent an eclass of enodes (values) as a tree. The union node allows for a value to have *multiple representations*: either constituent value could be used, and (in well-formed CLIF produced by correct optimization rules) they must be equivalent. Like the old egraph infrastructure, we take advantage of acyclicity and eager rule application to do optimization in a single pass. Like before, we integrate GVN (during the optimization pass) and LICM (during elaboration). Unlike the old egraph infrastructure, everything stays in the DataFlowGraph. "Pure" enodes are represented as instructions that have values attached, but that are not placed into the function layout. When entering "egraph" form, we remove them from the layout while optimizing. When leaving "egraph" form, during elaboration, we can place an instruction back into the layout the first time we elaborate the enode; if we elaborate it more than once, we clone the instruction. The implementation performs two passes overall: - One, a forward pass in RPO (to see defs before uses), that (i) removes "pure" instructions from the layout and (ii) optimizes as it goes. As before, we eagerly optimize, so we form the entire union of optimized forms of a value before we see any uses of that value. This lets us rewrite uses to use the most "up-to-date" form of the value and canonicalize and optimize that form. The eager rewriting and acyclic representation make each other work (we could not eagerly rewrite if there were cycles; and acyclicity does not miss optimization opportunities only because the first time we introduce a value, we immediately produce its "best" form). This design choice is also what allows us to avoid the "parent pointers" and fixpoint loop of traditional egraphs. This forward optimization pass keeps a scoped hashmap to "intern" nodes (thus performing GVN), and also interleaves on a per-instruction level with alias analysis. The interleaving with alias analysis allows alias analysis to see the most optimized form of each address (so it can see equivalences), and allows the next value to see any equivalences (reuses of loads or stored values) that alias analysis uncovers. - Two, a forward pass in domtree preorder, that "elaborates" pure enodes back into the layout, possibly in multiple places if needed. This tracks the loop nest and hoists nodes as needed, performing LICM as it goes. Note that by doing this in forward order, we avoid the "fixpoint" that traditional LICM needs: we hoist a def before its uses, so when we place a node, we place it in the right place the first time rather than moving later. This PR replaces the old (a)egraph implementation. It removes both the cranelift-egraph crate and the logic in cranelift-codegen that uses it. On `spidermonkey.wasm` running a simple recursive Fibonacci microbenchmark, this work shows 5.5% compile-time reduction and 7.7% runtime improvement (speedup). Most of this implementation was done in (very productive) pair programming sessions with Jamey Sharp, thus: Co-authored-by: Jamey Sharp <jsharp@fastly.com> * Review feedback. * Review feedback. * Review feedback. * Bugfix: cprop rule: `(x + k1) - k2` becomes `x - (k2 - k1)`, not `x - (k1 - k2)`. Co-authored-by: Jamey Sharp <jsharp@fastly.com>
2022-12-06 14:58:57 -08:00
parent 08d44e3746
commit f980defe17
42 changed files with 1890 additions and 3884 deletions
--- a/cranelift/codegen/src/context.rs
+++ b/cranelift/codegen/src/context.rs
@@ -12,7 +12,7 @@
 use crate::alias_analysis::AliasAnalysis;
 use crate::dce::do_dce;
 use crate::dominator_tree::DominatorTree;
-use crate::egraph::FuncEGraph;
+use crate::egraph::EgraphPass;
 use crate::flowgraph::ControlFlowGraph;
 use crate::ir::Function;
 use crate::isa::TargetIsa;
@@ -26,6 +26,7 @@ use crate::result::{CodegenResult, CompileResult};
 use crate::settings::{FlagsOrIsa, OptLevel};
 use crate::simple_gvn::do_simple_gvn;
 use crate::simple_preopt::do_preopt;
+use crate::trace;
 use crate::unreachable_code::eliminate_unreachable_code;
 use crate::verifier::{verify_context, VerifierErrors, VerifierResult};
 use crate::{timing, CompileError};
@@ -191,15 +192,7 @@ impl Context {
        self.remove_constant_phis(isa)?;

        if isa.flags().use_egraphs() {
-            log::debug!(
-                "About to optimize with egraph phase:\n{}",
-                self.func.display()
-            );
-            self.compute_loop_analysis();
-            let mut eg = FuncEGraph::new(&self.func, &self.domtree, &self.loop_analysis, &self.cfg);
-            eg.elaborate(&mut self.func);
-            log::debug!("After egraph optimization:\n{}", self.func.display());
-            log::info!("egraph stats: {:?}", eg.stats);
+            self.egraph_pass()?;
        } else if opt_level != OptLevel::None && isa.flags().enable_alias_analysis() {
            self.replace_redundant_loads()?;
            self.simple_gvn(isa)?;
@@ -379,4 +372,24 @@ impl Context {
        do_souper_harvest(&self.func, out);
        Ok(())
    }
+
+    /// Run optimizations via the egraph infrastructure.
+    pub fn egraph_pass(&mut self) -> CodegenResult<()> {
+        trace!(
+            "About to optimize with egraph phase:\n{}",
+            self.func.display()
+        );
+        self.compute_loop_analysis();
+        let mut alias_analysis = AliasAnalysis::new(&self.func, &self.domtree);
+        let mut pass = EgraphPass::new(
+            &mut self.func,
+            &self.domtree,
+            &self.loop_analysis,
+            &mut alias_analysis,
+        );
+        pass.run();
+        log::info!("egraph stats: {:?}", pass.stats);
+        trace!("After egraph optimization:\n{}", self.func.display());
+        Ok(())
+    }
 }
--- a/cranelift/codegen/src/ctxhash.rs
+++ b/cranelift/codegen/src/ctxhash.rs
@@ -0,0 +1,168 @@
+//! A hashmap with "external hashing": nodes are hashed or compared for
+//! equality only with some external context provided on lookup/insert.
+//! This allows very memory-efficient data structures where
+//! node-internal data references some other storage (e.g., offsets into
+//! an array or pool of shared data).
+
+use hashbrown::raw::RawTable;
+use std::hash::{Hash, Hasher};
+
+/// Trait that allows for equality comparison given some external
+/// context.
+///
+/// Note that this trait is implemented by the *context*, rather than
+/// the item type, for somewhat complex lifetime reasons (lack of GATs
+/// to allow `for<'ctx> Ctx<'ctx>`-like associated types in traits on
+/// the value type).
+pub trait CtxEq<V1: ?Sized, V2: ?Sized> {
+    /// Determine whether `a` and `b` are equal, given the context in
+    /// `self` and the union-find data structure `uf`.
+    fn ctx_eq(&self, a: &V1, b: &V2) -> bool;
+}
+
+/// Trait that allows for hashing given some external context.
+pub trait CtxHash<Value: ?Sized>: CtxEq<Value, Value> {
+    /// Compute the hash of `value`, given the context in `self` and
+    /// the union-find data structure `uf`.
+    fn ctx_hash<H: Hasher>(&self, state: &mut H, value: &Value);
+}
+
+/// A null-comparator context type for underlying value types that
+/// already have `Eq` and `Hash`.
+#[derive(Default)]
+pub struct NullCtx;
+
+impl<V: Eq + Hash> CtxEq<V, V> for NullCtx {
+    fn ctx_eq(&self, a: &V, b: &V) -> bool {
+        a.eq(b)
+    }
+}
+impl<V: Eq + Hash> CtxHash<V> for NullCtx {
+    fn ctx_hash<H: Hasher>(&self, state: &mut H, value: &V) {
+        value.hash(state);
+    }
+}
+
+/// A bucket in the hash table.
+///
+/// Some performance-related design notes: we cache the hashcode for
+/// speed, as this often buys a few percent speed in
+/// interning-table-heavy workloads. We only keep the low 32 bits of
+/// the hashcode, for memory efficiency: in common use, `K` and `V`
+/// are often 32 bits also, and a 12-byte bucket is measurably better
+/// than a 16-byte bucket.
+struct BucketData<K, V> {
+    hash: u32,
+    k: K,
+    v: V,
+}
+
+/// A HashMap that takes external context for all operations.
+pub struct CtxHashMap<K, V> {
+    raw: RawTable<BucketData<K, V>>,
+}
+
+impl<K, V> CtxHashMap<K, V> {
+    /// Create an empty hashmap with pre-allocated space for the given
+    /// capacity.
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            raw: RawTable::with_capacity(capacity),
+        }
+    }
+}
+
+fn compute_hash<Ctx, K>(ctx: &Ctx, k: &K) -> u32
+where
+    Ctx: CtxHash<K>,
+{
+    let mut hasher = crate::fx::FxHasher::default();
+    ctx.ctx_hash(&mut hasher, k);
+    hasher.finish() as u32
+}
+
+impl<K, V> CtxHashMap<K, V> {
+    /// Insert a new key-value pair, returning the old value associated
+    /// with this key (if any).
+    pub fn insert<Ctx>(&mut self, k: K, v: V, ctx: &Ctx) -> Option<V>
+    where
+        Ctx: CtxEq<K, K> + CtxHash<K>,
+    {
+        let hash = compute_hash(ctx, &k);
+        match self.raw.find(hash as u64, |bucket| {
+            hash == bucket.hash && ctx.ctx_eq(&bucket.k, &k)
+        }) {
+            Some(bucket) => {
+                let data = unsafe { bucket.as_mut() };
+                Some(std::mem::replace(&mut data.v, v))
+            }
+            None => {
+                let data = BucketData { hash, k, v };
+                self.raw
+                    .insert_entry(hash as u64, data, |bucket| bucket.hash as u64);
+                None
+            }
+        }
+    }
+
+    /// Look up a key, returning a borrow of the value if present.
+    pub fn get<'a, Q, Ctx>(&'a self, k: &Q, ctx: &Ctx) -> Option<&'a V>
+    where
+        Ctx: CtxEq<K, Q> + CtxHash<Q> + CtxHash<K>,
+    {
+        let hash = compute_hash(ctx, k);
+        self.raw
+            .find(hash as u64, |bucket| {
+                hash == bucket.hash && ctx.ctx_eq(&bucket.k, k)
+            })
+            .map(|bucket| {
+                let data = unsafe { bucket.as_ref() };
+                &data.v
+            })
+    }
+}
+
+#[cfg(test)]
+mod test {
+    use super::*;
+    use std::hash::Hash;
+
+    #[derive(Clone, Copy, Debug)]
+    struct Key {
+        index: u32,
+    }
+    struct Ctx {
+        vals: &'static [&'static str],
+    }
+    impl CtxEq<Key, Key> for Ctx {
+        fn ctx_eq(&self, a: &Key, b: &Key) -> bool {
+            self.vals[a.index as usize].eq(self.vals[b.index as usize])
+        }
+    }
+    impl CtxHash<Key> for Ctx {
+        fn ctx_hash<H: Hasher>(&self, state: &mut H, value: &Key) {
+            self.vals[value.index as usize].hash(state);
+        }
+    }
+
+    #[test]
+    fn test_basic() {
+        let ctx = Ctx {
+            vals: &["a", "b", "a"],
+        };
+
+        let k0 = Key { index: 0 };
+        let k1 = Key { index: 1 };
+        let k2 = Key { index: 2 };
+
+        assert!(ctx.ctx_eq(&k0, &k2));
+        assert!(!ctx.ctx_eq(&k0, &k1));
+        assert!(!ctx.ctx_eq(&k2, &k1));
+
+        let mut map: CtxHashMap<Key, u64> = CtxHashMap::with_capacity(4);
+        assert_eq!(map.insert(k0, 42, &ctx), None);
+        assert_eq!(map.insert(k2, 84, &ctx), Some(42));
+        assert_eq!(map.get(&k1, &ctx), None);
+        assert_eq!(*map.get(&k0, &ctx).unwrap(), 84);
+    }
+}
--- a/cranelift/codegen/src/egraph.rs
+++ b/cranelift/codegen/src/egraph.rs
@@ -1,342 +1,462 @@
-//! Egraph-based mid-end optimization framework.
+//! Support for egraphs represented in the DataFlowGraph.

+use crate::alias_analysis::{AliasAnalysis, LastStores};
+use crate::ctxhash::{CtxEq, CtxHash, CtxHashMap};
+use crate::cursor::{Cursor, CursorPosition, FuncCursor};
 use crate::dominator_tree::DominatorTree;
-use crate::egraph::stores::PackedMemoryState;
-use crate::flowgraph::ControlFlowGraph;
-use crate::loop_analysis::{LoopAnalysis, LoopLevel};
-use crate::trace;
-use crate::{
-    fx::{FxHashMap, FxHashSet},
-    inst_predicates::has_side_effect,
-    ir::{Block, Function, Inst, InstructionData, InstructionImms, Opcode, Type},
+use crate::egraph::domtree::DomTreeWithChildren;
+use crate::egraph::elaborate::Elaborator;
+use crate::fx::FxHashSet;
+use crate::inst_predicates::is_pure_for_egraph;
+use crate::ir::{
+    DataFlowGraph, Function, Inst, InstructionData, Type, Value, ValueDef, ValueListPool,
 };
-use alloc::vec::Vec;
-use core::ops::Range;
-use cranelift_egraph::{EGraph, Id, Language, NewOrExisting};
-use cranelift_entity::EntityList;
+use crate::loop_analysis::LoopAnalysis;
+use crate::opts::generated_code::ContextIter;
+use crate::opts::IsleContext;
+use crate::trace;
+use crate::unionfind::UnionFind;
+use cranelift_entity::packed_option::ReservedValue;
 use cranelift_entity::SecondaryMap;
+use std::hash::Hasher;

+mod cost;
 mod domtree;
 mod elaborate;
-mod node;
-mod stores;

-use elaborate::Elaborator;
-pub use node::{Node, NodeCtx};
-pub use stores::{AliasAnalysis, MemoryState};
-
-pub struct FuncEGraph<'a> {
+/// Pass over a Function that does the whole aegraph thing.
+///
+/// - Removes non-skeleton nodes from the Layout.
+/// - Performs a GVN-and-rule-application pass over all Values
+///   reachable from the skeleton, potentially creating new Union
+///   nodes (i.e., an aegraph) so that some values have multiple
+///   representations.
+/// - Does "extraction" on the aegraph: selects the best value out of
+///   the tree-of-Union nodes for each used value.
+/// - Does "scoped elaboration" on the aegraph: chooses one or more
+///   locations for pure nodes to become instructions again in the
+///   layout, as forced by the skeleton.
+///
+/// At the beginning and end of this pass, the CLIF should be in a
+/// state that passes the verifier and, additionally, has no Union
+/// nodes. During the pass, Union nodes may exist, and instructions in
+/// the layout may refer to results of instructions that are not
+/// placed in the layout.
+pub struct EgraphPass<'a> {
+    /// The function we're operating on.
+    func: &'a mut Function,
    /// Dominator tree, used for elaboration pass.
    domtree: &'a DominatorTree,
-    /// Loop analysis results, used for built-in LICM during elaboration.
+    /// Alias analysis, used during optimization.
+    alias_analysis: &'a mut AliasAnalysis<'a>,
+    /// "Domtree with children": like `domtree`, but with an explicit
+    /// list of children, rather than just parent pointers.
+    domtree_children: DomTreeWithChildren,
+    /// Loop analysis results, used for built-in LICM during
+    /// elaboration.
    loop_analysis: &'a LoopAnalysis,
-    /// Last-store tracker for integrated alias analysis during egraph build.
-    alias_analysis: AliasAnalysis,
-    /// The egraph itself.
-    pub(crate) egraph: EGraph<NodeCtx, Analysis>,
-    /// "node context", containing arenas for node data.
-    pub(crate) node_ctx: NodeCtx,
-    /// Ranges in `side_effect_ids` for sequences of side-effecting
-    /// eclasses per block.
-    side_effects: SecondaryMap<Block, Range<u32>>,
-    side_effect_ids: Vec<Id>,
-    /// Map from store instructions to their nodes; used for store-to-load forwarding.
-    pub(crate) store_nodes: FxHashMap<Inst, (Type, Id)>,
-    /// Ranges in `blockparam_ids_tys` for sequences of blockparam
-    /// eclass IDs and types per block.
-    blockparams: SecondaryMap<Block, Range<u32>>,
-    blockparam_ids_tys: Vec<(Id, Type)>,
-    /// Which canonical node IDs do we want to rematerialize in each
+    /// Which canonical Values do we want to rematerialize in each
    /// block where they're used?
-    pub(crate) remat_ids: FxHashSet<Id>,
-    /// Which canonical node IDs have an enode whose value subsumes
-    /// all others it's unioned with?
-    pub(crate) subsume_ids: FxHashSet<Id>,
-    /// Statistics recorded during the process of building,
-    /// optimizing, and lowering out of this egraph.
+    ///
+    /// (A canonical Value is the *oldest* Value in an eclass,
+    /// i.e. tree of union value-nodes).
+    remat_values: FxHashSet<Value>,
+    /// Stats collected while we run this pass.
    pub(crate) stats: Stats,
-    /// Current rewrite-recursion depth. Used to enforce a finite
-    /// limit on rewrite rule application so that we don't get stuck
-    /// in an infinite chain.
+    /// Union-find that maps all members of a Union tree (eclass) back
+    /// to the *oldest* (lowest-numbered) `Value`.
+    eclasses: UnionFind<Value>,
+}
+
+/// Context passed through node insertion and optimization.
+pub(crate) struct OptimizeCtx<'opt, 'analysis>
+where
+    'analysis: 'opt,
+{
+    // Borrowed from EgraphPass:
+    pub(crate) func: &'opt mut Function,
+    pub(crate) value_to_opt_value: &'opt mut SecondaryMap<Value, Value>,
+    pub(crate) gvn_map: &'opt mut CtxHashMap<(Type, InstructionData), Value>,
+    pub(crate) eclasses: &'opt mut UnionFind<Value>,
+    pub(crate) remat_values: &'opt mut FxHashSet<Value>,
+    pub(crate) stats: &'opt mut Stats,
+    pub(crate) alias_analysis: &'opt mut AliasAnalysis<'analysis>,
+    pub(crate) alias_analysis_state: &'opt mut LastStores,
+    // Held locally during optimization of one node (recursively):
    pub(crate) rewrite_depth: usize,
+    pub(crate) subsume_values: FxHashSet<Value>,
 }

-#[derive(Clone, Debug, Default)]
-pub(crate) struct Stats {
-    pub(crate) node_created: u64,
-    pub(crate) node_param: u64,
-    pub(crate) node_result: u64,
-    pub(crate) node_pure: u64,
-    pub(crate) node_inst: u64,
-    pub(crate) node_load: u64,
-    pub(crate) node_dedup_query: u64,
-    pub(crate) node_dedup_hit: u64,
-    pub(crate) node_dedup_miss: u64,
-    pub(crate) node_ctor_created: u64,
-    pub(crate) node_ctor_deduped: u64,
-    pub(crate) node_union: u64,
-    pub(crate) node_subsume: u64,
-    pub(crate) store_map_insert: u64,
-    pub(crate) side_effect_nodes: u64,
-    pub(crate) rewrite_rule_invoked: u64,
-    pub(crate) rewrite_depth_limit: u64,
-    pub(crate) store_to_load_forward: u64,
-    pub(crate) elaborate_visit_node: u64,
-    pub(crate) elaborate_memoize_hit: u64,
-    pub(crate) elaborate_memoize_miss: u64,
-    pub(crate) elaborate_memoize_miss_remat: u64,
-    pub(crate) elaborate_licm_hoist: u64,
-    pub(crate) elaborate_func: u64,
-    pub(crate) elaborate_func_pre_insts: u64,
-    pub(crate) elaborate_func_post_insts: u64,
+/// For passing to `insert_pure_enode`. Sometimes the enode already
+/// exists as an Inst (from the original CLIF), and sometimes we're in
+/// the middle of creating it and want to avoid inserting it if
+/// possible until we know we need it.
+pub(crate) enum NewOrExistingInst {
+    New(InstructionData, Type),
+    Existing(Inst),
 }

-impl<'a> FuncEGraph<'a> {
-    /// Create a new EGraph for the given function. Requires the
-    /// domtree to be precomputed as well; the domtree is used for
-    /// scheduling when lowering out of the egraph.
-    pub fn new(
-        func: &Function,
-        domtree: &'a DominatorTree,
-        loop_analysis: &'a LoopAnalysis,
-        cfg: &ControlFlowGraph,
-    ) -> FuncEGraph<'a> {
-        let num_values = func.dfg.num_values();
-        let num_blocks = func.dfg.num_blocks();
-        let node_count_estimate = num_values * 2;
-        let alias_analysis = AliasAnalysis::new(func, cfg);
-        let mut this = Self {
-            domtree,
-            loop_analysis,
-            alias_analysis,
-            egraph: EGraph::with_capacity(node_count_estimate, Some(Analysis)),
-            node_ctx: NodeCtx::with_capacity_for_dfg(&func.dfg),
-            side_effects: SecondaryMap::with_capacity(num_blocks),
-            side_effect_ids: Vec::with_capacity(node_count_estimate),
-            store_nodes: FxHashMap::default(),
-            blockparams: SecondaryMap::with_capacity(num_blocks),
-            blockparam_ids_tys: Vec::with_capacity(num_blocks * 10),
-            remat_ids: FxHashSet::default(),
-            subsume_ids: FxHashSet::default(),
-            stats: Default::default(),
-            rewrite_depth: 0,
+impl NewOrExistingInst {
+    fn get_inst_key<'a>(&'a self, dfg: &'a DataFlowGraph) -> (Type, InstructionData) {
+        match self {
+            NewOrExistingInst::New(data, ty) => (*ty, *data),
+            NewOrExistingInst::Existing(inst) => {
+                let ty = dfg.ctrl_typevar(*inst);
+                (ty, dfg[*inst].clone())
+            }
+        }
+    }
+}
+
+impl<'opt, 'analysis> OptimizeCtx<'opt, 'analysis>
+where
+    'analysis: 'opt,
+{
+    /// Optimization of a single instruction.
+    ///
+    /// This does a few things:
+    /// - Looks up the instruction in the GVN deduplication map. If we
+    ///   already have the same instruction somewhere else, with the
+    ///   same args, then we can alias the original instruction's
+    ///   results and omit this instruction entirely.
+    ///   - Note that we do this canonicalization based on the
+    ///     instruction with its arguments as *canonical* eclass IDs,
+    ///     that is, the oldest (smallest index) `Value` reachable in
+    ///     the tree-of-unions (whole eclass). This ensures that we
+    ///     properly canonicalize newer nodes that use newer "versions"
+    ///     of a value that are still equal to the older versions.
+    /// - If the instruction is "new" (not deduplicated), then apply
+    ///   optimization rules:
+    ///   - All of the mid-end rules written in ISLE.
+    ///   - Store-to-load forwarding.
+    /// - Update the value-to-opt-value map, and update the eclass
+    ///   union-find, if we rewrote the value to different form(s).
+    pub(crate) fn insert_pure_enode(&mut self, inst: NewOrExistingInst) -> Value {
+        // Create the external context for looking up and updating the
+        // GVN map. This is necessary so that instructions themselves
+        // do not have to carry all the references or data for a full
+        // `Eq` or `Hash` impl.
+        let gvn_context = GVNContext {
+            union_find: self.eclasses,
+            value_lists: &self.func.dfg.value_lists,
        };
-        this.store_nodes.reserve(func.dfg.num_values() / 8);
-        this.remat_ids.reserve(func.dfg.num_values() / 4);
-        this.subsume_ids.reserve(func.dfg.num_values() / 4);
-        this.build(func);
-        this
+
+        self.stats.pure_inst += 1;
+        if let NewOrExistingInst::New(..) = inst {
+            self.stats.new_inst += 1;
+        }
+
+        // Does this instruction already exist? If so, add entries to
+        // the value-map to rewrite uses of its results to the results
+        // of the original (existing) instruction. If not, optimize
+        // the new instruction.
+        if let Some(&orig_result) = self
+            .gvn_map
+            .get(&inst.get_inst_key(&self.func.dfg), &gvn_context)
+        {
+            self.stats.pure_inst_deduped += 1;
+            if let NewOrExistingInst::Existing(inst) = inst {
+                debug_assert_eq!(self.func.dfg.inst_results(inst).len(), 1);
+                let result = self.func.dfg.first_result(inst);
+                self.value_to_opt_value[result] = orig_result;
+                self.eclasses.union(result, orig_result);
+                self.stats.union += 1;
+                result
+            } else {
+                orig_result
+            }
+        } else {
+            // Now actually insert the InstructionData and attach
+            // result value (exactly one).
+            let (inst, result, ty) = match inst {
+                NewOrExistingInst::New(data, typevar) => {
+                    let inst = self.func.dfg.make_inst(data);
+                    // TODO: reuse return value?
+                    self.func.dfg.make_inst_results(inst, typevar);
+                    let result = self.func.dfg.first_result(inst);
+                    // Add to eclass unionfind.
+                    self.eclasses.add(result);
+                    // New inst. We need to do the analysis of its result.
+                    (inst, result, typevar)
+                }
+                NewOrExistingInst::Existing(inst) => {
+                    let result = self.func.dfg.first_result(inst);
+                    let ty = self.func.dfg.ctrl_typevar(inst);
+                    (inst, result, ty)
+                }
+            };
+
+            let opt_value = self.optimize_pure_enode(inst);
+            let gvn_context = GVNContext {
+                union_find: self.eclasses,
+                value_lists: &self.func.dfg.value_lists,
+            };
+            self.gvn_map
+                .insert((ty, self.func.dfg[inst].clone()), opt_value, &gvn_context);
+            self.value_to_opt_value[result] = opt_value;
+            opt_value
+        }
    }

-    fn build(&mut self, func: &Function) {
-        // Mapping of SSA `Value` to eclass ID.
-        let mut value_to_id = FxHashMap::default();
+    /// Optimizes an enode by applying any matching mid-end rewrite
+    /// rules (or store-to-load forwarding, which is a special case),
+    /// unioning together all possible optimized (or rewritten) forms
+    /// of this expression into an eclass and returning the `Value`
+    /// that represents that eclass.
+    fn optimize_pure_enode(&mut self, inst: Inst) -> Value {
+        // A pure node always has exactly one result.
+        let orig_value = self.func.dfg.first_result(inst);

-        // For each block in RPO, create an enode for block entry, for
-        // each block param, and for each instruction.
-        for &block in self.domtree.cfg_postorder().iter().rev() {
-            let loop_level = self.loop_analysis.loop_level(block);
-            let blockparam_start =
-                u32::try_from(self.blockparam_ids_tys.len()).expect("Overflow in blockparam count");
-            for (i, &value) in func.dfg.block_params(block).iter().enumerate() {
-                let ty = func.dfg.value_type(value);
-                let param = self
-                    .egraph
-                    .add(
-                        Node::Param {
-                            block,
-                            index: i
-                                .try_into()
-                                .expect("blockparam index should fit in Node::Param"),
-                            ty,
-                            loop_level,
-                        },
-                        &mut self.node_ctx,
-                    )
-                    .get();
-                value_to_id.insert(value, param);
-                self.blockparam_ids_tys.push((param, ty));
-                self.stats.node_created += 1;
-                self.stats.node_param += 1;
-            }
-            let blockparam_end =
-                u32::try_from(self.blockparam_ids_tys.len()).expect("Overflow in blockparam count");
-            self.blockparams[block] = blockparam_start..blockparam_end;
+        let mut isle_ctx = IsleContext { ctx: self };

-            let side_effect_start =
-                u32::try_from(self.side_effect_ids.len()).expect("Overflow in side-effect count");
-            for inst in func.layout.block_insts(block) {
-                // Build args from SSA values.
-                let args = EntityList::from_iter(
-                    func.dfg.inst_args(inst).iter().map(|&arg| {
-                        let arg = func.dfg.resolve_aliases(arg);
-                        *value_to_id
-                            .get(&arg)
-                            .expect("Must have seen def before this use")
-                    }),
-                    &mut self.node_ctx.args,
+        // Limit rewrite depth. When we apply optimization rules, they
+        // may create new nodes (values) and those are, recursively,
+        // optimized eagerly as soon as they are created. So we may
+        // have more than one ISLE invocation on the stack. (This is
+        // necessary so that as the toplevel builds the
+        // right-hand-side expression bottom-up, it uses the "latest"
+        // optimized values for all the constituent parts.) To avoid
+        // infinite or problematic recursion, we bound the rewrite
+        // depth to a small constant here.
+        const REWRITE_LIMIT: usize = 5;
+        if isle_ctx.ctx.rewrite_depth > REWRITE_LIMIT {
+            isle_ctx.ctx.stats.rewrite_depth_limit += 1;
+            return orig_value;
+        }
+        isle_ctx.ctx.rewrite_depth += 1;
+
+        // Invoke the ISLE toplevel constructor, getting all new
+        // values produced as equivalents to this value.
+        trace!("Calling into ISLE with original value {}", orig_value);
+        isle_ctx.ctx.stats.rewrite_rule_invoked += 1;
+        let optimized_values =
+            crate::opts::generated_code::constructor_simplify(&mut isle_ctx, orig_value);
+
+        // Create a union of all new values with the original (or
+        // maybe just one new value marked as "subsuming" the
+        // original, if present.)
+        let mut union_value = orig_value;
+        if let Some(mut optimized_values) = optimized_values {
+            while let Some(optimized_value) = optimized_values.next(&mut isle_ctx) {
+                trace!(
+                    "Returned from ISLE for {}, got {:?}",
+                    orig_value,
+                    optimized_value
                );
+                if optimized_value == orig_value {
+                    trace!(" -> same as orig value; skipping");
+                    continue;
+                }
+                if isle_ctx.ctx.subsume_values.contains(&optimized_value) {
+                    // Merge in the unionfind so canonicalization
+                    // still works, but take *only* the subsuming
+                    // value, and break now.
+                    isle_ctx.ctx.eclasses.union(optimized_value, union_value);
+                    union_value = optimized_value;
+                    break;
+                }

-                let results = func.dfg.inst_results(inst);
-                let ty = if results.len() == 1 {
-                    func.dfg.value_type(results[0])
+                let old_union_value = union_value;
+                union_value = isle_ctx
+                    .ctx
+                    .func
+                    .dfg
+                    .union(old_union_value, optimized_value);
+                isle_ctx.ctx.stats.union += 1;
+                trace!(" -> union: now {}", union_value);
+                isle_ctx.ctx.eclasses.add(union_value);
+                isle_ctx
+                    .ctx
+                    .eclasses
+                    .union(old_union_value, optimized_value);
+                isle_ctx.ctx.eclasses.union(old_union_value, union_value);
+            }
+        }
+
+        isle_ctx.ctx.rewrite_depth -= 1;
+
+        union_value
+    }
+
+    /// Optimize a "skeleton" instruction, possibly removing
+    /// it. Returns `true` if the instruction should be removed from
+    /// the layout.
+    fn optimize_skeleton_inst(&mut self, inst: Inst) -> bool {
+        self.stats.skeleton_inst += 1;
+        // Not pure, but may still be a load or store:
+        // process it to see if we can optimize it.
+        if let Some(new_result) =
+            self.alias_analysis
+                .process_inst(self.func, self.alias_analysis_state, inst)
+        {
+            self.stats.alias_analysis_removed += 1;
+            let result = self.func.dfg.first_result(inst);
+            self.value_to_opt_value[result] = new_result;
+            true
+        } else {
+            // Set all results to identity-map to themselves
+            // in the value-to-opt-value map.
+            for &result in self.func.dfg.inst_results(inst) {
+                self.value_to_opt_value[result] = result;
+                self.eclasses.add(result);
+            }
+            false
+        }
+    }
+}
+
+impl<'a> EgraphPass<'a> {
+    /// Create a new EgraphPass.
+    pub fn new(
+        func: &'a mut Function,
+        domtree: &'a DominatorTree,
+        loop_analysis: &'a LoopAnalysis,
+        alias_analysis: &'a mut AliasAnalysis<'a>,
+    ) -> Self {
+        let num_values = func.dfg.num_values();
+        let domtree_children = DomTreeWithChildren::new(func, domtree);
+        Self {
+            func,
+            domtree,
+            domtree_children,
+            loop_analysis,
+            alias_analysis,
+            stats: Stats::default(),
+            eclasses: UnionFind::with_capacity(num_values),
+            remat_values: FxHashSet::default(),
+        }
+    }
+
+    /// Run the process.
+    pub fn run(&mut self) {
+        self.remove_pure_and_optimize();
+
+        trace!("egraph built:\n{}\n", self.func.display());
+        if cfg!(feature = "trace-log") {
+            for (value, def) in self.func.dfg.values_and_defs() {
+                trace!(" -> {} = {:?}", value, def);
+                match def {
+                    ValueDef::Result(i, 0) => {
+                        trace!("  -> {} = {:?}", i, self.func.dfg[i]);
+                    }
+                    _ => {}
+                }
+            }
+        }
+        trace!("stats: {:?}", self.stats);
+        self.elaborate();
+    }
+
+    /// Remove pure nodes from the `Layout` of the function, ensuring
+    /// that only the "side-effect skeleton" remains, and also
+    /// optimize the pure nodes. This is the first step of
+    /// egraph-based processing and turns the pure CFG-based CLIF into
+    /// a CFG skeleton with a sea of (optimized) nodes tying it
+    /// together.
+    ///
+    /// As we walk through the code, we eagerly apply optimization
+    /// rules; at any given point we have a "latest version" of an
+    /// eclass of possible representations for a `Value` in the
+    /// original program, which is itself a `Value` at the root of a
+    /// union-tree. We keep a map from the original values to these
+    /// optimized values. When we encounter any instruction (pure or
+    /// side-effecting skeleton) we rewrite its arguments to capture
+    /// the "latest" optimized forms of these values. (We need to do
+    /// this as part of this pass, and not later using a finished map,
+    /// because the eclass can continue to be updated and we need to
+    /// only refer to its subset that exists at this stage, to
+    /// maintain acyclicity.)
+    fn remove_pure_and_optimize(&mut self) {
+        let mut cursor = FuncCursor::new(self.func);
+        let mut value_to_opt_value: SecondaryMap<Value, Value> =
+            SecondaryMap::with_default(Value::reserved_value());
+        let mut gvn_map: CtxHashMap<(Type, InstructionData), Value> =
+            CtxHashMap::with_capacity(cursor.func.dfg.num_values());
+
+        // In domtree preorder, visit blocks. (TODO: factor out an
+        // iterator from this and elaborator.)
+        let root = self.domtree_children.root();
+        let mut block_stack = vec![root];
+        while let Some(block) = block_stack.pop() {
+            // We popped this block; push children
+            // immediately, then process this block.
+            block_stack.extend(self.domtree_children.children(block));
+
+            trace!("Processing block {}", block);
+            cursor.set_position(CursorPosition::Before(block));
+
+            let mut alias_analysis_state = self.alias_analysis.block_starting_state(block);
+
+            for &param in cursor.func.dfg.block_params(block) {
+                trace!("creating initial singleton eclass for blockparam {}", param);
+                self.eclasses.add(param);
+                value_to_opt_value[param] = param;
+            }
+            while let Some(inst) = cursor.next_inst() {
+                trace!("Processing inst {}", inst);
+
+                // While we're passing over all insts, create initial
+                // singleton eclasses for all result and blockparam
+                // values.  Also do initial analysis of all inst
+                // results.
+                for &result in cursor.func.dfg.inst_results(inst) {
+                    trace!("creating initial singleton eclass for {}", result);
+                    self.eclasses.add(result);
+                }
+
+                // Rewrite args of *all* instructions using the
+                // value-to-opt-value map.
+                cursor.func.dfg.resolve_aliases_in_arguments(inst);
+                for arg in cursor.func.dfg.inst_args_mut(inst) {
+                    let new_value = value_to_opt_value[*arg];
+                    trace!("rewriting arg {} of inst {} to {}", arg, inst, new_value);
+                    debug_assert_ne!(new_value, Value::reserved_value());
+                    *arg = new_value;
+                }
+
+                // Build a context for optimization, with borrows of
+                // state. We can't invoke a method on `self` because
+                // we've borrowed `self.func` mutably (as
+                // `cursor.func`) so we pull apart the pieces instead
+                // here.
+                let mut ctx = OptimizeCtx {
+                    func: cursor.func,
+                    value_to_opt_value: &mut value_to_opt_value,
+                    gvn_map: &mut gvn_map,
+                    eclasses: &mut self.eclasses,
+                    rewrite_depth: 0,
+                    subsume_values: FxHashSet::default(),
+                    remat_values: &mut self.remat_values,
+                    stats: &mut self.stats,
+                    alias_analysis: self.alias_analysis,
+                    alias_analysis_state: &mut alias_analysis_state,
+                };
+
+                if is_pure_for_egraph(ctx.func, inst) {
+                    // Insert into GVN map and optimize any new nodes
+                    // inserted (recursively performing this work for
+                    // any nodes the optimization rules produce).
+                    let inst = NewOrExistingInst::Existing(inst);
+                    ctx.insert_pure_enode(inst);
+                    // We've now rewritten all uses, or will when we
+                    // see them, and the instruction exists as a pure
+                    // enode in the eclass, so we can remove it.
+                    cursor.remove_inst_and_step_back();
                } else {
-                    crate::ir::types::INVALID
-                };
-
-                let load_mem_state = self.alias_analysis.get_state_for_load(inst);
-                let is_readonly_load = match func.dfg[inst] {
-                    InstructionData::Load {
-                        opcode: Opcode::Load,
-                        flags,
-                        ..
-                    } => flags.readonly() && flags.notrap(),
-                    _ => false,
-                };
-
-                // Create the egraph node.
-                let op = InstructionImms::from(&func.dfg[inst]);
-                let opcode = op.opcode();
-                let srcloc = func.srclocs[inst];
-                let arity = u16::try_from(results.len())
-                    .expect("More than 2^16 results from an instruction");
-
-                let node = if is_readonly_load {
-                    self.stats.node_created += 1;
-                    self.stats.node_pure += 1;
-                    Node::Pure {
-                        op,
-                        args,
-                        ty,
-                        arity,
-                    }
-                } else if let Some(load_mem_state) = load_mem_state {
-                    let addr = args.as_slice(&self.node_ctx.args)[0];
-                    trace!("load at inst {} has mem state {:?}", inst, load_mem_state);
-                    self.stats.node_created += 1;
-                    self.stats.node_load += 1;
-                    Node::Load {
-                        op,
-                        ty,
-                        addr,
-                        mem_state: load_mem_state,
-                        srcloc,
-                    }
-                } else if has_side_effect(func, inst) || opcode.can_load() {
-                    self.stats.node_created += 1;
-                    self.stats.node_inst += 1;
-                    Node::Inst {
-                        op,
-                        args,
-                        ty,
-                        arity,
-                        srcloc,
-                        loop_level,
-                    }
-                } else {
-                    self.stats.node_created += 1;
-                    self.stats.node_pure += 1;
-                    Node::Pure {
-                        op,
-                        args,
-                        ty,
-                        arity,
-                    }
-                };
-                let dedup_needed = self.node_ctx.needs_dedup(&node);
-                let is_pure = matches!(node, Node::Pure { .. });
-
-                let mut id = self.egraph.add(node, &mut self.node_ctx);
-
-                if dedup_needed {
-                    self.stats.node_dedup_query += 1;
-                    match id {
-                        NewOrExisting::New(_) => {
-                            self.stats.node_dedup_miss += 1;
-                        }
-                        NewOrExisting::Existing(_) => {
-                            self.stats.node_dedup_hit += 1;
-                        }
-                    }
-                }
-
-                if opcode == Opcode::Store {
-                    let store_data_ty = func.dfg.value_type(func.dfg.inst_args(inst)[0]);
-                    self.store_nodes.insert(inst, (store_data_ty, id.get()));
-                    self.stats.store_map_insert += 1;
-                }
-
-                // Loads that did not already merge into an existing
-                // load: try to forward from a store (store-to-load
-                // forwarding).
-                if let NewOrExisting::New(new_id) = id {
-                    if load_mem_state.is_some() {
-                        let opt_id = crate::opts::store_to_load(new_id, self);
-                        trace!("store_to_load: {} -> {}", new_id, opt_id);
-                        if opt_id != new_id {
-                            id = NewOrExisting::Existing(opt_id);
-                        }
-                    }
-                }
-
-                // Now either optimize (for new pure nodes), or add to
-                // the side-effecting list (for all other new nodes).
-                let id = match id {
-                    NewOrExisting::Existing(id) => id,
-                    NewOrExisting::New(id) if is_pure => {
-                        // Apply all optimization rules immediately; the
-                        // aegraph (acyclic egraph) works best when we do
-                        // this so all uses pick up the eclass with all
-                        // possible enodes.
-                        crate::opts::optimize_eclass(id, self)
-                    }
-                    NewOrExisting::New(id) => {
-                        self.side_effect_ids.push(id);
-                        self.stats.side_effect_nodes += 1;
-                        id
-                    }
-                };
-
-                // Create results and save in Value->Id map.
-                match results {
-                    &[] => {}
-                    &[one_result] => {
-                        trace!("build: value {} -> id {}", one_result, id);
-                        value_to_id.insert(one_result, id);
-                    }
-                    many_results => {
-                        debug_assert!(many_results.len() > 1);
-                        for (i, &result) in many_results.iter().enumerate() {
-                            let ty = func.dfg.value_type(result);
-                            let projection = self
-                                .egraph
-                                .add(
-                                    Node::Result {
-                                        value: id,
-                                        result: i,
-                                        ty,
-                                    },
-                                    &mut self.node_ctx,
-                                )
-                                .get();
-                            self.stats.node_created += 1;
-                            self.stats.node_result += 1;
-                            trace!("build: value {} -> id {}", result, projection);
-                            value_to_id.insert(result, projection);
-                        }
+                    if ctx.optimize_skeleton_inst(inst) {
+                        cursor.remove_inst_and_step_back();
                    }
                }
            }
-
-            let side_effect_end =
-                u32::try_from(self.side_effect_ids.len()).expect("Overflow in side-effect count");
-            let side_effect_range = side_effect_start..side_effect_end;
-            self.side_effects[block] = side_effect_range;
        }
    }

    /// Scoped elaboration: compute a final ordering of op computation
-    /// for each block and replace the given Func body.
+    /// for each block and update the given Func body. After this
+    /// runs, the function body is back into the state where every
+    /// Inst with an used result is placed in the layout (possibly
+    /// duplicated, if our code-motion logic decides this is the best
+    /// option).
    ///
    /// This works in concert with the domtree. We do a preorder
    /// traversal of the domtree, tracking a scoped map from Id to
@@ -354,76 +474,95 @@ impl<'a> FuncEGraph<'a> {
    /// thus computed "as late as possible", but then memoized into
    /// the Id-to-Value map and available to all dominated blocks and
    /// for the rest of this block. (This subsumes GVN.)
-    pub fn elaborate(&mut self, func: &mut Function) {
-        let mut elab = Elaborator::new(
-            func,
+    fn elaborate(&mut self) {
+        let mut elaborator = Elaborator::new(
+            self.func,
            self.domtree,
+            &self.domtree_children,
            self.loop_analysis,
-            &self.egraph,
-            &self.node_ctx,
-            &self.remat_ids,
+            &mut self.remat_values,
+            &mut self.eclasses,
            &mut self.stats,
        );
-        elab.elaborate(
-            |block| {
-                let blockparam_range = self.blockparams[block].clone();
-                &self.blockparam_ids_tys
-                    [blockparam_range.start as usize..blockparam_range.end as usize]
-            },
-            |block| {
-                let side_effect_range = self.side_effects[block].clone();
-                &self.side_effect_ids
-                    [side_effect_range.start as usize..side_effect_range.end as usize]
-            },
-        );
+        elaborator.elaborate();
+
+        self.check_post_egraph();
    }
-}

-/// State for egraph analysis that computes all needed properties.
-pub(crate) struct Analysis;
-
-/// Analysis results for each eclass id.
-#[derive(Clone, Debug)]
-pub(crate) struct AnalysisValue {
-    pub(crate) loop_level: LoopLevel,
-}
-
-impl Default for AnalysisValue {
-    fn default() -> Self {
-        Self {
-            loop_level: LoopLevel::root(),
+    #[cfg(debug_assertions)]
+    fn check_post_egraph(&self) {
+        // Verify that no union nodes are reachable from inst args,
+        // and that all inst args' defining instructions are in the
+        // layout.
+        for block in self.func.layout.blocks() {
+            for inst in self.func.layout.block_insts(block) {
+                for &arg in self.func.dfg.inst_args(inst) {
+                    match self.func.dfg.value_def(arg) {
+                        ValueDef::Result(i, _) => {
+                            debug_assert!(self.func.layout.inst_block(i).is_some());
+                        }
+                        ValueDef::Union(..) => {
+                            panic!("egraph union node {} still reachable at {}!", arg, inst);
+                        }
+                        _ => {}
+                    }
+                }
+            }
        }
    }
+
+    #[cfg(not(debug_assertions))]
+    fn check_post_egraph(&self) {}
 }

-impl cranelift_egraph::Analysis for Analysis {
-    type L = NodeCtx;
-    type Value = AnalysisValue;
+/// Implementation of external-context equality and hashing on
+/// InstructionData. This allows us to deduplicate instructions given
+/// some context that lets us see its value lists and the mapping from
+/// any value to "canonical value" (in an eclass).
+struct GVNContext<'a> {
+    value_lists: &'a ValueListPool,
+    union_find: &'a UnionFind<Value>,
+}

-    fn for_node(
+impl<'a> CtxEq<(Type, InstructionData), (Type, InstructionData)> for GVNContext<'a> {
+    fn ctx_eq(
        &self,
-        ctx: &NodeCtx,
-        n: &Node,
-        values: &SecondaryMap<Id, AnalysisValue>,
-    ) -> AnalysisValue {
-        let loop_level = match n {
-            &Node::Pure { ref args, .. } => args
-                .as_slice(&ctx.args)
-                .iter()
-                .map(|&arg| values[arg].loop_level)
-                .max()
-                .unwrap_or(LoopLevel::root()),
-            &Node::Load { addr, .. } => values[addr].loop_level,
-            &Node::Result { value, .. } => values[value].loop_level,
-            &Node::Inst { loop_level, .. } | &Node::Param { loop_level, .. } => loop_level,
-        };
-
-        AnalysisValue { loop_level }
-    }
-
-    fn meet(&self, _ctx: &NodeCtx, v1: &AnalysisValue, v2: &AnalysisValue) -> AnalysisValue {
-        AnalysisValue {
-            loop_level: std::cmp::max(v1.loop_level, v2.loop_level),
-        }
+        (a_ty, a_inst): &(Type, InstructionData),
+        (b_ty, b_inst): &(Type, InstructionData),
+    ) -> bool {
+        a_ty == b_ty
+            && a_inst.eq(b_inst, self.value_lists, |value| {
+                self.union_find.find(value)
+            })
    }
 }
+
+impl<'a> CtxHash<(Type, InstructionData)> for GVNContext<'a> {
+    fn ctx_hash<H: Hasher>(&self, state: &mut H, (ty, inst): &(Type, InstructionData)) {
+        std::hash::Hash::hash(&ty, state);
+        inst.hash(state, self.value_lists, |value| self.union_find.find(value));
+    }
+}
+
+/// Statistics collected during egraph-based processing.
+#[derive(Clone, Debug, Default)]
+pub(crate) struct Stats {
+    pub(crate) pure_inst: u64,
+    pub(crate) pure_inst_deduped: u64,
+    pub(crate) skeleton_inst: u64,
+    pub(crate) alias_analysis_removed: u64,
+    pub(crate) new_inst: u64,
+    pub(crate) union: u64,
+    pub(crate) subsume: u64,
+    pub(crate) remat: u64,
+    pub(crate) rewrite_rule_invoked: u64,
+    pub(crate) rewrite_depth_limit: u64,
+    pub(crate) elaborate_visit_node: u64,
+    pub(crate) elaborate_memoize_hit: u64,
+    pub(crate) elaborate_memoize_miss: u64,
+    pub(crate) elaborate_memoize_miss_remat: u64,
+    pub(crate) elaborate_licm_hoist: u64,
+    pub(crate) elaborate_func: u64,
+    pub(crate) elaborate_func_pre_insts: u64,
+    pub(crate) elaborate_func_post_insts: u64,
+}
--- a/cranelift/codegen/src/egraph/cost.rs
+++ b/cranelift/codegen/src/egraph/cost.rs
@@ -0,0 +1,97 @@
+//! Cost functions for egraph representation.
+
+use crate::ir::Opcode;
+
+/// A cost of computing some value in the program.
+///
+/// Costs are measured in an arbitrary union that we represent in a
+/// `u32`. The ordering is meant to be meaningful, but the value of a
+/// single unit is arbitrary (and "not to scale"). We use a collection
+/// of heuristics to try to make this approximation at least usable.
+///
+/// We start by defining costs for each opcode (see `pure_op_cost`
+/// below). The cost of computing some value, initially, is the cost
+/// of its opcode, plus the cost of computing its inputs.
+///
+/// We then adjust the cost according to loop nests: for each
+/// loop-nest level, we multiply by 1024. Because we only have 32
+/// bits, we limit this scaling to a loop-level of two (i.e., multiply
+/// by 2^20 ~= 1M).
+///
+/// Arithmetic on costs is always saturating: we don't want to wrap
+/// around and return to a tiny cost when adding the costs of two very
+/// expensive operations. It is better to approximate and lose some
+/// precision than to lose the ordering by wrapping.
+///
+/// Finally, we reserve the highest value, `u32::MAX`, as a sentinel
+/// that means "infinite". This is separate from the finite costs and
+/// not reachable by doing arithmetic on them (even when overflowing)
+/// -- we saturate just *below* infinity. (This is done by the
+/// `finite()` method.) An infinite cost is used to represent a value
+/// that cannot be computed, or otherwise serve as a sentinel when
+/// performing search for the lowest-cost representation of a value.
+#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
+pub(crate) struct Cost(u32);
+impl Cost {
+    pub(crate) fn at_level(&self, loop_level: usize) -> Cost {
+        let loop_level = std::cmp::min(2, loop_level);
+        let multiplier = 1u32 << ((10 * loop_level) as u32);
+        Cost(self.0.saturating_mul(multiplier)).finite()
+    }
+
+    pub(crate) fn infinity() -> Cost {
+        // 2^32 - 1 is, uh, pretty close to infinite... (we use `Cost`
+        // only for heuristics and always saturate so this suffices!)
+        Cost(u32::MAX)
+    }
+
+    pub(crate) fn zero() -> Cost {
+        Cost(0)
+    }
+
+    /// Clamp this cost at a "finite" value. Can be used in
+    /// conjunction with saturating ops to avoid saturating into
+    /// `infinity()`.
+    fn finite(self) -> Cost {
+        Cost(std::cmp::min(u32::MAX - 1, self.0))
+    }
+}
+
+impl std::default::Default for Cost {
+    fn default() -> Cost {
+        Cost::zero()
+    }
+}
+
+impl std::ops::Add<Cost> for Cost {
+    type Output = Cost;
+    fn add(self, other: Cost) -> Cost {
+        Cost(self.0.saturating_add(other.0)).finite()
+    }
+}
+
+/// Return the cost of a *pure* opcode. Caller is responsible for
+/// checking that the opcode came from an instruction that satisfies
+/// `inst_predicates::is_pure_for_egraph()`.
+pub(crate) fn pure_op_cost(op: Opcode) -> Cost {
+    match op {
+        // Constants.
+        Opcode::Iconst | Opcode::F32const | Opcode::F64const => Cost(0),
+        // Extends/reduces.
+        Opcode::Uextend | Opcode::Sextend | Opcode::Ireduce | Opcode::Iconcat | Opcode::Isplit => {
+            Cost(1)
+        }
+        // "Simple" arithmetic.
+        Opcode::Iadd
+        | Opcode::Isub
+        | Opcode::Band
+        | Opcode::BandNot
+        | Opcode::Bor
+        | Opcode::BorNot
+        | Opcode::Bxor
+        | Opcode::BxorNot
+        | Opcode::Bnot => Cost(2),
+        // Everything else (pure.)
+        _ => Cost(3),
+    }
+}
--- a/cranelift/codegen/src/egraph/elaborate.rs
+++ b/cranelift/codegen/src/egraph/elaborate.rs
--- a/cranelift/codegen/src/egraph/node.rs
+++ b/cranelift/codegen/src/egraph/node.rs
@@ -1,366 +0,0 @@
-//! Node definition for EGraph representation.
-
-use super::PackedMemoryState;
-use crate::ir::{Block, DataFlowGraph, InstructionImms, Opcode, RelSourceLoc, Type};
-use crate::loop_analysis::LoopLevel;
-use cranelift_egraph::{CtxEq, CtxHash, Id, Language, UnionFind};
-use cranelift_entity::{EntityList, ListPool};
-use std::hash::{Hash, Hasher};
-
-#[derive(Debug)]
-pub enum Node {
-    /// A blockparam. Effectively an input/root; does not refer to
-    /// predecessors' branch arguments, because this would create
-    /// cycles.
-    Param {
-        /// CLIF block this param comes from.
-        block: Block,
-        /// Index of blockparam within block.
-        index: u32,
-        /// Type of the value.
-        ty: Type,
-        /// The loop level of this Param.
-        loop_level: LoopLevel,
-    },
-    /// A CLIF instruction that is pure (has no side-effects). Not
-    /// tied to any location; we will compute a set of locations at
-    /// which to compute this node during lowering back out of the
-    /// egraph.
-    Pure {
-        /// The instruction data, without SSA values.
-        op: InstructionImms,
-        /// eclass arguments to the operator.
-        args: EntityList<Id>,
-        /// Type of result, if one.
-        ty: Type,
-        /// Number of results.
-        arity: u16,
-    },
-    /// A CLIF instruction that has side-effects or is otherwise not
-    /// representable by `Pure`.
-    Inst {
-        /// The instruction data, without SSA values.
-        op: InstructionImms,
-        /// eclass arguments to the operator.
-        args: EntityList<Id>,
-        /// Type of result, if one.
-        ty: Type,
-        /// Number of results.
-        arity: u16,
-        /// The source location to preserve.
-        srcloc: RelSourceLoc,
-        /// The loop level of this Inst.
-        loop_level: LoopLevel,
-    },
-    /// A projection of one result of an `Inst` or `Pure`.
-    Result {
-        /// `Inst` or `Pure` node.
-        value: Id,
-        /// Index of the result we want.
-        result: usize,
-        /// Type of the value.
-        ty: Type,
-    },
-
-    /// A load instruction. Nominally a side-effecting `Inst` (and
-    /// included in the list of side-effecting roots so it will always
-    /// be elaborated), but represented as a distinct kind of node so
-    /// that we can leverage deduplication to do
-    /// redundant-load-elimination for free (and make store-to-load
-    /// forwarding much easier).
-    Load {
-        // -- identity depends on:
-        /// The original load operation. Must have one argument, the
-        /// address.
-        op: InstructionImms,
-        /// The type of the load result.
-        ty: Type,
-        /// Address argument. Actual address has an offset, which is
-        /// included in `op` (and thus already considered as part of
-        /// the key).
-        addr: Id,
-        /// The abstract memory state that this load accesses.
-        mem_state: PackedMemoryState,
-
-        // -- not included in dedup key:
-        /// Source location, for traps. Not included in Eq/Hash.
-        srcloc: RelSourceLoc,
-    },
-}
-
-impl Node {
-    pub(crate) fn is_non_pure(&self) -> bool {
-        match self {
-            Node::Inst { .. } | Node::Load { .. } => true,
-            _ => false,
-        }
-    }
-}
-
-/// Shared pools for type and id lists in nodes.
-pub struct NodeCtx {
-    /// Arena for arg eclass-ID lists.
-    pub args: ListPool<Id>,
-}
-
-impl NodeCtx {
-    pub(crate) fn with_capacity_for_dfg(dfg: &DataFlowGraph) -> Self {
-        let n_args = dfg.value_lists.capacity();
-        Self {
-            args: ListPool::with_capacity(n_args),
-        }
-    }
-}
-
-impl NodeCtx {
-    fn ids_eq(&self, a: &EntityList<Id>, b: &EntityList<Id>, uf: &mut UnionFind) -> bool {
-        let a = a.as_slice(&self.args);
-        let b = b.as_slice(&self.args);
-        a.len() == b.len() && a.iter().zip(b.iter()).all(|(&a, &b)| uf.equiv_id_mut(a, b))
-    }
-
-    fn hash_ids<H: Hasher>(&self, a: &EntityList<Id>, hash: &mut H, uf: &mut UnionFind) {
-        let a = a.as_slice(&self.args);
-        for &id in a {
-            uf.hash_id_mut(hash, id);
-        }
-    }
-}
-
-impl CtxEq<Node, Node> for NodeCtx {
-    fn ctx_eq(&self, a: &Node, b: &Node, uf: &mut UnionFind) -> bool {
-        match (a, b) {
-            (
-                &Node::Param {
-                    block,
-                    index,
-                    ty,
-                    loop_level: _,
-                },
-                &Node::Param {
-                    block: other_block,
-                    index: other_index,
-                    ty: other_ty,
-                    loop_level: _,
-                },
-            ) => block == other_block && index == other_index && ty == other_ty,
-            (
-                &Node::Result { value, result, ty },
-                &Node::Result {
-                    value: other_value,
-                    result: other_result,
-                    ty: other_ty,
-                },
-            ) => uf.equiv_id_mut(value, other_value) && result == other_result && ty == other_ty,
-            (
-                &Node::Pure {
-                    ref op,
-                    ref args,
-                    ty,
-                    arity: _,
-                },
-                &Node::Pure {
-                    op: ref other_op,
-                    args: ref other_args,
-                    ty: other_ty,
-                    arity: _,
-                },
-            ) => *op == *other_op && self.ids_eq(args, other_args, uf) && ty == other_ty,
-            (
-                &Node::Inst { ref args, .. },
-                &Node::Inst {
-                    args: ref other_args,
-                    ..
-                },
-            ) => self.ids_eq(args, other_args, uf),
-            (
-                &Node::Load {
-                    ref op,
-                    ty,
-                    addr,
-                    mem_state,
-                    ..
-                },
-                &Node::Load {
-                    op: ref other_op,
-                    ty: other_ty,
-                    addr: other_addr,
-                    mem_state: other_mem_state,
-                    // Explicitly exclude: `inst` and `srcloc`. We
-                    // want loads to merge if identical in
-                    // opcode/offset, address expression, and last
-                    // store (this does implicit
-                    // redundant-load-elimination.)
-                    //
-                    // Note however that we *do* include `ty` (the
-                    // type) and match on that: we otherwise would
-                    // have no way of disambiguating loads of
-                    // different widths to the same address.
-                    ..
-                },
-            ) => {
-                op == other_op
-                    && ty == other_ty
-                    && uf.equiv_id_mut(addr, other_addr)
-                    && mem_state == other_mem_state
-            }
-            _ => false,
-        }
-    }
-}
-
-impl CtxHash<Node> for NodeCtx {
-    fn ctx_hash(&self, value: &Node, uf: &mut UnionFind) -> u64 {
-        let mut state = crate::fx::FxHasher::default();
-        std::mem::discriminant(value).hash(&mut state);
-        match value {
-            &Node::Param {
-                block,
-                index,
-                ty: _,
-                loop_level: _,
-            } => {
-                block.hash(&mut state);
-                index.hash(&mut state);
-            }
-            &Node::Result {
-                value,
-                result,
-                ty: _,
-            } => {
-                uf.hash_id_mut(&mut state, value);
-                result.hash(&mut state);
-            }
-            &Node::Pure {
-                ref op,
-                ref args,
-                ty,
-                arity: _,
-            } => {
-                op.hash(&mut state);
-                self.hash_ids(args, &mut state, uf);
-                ty.hash(&mut state);
-            }
-            &Node::Inst { ref args, .. } => {
-                self.hash_ids(args, &mut state, uf);
-            }
-            &Node::Load {
-                ref op,
-                ty,
-                addr,
-                mem_state,
-                ..
-            } => {
-                op.hash(&mut state);
-                ty.hash(&mut state);
-                uf.hash_id_mut(&mut state, addr);
-                mem_state.hash(&mut state);
-            }
-        }
-
-        state.finish()
-    }
-}
-
-#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
-pub(crate) struct Cost(u32);
-impl Cost {
-    pub(crate) fn at_level(&self, loop_level: LoopLevel) -> Cost {
-        let loop_level = std::cmp::min(2, loop_level.level());
-        let multiplier = 1u32 << ((10 * loop_level) as u32);
-        Cost(self.0.saturating_mul(multiplier)).finite()
-    }
-
-    pub(crate) fn infinity() -> Cost {
-        // 2^32 - 1 is, uh, pretty close to infinite... (we use `Cost`
-        // only for heuristics and always saturate so this suffices!)
-        Cost(u32::MAX)
-    }
-
-    pub(crate) fn zero() -> Cost {
-        Cost(0)
-    }
-
-    /// Clamp this cost at a "finite" value. Can be used in
-    /// conjunction with saturating ops to avoid saturating into
-    /// `infinity()`.
-    fn finite(self) -> Cost {
-        Cost(std::cmp::min(u32::MAX - 1, self.0))
-    }
-}
-
-impl std::default::Default for Cost {
-    fn default() -> Cost {
-        Cost::zero()
-    }
-}
-
-impl std::ops::Add<Cost> for Cost {
-    type Output = Cost;
-    fn add(self, other: Cost) -> Cost {
-        Cost(self.0.saturating_add(other.0)).finite()
-    }
-}
-
-pub(crate) fn op_cost(op: &InstructionImms) -> Cost {
-    match op.opcode() {
-        // Constants.
-        Opcode::Iconst | Opcode::F32const | Opcode::F64const => Cost(0),
-        // Extends/reduces.
-        Opcode::Uextend | Opcode::Sextend | Opcode::Ireduce | Opcode::Iconcat | Opcode::Isplit => {
-            Cost(1)
-        }
-        // "Simple" arithmetic.
-        Opcode::Iadd
-        | Opcode::Isub
-        | Opcode::Band
-        | Opcode::BandNot
-        | Opcode::Bor
-        | Opcode::BorNot
-        | Opcode::Bxor
-        | Opcode::BxorNot
-        | Opcode::Bnot => Cost(2),
-        // Everything else.
-        _ => Cost(3),
-    }
-}
-
-impl Language for NodeCtx {
-    type Node = Node;
-
-    fn children<'a>(&'a self, node: &'a Node) -> &'a [Id] {
-        match node {
-            Node::Param { .. } => &[],
-            Node::Pure { args, .. } | Node::Inst { args, .. } => args.as_slice(&self.args),
-            Node::Load { addr, .. } => std::slice::from_ref(addr),
-            Node::Result { value, .. } => std::slice::from_ref(value),
-        }
-    }
-
-    fn children_mut<'a>(&'a mut self, node: &'a mut Node) -> &'a mut [Id] {
-        match node {
-            Node::Param { .. } => &mut [],
-            Node::Pure { args, .. } | Node::Inst { args, .. } => args.as_mut_slice(&mut self.args),
-            Node::Load { addr, .. } => std::slice::from_mut(addr),
-            Node::Result { value, .. } => std::slice::from_mut(value),
-        }
-    }
-
-    fn needs_dedup(&self, node: &Node) -> bool {
-        match node {
-            Node::Pure { .. } | Node::Load { .. } => true,
-            _ => false,
-        }
-    }
-}
-
-#[cfg(test)]
-mod test {
-    #[test]
-    #[cfg(target_pointer_width = "64")]
-    fn node_size() {
-        use super::*;
-        assert_eq!(std::mem::size_of::<InstructionImms>(), 16);
-        assert_eq!(std::mem::size_of::<Node>(), 32);
-    }
-}
--- a/cranelift/codegen/src/egraph/stores.rs
+++ b/cranelift/codegen/src/egraph/stores.rs
@@ -1,293 +0,0 @@
-//! Last-store tracking via alias analysis.
-//!
-//! We partition memory state into several *disjoint pieces* of
-//! "abstract state". There are a finite number of such pieces:
-//! currently, we call them "heap", "table", "vmctx", and "other". Any
-//! given address in memory belongs to exactly one disjoint piece.
-//!
-//! One never tracks which piece a concrete address belongs to at
-//! runtime; this is a purely static concept. Instead, all
-//! memory-accessing instructions (loads and stores) are labeled with
-//! one of these four categories in the `MemFlags`. It is forbidden
-//! for a load or store to access memory under one category and a
-//! later load or store to access the same memory under a different
-//! category. This is ensured to be true by construction during
-//! frontend translation into CLIF and during legalization.
-//!
-//! Given that this non-aliasing property is ensured by the producer
-//! of CLIF, we can compute a *may-alias* property: one load or store
-//! may-alias another load or store if both access the same category
-//! of abstract state.
-//!
-//! The "last store" pass helps to compute this aliasing: we perform a
-//! fixpoint analysis to track the last instruction that *might have*
-//! written to a given part of abstract state. We also track the block
-//! containing this store.
-//!
-//! We can't say for sure that the "last store" *did* actually write
-//! that state, but we know for sure that no instruction *later* than
-//! it (up to the current instruction) did. However, we can get a
-//! must-alias property from this: if at a given load or store, we
-//! look backward to the "last store", *AND* we find that it has
-//! exactly the same address expression and value type, then we know
-//! that the current instruction's access *must* be to the same memory
-//! location.
-//!
-//! To get this must-alias property, we leverage the node
-//! hashconsing. We design the Eq/Hash (node identity relation
-//! definition) of the `Node` struct so that all loads with (i) the
-//! same "last store", and (ii) the same address expression, and (iii)
-//! the same opcode-and-offset, will deduplicate (the first will be
-//! computed, and the later ones will use the same value). Furthermore
-//! we have an optimization that rewrites a load into the stored value
-//! of the last store *if* the last store has the same address
-//! expression and constant offset.
-//!
-//! This gives us two optimizations, "redundant load elimination" and
-//! "store-to-load forwarding".
-//!
-//! In theory we could also do *dead-store elimination*, where if a
-//! store overwrites a value earlier written by another store, *and*
-//! if no other load/store to the abstract state category occurred,
-//! *and* no other trapping instruction occurred (at which point we
-//! need an up-to-date memory state because post-trap-termination
-//! memory state can be observed), *and* we can prove the original
-//! store could not have trapped, then we can eliminate the original
-//! store. Because this is so complex, and the conditions for doing it
-//! correctly when post-trap state must be correct likely reduce the
-//! potential benefit, we don't yet do this.
-
-use crate::flowgraph::ControlFlowGraph;
-use crate::fx::{FxHashMap, FxHashSet};
-use crate::inst_predicates::has_memory_fence_semantics;
-use crate::ir::{Block, Function, Inst, InstructionData, MemFlags, Opcode};
-use crate::trace;
-use cranelift_entity::{EntityRef, SecondaryMap};
-use smallvec::{smallvec, SmallVec};
-
-/// For a given program point, the vector of last-store instruction
-/// indices for each disjoint category of abstract state.
-#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)]
-struct LastStores {
-    heap: MemoryState,
-    table: MemoryState,
-    vmctx: MemoryState,
-    other: MemoryState,
-}
-
-/// State of memory seen by a load.
-#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord, Hash, Default)]
-pub enum MemoryState {
-    /// State at function entry: nothing is known (but it is one
-    /// consistent value, so two loads from "entry" state at the same
-    /// address will still provide the same result).
-    #[default]
-    Entry,
-    /// State just after a store by the given instruction. The
-    /// instruction is a store from which we can forward.
-    Store(Inst),
-    /// State just before the given instruction. Used for abstract
-    /// value merges at merge-points when we cannot name a single
-    /// producing site.
-    BeforeInst(Inst),
-    /// State just after the given instruction. Used when the
-    /// instruction may update the associated state, but is not a
-    /// store whose value we can cleanly forward. (E.g., perhaps a
-    /// barrier of some sort.)
-    AfterInst(Inst),
-}
-
-/// Memory state index, packed into a u32.
-#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord, Hash)]
-pub struct PackedMemoryState(u32);
-
-impl From<MemoryState> for PackedMemoryState {
-    fn from(state: MemoryState) -> Self {
-        match state {
-            MemoryState::Entry => Self(0),
-            MemoryState::Store(i) => Self(1 | (i.index() as u32) << 2),
-            MemoryState::BeforeInst(i) => Self(2 | (i.index() as u32) << 2),
-            MemoryState::AfterInst(i) => Self(3 | (i.index() as u32) << 2),
-        }
-    }
-}
-
-impl PackedMemoryState {
-    /// Does this memory state refer to a specific store instruction?
-    pub fn as_store(&self) -> Option<Inst> {
-        if self.0 & 3 == 1 {
-            Some(Inst::from_bits(self.0 >> 2))
-        } else {
-            None
-        }
-    }
-}
-
-impl LastStores {
-    fn update(&mut self, func: &Function, inst: Inst) {
-        let opcode = func.dfg[inst].opcode();
-        if has_memory_fence_semantics(opcode) {
-            self.heap = MemoryState::AfterInst(inst);
-            self.table = MemoryState::AfterInst(inst);
-            self.vmctx = MemoryState::AfterInst(inst);
-            self.other = MemoryState::AfterInst(inst);
-        } else if opcode.can_store() {
-            if let Some(memflags) = func.dfg[inst].memflags() {
-                *self.for_flags(memflags) = MemoryState::Store(inst);
-            } else {
-                self.heap = MemoryState::AfterInst(inst);
-                self.table = MemoryState::AfterInst(inst);
-                self.vmctx = MemoryState::AfterInst(inst);
-                self.other = MemoryState::AfterInst(inst);
-            }
-        }
-    }
-
-    fn for_flags(&mut self, memflags: MemFlags) -> &mut MemoryState {
-        if memflags.heap() {
-            &mut self.heap
-        } else if memflags.table() {
-            &mut self.table
-        } else if memflags.vmctx() {
-            &mut self.vmctx
-        } else {
-            &mut self.other
-        }
-    }
-
-    fn meet_from(&mut self, other: &LastStores, loc: Inst) {
-        let meet = |a: MemoryState, b: MemoryState| -> MemoryState {
-            match (a, b) {
-                (a, b) if a == b => a,
-                _ => MemoryState::BeforeInst(loc),
-            }
-        };
-
-        self.heap = meet(self.heap, other.heap);
-        self.table = meet(self.table, other.table);
-        self.vmctx = meet(self.vmctx, other.vmctx);
-        self.other = meet(self.other, other.other);
-    }
-}
-
-/// An alias-analysis pass.
-pub struct AliasAnalysis {
-    /// Last-store instruction (or none) for a given load. Use a hash map
-    /// instead of a `SecondaryMap` because this is sparse.
-    load_mem_state: FxHashMap<Inst, PackedMemoryState>,
-}
-
-impl AliasAnalysis {
-    /// Perform an alias analysis pass.
-    pub fn new(func: &Function, cfg: &ControlFlowGraph) -> AliasAnalysis {
-        log::trace!("alias analysis: input is:\n{:?}", func);
-        let block_input = Self::compute_block_input_states(func, cfg);
-        let load_mem_state = Self::compute_load_last_stores(func, block_input);
-        AliasAnalysis { load_mem_state }
-    }
-
-    fn compute_block_input_states(
-        func: &Function,
-        cfg: &ControlFlowGraph,
-    ) -> SecondaryMap<Block, Option<LastStores>> {
-        let mut block_input = SecondaryMap::with_capacity(func.dfg.num_blocks());
-        let mut worklist: SmallVec<[Block; 16]> = smallvec![];
-        let mut worklist_set = FxHashSet::default();
-        let entry = func.layout.entry_block().unwrap();
-        worklist.push(entry);
-        worklist_set.insert(entry);
-        block_input[entry] = Some(LastStores::default());
-
-        while let Some(block) = worklist.pop() {
-            worklist_set.remove(&block);
-            let state = block_input[block].clone().unwrap();
-
-            trace!("alias analysis: input to {} is {:?}", block, state);
-
-            let state = func
-                .layout
-                .block_insts(block)
-                .fold(state, |mut state, inst| {
-                    state.update(func, inst);
-                    trace!("after {}: state is {:?}", inst, state);
-                    state
-                });
-
-            for succ in cfg.succ_iter(block) {
-                let succ_first_inst = func.layout.first_inst(succ).unwrap();
-                let succ_state = &mut block_input[succ];
-                let old = succ_state.clone();
-                if let Some(succ_state) = succ_state.as_mut() {
-                    succ_state.meet_from(&state, succ_first_inst);
-                } else {
-                    *succ_state = Some(state);
-                };
-                let updated = *succ_state != old;
-
-                if updated && worklist_set.insert(succ) {
-                    worklist.push(succ);
-                }
-            }
-        }
-
-        block_input
-    }
-
-    fn compute_load_last_stores(
-        func: &Function,
-        block_input: SecondaryMap<Block, Option<LastStores>>,
-    ) -> FxHashMap<Inst, PackedMemoryState> {
-        let mut load_mem_state = FxHashMap::default();
-        load_mem_state.reserve(func.dfg.num_insts() / 8);
-
-        for block in func.layout.blocks() {
-            let mut state = block_input[block].clone().unwrap();
-
-            for inst in func.layout.block_insts(block) {
-                trace!(
-                    "alias analysis: scanning at {} with state {:?} ({:?})",
-                    inst,
-                    state,
-                    func.dfg[inst],
-                );
-
-                // N.B.: we match `Load` specifically, and not any
-                // other kinds of loads (or any opcode such that
-                // `opcode.can_load()` returns true), because some
-                // "can load" instructions actually have very
-                // different semantics (are not just a load of a
-                // particularly-typed value). For example, atomic
-                // (load/store, RMW, CAS) instructions "can load" but
-                // definitely should not participate in store-to-load
-                // forwarding or redundant-load elimination. Our goal
-                // here is to provide a `MemoryState` just for plain
-                // old loads whose semantics we can completely reason
-                // about.
-                if let InstructionData::Load {
-                    opcode: Opcode::Load,
-                    flags,
-                    ..
-                } = func.dfg[inst]
-                {
-                    let mem_state = *state.for_flags(flags);
-                    trace!(
-                        "alias analysis: at {}: load with mem_state {:?}",
-                        inst,
-                        mem_state,
-                    );
-
-                    load_mem_state.insert(inst, mem_state.into());
-                }
-
-                state.update(func, inst);
-            }
-        }
-
-        load_mem_state
-    }
-
-    /// Get the state seen by a load, if any.
-    pub fn get_state_for_load(&self, inst: Inst) -> Option<PackedMemoryState> {
-        self.load_mem_state.get(&inst).copied()
-    }
-}
--- a/cranelift/codegen/src/inst_predicates.rs
+++ b/cranelift/codegen/src/inst_predicates.rs
@@ -45,6 +45,35 @@ pub fn has_side_effect(func: &Function, inst: Inst) -> bool {
    trivially_has_side_effects(opcode) || is_load_with_defined_trapping(opcode, data)
 }

+/// Does the given instruction behave as a "pure" node with respect to
+/// aegraph semantics?
+///
+/// - Actual pure nodes (arithmetic, etc)
+/// - Loads with the `readonly` flag set
+pub fn is_pure_for_egraph(func: &Function, inst: Inst) -> bool {
+    let is_readonly_load = match func.dfg[inst] {
+        InstructionData::Load {
+            opcode: Opcode::Load,
+            flags,
+            ..
+        } => flags.readonly() && flags.notrap(),
+        _ => false,
+    };
+    // Multi-value results do not play nicely with much of the egraph
+    // infrastructure. They are in practice used only for multi-return
+    // calls and some other odd instructions (e.g. iadd_cout) which,
+    // for now, we can afford to leave in place as opaque
+    // side-effecting ops. So if more than one result, then the inst
+    // is "not pure". Similarly, ops with zero results can be used
+    // only for their side-effects, so are never pure. (Or if they
+    // are, we can always trivially eliminate them with no effect.)
+    let has_one_result = func.dfg.inst_results(inst).len() == 1;
+
+    let op = func.dfg[inst].opcode();
+
+    has_one_result && (is_readonly_load || (!op.can_load() && !trivially_has_side_effects(op)))
+}
+
 /// Does the given instruction have any side-effect as per [has_side_effect], or else is a load,
 /// but not the get_pinned_reg opcode?
 pub fn has_lowering_side_effect(func: &Function, inst: Inst) -> bool {
--- a/cranelift/codegen/src/ir/dfg.rs
+++ b/cranelift/codegen/src/ir/dfg.rs
@@ -125,23 +125,6 @@ impl DataFlowGraph {
        self.immediates.clear();
    }

-    /// Clear all instructions, but keep blocks and other metadata
-    /// (signatures, constants, immediates). Everything to do with
-    /// `Value`s is cleared, including block params and debug info.
-    ///
-    /// Used during egraph-based optimization to clear out the pre-opt
-    /// body so that we can regenerate it from the egraph.
-    pub(crate) fn clear_insts(&mut self) {
-        self.insts.clear();
-        self.results.clear();
-        self.value_lists.clear();
-        self.values.clear();
-        self.values_labels = None;
-        for block in self.blocks.values_mut() {
-            block.params = ValueList::new();
-        }
-    }
-
    /// Get the total number of instructions created in this function, whether they are currently
    /// inserted in the layout or not.
    ///
@@ -173,6 +156,11 @@ impl DataFlowGraph {
        self.values.len()
    }

+    /// Get an iterator over all values and their definitions.
+    pub fn values_and_defs(&self) -> impl Iterator<Item = (Value, ValueDef)> + '_ {
+        self.values().map(|value| (value, self.value_def(value)))
+    }
+
    /// Starts collection of debug information.
    pub fn collect_debug_info(&mut self) {
        if self.values_labels.is_none() {
@@ -279,12 +267,6 @@ impl DataFlowGraph {
        self.values[v].ty()
    }

-    /// Fill in the type of a value, only if currently invalid (as a placeholder).
-    pub(crate) fn fill_in_value_type(&mut self, v: Value, ty: Type) {
-        debug_assert!(self.values[v].ty().is_invalid() || self.values[v].ty() == ty);
-        self.values[v].set_type(ty);
-    }
-
    /// Get the definition of a value.
    ///
    /// This is either the instruction that defined it or the Block that has the value as an
@@ -298,6 +280,7 @@ impl DataFlowGraph {
                // detect alias loops without overrunning the stack.
                self.value_def(self.resolve_aliases(original))
            }
+            ValueData::Union { x, y, .. } => ValueDef::Union(x, y),
        }
    }

@@ -313,6 +296,7 @@ impl DataFlowGraph {
            Inst { inst, num, .. } => Some(&v) == self.inst_results(inst).get(num as usize),
            Param { block, num, .. } => Some(&v) == self.block_params(block).get(num as usize),
            Alias { .. } => false,
+            Union { .. } => false,
        }
    }

@@ -422,6 +406,8 @@ pub enum ValueDef {
    Result(Inst, usize),
    /// Value is the n'th parameter to a block.
    Param(Block, usize),
+    /// Value is a union of two other values.
+    Union(Value, Value),
 }

 impl ValueDef {
@@ -458,6 +444,7 @@ impl ValueDef {
    pub fn num(self) -> usize {
        match self {
            Self::Result(_, n) | Self::Param(_, n) => n,
+            Self::Union(_, _) => 0,
        }
    }
 }
@@ -476,6 +463,11 @@ enum ValueData {
    /// An alias value can't be linked as an instruction result or block parameter. It is used as a
    /// placeholder when the original instruction or block has been rewritten or modified.
    Alias { ty: Type, original: Value },
+
+    /// Union is a "fork" in representation: the value can be
+    /// represented as either of the values named here. This is used
+    /// for aegraph (acyclic egraph) representation in the DFG.
+    Union { ty: Type, x: Value, y: Value },
 }

 /// Bit-packed version of ValueData, for efficiency.
@@ -483,40 +475,71 @@ enum ValueData {
 /// Layout:
 ///
 /// ```plain
-///        | tag:2 |  type:14        |    num:16       | index:32          |
+///        | tag:2 |  type:14        |    x:24       | y:24          |
+///
+/// Inst       00     ty               inst output     inst index
+/// Param      01     ty               blockparam num  block index
+/// Alias      10     ty               0               value index
+/// Union      11     ty               first value     second value
 /// ```
 #[derive(Clone, Copy, Debug, PartialEq, Hash)]
 #[cfg_attr(feature = "enable-serde", derive(Serialize, Deserialize))]
 struct ValueDataPacked(u64);

+/// Encodes a value in 0..2^32 into 0..2^n, where n is less than 32
+/// (and is implied by `mask`), by translating 2^32-1 (0xffffffff)
+/// into 2^n-1 and panic'ing on 2^n..2^32-1.
+fn encode_narrow_field(x: u32, bits: u8) -> u32 {
+    if x == 0xffff_ffff {
+        (1 << bits) - 1
+    } else {
+        debug_assert!(x < (1 << bits));
+        x
+    }
+}
+
+/// The inverse of the above `encode_narrow_field`: unpacks 2^n-1 into
+/// 2^32-1.
+fn decode_narrow_field(x: u32, bits: u8) -> u32 {
+    if x == (1 << bits) - 1 {
+        0xffff_ffff
+    } else {
+        x
+    }
+}
+
 impl ValueDataPacked {
-    const INDEX_SHIFT: u64 = 0;
-    const INDEX_BITS: u64 = 32;
-    const NUM_SHIFT: u64 = Self::INDEX_SHIFT + Self::INDEX_BITS;
-    const NUM_BITS: u64 = 16;
-    const TYPE_SHIFT: u64 = Self::NUM_SHIFT + Self::NUM_BITS;
-    const TYPE_BITS: u64 = 14;
-    const TAG_SHIFT: u64 = Self::TYPE_SHIFT + Self::TYPE_BITS;
-    const TAG_BITS: u64 = 2;
+    const Y_SHIFT: u8 = 0;
+    const Y_BITS: u8 = 24;
+    const X_SHIFT: u8 = Self::Y_SHIFT + Self::Y_BITS;
+    const X_BITS: u8 = 24;
+    const TYPE_SHIFT: u8 = Self::X_SHIFT + Self::X_BITS;
+    const TYPE_BITS: u8 = 14;
+    const TAG_SHIFT: u8 = Self::TYPE_SHIFT + Self::TYPE_BITS;
+    const TAG_BITS: u8 = 2;

-    const TAG_INST: u64 = 1;
-    const TAG_PARAM: u64 = 2;
-    const TAG_ALIAS: u64 = 3;
+    const TAG_INST: u64 = 0;
+    const TAG_PARAM: u64 = 1;
+    const TAG_ALIAS: u64 = 2;
+    const TAG_UNION: u64 = 3;

-    fn make(tag: u64, ty: Type, num: u16, index: u32) -> ValueDataPacked {
+    fn make(tag: u64, ty: Type, x: u32, y: u32) -> ValueDataPacked {
        debug_assert!(tag < (1 << Self::TAG_BITS));
        debug_assert!(ty.repr() < (1 << Self::TYPE_BITS));

+        let x = encode_narrow_field(x, Self::X_BITS);
+        let y = encode_narrow_field(y, Self::Y_BITS);
+
        ValueDataPacked(
            (tag << Self::TAG_SHIFT)
                | ((ty.repr() as u64) << Self::TYPE_SHIFT)
-                | ((num as u64) << Self::NUM_SHIFT)
-                | ((index as u64) << Self::INDEX_SHIFT),
+                | ((x as u64) << Self::X_SHIFT)
+                | ((y as u64) << Self::Y_SHIFT),
        )
    }

    #[inline(always)]
-    fn field(self, shift: u64, bits: u64) -> u64 {
+    fn field(self, shift: u8, bits: u8) -> u64 {
        (self.0 >> shift) & ((1 << bits) - 1)
    }

@@ -537,14 +560,17 @@ impl From<ValueData> for ValueDataPacked {
    fn from(data: ValueData) -> Self {
        match data {
            ValueData::Inst { ty, num, inst } => {
-                Self::make(Self::TAG_INST, ty, num, inst.as_bits())
+                Self::make(Self::TAG_INST, ty, num.into(), inst.as_bits())
            }
            ValueData::Param { ty, num, block } => {
-                Self::make(Self::TAG_PARAM, ty, num, block.as_bits())
+                Self::make(Self::TAG_PARAM, ty, num.into(), block.as_bits())
            }
            ValueData::Alias { ty, original } => {
                Self::make(Self::TAG_ALIAS, ty, 0, original.as_bits())
            }
+            ValueData::Union { ty, x, y } => {
+                Self::make(Self::TAG_ALIAS, ty, x.as_bits(), y.as_bits())
+            }
        }
    }
 }
@@ -552,25 +578,33 @@ impl From<ValueData> for ValueDataPacked {
 impl From<ValueDataPacked> for ValueData {
    fn from(data: ValueDataPacked) -> Self {
        let tag = data.field(ValueDataPacked::TAG_SHIFT, ValueDataPacked::TAG_BITS);
-        let ty = data.field(ValueDataPacked::TYPE_SHIFT, ValueDataPacked::TYPE_BITS) as u16;
-        let num = data.field(ValueDataPacked::NUM_SHIFT, ValueDataPacked::NUM_BITS) as u16;
-        let index = data.field(ValueDataPacked::INDEX_SHIFT, ValueDataPacked::INDEX_BITS) as u32;
+        let ty = u16::try_from(data.field(ValueDataPacked::TYPE_SHIFT, ValueDataPacked::TYPE_BITS))
+            .expect("Mask should ensure result fits in a u16");
+        let x = u32::try_from(data.field(ValueDataPacked::X_SHIFT, ValueDataPacked::X_BITS))
+            .expect("Mask should ensure result fits in a u32");
+        let y = u32::try_from(data.field(ValueDataPacked::Y_SHIFT, ValueDataPacked::Y_BITS))
+            .expect("Mask should ensure result fits in a u32");

        let ty = Type::from_repr(ty);
        match tag {
            ValueDataPacked::TAG_INST => ValueData::Inst {
                ty,
-                num,
-                inst: Inst::from_bits(index),
+                num: u16::try_from(x).expect("Inst result num should fit in u16"),
+                inst: Inst::from_bits(decode_narrow_field(y, ValueDataPacked::Y_BITS)),
            },
            ValueDataPacked::TAG_PARAM => ValueData::Param {
                ty,
-                num,
-                block: Block::from_bits(index),
+                num: u16::try_from(x).expect("Blockparam index should fit in u16"),
+                block: Block::from_bits(decode_narrow_field(y, ValueDataPacked::Y_BITS)),
            },
            ValueDataPacked::TAG_ALIAS => ValueData::Alias {
                ty,
-                original: Value::from_bits(index),
+                original: Value::from_bits(decode_narrow_field(y, ValueDataPacked::Y_BITS)),
+            },
+            ValueDataPacked::TAG_UNION => ValueData::Union {
+                ty,
+                x: Value::from_bits(decode_narrow_field(x, ValueDataPacked::X_BITS)),
+                y: Value::from_bits(decode_narrow_field(y, ValueDataPacked::Y_BITS)),
            },
            _ => panic!("Invalid tag {} in ValueDataPacked 0x{:x}", tag, data.0),
        }
@@ -582,8 +616,11 @@ impl From<ValueDataPacked> for ValueData {
 impl DataFlowGraph {
    /// Create a new instruction.
    ///
-    /// The type of the first result is indicated by `data.ty`. If the instruction produces
-    /// multiple results, also call `make_inst_results` to allocate value table entries.
+    /// The type of the first result is indicated by `data.ty`. If the
+    /// instruction produces multiple results, also call
+    /// `make_inst_results` to allocate value table entries. (It is
+    /// always safe to call `make_inst_results`, regardless of how
+    /// many results the instruction has.)
    pub fn make_inst(&mut self, data: InstructionData) -> Inst {
        let n = self.num_insts() + 1;
        self.results.resize(n);
@@ -608,6 +645,7 @@ impl DataFlowGraph {
        match self.value_def(value) {
            ir::ValueDef::Result(inst, _) => self.display_inst(inst),
            ir::ValueDef::Param(_, _) => panic!("value is not defined by an instruction"),
+            ir::ValueDef::Union(_, _) => panic!("value is a union of two other values"),
        }
    }

@@ -823,6 +861,19 @@ impl DataFlowGraph {
        self.insts[inst].put_value_list(branch_values)
    }

+    /// Clone an instruction, attaching new result `Value`s and
+    /// returning them.
+    pub fn clone_inst(&mut self, inst: Inst) -> Inst {
+        // First, add a clone of the InstructionData.
+        let inst_data = self[inst].clone();
+        let new_inst = self.make_inst(inst_data);
+        // Get the controlling type variable.
+        let ctrl_typevar = self.ctrl_typevar(inst);
+        // Create new result values.
+        self.make_inst_results(new_inst, ctrl_typevar);
+        new_inst
+    }
+
    /// Get the first result of an instruction.
    ///
    /// This function panics if the instruction doesn't have any result.
@@ -847,6 +898,14 @@ impl DataFlowGraph {
        self.results[inst]
    }

+    /// Create a union of two values.
+    pub fn union(&mut self, x: Value, y: Value) -> Value {
+        // Get the type.
+        let ty = self.value_type(x);
+        debug_assert_eq!(ty, self.value_type(y));
+        self.make_value(ValueData::Union { ty, x, y })
+    }
+
    /// Get the call signature of a direct or indirect call instruction.
    /// Returns `None` if `inst` is not a call instruction.
    pub fn call_signature(&self, inst: Inst) -> Option<SigRef> {
--- a/cranelift/codegen/src/ir/layout.rs
+++ b/cranelift/codegen/src/ir/layout.rs
@@ -61,18 +61,6 @@ impl Layout {
        self.last_block = None;
    }

-    /// Clear instructions from every block, but keep the blocks.
-    ///
-    /// Used by the egraph-based optimization to clear out the
-    /// function body but keep the CFG skeleton.
-    pub(crate) fn clear_insts(&mut self) {
-        self.insts.clear();
-        for block in self.blocks.values_mut() {
-            block.first_inst = None.into();
-            block.last_inst = None.into();
-        }
-    }
-
    /// Returns the capacity of the `BlockData` map.
    pub fn block_capacity(&self) -> usize {
        self.blocks.capacity()
--- a/cranelift/codegen/src/ir/mod.rs
+++ b/cranelift/codegen/src/ir/mod.rs
@@ -48,7 +48,7 @@ pub use crate::ir::function::{DisplayFunctionAnnotations, Function};
 pub use crate::ir::globalvalue::GlobalValueData;
 pub use crate::ir::heap::{HeapData, HeapStyle};
 pub use crate::ir::instructions::{
-    InstructionData, InstructionImms, Opcode, ValueList, ValueListPool, VariableArgs,
+    InstructionData, Opcode, ValueList, ValueListPool, VariableArgs,
 };
 pub use crate::ir::jumptable::JumpTableData;
 pub use crate::ir::known_symbol::KnownSymbol;
--- a/cranelift/codegen/src/ir/progpoint.rs
+++ b/cranelift/codegen/src/ir/progpoint.rs
@@ -37,6 +37,7 @@ impl From<ValueDef> for ProgramPoint {
        match def {
            ValueDef::Result(inst, _) => inst.into(),
            ValueDef::Param(block, _) => block.into(),
+            ValueDef::Union(_, _) => panic!("Union does not have a single program point"),
        }
    }
 }
@@ -78,6 +79,7 @@ impl From<ValueDef> for ExpandedProgramPoint {
        match def {
            ValueDef::Result(inst, _) => inst.into(),
            ValueDef::Param(block, _) => block.into(),
+            ValueDef::Union(_, _) => panic!("Union does not have a single program point"),
        }
    }
 }
--- a/cranelift/codegen/src/isle_prelude.rs
+++ b/cranelift/codegen/src/isle_prelude.rs
@@ -585,5 +585,27 @@ macro_rules! isle_common_prelude_methods {
                | IntCC::SignedLessThan => Some(*cc),
            }
        }
+
+        #[inline]
+        fn unpack_value_array_2(&mut self, arr: &ValueArray2) -> (Value, Value) {
+            let [a, b] = *arr;
+            (a, b)
+        }
+
+        #[inline]
+        fn pack_value_array_2(&mut self, a: Value, b: Value) -> ValueArray2 {
+            [a, b]
+        }
+
+        #[inline]
+        fn unpack_value_array_3(&mut self, arr: &ValueArray3) -> (Value, Value, Value) {
+            let [a, b, c] = *arr;
+            (a, b, c)
+        }
+
+        #[inline]
+        fn pack_value_array_3(&mut self, a: Value, b: Value, c: Value) -> ValueArray3 {
+            [a, b, c]
+        }
    };
 }
--- a/cranelift/codegen/src/lib.rs
+++ b/cranelift/codegen/src/lib.rs
@@ -95,6 +95,7 @@ mod alias_analysis;
 mod bitset;
 mod constant_hash;
 mod context;
+mod ctxhash;
 mod dce;
 mod divconst_magic_numbers;
 mod egraph;
@@ -111,6 +112,7 @@ mod result;
 mod scoped_hash_map;
 mod simple_gvn;
 mod simple_preopt;
+mod unionfind;
 mod unreachable_code;
 mod value_label;

--- a/cranelift/codegen/src/loop_analysis.rs
+++ b/cranelift/codegen/src/loop_analysis.rs
@@ -37,7 +37,7 @@ struct LoopData {
 #[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord, Hash)]
 pub struct LoopLevel(u8);
 impl LoopLevel {
-    const INVALID: u8 = 0xff;
+    const INVALID: u8 = u8::MAX;

    /// Get the root level (no loop).
    pub fn root() -> Self {
--- a/cranelift/codegen/src/machinst/isle.rs
+++ b/cranelift/codegen/src/machinst/isle.rs
@@ -56,25 +56,8 @@ macro_rules! isle_lower_prelude_methods {
        }

        #[inline]
-        fn unpack_value_array_2(&mut self, arr: &ValueArray2) -> (Value, Value) {
-            let [a, b] = *arr;
-            (a, b)
-        }
-
-        #[inline]
-        fn pack_value_array_2(&mut self, a: Value, b: Value) -> ValueArray2 {
-            [a, b]
-        }
-
-        #[inline]
-        fn unpack_value_array_3(&mut self, arr: &ValueArray3) -> (Value, Value, Value) {
-            let [a, b, c] = *arr;
-            (a, b, c)
-        }
-
-        #[inline]
-        fn pack_value_array_3(&mut self, a: Value, b: Value, c: Value) -> ValueArray3 {
-            [a, b, c]
+        fn value_type(&mut self, val: Value) -> Type {
+            self.lower_ctx.dfg().value_type(val)
        }

        #[inline]
@@ -230,11 +213,6 @@ macro_rules! isle_lower_prelude_methods {
            self.lower_ctx.dfg()[inst]
        }

-        #[inline]
-        fn value_type(&mut self, val: Value) -> Type {
-            self.lower_ctx.dfg().value_type(val)
-        }
-
        #[inline]
        fn def_inst(&mut self, val: Value) -> Option<Inst> {
            self.lower_ctx.dfg().value_def(val).inst()
--- a/cranelift/codegen/src/opts.rs
+++ b/cranelift/codegen/src/opts.rs
@@ -1,308 +1,131 @@
 //! Optimization driver using ISLE rewrite rules on an egraph.

-use crate::egraph::Analysis;
-use crate::egraph::FuncEGraph;
-pub use crate::egraph::{Node, NodeCtx};
+use crate::egraph::{NewOrExistingInst, OptimizeCtx};
 use crate::ir::condcodes;
 pub use crate::ir::condcodes::{FloatCC, IntCC};
+use crate::ir::dfg::ValueDef;
 pub use crate::ir::immediates::{Ieee32, Ieee64, Imm64, Offset32, Uimm32, Uimm64, Uimm8};
 pub use crate::ir::types::*;
 pub use crate::ir::{
-    dynamic_to_fixed, AtomicRmwOp, Block, Constant, DynamicStackSlot, FuncRef, GlobalValue, Heap,
-    HeapImm, Immediate, InstructionImms, JumpTable, MemFlags, Opcode, StackSlot, Table, TrapCode,
-    Type, Value,
+    dynamic_to_fixed, AtomicRmwOp, Block, Constant, DataFlowGraph, DynamicStackSlot, FuncRef,
+    GlobalValue, Heap, HeapImm, Immediate, InstructionData, JumpTable, MemFlags, Opcode, StackSlot,
+    Table, TrapCode, Type, Value,
 };
 use crate::isle_common_prelude_methods;
 use crate::machinst::isle::*;
 use crate::trace;
-pub use cranelift_egraph::{Id, NewOrExisting, NodeIter};
-use cranelift_entity::{EntityList, EntityRef};
-use smallvec::SmallVec;
+use cranelift_entity::packed_option::ReservedValue;
+use smallvec::{smallvec, SmallVec};
 use std::marker::PhantomData;

-pub type IdArray = EntityList<Id>;
 #[allow(dead_code)]
 pub type Unit = ();
 pub type Range = (usize, usize);
+pub type ValueArray2 = [Value; 2];
+pub type ValueArray3 = [Value; 3];

 pub type ConstructorVec<T> = SmallVec<[T; 8]>;

-mod generated_code;
+pub(crate) mod generated_code;
 use generated_code::ContextIter;

-struct IsleContext<'a, 'b> {
-    egraph: &'a mut FuncEGraph<'b>,
+pub(crate) struct IsleContext<'a, 'b, 'c> {
+    pub(crate) ctx: &'a mut OptimizeCtx<'b, 'c>,
 }

-const REWRITE_LIMIT: usize = 5;
-
-pub fn optimize_eclass<'a>(id: Id, egraph: &mut FuncEGraph<'a>) -> Id {
-    trace!("running rules on eclass {}", id.index());
-    egraph.stats.rewrite_rule_invoked += 1;
-
-    if egraph.rewrite_depth > REWRITE_LIMIT {
-        egraph.stats.rewrite_depth_limit += 1;
-        return id;
-    }
-    egraph.rewrite_depth += 1;
-
-    // Find all possible rewrites and union them in, returning the
-    // union.
-    let mut ctx = IsleContext { egraph };
-    let optimized_ids = generated_code::constructor_simplify(&mut ctx, id);
-    let mut union_id = id;
-    if let Some(mut ids) = optimized_ids {
-        while let Some(new_id) = ids.next(&mut ctx) {
-            if ctx.egraph.subsume_ids.contains(&new_id) {
-                trace!(" -> eclass {} subsumes {}", new_id, id);
-                ctx.egraph.stats.node_subsume += 1;
-                // Merge in the unionfind so canonicalization still
-                // works, but take *only* the subsuming ID, and break
-                // now.
-                ctx.egraph.egraph.unionfind.union(union_id, new_id);
-                union_id = new_id;
-                break;
-            }
-            ctx.egraph.stats.node_union += 1;
-            let old_union_id = union_id;
-            union_id = ctx
-                .egraph
-                .egraph
-                .union(&ctx.egraph.node_ctx, union_id, new_id);
-            trace!(
-                " -> union eclass {} with {} to get {}",
-                new_id,
-                old_union_id,
-                union_id
-            );
-        }
-    }
-    trace!(" -> optimize {} got {}", id, union_id);
-    ctx.egraph.rewrite_depth -= 1;
-    union_id
-}
-
-pub(crate) fn store_to_load<'a>(id: Id, egraph: &mut FuncEGraph<'a>) -> Id {
-    // Note that we only examine the latest enode in the eclass: opts
-    // are invoked for every new enode added to an eclass, so
-    // traversing the whole eclass would be redundant.
-    let load_key = egraph.egraph.classes[id].get_node().unwrap();
-    if let Node::Load {
-        op:
-            InstructionImms::Load {
-                opcode: Opcode::Load,
-                offset: load_offset,
-                ..
-            },
-        ty: load_ty,
-        addr: load_addr,
-        mem_state,
-        ..
-    } = load_key.node(&egraph.egraph.nodes)
-    {
-        if let Some(store_inst) = mem_state.as_store() {
-            trace!(" -> got load op for id {}", id);
-            if let Some((store_ty, store_id)) = egraph.store_nodes.get(&store_inst) {
-                trace!(" -> got store id: {} ty: {}", store_id, store_ty);
-                let store_key = egraph.egraph.classes[*store_id].get_node().unwrap();
-                if let Node::Inst {
-                    op:
-                        InstructionImms::Store {
-                            opcode: Opcode::Store,
-                            offset: store_offset,
-                            ..
-                        },
-                    args: store_args,
-                    ..
-                } = store_key.node(&egraph.egraph.nodes)
-                {
-                    let store_args = store_args.as_slice(&egraph.node_ctx.args);
-                    let store_data = store_args[0];
-                    let store_addr = store_args[1];
-                    if *load_offset == *store_offset
-                        && *load_ty == *store_ty
-                        && egraph.egraph.unionfind.equiv_id_mut(*load_addr, store_addr)
-                    {
-                        trace!(" -> same offset, type, address; forwarding");
-                        egraph.stats.store_to_load_forward += 1;
-                        return store_data;
-                    }
-                }
-            }
-        }
-    }
-
-    id
-}
-
-struct NodesEtorIter<'a, 'b>
-where
-    'b: 'a,
-{
-    root: Id,
-    iter: NodeIter<NodeCtx, Analysis>,
+pub(crate) struct InstDataEtorIter<'a, 'b, 'c> {
+    stack: SmallVec<[Value; 8]>,
    _phantom1: PhantomData<&'a ()>,
    _phantom2: PhantomData<&'b ()>,
+    _phantom3: PhantomData<&'c ()>,
 }
-
-impl<'a, 'b> generated_code::ContextIter for NodesEtorIter<'a, 'b>
-where
-    'b: 'a,
-{
-    type Context = IsleContext<'a, 'b>;
-    type Output = (Type, InstructionImms, IdArray);
-
-    fn next(&mut self, ctx: &mut IsleContext<'a, 'b>) -> Option<Self::Output> {
-        while let Some(node) = self.iter.next(&ctx.egraph.egraph) {
-            trace!("iter from root {}: node {:?}", self.root, node);
-            match node {
-                Node::Pure {
-                    op,
-                    args,
-                    ty,
-                    arity,
-                }
-                | Node::Inst {
-                    op,
-                    args,
-                    ty,
-                    arity,
-                    ..
-                } if *arity == 1 => {
-                    return Some((*ty, op.clone(), args.clone()));
-                }
-                _ => {}
-            }
-        }
-        None
-    }
-}
-
-impl<'a, 'b> generated_code::Context for IsleContext<'a, 'b> {
-    isle_common_prelude_methods!();
-
-    fn eclass_type(&mut self, eclass: Id) -> Option<Type> {
-        let mut iter = self.egraph.egraph.enodes(eclass);
-        while let Some(node) = iter.next(&self.egraph.egraph) {
-            match node {
-                &Node::Pure { ty, arity, .. } | &Node::Inst { ty, arity, .. } if arity == 1 => {
-                    return Some(ty);
-                }
-                &Node::Load { ty, .. } => return Some(ty),
-                &Node::Result { ty, .. } => return Some(ty),
-                &Node::Param { ty, .. } => return Some(ty),
-                _ => {}
-            }
-        }
-        None
-    }
-
-    fn at_loop_level(&mut self, eclass: Id) -> (u8, Id) {
-        (
-            self.egraph.egraph.analysis_value(eclass).loop_level.level() as u8,
-            eclass,
-        )
-    }
-
-    type enodes_etor_iter = NodesEtorIter<'a, 'b>;
-
-    fn enodes_etor(&mut self, eclass: Id) -> Option<NodesEtorIter<'a, 'b>> {
-        Some(NodesEtorIter {
-            root: eclass,
-            iter: self.egraph.egraph.enodes(eclass),
+impl<'a, 'b, 'c> InstDataEtorIter<'a, 'b, 'c> {
+    fn new(root: Value) -> Self {
+        debug_assert_ne!(root, Value::reserved_value());
+        Self {
+            stack: smallvec![root],
            _phantom1: PhantomData,
            _phantom2: PhantomData,
-        })
-    }
-
-    fn pure_enode_ctor(&mut self, ty: Type, op: &InstructionImms, args: IdArray) -> Id {
-        let op = op.clone();
-        match self.egraph.egraph.add(
-            Node::Pure {
-                op,
-                args,
-                ty,
-                arity: 1,
-            },
-            &mut self.egraph.node_ctx,
-        ) {
-            NewOrExisting::New(id) => {
-                self.egraph.stats.node_created += 1;
-                self.egraph.stats.node_pure += 1;
-                self.egraph.stats.node_ctor_created += 1;
-                optimize_eclass(id, self.egraph)
-            }
-            NewOrExisting::Existing(id) => {
-                self.egraph.stats.node_ctor_deduped += 1;
-                id
-            }
+            _phantom3: PhantomData,
        }
    }
-
-    fn id_array_0_etor(&mut self, arg0: IdArray) -> Option<()> {
-        let values = arg0.as_slice(&self.egraph.node_ctx.args);
-        if values.len() == 0 {
-            Some(())
-        } else {
-            None
-        }
-    }
-
-    fn id_array_0_ctor(&mut self) -> IdArray {
-        EntityList::default()
-    }
-
-    fn id_array_1_etor(&mut self, arg0: IdArray) -> Option<Id> {
-        let values = arg0.as_slice(&self.egraph.node_ctx.args);
-        if values.len() == 1 {
-            Some(values[0])
-        } else {
-            None
-        }
-    }
-
-    fn id_array_1_ctor(&mut self, arg0: Id) -> IdArray {
-        EntityList::from_iter([arg0].into_iter(), &mut self.egraph.node_ctx.args)
-    }
-
-    fn id_array_2_etor(&mut self, arg0: IdArray) -> Option<(Id, Id)> {
-        let values = arg0.as_slice(&self.egraph.node_ctx.args);
-        if values.len() == 2 {
-            Some((values[0], values[1]))
-        } else {
-            None
-        }
-    }
-
-    fn id_array_2_ctor(&mut self, arg0: Id, arg1: Id) -> IdArray {
-        EntityList::from_iter([arg0, arg1].into_iter(), &mut self.egraph.node_ctx.args)
-    }
-
-    fn id_array_3_etor(&mut self, arg0: IdArray) -> Option<(Id, Id, Id)> {
-        let values = arg0.as_slice(&self.egraph.node_ctx.args);
-        if values.len() == 3 {
-            Some((values[0], values[1], values[2]))
-        } else {
-            None
-        }
-    }
-
-    fn id_array_3_ctor(&mut self, arg0: Id, arg1: Id, arg2: Id) -> IdArray {
-        EntityList::from_iter(
-            [arg0, arg1, arg2].into_iter(),
-            &mut self.egraph.node_ctx.args,
-        )
-    }
-
-    fn remat(&mut self, id: Id) -> Id {
-        trace!("remat: {}", id);
-        self.egraph.remat_ids.insert(id);
-        id
-    }
-
-    fn subsume(&mut self, id: Id) -> Id {
-        trace!("subsume: {}", id);
-        self.egraph.subsume_ids.insert(id);
-        id
-    }
+}
+
+impl<'a, 'b, 'c> ContextIter for InstDataEtorIter<'a, 'b, 'c>
+where
+    'b: 'a,
+    'c: 'b,
+{
+    type Context = IsleContext<'a, 'b, 'c>;
+    type Output = (Type, InstructionData);
+
+    fn next(&mut self, ctx: &mut IsleContext<'a, 'b, 'c>) -> Option<Self::Output> {
+        while let Some(value) = self.stack.pop() {
+            debug_assert_ne!(value, Value::reserved_value());
+            let value = ctx.ctx.func.dfg.resolve_aliases(value);
+            trace!("iter: value {:?}", value);
+            match ctx.ctx.func.dfg.value_def(value) {
+                ValueDef::Union(x, y) => {
+                    debug_assert_ne!(x, Value::reserved_value());
+                    debug_assert_ne!(y, Value::reserved_value());
+                    trace!(" -> {}, {}", x, y);
+                    self.stack.push(x);
+                    self.stack.push(y);
+                    continue;
+                }
+                ValueDef::Result(inst, _) if ctx.ctx.func.dfg.inst_results(inst).len() == 1 => {
+                    let ty = ctx.ctx.func.dfg.value_type(value);
+                    trace!(" -> value of type {}", ty);
+                    return Some((ty, ctx.ctx.func.dfg[inst].clone()));
+                }
+                _ => {}
+            }
+        }
+        None
+    }
+}
+
+impl<'a, 'b, 'c> generated_code::Context for IsleContext<'a, 'b, 'c> {
+    isle_common_prelude_methods!();
+
+    type inst_data_etor_iter = InstDataEtorIter<'a, 'b, 'c>;
+
+    fn inst_data_etor(&mut self, eclass: Value) -> Option<InstDataEtorIter<'a, 'b, 'c>> {
+        Some(InstDataEtorIter::new(eclass))
+    }
+
+    fn make_inst_ctor(&mut self, ty: Type, op: &InstructionData) -> Value {
+        let value = self
+            .ctx
+            .insert_pure_enode(NewOrExistingInst::New(op.clone(), ty));
+        trace!("make_inst_ctor: {:?} -> {}", op, value);
+        value
+    }
+
+    fn value_array_2_ctor(&mut self, arg0: Value, arg1: Value) -> ValueArray2 {
+        [arg0, arg1]
+    }
+
+    fn value_array_3_ctor(&mut self, arg0: Value, arg1: Value, arg2: Value) -> ValueArray3 {
+        [arg0, arg1, arg2]
+    }
+
+    #[inline]
+    fn value_type(&mut self, val: Value) -> Type {
+        self.ctx.func.dfg.value_type(val)
+    }
+
+    fn remat(&mut self, value: Value) -> Value {
+        trace!("remat: {}", value);
+        self.ctx.remat_values.insert(value);
+        self.ctx.stats.remat += 1;
+        value
+    }
+
+    fn subsume(&mut self, value: Value) -> Value {
+        trace!("subsume: {}", value);
+        self.ctx.subsume_values.insert(value);
+        self.ctx.stats.subsume += 1;
+        value
+    }
 }
--- a/cranelift/codegen/src/opts/algebraic.isle
+++ b/cranelift/codegen/src/opts/algebraic.isle
@@ -145,31 +145,15 @@
      (iadd ty x x))

 ;; x<<32>>32: uextend/sextend 32->64.
-(rule (simplify (ushr $I64 (ishl $I64 (uextend $I64 x @ (eclass_type $I32)) (iconst _ (simm32 32))) (iconst _ (simm32 32))))
+(rule (simplify (ushr $I64 (ishl $I64 (uextend $I64 x @ (value_type $I32)) (iconst _ (simm32 32))) (iconst _ (simm32 32))))
      (uextend $I64 x))

-(rule (simplify (sshr $I64 (ishl $I64 (uextend $I64 x @ (eclass_type $I32)) (iconst _ (simm32 32))) (iconst _ (simm32 32))))
+(rule (simplify (sshr $I64 (ishl $I64 (uextend $I64 x @ (value_type $I32)) (iconst _ (simm32 32))) (iconst _ (simm32 32))))
      (sextend $I64 x))

 ;; TODO: strength reduction: mul/div to shifts
 ;; TODO: div/rem by constants -> magic multiplications

-;; Reassociate when it benefits LICM.
-(rule (simplify (iadd ty (iadd ty x y) z))
-      (if-let (at_loop_level lx _) x)
-      (if-let (at_loop_level ly _) y)
-      (if-let (at_loop_level lz _) z)
-      (if (u8_lt lx ly))
-      (if (u8_lt lz ly))
-      (iadd ty (iadd ty x z) y))
-(rule (simplify (iadd ty (iadd ty x y) z))
-      (if-let (at_loop_level lx _) x)
-      (if-let (at_loop_level ly _) y)
-      (if-let (at_loop_level lz _) z)
-      (if (u8_lt ly lx))
-      (if (u8_lt lz lx))
-      (iadd ty (iadd ty y z) x))
-
 ;; Rematerialize ALU-op-with-imm and iconsts in each block where they're
 ;; used. This is neutral (add-with-imm) or positive (iconst) for
 ;; register pressure, and these ops are very cheap.
--- a/cranelift/codegen/src/opts/cprop.isle
+++ b/cranelift/codegen/src/opts/cprop.isle
@@ -107,7 +107,7 @@
 (rule (simplify (isub ty
                      (iadd ty x (iconst ty (u64_from_imm64 k1)))
                      (iconst ty (u64_from_imm64 k2))))
-      (isub ty x (iconst ty (imm64 (u64_sub k1 k2)))))
+      (isub ty x (iconst ty (imm64 (u64_sub k2 k1)))))
 (rule (simplify (iadd ty
                      (isub ty x (iconst ty (u64_from_imm64 k1)))
                      (iconst ty (u64_from_imm64 k2))))
--- a/cranelift/codegen/src/prelude.isle
+++ b/cranelift/codegen/src/prelude.isle
@@ -32,6 +32,15 @@

 ;; `cranelift-entity`-based identifiers.
 (type Type (primitive Type))
+(type Value (primitive Value))
+(type ValueList (primitive ValueList))
+
+;; ISLE representation of `&[Value]`.
+(type ValueSlice (primitive ValueSlice))
+
+;; Extract the type of a `Value`.
+(decl value_type (Type) Value)
+(extern extractor infallible value_type value_type)

 (decl u32_add (u32 u32) u32)
 (extern constructor u32_add u32_add)
--- a/cranelift/codegen/src/prelude_lower.isle
+++ b/cranelift/codegen/src/prelude_lower.isle
@@ -5,15 +5,10 @@

 ;; `cranelift-entity`-based identifiers.
 (type Inst (primitive Inst))
-(type Value (primitive Value))
-
-;; ISLE representation of `&[Value]`.
-(type ValueSlice (primitive ValueSlice))

 ;; ISLE representation of `Vec<u8>`
 (type VecMask extern (enum))

-(type ValueList (primitive ValueList))
 (type ValueRegs (primitive ValueRegs))
 (type WritableValueRegs (primitive WritableValueRegs))

@@ -214,10 +209,6 @@
 (decl inst_data (InstructionData) Inst)
 (extern extractor infallible inst_data inst_data)

-;; Extract the type of a `Value`.
-(decl value_type (Type) Value)
-(extern extractor infallible value_type value_type)
-
 ;; Extract the type of the instruction's first result.
 (decl result_type (Type) Inst)
 (extractor (result_type ty)
--- a/cranelift/codegen/src/prelude_opt.isle
+++ b/cranelift/codegen/src/prelude_opt.isle
@@ -2,60 +2,33 @@

 ;;;;; eclass and enode access ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

-;; An eclass ID.
-(type Id (primitive Id))
-
-;; What is the type of an eclass (if a single type)?
-(decl eclass_type (Type) Id)
-(extern extractor eclass_type eclass_type)
-
-;; Helper to wrap an Id-matching pattern and extract type.
-(decl has_type (Type Id) Id)
-(extractor (has_type ty id)
-           (and (eclass_type ty)
-                id))
-
 ;; Extract any node(s) for the given eclass ID.
-(decl multi enodes (Type InstructionImms IdArray) Id)
-(extern extractor enodes enodes_etor)
+(decl multi inst_data (Type InstructionData) Value)
+(extern extractor inst_data inst_data_etor)

 ;; Construct a pure node, returning a new (or deduplicated
 ;; already-existing) eclass ID.
-(decl pure_enode (Type InstructionImms IdArray) Id)
-(extern constructor pure_enode pure_enode_ctor)
+(decl make_inst (Type InstructionData) Value)
+(extern constructor make_inst make_inst_ctor)

-;; Type of an Id slice (for args).
-(type IdArray (primitive IdArray))
-
-(decl id_array_0 () IdArray)
-(extern constructor id_array_0 id_array_0_ctor)
-(extern extractor id_array_0 id_array_0_etor)
-(decl id_array_1 (Id) IdArray)
-(extern constructor id_array_1 id_array_1_ctor)
-(extern extractor id_array_1 id_array_1_etor)
-(decl id_array_2 (Id Id) IdArray)
-(extern constructor id_array_2 id_array_2_ctor)
-(extern extractor id_array_2 id_array_2_etor)
-(decl id_array_3 (Id Id Id) IdArray)
-(extern constructor id_array_3 id_array_3_ctor)
-(extern extractor id_array_3 id_array_3_etor)
-
-;; Extractor to get the min loop-level of an eclass.
-(decl at_loop_level (u8 Id) Id)
-(extern extractor infallible at_loop_level at_loop_level)
+;; Constructors for value arrays.
+(decl value_array_2_ctor (Value Value) ValueArray2)
+(extern constructor value_array_2_ctor value_array_2_ctor)
+(decl value_array_3_ctor (Value Value Value) ValueArray3)
+(extern constructor value_array_3_ctor value_array_3_ctor)

 ;;;;; optimization toplevel ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

 ;; The main matcher rule invoked by the toplevel driver.
-(decl multi simplify (Id) Id)
+(decl multi simplify (Value) Value)

 ;; Mark a node as requiring remat when used in a different block.
-(decl remat (Id) Id)
+(decl remat (Value) Value)
 (extern constructor remat remat)

 ;; Mark a node as subsuming whatever else it's rewritten from -- this
 ;; is definitely preferable, not just a possible option. Useful for,
 ;; e.g., constant propagation where we arrive at a definite "final
 ;; answer".
-(decl subsume (Id) Id)
+(decl subsume (Value) Value)
 (extern constructor subsume subsume)
--- a/cranelift/codegen/src/simple_gvn.rs
+++ b/cranelift/codegen/src/simple_gvn.rs
@@ -39,14 +39,14 @@ struct HashKey<'a, 'f: 'a> {
 impl<'a, 'f: 'a> Hash for HashKey<'a, 'f> {
    fn hash<H: Hasher>(&self, state: &mut H) {
        let pool = &self.pos.borrow().func.dfg.value_lists;
-        self.inst.hash(state, pool);
+        self.inst.hash(state, pool, |value| value);
        self.ty.hash(state);
    }
 }
 impl<'a, 'f: 'a> PartialEq for HashKey<'a, 'f> {
    fn eq(&self, other: &Self) -> bool {
        let pool = &self.pos.borrow().func.dfg.value_lists;
-        self.inst.eq(&other.inst, pool) && self.ty == other.ty
+        self.inst.eq(&other.inst, pool, |value| value) && self.ty == other.ty
    }
 }
 impl<'a, 'f: 'a> Eq for HashKey<'a, 'f> {}
--- a/cranelift/codegen/src/unionfind.rs
+++ b/cranelift/codegen/src/unionfind.rs
@@ -0,0 +1,74 @@
+//! Simple union-find data structure.
+
+use crate::trace;
+use cranelift_entity::{packed_option::ReservedValue, EntityRef, SecondaryMap};
+use std::hash::Hash;
+
+/// A union-find data structure. The data structure can allocate
+/// `Id`s, indicating eclasses, and can merge eclasses together.
+#[derive(Clone, Debug, PartialEq)]
+pub struct UnionFind<Idx: EntityRef> {
+    parent: SecondaryMap<Idx, Val<Idx>>,
+}
+
+#[derive(Clone, Debug, PartialEq)]
+struct Val<Idx>(Idx);
+impl<Idx: EntityRef + ReservedValue> Default for Val<Idx> {
+    fn default() -> Self {
+        Self(Idx::reserved_value())
+    }
+}
+
+impl<Idx: EntityRef + Hash + std::fmt::Display + Ord + ReservedValue> UnionFind<Idx> {
+    /// Create a new `UnionFind` with the given capacity.
+    pub fn with_capacity(cap: usize) -> Self {
+        UnionFind {
+            parent: SecondaryMap::with_capacity(cap),
+        }
+    }
+
+    /// Add an `Idx` to the `UnionFind`, with its own equivalence class
+    /// initially. All `Idx`s must be added before being queried or
+    /// unioned.
+    pub fn add(&mut self, id: Idx) {
+        debug_assert!(id != Idx::reserved_value());
+        self.parent[id] = Val(id);
+    }
+
+    /// Find the canonical `Idx` of a given `Idx`.
+    pub fn find(&self, mut node: Idx) -> Idx {
+        while node != self.parent[node].0 {
+            node = self.parent[node].0;
+        }
+        node
+    }
+
+    /// Find the canonical `Idx` of a given `Idx`, updating the data
+    /// structure in the process so that future queries for this `Idx`
+    /// (and others in its chain up to the root of the equivalence
+    /// class) will be faster.
+    pub fn find_and_update(&mut self, mut node: Idx) -> Idx {
+        // "Path splitting" mutating find (Tarjan and Van Leeuwen).
+        debug_assert!(node != Idx::reserved_value());
+        while node != self.parent[node].0 {
+            let next = self.parent[self.parent[node].0].0;
+            debug_assert!(next != Idx::reserved_value());
+            self.parent[node] = Val(next);
+            node = next;
+        }
+        debug_assert!(node != Idx::reserved_value());
+        node
+    }
+
+    /// Merge the equivalence classes of the two `Idx`s.
+    pub fn union(&mut self, a: Idx, b: Idx) {
+        let a = self.find_and_update(a);
+        let b = self.find_and_update(b);
+        let (a, b) = (std::cmp::min(a, b), std::cmp::max(a, b));
+        if a != b {
+            // Always canonicalize toward lower IDs.
+            self.parent[b] = Val(a);
+            trace!("union: {}, {}", a, b);
+        }
+    }
+}
--- a/cranelift/codegen/src/verifier/mod.rs
+++ b/cranelift/codegen/src/verifier/mod.rs
@@ -1041,6 +1041,10 @@ impl<'a> Verifier<'a> {
                    ));
                }
            }
+            ValueDef::Union(_, _) => {
+                // Nothing: union nodes themselves have no location,
+                // so we cannot check any dominance properties.
+            }
        }
        Ok(())
    }
@@ -1070,6 +1074,11 @@ impl<'a> Verifier<'a> {
                self.context(loc_inst),
                format!("instruction result {} is not defined by the instruction", v),
            )),
+            ValueDef::Union(_, _) => errors.fatal((
+                loc_inst,
+                self.context(loc_inst),
+                format!("instruction result {} is a union node", v),
+            )),
        }
    }

--- a/cranelift/codegen/src/write.rs
+++ b/cranelift/codegen/src/write.rs
@@ -298,6 +298,7 @@ fn type_suffix(func: &Function, inst: Inst) -> Option<Type> {
        let def_block = match func.dfg.value_def(ctrl_var) {
            ValueDef::Result(instr, _) => func.layout.inst_block(instr),
            ValueDef::Param(block, _) => Some(block),
+            ValueDef::Union(..) => None,
        };
        if def_block.is_some() && def_block == func.layout.inst_block(inst) {
            return None;