egraph-based midend: draw the rest of the owl (productionized). (#4953)

* egraph-based midend: draw the rest of the owl. * Rename `egg` submodule of cranelift-codegen to `egraph`. * Apply some feedback from @jsharp during code walkthrough. * Remove recursion from find_best_node by doing a single pass. Rather than recursively computing the lowest-cost node for a given eclass and memoizing the answer at each eclass node, we can do a single forward pass; because every eclass node refers only to earlier nodes, this is sufficient. The behavior may slightly differ from the earlier behavior because we cannot short-circuit costs to zero once a node is elaborated; but in practice this should not matter. * Make elaboration non-recursive. Use an explicit stack instead (with `ElabStackEntry` entries, alongside a result stack). * Make elaboration traversal of the domtree non-recursive/stack-safe. * Work analysis logic in Cranelift-side egraph glue into a general analysis framework in cranelift-egraph. * Apply static recursion limit to rule application. * Fix aarch64 wrt dynamic-vector support -- broken rebase. * Topo-sort cranelift-egraph before cranelift-codegen in publish script, like the comment instructs me to! * Fix multi-result call testcase. * Include `cranelift-egraph` in `PUBLISHED_CRATES`. * Fix atomic_rmw: not really a load. * Remove now-unnecessary PartialOrd/Ord derivations. * Address some code-review comments. * Review feedback. * Review feedback. * No overlap in mid-end rules, because we are defining a multi-constructor. * rustfmt * Review feedback. * Review feedback. * Review feedback. * Review feedback. * Remove redundant `mut`. * Add comment noting what rules can do. * Review feedback. * Clarify comment wording. * Update `has_memory_fence_semantics`. * Apply @jameysharp's improved loop-level computation. Co-authored-by: Jamey Sharp <jamey@minilop.net> * Fix suggestion commit. * Fix off-by-one in new loop-nest analysis. * Review feedback. * Review feedback. * Review feedback. * Use `Default`, not `std::default::Default`, as per @fitzgen Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com> * Apply @fitzgen's comment elaboration to a doc-comment. Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com> * Add stat for hitting the rewrite-depth limit. * Some code motion in split prelude to make the diff a little clearer wrt `main`. * Take @jameysharp's suggested `try_into()` usage for blockparam indices. Co-authored-by: Jamey Sharp <jamey@minilop.net> * Take @jameysharp's suggestion to avoid double-match on load op. Co-authored-by: Jamey Sharp <jamey@minilop.net> * Fix suggestion (add import). * Review feedback. * Fix stack_load handling. * Remove redundant can_store case. * Take @jameysharp's suggested improvement to FuncEGraph::build() logic Co-authored-by: Jamey Sharp <jamey@minilop.net> * Tweaks to FuncEGraph::build() on top of suggestion. * Take @jameysharp's suggested clarified condition Co-authored-by: Jamey Sharp <jamey@minilop.net> * Clean up after suggestion (unused variable). * Fix loop analysis. * loop level asserts * Revert constant-space loop analysis -- edge cases were incorrect, so let's go with the simple thing for now. * Take @jameysharp's suggestion re: result_tys Co-authored-by: Jamey Sharp <jamey@minilop.net> * Fix up after suggestion * Take @jameysharp's suggestion to use fold rather than reduce Co-authored-by: Jamey Sharp <jamey@minilop.net> * Fixup after suggestion * Take @jameysharp's suggestion to remove elaborate_eclass_use's return value. * Clarifying comment in terminator insts. Co-authored-by: Jamey Sharp <jamey@minilop.net> Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com>
2022-10-11 18:15:53 -07:00
parent e2f1ced0b6
commit 2be12a5167
59 changed files with 5125 additions and 1580 deletions
--- a/cranelift/codegen/src/egraph/domtree.rs
+++ b/cranelift/codegen/src/egraph/domtree.rs
@@ -0,0 +1,69 @@
+//! Extended domtree with various traversal support.
+
+use crate::dominator_tree::DominatorTree;
+use crate::ir::{Block, Function};
+use cranelift_entity::{packed_option::PackedOption, SecondaryMap};
+
+#[derive(Clone, Debug)]
+pub(crate) struct DomTreeWithChildren {
+    nodes: SecondaryMap<Block, DomTreeNode>,
+    root: Block,
+}
+
+#[derive(Clone, Copy, Debug, Default)]
+struct DomTreeNode {
+    children: PackedOption<Block>,
+    next: PackedOption<Block>,
+}
+
+impl DomTreeWithChildren {
+    pub(crate) fn new(func: &Function, domtree: &DominatorTree) -> DomTreeWithChildren {
+        let mut nodes: SecondaryMap<Block, DomTreeNode> =
+            SecondaryMap::with_capacity(func.dfg.num_blocks());
+
+        for block in func.layout.blocks() {
+            let idom_inst = match domtree.idom(block) {
+                Some(idom_inst) => idom_inst,
+                None => continue,
+            };
+            let idom = func
+                .layout
+                .inst_block(idom_inst)
+                .expect("Dominating instruction should be part of a block");
+
+            nodes[block].next = nodes[idom].children;
+            nodes[idom].children = block.into();
+        }
+
+        let root = func.layout.entry_block().unwrap();
+
+        Self { nodes, root }
+    }
+
+    pub(crate) fn root(&self) -> Block {
+        self.root
+    }
+
+    pub(crate) fn children<'a>(&'a self, block: Block) -> DomTreeChildIter<'a> {
+        let block = self.nodes[block].children;
+        DomTreeChildIter {
+            domtree: self,
+            block,
+        }
+    }
+}
+
+pub(crate) struct DomTreeChildIter<'a> {
+    domtree: &'a DomTreeWithChildren,
+    block: PackedOption<Block>,
+}
+
+impl<'a> Iterator for DomTreeChildIter<'a> {
+    type Item = Block;
+    fn next(&mut self) -> Option<Block> {
+        self.block.expand().map(|block| {
+            self.block = self.domtree.nodes[block].next;
+            block
+        })
+    }
+}
--- a/cranelift/codegen/src/egraph/elaborate.rs
+++ b/cranelift/codegen/src/egraph/elaborate.rs
@@ -0,0 +1,612 @@
+//! Elaboration phase: lowers EGraph back to sequences of operations
+//! in CFG nodes.
+
+use super::domtree::DomTreeWithChildren;
+use super::node::{op_cost, Cost, Node, NodeCtx};
+use super::Analysis;
+use super::Stats;
+use crate::dominator_tree::DominatorTree;
+use crate::fx::FxHashSet;
+use crate::ir::{Block, Function, Inst, Opcode, RelSourceLoc, Type, Value, ValueList};
+use crate::loop_analysis::LoopAnalysis;
+use crate::scoped_hash_map::ScopedHashMap;
+use crate::trace;
+use alloc::vec::Vec;
+use cranelift_egraph::{EGraph, Id, Language, NodeKey};
+use cranelift_entity::{packed_option::PackedOption, SecondaryMap};
+use smallvec::{smallvec, SmallVec};
+use std::ops::Add;
+
+type LoopDepth = u32;
+
+pub(crate) struct Elaborator<'a> {
+    func: &'a mut Function,
+    domtree: &'a DominatorTree,
+    loop_analysis: &'a LoopAnalysis,
+    node_ctx: &'a NodeCtx,
+    egraph: &'a EGraph<NodeCtx, Analysis>,
+    id_to_value: ScopedHashMap<Id, IdValue>,
+    id_to_best_cost_and_node: SecondaryMap<Id, (Cost, Id)>,
+    /// Stack of blocks and loops in current elaboration path.
+    loop_stack: SmallVec<[LoopStackEntry; 8]>,
+    cur_block: Option<Block>,
+    first_branch: SecondaryMap<Block, PackedOption<Inst>>,
+    remat_ids: &'a FxHashSet<Id>,
+    /// Explicitly-unrolled value elaboration stack.
+    elab_stack: Vec<ElabStackEntry>,
+    elab_result_stack: Vec<IdValue>,
+    /// Explicitly-unrolled block elaboration stack.
+    block_stack: Vec<BlockStackEntry>,
+    stats: &'a mut Stats,
+}
+
+#[derive(Clone, Debug)]
+struct LoopStackEntry {
+    /// The hoist point: a block that immediately dominates this
+    /// loop. May not be an immediate predecessor, but will be a valid
+    /// point to place all loop-invariant ops: they must depend only
+    /// on inputs that dominate the loop, so are available at (the end
+    /// of) this block.
+    hoist_block: Block,
+    /// The depth in the scope map.
+    scope_depth: u32,
+}
+
+#[derive(Clone, Debug)]
+enum ElabStackEntry {
+    /// Next action is to resolve this id into a node and elaborate
+    /// args.
+    Start { id: Id },
+    /// Args have been pushed; waiting for results.
+    PendingNode {
+        canonical: Id,
+        node_key: NodeKey,
+        remat: bool,
+        num_args: usize,
+    },
+    /// Waiting for a result to return one projected value of a
+    /// multi-value result.
+    PendingProjection { canonical: Id, index: usize },
+}
+
+#[derive(Clone, Debug)]
+enum BlockStackEntry {
+    Elaborate { block: Block, idom: Option<Block> },
+    Pop,
+}
+
+#[derive(Clone, Debug)]
+enum IdValue {
+    /// A single value.
+    Value {
+        depth: LoopDepth,
+        block: Block,
+        value: Value,
+    },
+    /// Multiple results; indices in `node_args`.
+    Values {
+        depth: LoopDepth,
+        block: Block,
+        values: ValueList,
+    },
+}
+
+impl IdValue {
+    fn block(&self) -> Block {
+        match self {
+            IdValue::Value { block, .. } | IdValue::Values { block, .. } => *block,
+        }
+    }
+}
+
+impl<'a> Elaborator<'a> {
+    pub(crate) fn new(
+        func: &'a mut Function,
+        domtree: &'a DominatorTree,
+        loop_analysis: &'a LoopAnalysis,
+        egraph: &'a EGraph<NodeCtx, Analysis>,
+        node_ctx: &'a NodeCtx,
+        remat_ids: &'a FxHashSet<Id>,
+        stats: &'a mut Stats,
+    ) -> Self {
+        let num_blocks = func.dfg.num_blocks();
+        let mut id_to_best_cost_and_node =
+            SecondaryMap::with_default((Cost::infinity(), Id::invalid()));
+        id_to_best_cost_and_node.resize(egraph.classes.len());
+        Self {
+            func,
+            domtree,
+            loop_analysis,
+            egraph,
+            node_ctx,
+            id_to_value: ScopedHashMap::with_capacity(egraph.classes.len()),
+            id_to_best_cost_and_node,
+            loop_stack: smallvec![],
+            cur_block: None,
+            first_branch: SecondaryMap::with_capacity(num_blocks),
+            remat_ids,
+            elab_stack: vec![],
+            elab_result_stack: vec![],
+            block_stack: vec![],
+            stats,
+        }
+    }
+
+    fn cur_loop_depth(&self) -> LoopDepth {
+        self.loop_stack.len() as LoopDepth
+    }
+
+    fn start_block(&mut self, idom: Option<Block>, block: Block, block_params: &[(Id, Type)]) {
+        trace!(
+            "start_block: block {:?} with idom {:?} at loop depth {} scope depth {}",
+            block,
+            idom,
+            self.cur_loop_depth(),
+            self.id_to_value.depth()
+        );
+
+        // Note that if the *entry* block is a loop header, we will
+        // not make note of the loop here because it will not have an
+        // immediate dominator. We must disallow this case because we
+        // will skip adding the `LoopStackEntry` here but our
+        // `LoopAnalysis` will otherwise still make note of this loop
+        // and loop depths will not match.
+        if let Some(idom) = idom {
+            if self.loop_analysis.is_loop_header(block).is_some() {
+                self.loop_stack.push(LoopStackEntry {
+                    // Any code hoisted out of this loop will have code
+                    // placed in `idom`, and will have def mappings
+                    // inserted in to the scoped hashmap at that block's
+                    // level.
+                    hoist_block: idom,
+                    scope_depth: (self.id_to_value.depth() - 1) as u32,
+                });
+                trace!(
+                    " -> loop header, pushing; depth now {}",
+                    self.loop_stack.len()
+                );
+            }
+        } else {
+            debug_assert!(
+                self.loop_analysis.is_loop_header(block).is_none(),
+                "Entry block (domtree root) cannot be a loop header!"
+            );
+        }
+
+        self.cur_block = Some(block);
+        for &(id, ty) in block_params {
+            let value = self.func.dfg.append_block_param(block, ty);
+            trace!(" -> block param id {:?} value {:?}", id, value);
+            self.id_to_value.insert_if_absent(
+                id,
+                IdValue::Value {
+                    depth: self.cur_loop_depth(),
+                    block,
+                    value,
+                },
+            );
+        }
+    }
+
+    fn add_node(&mut self, node: &Node, args: &[Value], to_block: Block) -> ValueList {
+        let (instdata, result_tys) = match node {
+            Node::Pure { op, types, .. } | Node::Inst { op, types, .. } => (
+                op.with_args(args, &mut self.func.dfg.value_lists),
+                types.as_slice(&self.node_ctx.types),
+            ),
+            Node::Load { op, ty, .. } => (
+                op.with_args(args, &mut self.func.dfg.value_lists),
+                std::slice::from_ref(ty),
+            ),
+            _ => panic!("Cannot `add_node()` on block param or projection"),
+        };
+        let srcloc = match node {
+            Node::Inst { srcloc, .. } | Node::Load { srcloc, .. } => *srcloc,
+            _ => RelSourceLoc::default(),
+        };
+        let opcode = instdata.opcode();
+        // Is this instruction either an actual terminator (an
+        // instruction that must end the block), or at least in the
+        // group of branches at the end (including conditional
+        // branches that may be followed by an actual terminator)? We
+        // call this the "terminator group", and we record the first
+        // inst in this group (`first_branch` below) so that we do not
+        // insert instructions needed only by args of later
+        // instructions in the terminator group in the middle of the
+        // terminator group.
+        //
+        // E.g., for the original sequence
+        //   v1 = op ...
+        //   brnz vCond, block1
+        //   jump block2(v1)
+        //
+        // elaboration would naively produce
+        //
+        //   brnz vCond, block1
+        //   v1 = op ...
+        //   jump block2(v1)
+        //
+        // but we use the `first_branch` mechanism below to ensure
+        // that once we've emitted at least one branch, all other
+        // elaborated insts have to go before that. So we emit brnz
+        // first, then as we elaborate the jump, we find we need the
+        // `op`; we `insert_inst` it *before* the brnz (which is the
+        // `first_branch`).
+        let is_terminator_group_inst =
+            opcode.is_branch() || opcode.is_return() || opcode == Opcode::Trap;
+        let inst = self.func.dfg.make_inst(instdata);
+        self.func.srclocs[inst] = srcloc;
+
+        for &ty in result_tys {
+            self.func.dfg.append_result(inst, ty);
+        }
+
+        if is_terminator_group_inst {
+            self.func.layout.append_inst(inst, to_block);
+            if self.first_branch[to_block].is_none() {
+                self.first_branch[to_block] = Some(inst).into();
+            }
+        } else if let Some(branch) = self.first_branch[to_block].into() {
+            self.func.layout.insert_inst(inst, branch);
+        } else {
+            self.func.layout.append_inst(inst, to_block);
+        }
+        self.func.dfg.inst_results_list(inst)
+    }
+
+    fn compute_best_nodes(&mut self) {
+        let best = &mut self.id_to_best_cost_and_node;
+        for (eclass_id, eclass) in &self.egraph.classes {
+            trace!("computing best for eclass {:?}", eclass_id);
+            if let Some(child1) = eclass.child1() {
+                trace!(" -> child {:?}", child1);
+                best[eclass_id] = best[child1];
+            }
+            if let Some(child2) = eclass.child2() {
+                trace!(" -> child {:?}", child2);
+                if best[child2].0 < best[eclass_id].0 {
+                    best[eclass_id] = best[child2];
+                }
+            }
+            if let Some(node_key) = eclass.get_node() {
+                let node = node_key.node(&self.egraph.nodes);
+                trace!(" -> eclass {:?}: node {:?}", eclass_id, node);
+                let (cost, id) = match node {
+                    Node::Param { .. }
+                    | Node::Inst { .. }
+                    | Node::Load { .. }
+                    | Node::Result { .. } => (Cost::zero(), eclass_id),
+                    Node::Pure { op, .. } => {
+                        let args_cost = self
+                            .node_ctx
+                            .children(node)
+                            .iter()
+                            .map(|&arg_id| {
+                                trace!("  -> arg {:?}", arg_id);
+                                best[arg_id].0
+                            })
+                            // Can't use `.sum()` for `Cost` types; do
+                            // an explicit reduce instead.
+                            .fold(Cost::zero(), Cost::add);
+                        let level = self.egraph.analysis_value(eclass_id).loop_level;
+                        let cost = op_cost(op).at_level(level) + args_cost;
+                        (cost, eclass_id)
+                    }
+                };
+
+                if cost < best[eclass_id].0 {
+                    best[eclass_id] = (cost, id);
+                }
+            }
+            debug_assert_ne!(best[eclass_id].0, Cost::infinity());
+            debug_assert_ne!(best[eclass_id].1, Id::invalid());
+            trace!("best for eclass {:?}: {:?}", eclass_id, best[eclass_id]);
+        }
+    }
+
+    fn elaborate_eclass_use(&mut self, id: Id) {
+        self.elab_stack.push(ElabStackEntry::Start { id });
+        self.process_elab_stack();
+        debug_assert_eq!(self.elab_result_stack.len(), 1);
+        self.elab_result_stack.clear();
+    }
+
+    fn process_elab_stack(&mut self) {
+        while let Some(entry) = self.elab_stack.last() {
+            match entry {
+                &ElabStackEntry::Start { id } => {
+                    // We always replace the Start entry, so pop it now.
+                    self.elab_stack.pop();
+
+                    self.stats.elaborate_visit_node += 1;
+                    let canonical = self.egraph.canonical_id(id);
+                    trace!("elaborate: id {}", id);
+
+                    let remat = if let Some(val) = self.id_to_value.get(&canonical) {
+                        // Look at the defined block, and determine whether this
+                        // node kind allows rematerialization if the value comes
+                        // from another block. If so, ignore the hit and recompute
+                        // below.
+                        let remat = val.block() != self.cur_block.unwrap()
+                            && self.remat_ids.contains(&canonical);
+                        if !remat {
+                            trace!("elaborate: id {} -> {:?}", id, val);
+                            self.stats.elaborate_memoize_hit += 1;
+                            self.elab_result_stack.push(val.clone());
+                            continue;
+                        }
+                        trace!("elaborate: id {} -> remat", id);
+                        self.stats.elaborate_memoize_miss_remat += 1;
+                        // The op is pure at this point, so it is always valid to
+                        // remove from this map.
+                        self.id_to_value.remove(&canonical);
+                        true
+                    } else {
+                        self.remat_ids.contains(&canonical)
+                    };
+                    self.stats.elaborate_memoize_miss += 1;
+
+                    // Get the best option; we use `id` (latest id) here so we
+                    // have a full view of the eclass.
+                    let (_, best_node_eclass) = self.id_to_best_cost_and_node[id];
+                    debug_assert_ne!(best_node_eclass, Id::invalid());
+
+                    trace!(
+                        "elaborate: id {} -> best {} -> eclass node {:?}",
+                        id,
+                        best_node_eclass,
+                        self.egraph.classes[best_node_eclass]
+                    );
+                    let node_key = self.egraph.classes[best_node_eclass].get_node().unwrap();
+                    let node = node_key.node(&self.egraph.nodes);
+                    trace!(" -> enode {:?}", node);
+
+                    // Is the node a block param? We should never get here if so
+                    // (they are inserted when first visiting the block).
+                    if matches!(node, Node::Param { .. }) {
+                        unreachable!("Param nodes should already be inserted");
+                    }
+
+                    // Is the node a result projection? If so, resolve
+                    // the value we are projecting a part of, then
+                    // eventually return here (saving state with a
+                    // PendingProjection).
+                    if let Node::Result { value, result, .. } = node {
+                        trace!(" -> result; pushing arg value {}", value);
+                        self.elab_stack.push(ElabStackEntry::PendingProjection {
+                            index: *result,
+                            canonical,
+                        });
+                        self.elab_stack.push(ElabStackEntry::Start { id: *value });
+                        continue;
+                    }
+
+                    // We're going to need to emit this
+                    // operator. First, enqueue all args to be
+                    // elaborated. Push state to receive the results
+                    // and later elab this node.
+                    let num_args = self.node_ctx.children(&node).len();
+                    self.elab_stack.push(ElabStackEntry::PendingNode {
+                        canonical,
+                        node_key,
+                        remat,
+                        num_args,
+                    });
+                    // Push args in reverse order so we process the
+                    // first arg first.
+                    for &arg_id in self.node_ctx.children(&node).iter().rev() {
+                        self.elab_stack.push(ElabStackEntry::Start { id: arg_id });
+                    }
+                }
+
+                &ElabStackEntry::PendingNode {
+                    canonical,
+                    node_key,
+                    remat,
+                    num_args,
+                } => {
+                    self.elab_stack.pop();
+
+                    let node = node_key.node(&self.egraph.nodes);
+
+                    // We should have all args resolved at this point.
+                    let arg_idx = self.elab_result_stack.len() - num_args;
+                    let args = &self.elab_result_stack[arg_idx..];
+
+                    // Gather the individual output-CLIF `Value`s.
+                    let arg_values: SmallVec<[Value; 8]> = args
+                        .iter()
+                        .map(|idvalue| match idvalue {
+                            IdValue::Value { value, .. } => *value,
+                            IdValue::Values { .. } => {
+                                panic!("enode depends directly on multi-value result")
+                            }
+                        })
+                        .collect();
+
+                    // Compute max loop depth.
+                    let max_loop_depth = args
+                        .iter()
+                        .map(|idvalue| match idvalue {
+                            IdValue::Value { depth, .. } => *depth,
+                            IdValue::Values { .. } => unreachable!(),
+                        })
+                        .max()
+                        .unwrap_or(0);
+
+                    // Remove args from result stack.
+                    self.elab_result_stack.truncate(arg_idx);
+
+                    // Determine the location at which we emit it. This is the
+                    // current block *unless* we hoist above a loop when all args
+                    // are loop-invariant (and this op is pure).
+                    let (loop_depth, scope_depth, block) = if node.is_non_pure() {
+                        // Non-pure op: always at the current location.
+                        (
+                            self.cur_loop_depth(),
+                            self.id_to_value.depth(),
+                            self.cur_block.unwrap(),
+                        )
+                    } else if max_loop_depth == self.cur_loop_depth() || remat {
+                        // Pure op, but depends on some value at the current loop
+                        // depth, or remat forces it here: as above.
+                        (
+                            self.cur_loop_depth(),
+                            self.id_to_value.depth(),
+                            self.cur_block.unwrap(),
+                        )
+                    } else {
+                        // Pure op, and does not depend on any args at current
+                        // loop depth: hoist out of loop.
+                        self.stats.elaborate_licm_hoist += 1;
+                        let data = &self.loop_stack[max_loop_depth as usize];
+                        (max_loop_depth, data.scope_depth as usize, data.hoist_block)
+                    };
+                    // Loop scopes are a subset of all scopes.
+                    debug_assert!(scope_depth >= loop_depth as usize);
+
+                    // This is an actual operation; emit the node in sequence now.
+                    let results = self.add_node(node, &arg_values[..], block);
+                    let results_slice = results.as_slice(&self.func.dfg.value_lists);
+
+                    // Build the result and memoize in the id-to-value map.
+                    let result = if results_slice.len() == 1 {
+                        IdValue::Value {
+                            depth: loop_depth,
+                            block,
+                            value: results_slice[0],
+                        }
+                    } else {
+                        IdValue::Values {
+                            depth: loop_depth,
+                            block,
+                            values: results,
+                        }
+                    };
+
+                    self.id_to_value.insert_if_absent_with_depth(
+                        canonical,
+                        result.clone(),
+                        scope_depth,
+                    );
+
+                    // Push onto the elab-results stack.
+                    self.elab_result_stack.push(result)
+                }
+                &ElabStackEntry::PendingProjection { index, canonical } => {
+                    self.elab_stack.pop();
+
+                    // Grab the input from the elab-result stack.
+                    let value = self.elab_result_stack.pop().expect("Should have result");
+
+                    let (depth, block, values) = match value {
+                        IdValue::Values {
+                            depth,
+                            block,
+                            values,
+                            ..
+                        } => (depth, block, values),
+                        IdValue::Value { .. } => {
+                            unreachable!("Projection nodes should not be used on single results");
+                        }
+                    };
+                    let values = values.as_slice(&self.func.dfg.value_lists);
+                    let value = IdValue::Value {
+                        depth,
+                        block,
+                        value: values[index],
+                    };
+                    self.id_to_value.insert_if_absent(canonical, value.clone());
+
+                    self.elab_result_stack.push(value);
+                }
+            }
+        }
+    }
+
+    fn elaborate_block<'b, PF: Fn(Block) -> &'b [(Id, Type)], SEF: Fn(Block) -> &'b [Id]>(
+        &mut self,
+        idom: Option<Block>,
+        block: Block,
+        block_params_fn: &PF,
+        block_side_effects_fn: &SEF,
+    ) {
+        let blockparam_ids_tys = (block_params_fn)(block);
+        self.start_block(idom, block, blockparam_ids_tys);
+        for &id in (block_side_effects_fn)(block) {
+            self.elaborate_eclass_use(id);
+        }
+    }
+
+    fn elaborate_domtree<'b, PF: Fn(Block) -> &'b [(Id, Type)], SEF: Fn(Block) -> &'b [Id]>(
+        &mut self,
+        block_params_fn: &PF,
+        block_side_effects_fn: &SEF,
+        domtree: &DomTreeWithChildren,
+    ) {
+        let root = domtree.root();
+        self.block_stack.push(BlockStackEntry::Elaborate {
+            block: root,
+            idom: None,
+        });
+        while let Some(top) = self.block_stack.pop() {
+            match top {
+                BlockStackEntry::Elaborate { block, idom } => {
+                    self.block_stack.push(BlockStackEntry::Pop);
+                    self.id_to_value.increment_depth();
+
+                    self.elaborate_block(idom, block, block_params_fn, block_side_effects_fn);
+
+                    // Push children. We are doing a preorder
+                    // traversal so we do this after processing this
+                    // block above.
+                    let block_stack_end = self.block_stack.len();
+                    for child in domtree.children(block) {
+                        self.block_stack.push(BlockStackEntry::Elaborate {
+                            block: child,
+                            idom: Some(block),
+                        });
+                    }
+                    // Reverse what we just pushed so we elaborate in
+                    // original block order. (The domtree iter is a
+                    // single-ended iter over a singly-linked list so
+                    // we can't `.rev()` above.)
+                    self.block_stack[block_stack_end..].reverse();
+                }
+                BlockStackEntry::Pop => {
+                    self.id_to_value.decrement_depth();
+                    if let Some(innermost_loop) = self.loop_stack.last() {
+                        if innermost_loop.scope_depth as usize == self.id_to_value.depth() {
+                            self.loop_stack.pop();
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    fn clear_func_body(&mut self) {
+        // Clear all instructions and args/results from the DFG. We
+        // rebuild them entirely during elaboration. (TODO: reuse the
+        // existing inst for the *first* copy of a given node.)
+        self.func.dfg.clear_insts();
+        // Clear the instructions in every block, but leave the list
+        // of blocks and their layout unmodified.
+        self.func.layout.clear_insts();
+        self.func.srclocs.clear();
+    }
+
+    pub(crate) fn elaborate<'b, PF: Fn(Block) -> &'b [(Id, Type)], SEF: Fn(Block) -> &'b [Id]>(
+        &mut self,
+        block_params_fn: PF,
+        block_side_effects_fn: SEF,
+    ) {
+        let domtree = DomTreeWithChildren::new(self.func, self.domtree);
+        self.stats.elaborate_func += 1;
+        self.stats.elaborate_func_pre_insts += self.func.dfg.num_insts() as u64;
+        self.clear_func_body();
+        self.compute_best_nodes();
+        self.elaborate_domtree(&block_params_fn, &block_side_effects_fn, &domtree);
+        self.stats.elaborate_func_post_insts += self.func.dfg.num_insts() as u64;
+    }
+}
--- a/cranelift/codegen/src/egraph/node.rs
+++ b/cranelift/codegen/src/egraph/node.rs
@@ -0,0 +1,376 @@
+//! Node definition for EGraph representation.
+
+use super::MemoryState;
+use crate::ir::{Block, DataFlowGraph, Inst, InstructionImms, Opcode, RelSourceLoc, Type};
+use crate::loop_analysis::LoopLevel;
+use cranelift_egraph::{BumpArena, BumpSlice, CtxEq, CtxHash, Id, Language, UnionFind};
+use cranelift_entity::{EntityList, ListPool};
+use std::hash::{Hash, Hasher};
+
+#[derive(Debug)]
+pub enum Node {
+    /// A blockparam. Effectively an input/root; does not refer to
+    /// predecessors' branch arguments, because this would create
+    /// cycles.
+    Param {
+        /// CLIF block this param comes from.
+        block: Block,
+        /// Index of blockparam within block.
+        index: u32,
+        /// Type of the value.
+        ty: Type,
+        /// The loop level of this Param.
+        loop_level: LoopLevel,
+    },
+    /// A CLIF instruction that is pure (has no side-effects). Not
+    /// tied to any location; we will compute a set of locations at
+    /// which to compute this node during lowering back out of the
+    /// egraph.
+    Pure {
+        /// The instruction data, without SSA values.
+        op: InstructionImms,
+        /// eclass arguments to the operator.
+        args: EntityList<Id>,
+        /// Types of results.
+        types: BumpSlice<Type>,
+    },
+    /// A CLIF instruction that has side-effects or is otherwise not
+    /// representable by `Pure`.
+    Inst {
+        /// The instruction data, without SSA values.
+        op: InstructionImms,
+        /// eclass arguments to the operator.
+        args: EntityList<Id>,
+        /// Types of results.
+        types: BumpSlice<Type>,
+        /// The index of the original instruction. We include this so
+        /// that the `Inst`s are not deduplicated: every instance is a
+        /// logically separate and unique side-effect. However,
+        /// because we clear the DataFlowGraph before elaboration,
+        /// this `Inst` is *not* valid to fetch any details from the
+        /// original instruction.
+        inst: Inst,
+        /// The source location to preserve.
+        srcloc: RelSourceLoc,
+        /// The loop level of this Inst.
+        loop_level: LoopLevel,
+    },
+    /// A projection of one result of an `Inst` or `Pure`.
+    Result {
+        /// `Inst` or `Pure` node.
+        value: Id,
+        /// Index of the result we want.
+        result: usize,
+        /// Type of the value.
+        ty: Type,
+    },
+
+    /// A load instruction. Nominally a side-effecting `Inst` (and
+    /// included in the list of side-effecting roots so it will always
+    /// be elaborated), but represented as a distinct kind of node so
+    /// that we can leverage deduplication to do
+    /// redundant-load-elimination for free (and make store-to-load
+    /// forwarding much easier).
+    Load {
+        // -- identity depends on:
+        /// The original load operation. Must have one argument, the
+        /// address.
+        op: InstructionImms,
+        /// The type of the load result.
+        ty: Type,
+        /// Address argument. Actual address has an offset, which is
+        /// included in `op` (and thus already considered as part of
+        /// the key).
+        addr: Id,
+        /// The abstract memory state that this load accesses.
+        mem_state: MemoryState,
+
+        // -- not included in dedup key:
+        /// The `Inst` we will use for a trap location for this
+        /// load. Excluded from Eq/Hash so that loads that are
+        /// identical except for the specific instance will dedup on
+        /// top of each other.
+        inst: Inst,
+        /// Source location, for traps. Not included in Eq/Hash.
+        srcloc: RelSourceLoc,
+    },
+}
+
+impl Node {
+    pub(crate) fn is_non_pure(&self) -> bool {
+        match self {
+            Node::Inst { .. } | Node::Load { .. } => true,
+            _ => false,
+        }
+    }
+}
+
+/// Shared pools for type and id lists in nodes.
+pub struct NodeCtx {
+    /// Arena for result-type arrays.
+    pub types: BumpArena<Type>,
+    /// Arena for arg eclass-ID lists.
+    pub args: ListPool<Id>,
+}
+
+impl NodeCtx {
+    pub(crate) fn with_capacity_for_dfg(dfg: &DataFlowGraph) -> Self {
+        let n_types = dfg.num_values();
+        let n_args = dfg.value_lists.capacity();
+        Self {
+            types: BumpArena::arena_with_capacity(n_types),
+            args: ListPool::with_capacity(n_args),
+        }
+    }
+}
+
+impl NodeCtx {
+    fn ids_eq(&self, a: &EntityList<Id>, b: &EntityList<Id>, uf: &mut UnionFind) -> bool {
+        let a = a.as_slice(&self.args);
+        let b = b.as_slice(&self.args);
+        a.len() == b.len() && a.iter().zip(b.iter()).all(|(&a, &b)| uf.equiv_id_mut(a, b))
+    }
+
+    fn hash_ids<H: Hasher>(&self, a: &EntityList<Id>, hash: &mut H, uf: &mut UnionFind) {
+        let a = a.as_slice(&self.args);
+        for &id in a {
+            uf.hash_id_mut(hash, id);
+        }
+    }
+}
+
+impl CtxEq<Node, Node> for NodeCtx {
+    fn ctx_eq(&self, a: &Node, b: &Node, uf: &mut UnionFind) -> bool {
+        match (a, b) {
+            (
+                &Node::Param {
+                    block,
+                    index,
+                    ty,
+                    loop_level: _,
+                },
+                &Node::Param {
+                    block: other_block,
+                    index: other_index,
+                    ty: other_ty,
+                    loop_level: _,
+                },
+            ) => block == other_block && index == other_index && ty == other_ty,
+            (
+                &Node::Result { value, result, ty },
+                &Node::Result {
+                    value: other_value,
+                    result: other_result,
+                    ty: other_ty,
+                },
+            ) => uf.equiv_id_mut(value, other_value) && result == other_result && ty == other_ty,
+            (
+                &Node::Pure {
+                    ref op,
+                    ref args,
+                    ref types,
+                },
+                &Node::Pure {
+                    op: ref other_op,
+                    args: ref other_args,
+                    types: ref other_types,
+                },
+            ) => {
+                *op == *other_op
+                    && self.ids_eq(args, other_args, uf)
+                    && types.as_slice(&self.types) == other_types.as_slice(&self.types)
+            }
+            (
+                &Node::Inst { inst, ref args, .. },
+                &Node::Inst {
+                    inst: other_inst,
+                    args: ref other_args,
+                    ..
+                },
+            ) => inst == other_inst && self.ids_eq(args, other_args, uf),
+            (
+                &Node::Load {
+                    ref op,
+                    ty,
+                    addr,
+                    mem_state,
+                    ..
+                },
+                &Node::Load {
+                    op: ref other_op,
+                    ty: other_ty,
+                    addr: other_addr,
+                    mem_state: other_mem_state,
+                    // Explicitly exclude: `inst` and `srcloc`. We
+                    // want loads to merge if identical in
+                    // opcode/offset, address expression, and last
+                    // store (this does implicit
+                    // redundant-load-elimination.)
+                    //
+                    // Note however that we *do* include `ty` (the
+                    // type) and match on that: we otherwise would
+                    // have no way of disambiguating loads of
+                    // different widths to the same address.
+                    ..
+                },
+            ) => {
+                op == other_op
+                    && ty == other_ty
+                    && uf.equiv_id_mut(addr, other_addr)
+                    && mem_state == other_mem_state
+            }
+            _ => false,
+        }
+    }
+}
+
+impl CtxHash<Node> for NodeCtx {
+    fn ctx_hash(&self, value: &Node, uf: &mut UnionFind) -> u64 {
+        let mut state = crate::fx::FxHasher::default();
+        std::mem::discriminant(value).hash(&mut state);
+        match value {
+            &Node::Param {
+                block,
+                index,
+                ty: _,
+                loop_level: _,
+            } => {
+                block.hash(&mut state);
+                index.hash(&mut state);
+            }
+            &Node::Result {
+                value,
+                result,
+                ty: _,
+            } => {
+                uf.hash_id_mut(&mut state, value);
+                result.hash(&mut state);
+            }
+            &Node::Pure {
+                ref op,
+                ref args,
+                types: _,
+            } => {
+                op.hash(&mut state);
+                self.hash_ids(args, &mut state, uf);
+                // Don't hash `types`: it requires an indirection
+                // (hence cache misses), and result type *should* be
+                // fully determined by op and args.
+            }
+            &Node::Inst { inst, ref args, .. } => {
+                inst.hash(&mut state);
+                self.hash_ids(args, &mut state, uf);
+            }
+            &Node::Load {
+                ref op,
+                ty,
+                addr,
+                mem_state,
+                ..
+            } => {
+                op.hash(&mut state);
+                ty.hash(&mut state);
+                uf.hash_id_mut(&mut state, addr);
+                mem_state.hash(&mut state);
+            }
+        }
+
+        state.finish()
+    }
+}
+
+#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
+pub(crate) struct Cost(u32);
+impl Cost {
+    pub(crate) fn at_level(&self, loop_level: LoopLevel) -> Cost {
+        let loop_level = std::cmp::min(2, loop_level.level());
+        let multiplier = 1u32 << ((10 * loop_level) as u32);
+        Cost(self.0.saturating_mul(multiplier)).finite()
+    }
+
+    pub(crate) fn infinity() -> Cost {
+        // 2^32 - 1 is, uh, pretty close to infinite... (we use `Cost`
+        // only for heuristics and always saturate so this suffices!)
+        Cost(u32::MAX)
+    }
+
+    pub(crate) fn zero() -> Cost {
+        Cost(0)
+    }
+
+    /// Clamp this cost at a "finite" value. Can be used in
+    /// conjunction with saturating ops to avoid saturating into
+    /// `infinity()`.
+    fn finite(self) -> Cost {
+        Cost(std::cmp::min(u32::MAX - 1, self.0))
+    }
+}
+
+impl std::default::Default for Cost {
+    fn default() -> Cost {
+        Cost::zero()
+    }
+}
+
+impl std::ops::Add<Cost> for Cost {
+    type Output = Cost;
+    fn add(self, other: Cost) -> Cost {
+        Cost(self.0.saturating_add(other.0)).finite()
+    }
+}
+
+pub(crate) fn op_cost(op: &InstructionImms) -> Cost {
+    match op.opcode() {
+        // Constants.
+        Opcode::Iconst | Opcode::F32const | Opcode::F64const | Opcode::Bconst => Cost(0),
+        // Extends/reduces.
+        Opcode::Bextend
+        | Opcode::Breduce
+        | Opcode::Uextend
+        | Opcode::Sextend
+        | Opcode::Ireduce
+        | Opcode::Iconcat
+        | Opcode::Isplit => Cost(1),
+        // "Simple" arithmetic.
+        Opcode::Iadd
+        | Opcode::Isub
+        | Opcode::Band
+        | Opcode::BandNot
+        | Opcode::Bor
+        | Opcode::BorNot
+        | Opcode::Bxor
+        | Opcode::BxorNot
+        | Opcode::Bnot => Cost(2),
+        // Everything else.
+        _ => Cost(3),
+    }
+}
+
+impl Language for NodeCtx {
+    type Node = Node;
+
+    fn children<'a>(&'a self, node: &'a Node) -> &'a [Id] {
+        match node {
+            Node::Param { .. } => &[],
+            Node::Pure { args, .. } | Node::Inst { args, .. } => args.as_slice(&self.args),
+            Node::Load { addr, .. } => std::slice::from_ref(addr),
+            Node::Result { value, .. } => std::slice::from_ref(value),
+        }
+    }
+
+    fn children_mut<'a>(&'a mut self, node: &'a mut Node) -> &'a mut [Id] {
+        match node {
+            Node::Param { .. } => &mut [],
+            Node::Pure { args, .. } | Node::Inst { args, .. } => args.as_mut_slice(&mut self.args),
+            Node::Load { addr, .. } => std::slice::from_mut(addr),
+            Node::Result { value, .. } => std::slice::from_mut(value),
+        }
+    }
+
+    fn needs_dedup(&self, node: &Node) -> bool {
+        match node {
+            Node::Pure { .. } | Node::Load { .. } => true,
+            _ => false,
+        }
+    }
+}
--- a/cranelift/codegen/src/egraph/stores.rs
+++ b/cranelift/codegen/src/egraph/stores.rs
@@ -0,0 +1,266 @@
+//! Last-store tracking via alias analysis.
+//!
+//! We partition memory state into several *disjoint pieces* of
+//! "abstract state". There are a finite number of such pieces:
+//! currently, we call them "heap", "table", "vmctx", and "other". Any
+//! given address in memory belongs to exactly one disjoint piece.
+//!
+//! One never tracks which piece a concrete address belongs to at
+//! runtime; this is a purely static concept. Instead, all
+//! memory-accessing instructions (loads and stores) are labeled with
+//! one of these four categories in the `MemFlags`. It is forbidden
+//! for a load or store to access memory under one category and a
+//! later load or store to access the same memory under a different
+//! category. This is ensured to be true by construction during
+//! frontend translation into CLIF and during legalization.
+//!
+//! Given that this non-aliasing property is ensured by the producer
+//! of CLIF, we can compute a *may-alias* property: one load or store
+//! may-alias another load or store if both access the same category
+//! of abstract state.
+//!
+//! The "last store" pass helps to compute this aliasing: we perform a
+//! fixpoint analysis to track the last instruction that *might have*
+//! written to a given part of abstract state. We also track the block
+//! containing this store.
+//!
+//! We can't say for sure that the "last store" *did* actually write
+//! that state, but we know for sure that no instruction *later* than
+//! it (up to the current instruction) did. However, we can get a
+//! must-alias property from this: if at a given load or store, we
+//! look backward to the "last store", *AND* we find that it has
+//! exactly the same address expression and value type, then we know
+//! that the current instruction's access *must* be to the same memory
+//! location.
+//!
+//! To get this must-alias property, we leverage the node
+//! hashconsing. We design the Eq/Hash (node identity relation
+//! definition) of the `Node` struct so that all loads with (i) the
+//! same "last store", and (ii) the same address expression, and (iii)
+//! the same opcode-and-offset, will deduplicate (the first will be
+//! computed, and the later ones will use the same value). Furthermore
+//! we have an optimization that rewrites a load into the stored value
+//! of the last store *if* the last store has the same address
+//! expression and constant offset.
+//!
+//! This gives us two optimizations, "redundant load elimination" and
+//! "store-to-load forwarding".
+//!
+//! In theory we could also do *dead-store elimination*, where if a
+//! store overwrites a value earlier written by another store, *and*
+//! if no other load/store to the abstract state category occurred,
+//! *and* no other trapping instruction occurred (at which point we
+//! need an up-to-date memory state because post-trap-termination
+//! memory state can be observed), *and* we can prove the original
+//! store could not have trapped, then we can eliminate the original
+//! store. Because this is so complex, and the conditions for doing it
+//! correctly when post-trap state must be correct likely reduce the
+//! potential benefit, we don't yet do this.
+
+use crate::flowgraph::ControlFlowGraph;
+use crate::fx::{FxHashMap, FxHashSet};
+use crate::inst_predicates::has_memory_fence_semantics;
+use crate::ir::{Block, Function, Inst, InstructionData, MemFlags, Opcode};
+use crate::trace;
+use cranelift_entity::SecondaryMap;
+use smallvec::{smallvec, SmallVec};
+
+/// For a given program point, the vector of last-store instruction
+/// indices for each disjoint category of abstract state.
+#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)]
+struct LastStores {
+    heap: MemoryState,
+    table: MemoryState,
+    vmctx: MemoryState,
+    other: MemoryState,
+}
+
+/// State of memory seen by a load.
+#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord, Hash, Default)]
+pub enum MemoryState {
+    /// State at function entry: nothing is known (but it is one
+    /// consistent value, so two loads from "entry" state at the same
+    /// address will still provide the same result).
+    #[default]
+    Entry,
+    /// State just after a store by the given instruction. The
+    /// instruction is a store from which we can forward.
+    Store(Inst),
+    /// State just before the given instruction. Used for abstract
+    /// value merges at merge-points when we cannot name a single
+    /// producing site.
+    BeforeInst(Inst),
+    /// State just after the given instruction. Used when the
+    /// instruction may update the associated state, but is not a
+    /// store whose value we can cleanly forward. (E.g., perhaps a
+    /// barrier of some sort.)
+    AfterInst(Inst),
+}
+
+impl LastStores {
+    fn update(&mut self, func: &Function, inst: Inst) {
+        let opcode = func.dfg[inst].opcode();
+        if has_memory_fence_semantics(opcode) {
+            self.heap = MemoryState::AfterInst(inst);
+            self.table = MemoryState::AfterInst(inst);
+            self.vmctx = MemoryState::AfterInst(inst);
+            self.other = MemoryState::AfterInst(inst);
+        } else if opcode.can_store() {
+            if let Some(memflags) = func.dfg[inst].memflags() {
+                *self.for_flags(memflags) = MemoryState::Store(inst);
+            } else {
+                self.heap = MemoryState::AfterInst(inst);
+                self.table = MemoryState::AfterInst(inst);
+                self.vmctx = MemoryState::AfterInst(inst);
+                self.other = MemoryState::AfterInst(inst);
+            }
+        }
+    }
+
+    fn for_flags(&mut self, memflags: MemFlags) -> &mut MemoryState {
+        if memflags.heap() {
+            &mut self.heap
+        } else if memflags.table() {
+            &mut self.table
+        } else if memflags.vmctx() {
+            &mut self.vmctx
+        } else {
+            &mut self.other
+        }
+    }
+
+    fn meet_from(&mut self, other: &LastStores, loc: Inst) {
+        let meet = |a: MemoryState, b: MemoryState| -> MemoryState {
+            match (a, b) {
+                (a, b) if a == b => a,
+                _ => MemoryState::BeforeInst(loc),
+            }
+        };
+
+        self.heap = meet(self.heap, other.heap);
+        self.table = meet(self.table, other.table);
+        self.vmctx = meet(self.vmctx, other.vmctx);
+        self.other = meet(self.other, other.other);
+    }
+}
+
+/// An alias-analysis pass.
+pub struct AliasAnalysis {
+    /// Last-store instruction (or none) for a given load. Use a hash map
+    /// instead of a `SecondaryMap` because this is sparse.
+    load_mem_state: FxHashMap<Inst, MemoryState>,
+}
+
+impl AliasAnalysis {
+    /// Perform an alias analysis pass.
+    pub fn new(func: &Function, cfg: &ControlFlowGraph) -> AliasAnalysis {
+        log::trace!("alias analysis: input is:\n{:?}", func);
+        let block_input = Self::compute_block_input_states(func, cfg);
+        let load_mem_state = Self::compute_load_last_stores(func, block_input);
+        AliasAnalysis { load_mem_state }
+    }
+
+    fn compute_block_input_states(
+        func: &Function,
+        cfg: &ControlFlowGraph,
+    ) -> SecondaryMap<Block, Option<LastStores>> {
+        let mut block_input = SecondaryMap::with_capacity(func.dfg.num_blocks());
+        let mut worklist: SmallVec<[Block; 8]> = smallvec![];
+        let mut worklist_set = FxHashSet::default();
+        let entry = func.layout.entry_block().unwrap();
+        worklist.push(entry);
+        worklist_set.insert(entry);
+        block_input[entry] = Some(LastStores::default());
+
+        while let Some(block) = worklist.pop() {
+            worklist_set.remove(&block);
+            let state = block_input[block].clone().unwrap();
+
+            trace!("alias analysis: input to {} is {:?}", block, state);
+
+            let state = func
+                .layout
+                .block_insts(block)
+                .fold(state, |mut state, inst| {
+                    state.update(func, inst);
+                    trace!("after {}: state is {:?}", inst, state);
+                    state
+                });
+
+            for succ in cfg.succ_iter(block) {
+                let succ_first_inst = func.layout.first_inst(succ).unwrap();
+                let succ_state = &mut block_input[succ];
+                let old = succ_state.clone();
+                if let Some(succ_state) = succ_state.as_mut() {
+                    succ_state.meet_from(&state, succ_first_inst);
+                } else {
+                    *succ_state = Some(state);
+                };
+                let updated = *succ_state != old;
+
+                if updated && worklist_set.insert(succ) {
+                    worklist.push(succ);
+                }
+            }
+        }
+
+        block_input
+    }
+
+    fn compute_load_last_stores(
+        func: &Function,
+        block_input: SecondaryMap<Block, Option<LastStores>>,
+    ) -> FxHashMap<Inst, MemoryState> {
+        let mut load_mem_state = FxHashMap::default();
+
+        for block in func.layout.blocks() {
+            let mut state = block_input[block].clone().unwrap();
+
+            for inst in func.layout.block_insts(block) {
+                trace!(
+                    "alias analysis: scanning at {} with state {:?} ({:?})",
+                    inst,
+                    state,
+                    func.dfg[inst],
+                );
+
+                // N.B.: we match `Load` specifically, and not any
+                // other kinds of loads (or any opcode such that
+                // `opcode.can_load()` returns true), because some
+                // "can load" instructions actually have very
+                // different semantics (are not just a load of a
+                // particularly-typed value). For example, atomic
+                // (load/store, RMW, CAS) instructions "can load" but
+                // definitely should not participate in store-to-load
+                // forwarding or redundant-load elimination. Our goal
+                // here is to provide a `MemoryState` just for plain
+                // old loads whose semantics we can completely reason
+                // about.
+                if let InstructionData::Load {
+                    opcode: Opcode::Load,
+                    flags,
+                    ..
+                } = func.dfg[inst]
+                {
+                    let mem_state = *state.for_flags(flags);
+                    trace!(
+                        "alias analysis: at {}: load with mem_state {:?}",
+                        inst,
+                        mem_state,
+                    );
+
+                    load_mem_state.insert(inst, mem_state);
+                }
+
+                state.update(func, inst);
+            }
+        }
+
+        load_mem_state
+    }
+
+    /// Get the state seen by a load, if any.
+    pub fn get_state_for_load(&self, inst: Inst) -> Option<MemoryState> {
+        self.load_mem_state.get(&inst).copied()
+    }
+}