peepmatic: Introduce the peepmatic-automata crate

The `peepmatic-automata` crate builds and queries finite-state transducer
automata.

A transducer is a type of automata that has not only an input that it
accepts or rejects, but also an output. While regular automata check whether
an input string is in the set that the automata accepts, a transducer maps
the input strings to values. A regular automata is sort of a compressed,
immutable set, and a transducer is sort of a compressed, immutable key-value
dictionary. A [trie] compresses a set of strings or map from a string to a
value by sharing prefixes of the input string. Automata and transducers can
compress even better: they can share both prefixes and suffixes. [*Index
1,600,000,000 Keys with Automata and Rust* by Andrew Gallant (aka
burntsushi)][burntsushi-blog-post] is a top-notch introduction.

If you're looking for a general-purpose transducers crate in Rust you're
probably looking for [the `fst` crate][fst-crate]. While this implementation
is fully generic and has no dependencies, its feature set is specific to
`peepmatic`'s needs:

* We need to associate extra data with each state: the match operation to
  evaluate next.

* We can't provide the full input string up front, so this crate must
  support incremental lookups. This is because the peephole optimizer is
  computing the input string incrementally and dynamically: it looks at the
  current state's match operation, evaluates it, and then uses the result as
  the next character of the input string.

* We also support incremental insertion and output when building the
  transducer. This is necessary because we don't want to emit output values
  that bind a match on an optimization's left-hand side's pattern (for
  example) until after we've succeeded in matching it, which might not
  happen until we've reached the n^th state.

* We need to support generic output values. The `fst` crate only supports
  `u64` outputs, while we need to build up an optimization's right-hand side
  instructions.

This implementation is based on [*Direct Construction of Minimal Acyclic
Subsequential Transducers* by Mihov and Maurel][paper]. That means that keys
must be inserted in lexicographic order during construction.

[trie]: https://en.wikipedia.org/wiki/Trie
[burntsushi-blog-post]: https://blog.burntsushi.net/transducers/#ordered-maps
[fst-crate]: https://crates.io/crates/fst
[paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3698&rep=rep1&type=pdf
This commit is contained in:
Nick Fitzgerald
2020-05-01 15:30:37 -07:00
parent 0592b5a995
commit c82326a1ae
5 changed files with 1603 additions and 0 deletions

View File

@@ -0,0 +1,195 @@
//! `serde::Serialize` and `serde::Deserialize` implementations for `Automaton`.
//!
//! Rather than prefix each serialized field with which field it is, we always
//! serialize fields in alphabetical order. Make sure to maintain this if you
//! add or remove fields!
//!
//! Each time you add/remove a field, or change serialization in any other way,
//! make sure to bump `SERIALIZATION_VERSION`.
use crate::{Automaton, Output, State};
use serde::{
de::{self, Deserializer, SeqAccess, Visitor},
ser::SerializeTupleStruct,
Deserialize, Serialize, Serializer,
};
use std::collections::BTreeMap;
use std::fmt;
use std::hash::Hash;
use std::marker::PhantomData;
const SERIALIZATION_VERSION: u32 = 1;
impl Serialize for State {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
serializer.serialize_u32(self.0)
}
}
impl<'de> Deserialize<'de> for State {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: Deserializer<'de>,
{
Ok(State(deserializer.deserialize_u32(U32Visitor)?))
}
}
struct U32Visitor;
impl<'de> Visitor<'de> for U32Visitor {
type Value = u32;
fn expecting(&self, f: &mut fmt::Formatter) -> fmt::Result {
f.write_str("an integer between `0` and `2^32 - 1`")
}
fn visit_u8<E>(self, value: u8) -> Result<Self::Value, E>
where
E: de::Error,
{
Ok(u32::from(value))
}
fn visit_u32<E>(self, value: u32) -> Result<Self::Value, E>
where
E: de::Error,
{
Ok(value)
}
fn visit_u64<E>(self, value: u64) -> Result<Self::Value, E>
where
E: de::Error,
{
use std::u32;
if value <= u64::from(u32::MAX) {
Ok(value as u32)
} else {
Err(E::custom(format!("u32 out of range: {}", value)))
}
}
}
impl<TAlphabet, TState, TOutput> Serialize for Automaton<TAlphabet, TState, TOutput>
where
TAlphabet: Serialize + Clone + Eq + Hash + Ord,
TState: Serialize + Clone + Eq + Hash,
TOutput: Serialize + Output,
{
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
let Automaton {
final_states,
start_state,
state_data,
transitions,
} = self;
let mut s = serializer.serialize_tuple_struct("Automaton", 5)?;
s.serialize_field(&SERIALIZATION_VERSION)?;
s.serialize_field(final_states)?;
s.serialize_field(start_state)?;
s.serialize_field(state_data)?;
s.serialize_field(transitions)?;
s.end()
}
}
impl<'de, TAlphabet, TState, TOutput> Deserialize<'de> for Automaton<TAlphabet, TState, TOutput>
where
TAlphabet: 'de + Deserialize<'de> + Clone + Eq + Hash + Ord,
TState: 'de + Deserialize<'de> + Clone + Eq + Hash,
TOutput: 'de + Deserialize<'de> + Output,
{
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: Deserializer<'de>,
{
deserializer.deserialize_tuple_struct(
"Automaton",
5,
AutomatonVisitor {
phantom: PhantomData,
},
)
}
}
struct AutomatonVisitor<'de, TAlphabet, TState, TOutput>
where
TAlphabet: 'de + Deserialize<'de> + Clone + Eq + Hash + Ord,
TState: 'de + Deserialize<'de> + Clone + Eq + Hash,
TOutput: 'de + Deserialize<'de> + Output,
{
phantom: PhantomData<&'de (TAlphabet, TState, TOutput)>,
}
impl<'de, TAlphabet, TState, TOutput> Visitor<'de>
for AutomatonVisitor<'de, TAlphabet, TState, TOutput>
where
TAlphabet: 'de + Deserialize<'de> + Clone + Eq + Hash + Ord,
TState: 'de + Deserialize<'de> + Clone + Eq + Hash,
TOutput: 'de + Deserialize<'de> + Output,
{
type Value = Automaton<TAlphabet, TState, TOutput>;
fn expecting(&self, f: &mut fmt::Formatter) -> fmt::Result {
f.write_str("Automaton")
}
fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error>
where
A: SeqAccess<'de>,
{
match seq.next_element::<u32>()? {
Some(v) if v == SERIALIZATION_VERSION => {}
Some(v) => {
return Err(de::Error::invalid_value(
de::Unexpected::Unsigned(v as u64),
&self,
));
}
None => return Err(de::Error::invalid_length(0, &"Automaton expects 5 elements")),
}
let final_states = match seq.next_element::<BTreeMap<State, TOutput>>()? {
Some(x) => x,
None => return Err(de::Error::invalid_length(1, &"Automaton expects 5 elements")),
};
let start_state = match seq.next_element::<State>()? {
Some(x) => x,
None => return Err(de::Error::invalid_length(2, &"Automaton expects 5 elements")),
};
let state_data = match seq.next_element::<Vec<Option<TState>>>()? {
Some(x) => x,
None => return Err(de::Error::invalid_length(3, &"Automaton expects 5 elements")),
};
let transitions = match seq.next_element::<Vec<BTreeMap<TAlphabet, (State, TOutput)>>>()? {
Some(x) => x,
None => return Err(de::Error::invalid_length(4, &"Automaton expects 5 elements")),
};
let automata = Automaton {
final_states,
start_state,
state_data,
transitions,
};
// Ensure that the deserialized automata is well-formed.
automata
.check_representation()
.map_err(|msg| de::Error::custom(msg))?;
Ok(automata)
}
}