An epsilon machine (\epsilon-machine) is the minimally sized, maximally predictive Hidden Markov Model of a stochastic process. \epsilon-Machines are a useful mathematical abstraction, but can also be inferred (reconstructed) from finite strings of symbols.
This page documents code for reconstruction using finite data, found in the Computational Mechanics in Python (CMPy) package. This code implements state splitting and tree merging reconstruction techniques for finite data sources. For background on \epsilon-machines, see An Algorithm for Pattern Discovery in Time Series.
Fetching CMPy requires the free git version control system:
git clone git://vcs.cse.ucdavis.edu/cmpy.git
Install the cmpy package:
cd cmpy sudo python setup.py install
A great way to experiment with cmpy is the interactive python shell provided by ipython. To copy code from this tutorial and paste it directly into an ipython session, first enter the following command at the ipython prompt:
%doctest_mode
OS X users may find this tutorial useful.
Basic use requires creating a TreeMerger or StateSplitter object, and calling its infer method which takes time series data and any non-default inference parameters. Several stochastic processes are also available for generating data, by calling generate(number_of_symbols) on a process object. An example should make this clear.
Below we infer the \epsilon-machine for the golden mean process that generates a binary string with no consecutive zeros, and write the machine to a Graphviz file named em.dot by default.
from cmpy.inference import * t = TreeMerger() data = GoldenMean().generate(1000) t.infer(data) t.writeGraph()
The resulting \epsilon-machine, exported from Graphviz:
Note the transition probabilities 49% and 51% should each be 50% if given infinite data, and may vary more or less for different realizations and data lengths.
Next we infer from the even process, generating a binary string where ones occur in blocks of even length only. When writing the Graphviz file, we omit transient states and specify the filename even.dot. Here we use state splitting instead of tree merging.
from cmpy.inference import *
s = StateSplitter()
s.infer(EvenProcess().generate(1000))
s.writeGraph('even.dot', transients=False)
Note the state labeled 1 is missing, as it is a transient state omitted by the transients=False parameter.
A tutorial on sequential inference can be found here.
Several issues can arise when reconstructing machines from finite data, including dangling states and non-deterministic transitions. The code in cmpy.inference has developed heuristics for addressing some of these issues. These heuristics enable reconstruction of valid \epsilon-machines at the potential cost of model accuracy.
Reconstruction is highly sensitive to the choice of algorithm and parameters. Some algorithms struggle with certain types of processes. For example, compare the inferred \epsilon-machine for RRXOR using tree merging versus state splitting. When inferring from an unknown data source, experimenting with different algorithms and parameters is suggested. In general, higher parameter values require more data for accurate modeling. For tree merging, the following excerpt from James Hanson's thesis provides insight:
"The simplest reason for nondeterminism is insufficient statistics. That is, the tree depth D was chosen too large relative to the size N of the data sequence. In this case, empirical fluctuations in the observed length D subwords corrupt the shape of the overall tree, ultimately leading to subtrees with "missing" branches due to incomplete statistics. This artificially drives up the number of morphs observed, leading to an explosion in the number of machine states. The extra states are generally "dangling" states with no exiting transitions, and are generally reached by nondeterministic transitions. As more data is added to the tree, the missing branches in the subtrees are gradually filled in, reducing the number of morphs that occur, and the danglers and nondeterministic transitions disappear.
Another source of nondeterminism is nonstationarity in the data. If the length D subwords laid down on the tree were generated by a nonstationary source, or were collected in some nonstationary manner, branches at different levels of the tree may connect different pairs of morphs...
A different problem occurs when L is very small, for example because D was small. In that case, the reconstructed machines can be inaccurate models of the underlying process that generated the data. This can be understood by observing that if the data source induces correlations on a length scale of L', morphs of depth L < L' will fail to capture them. That is, two subtrees that are equivalent at depth L may be distinct at depth L'. Nodes are classed as future equivalent when in fact they are not. This leads to machines that are too "loose", in that they accept sequences that the data source cannot in fact generate.
In practice, the two extremes of (i) inaccuracy due to small morph depth and (ii) nondeterminism due to bad statistics are played off one another in a systematic fashion. For fixed, finite data size N, a series of reconstructions is performed for progressively increasing values of D and L. At first the machine is inaccurate because L is too small. Then as D and L increase the machine grows as the different classes of future equivalence are better resolved. If the data is stationary and if N is sufficiently large, there is a period during which several successively reconstructed machines are identical; at this point the reconstruction is said to have stabilized. Finally, as D and L grow too large, statistical fluctuations begin to dominate, the machine size blows up, and nondeterministic transitions appear."