Observable Operator Models

This paper describes a new approach to model discrete stochastic processes, called observable operator models (OOMs). The OOMs were introduced by Jaeger as a generalization of hidden Markov models (HMMs). The theory of OOMs makes use of both probabilistic and linear algebraic tools, which has an important advantage: using the tools of linear algebra a very simple and efficient learning algorithm can be developed for OOMs. This seems to be better than the known algorithms for HMMs. This learning algorithm is presented in detail in the second part of the article. Zusammenfassung: Dieser Aufsatz beschreibt eine neue Vorgehensweise um diskrete stochastische Prozesse zu modellieren, so genannte Observable Operator Modelle (OOMs). Derartige OOMs wurden bereits von Jaeger als Verallgemeinerung der Hidden Markov Modelle (HMMs) eingeführt. Die Theorie der OOMs verwendet probabilistische wie auch lineare algebraische Hilfsmittel, was einen ganz wichtigen Vorteil hat: Mit den Werkzeugen der Linearen Algebra kann ein sehr einfacher und effizienter Lern-Algorithmus für OOMs hergeleitet werden. Dieser scheint besser zu sein als die bekannten Algorithmen für HMMs. Der Lern-Algorithmus wird im zweiten Teil dieser Arbeit in aller Genauigkeit präsentiert.


Introduction
The theory of hidden Markov models (HMMs) was developed in the 1960's (Baum and Petrie, 1966).In the 1980's this model became very popular in applications, first of all in speech recognition.Since then the hidden Markov models proved to be very useful in nanotechnology (Hunter, Jones, Sagar, and Lafontaine, 1995), telecommunication (Shue, Dey, Anderson, and Bruyne, 1999), speech recognition (Huang, Ariki, and Jack, 1990), financial mathematics (Elliott, Malcolm, and Tsoi, 2002) and astronomy (Berger, 1997).
The observable operator models (OOMs) were introduced by Jaeger as a generalization of HMMs (Jaeger, 1997(Jaeger, , 2000b)).HMMs have a structure of hidden states and emission distributions whereas the theory of OOMs concentrates on the observations themselves.In OOM theory the model trajectory is seen as a sequence of linear operators and not as a sequence of states.This idea leads us to the linear algebra structure of OOMs, which provides efficient methods in estimation and learning.The core of the learning algorithm has a time complexity of O(N + nm 3 ), where N is the size of the training data, n is the number of distinguishable outcomes and m is the model state space dimension which is the dimension of the vector space on which the observable operators act.Jaeger (2000a) presents a comprehensive study of the OOM learning algorithm.
The observation probability matrix is Finally, the initial state distribution of the Markov chain is π, so Therefore, a hidden Markov model is given by the triple λ = (A, B, π).For the useful applications of HMMs we have to solve three basic problems: 1. Evaluation problem: Given the observation sequence y 1 , . . ., y n and the model λ = (A, B, π), how do we efficiently compute the probability 2. Specifying of the hidden states: Given the observation sequence Y 1 , . . ., Y n and the model λ = (A, B, π), how do we choose a corresponding state sequence s 1 , . . ., s n which best explains the observations?
3. Parameter estimation: How do we adjust the model parameters λ = (A, B, π) to maximize the probability For detailed description and solutions of these problems see Rabiner (1989).

The OOMs as a Generalization of HMMs
The OOMs provide another point of view of of the HMM theory.When we use HMMs we consider a trajectory as a sequence of states whereas using OOM means that a trajectory is seen as a sequence of operators which generate the next outcome from the previous one.
In the followings we describe how a given HMM corresponds to an OOM in the sense that they generate the same observation process.
Let (X t ) t∈N be a Markov chain with a finite state space S = {s 1 , s 2 , . . ., s m } which is given by the initial distribution w 0 and the state transition probability matrix M .When the Markov chain is in state s j at time t, it produces an observable outcome Y t with timeinvariant probability.The set of the outcomes is O = {a 1 , a 2 , . . ., a n }.The distribution of the outcomes can be characterized by the emission probabilities For every a ∈ O we define a diagonal matrix O a which contains the probabilities P(Y t = a|X t = s 1 ), . . ., P(Y t = a|X t = s m ) in its diagonal.Figure 1 shows an example for an HMM with two hidden states and two outcomes.The hidden states are {s 1 , s 2 }, the outcomes are {a, b}.The initial state distribution of the Markov chain is (1/3, 2/3) T .The state transition probabilities are indicated on the arrows between s 1 and s 2 .The emission probabilities can be seen between the hidden states and the outcomes.
This HMM is uniquely characterized by the following matrices: If we define the matrix T a = M T O a then we obtain that P(Y 0 = a) = 1T a w 0 where 1 = (1, . . ., 1) denotes the m-dimensional row vector of units.This equation is valid for any observation sequence in the sense that Therefore, the distribution of the process Y t is specified by the operators T a i and the vector w 0 in as much as they contain the same information as the matrices M , O a , and w 0 .So, an HMM can be seen as a structure (R m , (T a ) a∈O , w 0 ), where R m is the domain of the operators T a .
In our example we get the following structure: Using this idea we get the definition of OOMs by weaker requirements.We use τ a instead of T a and µ instead of M T .
These conditions are less stringent because negative entries are allowed in w 0 and µ, which w 0 is a probability vector and µ is a stochastic matrix in the theory of HMMs.
The third condition ensures that a non-negative value will be obtained when we compute probabilities in the OOM.

Probability Clock
It has an obvious question whether there exist OOM which which cannot be modelled by a HMM.The next example is the answer to this question.Let's take an OOM with outcomes O = {a, b}.Let τ ϕ be the linear mapping which rotates R 3 by an angle ϕ around (1, 0, 0) and ϕ is not a rational multiple of 2π.τ a = ατ ϕ , where 0 < α < 1.Let τ b be an operator which projects every vector on the direction of w 0 .We have the following observable operators: Note that every occurrence of b "resets" the process to a multiple of the initial vector w 0 .Therefore, only the conditional probabilities This process is called the probability clock.The probability clock cannot be modelled by an HMM, which shows that the class of OOMs is greater than the class of HMMs.This can be proved by means of convex analysis, see Jaeger (2000b).This result provides an equivalent condition for an OOM to be an HMM.

Generating and Prediction with OOMs
Suppose that the distribution of the process (Y t ) t∈N is specified by the OOM A = (R m , (τ a ) a∈O , w 0 ).The task is to generate the paths of Y t .First, we have to generate Y 0 .The distribution of Y 0 is known: P(Y 0 = a) = 1τ a w 0 .At time 0 we make a random decision for the value of Y 0 according to these probabilities.Assume that an initial realization a i 0 , a i 1 , . . ., a i t−1 of the process is already known.Then we have the following expression for the conditional probabilities where w t = τ a t−1 w t−1 /1τ a t−1 w t−1 .Hence, w t can be computed from w t−1 recursively.The prediction task is very similar to the generation task.After an initial realization a i 0 , a i 1 , . . ., a i t−1 we would like to know the probability of the occurrence of b at the next step.This probability is calculated in Equation (1).Similarly, the probability of collective outcomes for multiple time steps can be computed as follows:

Learning OOMs from Data
In this section we follow the work of Jaeger (2000a).
Assume that a sequence S = a 0 a 1 . . .a N is given which is a path of an unknown stationary OOM A = (R m , (τ a ) a∈O , w 0 ).An A = (R m , (τ a ) a∈O , ω 0 ) OOM is stationary if µω 0 = ω 0 where µ = a∈O τ a .The learning task is to find an OOM Ã from S which is as close to A as possible.To this end, we first have to determine the dimension m and then we have to estimate the observable operators τ a and the initial vector w 0 .
To determine the correct dimension of the OOM is very important because if the dimension is too low then the data are underexploited.If the dimension is too high then the data are overfitted and the model is too expensive from a computational point of view.

The Dimension of a Stochastic Process
In this section the general notion of a stochastic process will be introduced, which has a close connection with the dimension of the OOM of the process.
To define the dimension of a stochastic process we need some notations.Let O * denote the set of all finite sequences whose elements come from O and the empty sequence (ε).To get a shorter form we write P((Y t , . . ., Y t+s ) = a|(Y 0 , . . ., Y t−1 ) = b) = P(a|b).For every b ∈ O * we introduce the conditional probability function: (2) Equation ( 2) carries over from basis elements to all b ∈ O * .It turns out from the definitions that if a = a i 0 a i 1 . . .a i k then it holds that Using Equation (3) we can compute probabilities of finite sequences: These probabilities can be computed from the basis of B as the following theorem shows.
Theorem 1 (Jaeger, 2000b) Then it holds that P(a) = n i=1 α i .This theorem states that the distribution of the process (Y t ) is uniquely characterized by the observable operators (t a ) a∈O so the following definition makes sense.
Definition 3 Let (Y t ) t∈N be a stochastic process with values in a finite set O. The structure (B, (t a ) a∈O , g ε ) is called the predictor-space observable operator model of the process.The vector space dimension of B is called the dimension of the process.
Consider the elements of O * which are the finite realizations of the process Y t .Let O * = {o 1 , o 2 , o 3 , . ..} be ordered lexicographically and let D be an infinite matrix whose entries are the conditional probabilities d ij = P(o i |o j ).According to Definition 3 the dimension of the stochastic process Y t is equal to the maximal number of linearly independent column vectors of the matrix D, which is the rank of D. Hence, determining the dimension of the process Y t is equivalent to finding the rank of the matrix D.

Determination of the OOM Dimension
The following theorem shows the connection between the dimension of a process and the dimension of the ordinary OOM of this process.This will be the basic idea in the estimation procedure.
Theorem 2 (Jaeger, 2000b) a) If Y t is a process with finite dimension m, then an m-dimensional ordinary OOM of this process exists.
b) A process Y t whose distribution is described by a k-dimensional OOM has a dimension m ≤ k.
The correct dimension of the OOM is equal to the rank of the matrix D. Therefore, we have to estimate the conditional probabilities d ij = P(o i |o j ) from the realization S = a 0 a 1 . . .a N then we have to estimate the true rank of the "noisy" matrix D.
Choosing the correct model dimension of the OOM is very difficult.This is also a very hard task in the HMM theory and it was recently solved.In real life usually the task is not to learn models with the true process dimension because empirical processes that are generated by complex physical systems are quite likely very high-dimensional (even infinite-dimensional).However, one can hope that only few of the true dimensions are responsible for most of the stochastic phenomena that appear in the data.Given finite

The Learning Algorithm
Assume that a sequence S = a 0 a 1 . . .a N is given which is a path of an unknown stationary process Y t .Furthermore, assume that the dimension of Y t is known to be m and the characteristic events A 1 , . . ., A m have already been selected.We would like to construct the interpretable OOM of the process.
In the first step, we estimate w 0 .Proposition 2 states that ω 0 = (P(A 1 ), . . ., P(A m )).Therefore, a natural estimate of w 0 is w0 = ( P(A 1 ), . . ., P(A m )), where where k is the length of events A i .
In the second step we estimate the operators τ a .According to Proposition 2 for any sequence b j it holds that τ a (τ b j ω 0 ) = (P(b j aA 1 ), . . ., P(b j aA m )) . (4) An m-dimensional linear operator is uniquely determined by the values it takes on m linearly independent vectors and this fact leads us to the estimation of τ a using (4).We estimate m linearly independent vectors v j = τ b j ω 0 = (P(b j A 1 ), . . ., P(b j A m )) by putting where l is the length of b j .We also estimate the results τ a v j = τ a (τ b j ω 0 ): Thus we obtain estimates (ṽ j , (τ a v j )) of m argument-value pairs (v j , τ a v j ).Finally, we can estimate τ a using the following elementary fact.We collect the vectors ṽj as columns in a matrix Ṽ and the vectors (τ a v j ) as columns in a matrix Wa .Then we obtain τa = Wa Ṽ −1 .Instead of the sequence b j we can take collective events B j (1 ≤ j ≤ m) of some common length l to construct Ṽij = P(B j A i ) and Wa = P(B j aA i ).We will call B j indicative events.Now, the learning algorithm is the following: 1st step: Ṽij = P(B j A i ), 2nd step: ( Wa ) ij = P(B j aA i ), 3rd step: τa = Wa Ṽ −1 .As for the selection of characteristic events, we first remark that if we are given infinite training data then we may choose indicative and characteristic events virtually arbitrarily such that almost surely we arrive at a perfect model estimate.On the other hand, in real life we deal with finite training data.In this case the particular choice of indicative and characteristic events is most important as it will determine the quality of the estimated model.Although the learning algorithm is asymptotically correct, the speed of the convergence and possible bounds of model quality for finite training data are dependent on the particular choice of the indicative and characteristic events.
In the current state of the OOM theory one cannot provide a method for optimal choice but there are some helpful rules: • The characteristic and indicative events should occur in the data with roughly equal frequencies.This minimizes the average relative error in Ṽ and Wa .
• The inversion of the matrix Ṽ should be as insensitive as possible against error in the matrix entries.So we have to be careful with the choice of R and U in the singular value decomposition of V = RSU T .
• The sequences a contained in A i should have high mutual correlation and members of different characteristic events should have low mutual correlation: if a 1 , a 2 ∈ A i and a 1 is likely to occur next, then so is a 2 .
• Indicative events should exhaust O l for some l, i.e.B 1 ∪ . . .∪ B m = O l .This warrants that when we move the inspection window over S, at every position we get at least one count for Ṽ , thus we exploit S best.
More selection criteria and explicit constructions can be found in the following papers.Jaeger (2000a) deals with the connection between the unbiased estimation and the selection of indicative and characteristic events.Moreover, it provides a sufficient condition for characteristic events such that using these characteristic events the OOM estimated from data is interpretable.In Kretzschmar (2003) the characteristic events are chosen such that it minimizes the model variance arising from the pseudoinverse operation Ṽ −1 in the 3rd step of the learning procedure.Jaeger, Zhao, and Kolling (2006) provides a method to minimize the model variance by using reverse OOM.In contrast to the paper Kretzschmar (2003), they minimize the variance Ṽ itself.

Conclusions
In this paper we presented an introduction to the theory of Observable Operator Models (OOM).The OOMs appeared as the generalization of HMMs.OOMs form a deep connection between linear algebra and stochastic processes.Using the tools of linear algebra a very simple and efficient learning algorithm can be developed for OOMs, which seems to be better than the known algorithms for HMMs.It turned out that the class of HMMs is a strict subset of the class of OOMs.As an example, the Probability Clock model was mentioned which is an OOM but cannot be modelled by an HMM.
In the second part of the paper the learning algorithm for OOMs was presented in details.First, we have to determine the dimension of the OOM and second, we have to estimate the observable operators.The learning method seems to be simple but there are some still partially solved problems.For example, how the characteristic and indicative events should be chosen so that we get a computationally simple algorithm that converges fast enough to the "real" OOM in the background.

Figure 1 :
Figure 1: An example for an HMM with transition and emission probabilities.

Figure 2 :
Figure 2: The horizontal axis represents t, the vertical axis represents the conditional probabilities P(Y t = a|Y 0 = a, Y 1 = a, . . ., Y t−1 = a) at the probability clock.
and g ε (a) = P(a|ε) = P(a) .Let B denote the linear subspace spanned by the set {g b |b ∈ O * } in the linear space of functions f : O * \{ε} → R. Choose O * 0 ⊆ O * so that B 0 = {g b |b ∈ O * 0 } is a basis of B. The next step is to define a linear function for every a ∈ O: t a (g b ) = P(a|b) • g ba .