Efficient HMM with Feature Vectors for Improved Sequence Analysis

Srecko · Aug 31, 2015

Hi all,

Not sure if this would be the right place for this question, but I know it bothers me for some time already and would really appreciate any kind of help. I am trying to fit an HMM, but here for every observation in the sequence I have feature vector - probability distribution that given observation appears in different groups.

I have seen this approach http://www.cs.jhu.edu/~mpaul/files/emnlp2012-m4.pdf that basically uses log-linear function to translate a vector of continuous values into a discrete value. However, I'm not sure whether I understood it correctly nor how to apply it. My idea was to apply this (or similar) algorithm on the data I have and then use any existing HMM implementation (probably in R) to fit a model.

Thanks for any ideas, suggestions...

mathman · Sep 6, 2015

My expertize is in probability theory (not statistics). I was unable to respond because your terminology was too obscure. (HMM?)

mfb · Sep 6, 2015

Hidden markov model.
I don't even understand what Srecko wants to do and where the problem is.

Srecko · Sep 7, 2015

Thanks mathman and mfb,

Really sorry for the confusion. Let me try to provide better explanation. The sample from my dataset would look like this:

Code:

sequence parent_i week    student cluster1 cluster2 cluster3 cluster4 cluster5 cluster6 cluster7
       0       0 w2          1111   0.032   0.007   0.056   0.027   0.056   0.072       0
       1      -1 w1          2222   0.073   0.114   0.089   0.004   0.044   0.165       0
       2       0 w6          3333   0.003   0.038       0  0.0004  0.0002   0.138   0.045
       3       0 w2          4444       0       0       0   0.004       0       0       0
       4      -1 w0          5555   0.212   0.016   0.011   0.011   0.004   0.071  0.0006
       5      -1 w1          5555   0.013  0.0004   0.016       0       0       0       0
       6    1329 w2          5555   0.008   0.011   0.099    0.04   0.096   0.158       0
       7    1442 w3          5555   0.098       0  0.0006       0    0.08   0.021   0.136
       8    1829 w4          5555   0.129   0.175   0.147   0.003   0.007    0.06   0.025

That is, instead of having discrete value for each observation, I have a feature of continuous values that represent to what extent each student belong to a certain cluster (community). What I am trying to do is to fit a Hidden Markov model in order to examine what are the most common transitions. Easier scenario would be if I had:

student, week, cluster - where cluster is a value between 1 and 10 (for example).

Please let me know is this is still unclear.

Stephen Tashi · Sep 7, 2015

Srecko said:

Please let me know is this is still unclear.

I suggest you begin by relying on the mathematical definition of a Markov process. A Markov process has "states". What defines a "state" in your problem?

The "states" in a Markov process are generally given by a vector of variables. A variable can simply denote a discrete category (like x = 1 for "cheerful" x = 2 for "sad"). A variable can denote a continuous quantity (like t = temperature). If the "state" of a Markov process is to be defined by a particular function then, in practice, you would define the function by a finite number of parameters (like A,B,C for the the function Ax^2 + Bx + C).

If "students" is to be a state variable, you need to explain whether the state of the process is defined by information about one student or whether it indicates information about a set of several students.

A Markov process has "transition rules" that give the probability that one state transitions to another states as the process moves forward one time step (in the case of a discrete time process). If you have a finite number of states, the transition rules can be specified by a matrix. Do you have a finite number of states? If a state is specified by a value of a continuous variable (like temperature) or by a function then you need a way to specify the transition rules as functions.

Data from one time series of a Markov process can be called data about a "trajectory" of the process. Does you data consist of one trajectory? Or do you have data from several different trajectories?

Srecko · Sep 7, 2015

Thank you for the great explanation, Stephen.

I agree that state is not clearly defined here and I'm trying to apply HMMs in a situation that is not very well defined. Let's make clear several things. Here I have trajectories for several students. When I specified the model, I basically defined the length of each trajectory. On the other hand, let's say that each observation is described with single discrete value (from 1 to 6, indicating different clusters of students) and forget the vectors for the moment. We would basically have student 1, and his sequence 1, 3, 5, 4; student 2 - 3, 2, 1, 5, etc. After fitting the model, each state would be defined by one or more clusters. If there are 3 states, I will have state 1 is described by clusters 2 and 3, state 2 with cluster 5, and state 3 with clusters 1 and 4. Given that, I should be able to say that during the course, students tend to transition from clusters 2 and 3 to cluster 5 (for example). Of course, I would have to identify and "describe" those clusters in order to be able to interpret such results.

Now, let's get back to vectors. I don't have discrete values (1-6), but instead of that each emission is described by a vector of probabilities. The approach I shared earlier (http://www.cs.jhu.edu/~mpaul/files/emnlp2012-m4.pdf) does something like that. Instead of students, they deal with words and topics instead of clusters. Eventually, we are able to say words transition from this topic to another topic.

Finally, regarding transitions. If I understood correctly, I don't have to define transitions, I guess that should be inferred by the model? At least, when we fit a model using depmix (depmixS4), usually it's enough to define number of states...

Stephen Tashi · Sep 8, 2015

You're using the terms "cluster" and "emission" without explaining what they mean and without explaining their relation to the definition of "state".

I suppose would be advisors must look at the documentation for depmix: http://raptor1.bizlab.mtsu.edu/S-Drive/TEFF/Rlib/library/depmix/doc/depmix-intro.pdf

Srecko · Sep 8, 2015

You are right, Stephen.

Each state should be "described" with one or more "clusters". Those probabilities are actually obtained using Mixed Membership Stochastic Blockmodels (MMSB), applied on the graph of learners. I didn't want to go into all the details since I thought that what I said before would be enough. Obviously I was wrong. So, everything begins with a graph. Once I had a graph, I applied MMSB to identify communities of learners (clusters/groups). MMSB is "soft clustering" method, and as an output it gives you a probability that node (i.e., student) belongs to each of the identified communities. Those probabilities describe each of the observations.

Finally, each state in HMMs would be (ideally) described with one or more cluster(s) (i.e., communities obtained using MMSB).

I did check the depmix documentation, but I wasn't able to find implementation that would support fitting HMM with vector (instead of single discrete/continuous value).

Stephen Tashi · Sep 8, 2015

Srecko said:

So, everything begins with a graph.

In mathematics, there are "graphs" that are studied in Graph Theory. There are also graphs of functions. In business presentations, there are graphs in the sense of bar graphs and pie charts. If you are talking about the kind of graph that is studied in Graph Theory then such a graph has "vertices" and "edges". The vertices can have "labels". Perhaps the "labels" define "states"?

It's is ambiguous to mention "probabilities" without defining them. A probability is the the probability of some event happening. If you just call it a "probability", it isn't clear what event is associated with the probability.

It's hard to understand the problem from your writing - especially for people who aren't familiar with the software you mention. If you don't yet have a clear idea of the problem as a mathematical problem, try describing the problem in practical terms. Don't describe it merely by saying your are applying certain software.

Srecko · Sep 8, 2015

Hi Stephen,

I am familiar with graph theory (at least more than HMMs), and the only reason I posted here was to ask whether there is an HMM implementation that allows for describing each observation in a sequence/trajectory with feature vector, instead of a single (discrete or continuous) value. From practical perspective (at least how I see it), I am not sure that understanding what states are (or any other concept) is necessary.

I really appreciate your help, and really would like to discuss all the implementation details and why I did what I did. Currently, I am only concerned with HMM part. If I understood correctly, standard HMM implementation assumes that each observation in a trajectory can be described with a single value (e.g., Rain - 0.8, Sun - 0.3, Snow - 0.4). However, each observation in my case is described with a feature vector (e.g., Rain - (0.2, 0.3, 0.5), Sun - (0.4, 0.6, 0.000)).

I thought it would be enough and didn't want to overwhelm everyone here with all the details. However, if you think that would be necessary, I'm happy to share all the details.

Stephen Tashi · Sep 8, 2015

Srecko said:

If I understood correctly, standard HMM implementation assumes that each observation in a trajectory can be described with a single value (e.g., Rain - 0.8, Sun - 0.3, Snow - 0.4).

Let's straighten that out first! My idea of trajectory data for weather would be like this:
day 1 rain
day 2 sun
day 3 sun
day 4 rain
day 5 snow
...etc.

Your idea of trajectory data in that example has the numbers 0.8, 0.3, 0.4 in it. What do those numbers mean?

Srecko · Sep 8, 2015

That example would work as well. The difference between the problem you explained and the one I'm trying to solve is the following: I don't have "rain", "sun", "snow". Also, I don't have "1", "2", "3". Instead of those discrete values, I have

day 1 (0.2, 0.3, 0.5, 0.0)
day 2 (0.8, 0.0, 0.0, 0.2)
...

Stephen Tashi · Sep 8, 2015

Srecko said:

I have

day 1 (0.2, 0.3, 0.5, 0.0)
day 2 (0.8, 0.0, 0.0, 0.2)
...

OK, that makes things clearer. Since you mentioned students, is the model supposed to apply to each student? Or are their state variables that make the model behave differently for different students - for example do things depend on a student's age, weight etc?

I gather you have a continuum of states rather than a finite number of states. Is the purpose of "clusters" to create a model that uses only a finite number of states?

For example if the data has the format: [ day, (A,B,C,D) ] then do we want to define states as certain ranges of values for A,B,C,D. For example: State 1 might be the condition " A^2 + B^2 < 3 and 0.02 < B < C < 2.1"

Srecko · Sep 8, 2015

Each student has his/her own trajectory, but the model doesn't depend on student's age or other "properties". Also, it should be a finite number of states. I didn't think about conditions, but each states should be defined with certain values of A, B, C, and D. For example, State 1 is described with C, State 2 with A and D, while State 3 is defined with B.

Thinking of conditions, that does seem to make sense.

Stephen Tashi · Sep 9, 2015

Srecko said:

For example, State 1 is described with C,

Is State 1 actually a single state? Or is it a different state for each different value of C ?

Srecko · Sep 9, 2015

It is a single state. It just that we can say that cluster 1 represents this state. So, when we say (for example) that students most commonly transition from state 1 and state 3 to state 2, we should be able to say that students usually transition from C and B to A and D (from the previous example).

Stephen Tashi · Sep 9, 2015

Srecko said:

It is a single state. .

Then what does the numerical value associated with C represent?

Srecko · Sep 9, 2015

Numerical value simply indicate to what extent student "belongs" to group C. That is, observing his interactions with other students, how much of his activity is associated with other students in the group C.

Here we are talking about clustering again. There are (at least) to cluster students (or detect communities) in the network of students. The first approach "hard clustering" would tell us that student1 belongs to group/cluster/community C (depends what we are trying to observe). "Soft" clustering, on the other hand, will tell us that student1 belongs to group C 22%, group A 60%, etc.

Finally, when we say that state 1 is described with cluster C, we need to define cluster C. How we are going to do that? We will define a threshold and select all the students in that cluster that have the value of "belonging" above the threshold. Those students (with their properties - e.g., grade, gpa, count of posts) "describe" the cluster.

Stephen Tashi · Sep 9, 2015

Srecko said:

Numerical value simply indicate to what extent student "belongs" to group C. That is, observing his interactions with other students, how much of his activity is associated with other students in the group C.

I understand. So, as time passes, a student "degree of belonging" to cluster C can change - correct?

I would not call C a "state" of the particular Markov process that produces your example data. I'd call C a "state variable". A set of definite values of the state variables (like (0.5, 0.0, 0.2, 0.3) ) is a "state".

You mention an "emission" and the link you gave deals with the "emission of tokens". Are "emissions" relevant to your data? For example if the data for student number 6 begins: [day 1 , (0.5, 0.0, 0.2, 0.3)], [day 2, (0.3, 0.1, 0.5. 0.5)], ... is there some sort of "emission" produced by student 6 as he goes from day 1 to day 2 ?

Srecko · Sep 9, 2015

Stephen Tashi said:

I understand. So, as time passes, a student "degree of belonging" to cluster C can change - correct?

Yes, the degree can (and usually does) changes from week to week.

It could be that C and D describe a state, so it's not C only. I agree it should be called "state variable".
I don't see any emission between two states. It is important to consider previous state when inferring the current state, but guess that is the point of HMMs.

Stephen Tashi · Sep 9, 2015

The terminology for a model for you data is a "continuous state, discrete time Markov process". If you choose to add "hidden" states to the model then you apply the adjective "hidden" on "Markov".

I'm not familiar with fitting such models to data, but I think we can look up information on that topic.

One reference I found (http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA415930 is a page with the link to the pdf) makes the interesting claim that the Kalman filter technique is an example of such a model.

The idea of an equivalence between Kalman filters and HMMs will come as no
surprise to many readers, for the analogous nature of these two areas has been noted in
the Literature for many years. The specifics of the equivalence may be more surprising,
however. The Kalman filter is not just analogous to an HMM. The Kalman-filter model
is an HMM with linear Gaussian model densities. The Baum algorithm for this HMM
is a Kalman smoother, as is the Viterbi algorithm.

On the one hand, it's easy to find information on the Kalman filter. On the other hand, Kalman filter questions on this forum often go answered. The Kalman filter is better known in engineering than in theoretical mathematics.

Do you have an advisor for the work you are doing? Has the advisor dropped any hints about articles that are relevant. Offhand, I don't see the how the article you linked applies to your problem - plus the problem studied in that article is unfamiliar to me. I don't understand what real world phenomena are being modeled in that article.

Srecko · Sep 9, 2015

Thank you for the really amazing help, Stephen.

Appreciate all your comments. I will read the doc you shared now and try to see how Kalman filter could fit there. It seems like a good approach, but so did the one using Block HMMs... so, guess will have to do some more reading.

Efficient HMM with Feature Vectors for Improved Sequence Analysis

Related to Efficient HMM with Feature Vectors for Improved Sequence Analysis

1. What is an HMM?

2. How do feature vectors improve HMM efficiency?

3. What types of sequences can benefit from efficient HMM with feature vectors?

4. How is the efficiency of HMM with feature vectors measured?

5. Are there any limitations to using feature vectors in HMMs?

Similar threads

Hot Threads

Recent Insights