Efficient HMM with Feature Vectors for Improved Sequence Analysis

In summary: For example, in your problem, the "states" could be student, week, cluster. After understanding what a "state" is, you can start to think about how to define a "transition". A transition is simply a change in the state of a Markov process. In your problem, a transition could be denoting when a student moves from one cluster to another. Once you have a definition for a "state" and a "transition", it is easy to start thinking about how to define a "model". A model in this context is simply a set of equations that describe how the state of a Markov process changes with respect to the values of the variables.
  • #1
Srecko
11
0
Hi all,

Not sure if this would be the right place for this question, but I know it bothers me for some time already and would really appreciate any kind of help. I am trying to fit an HMM, but here for every observation in the sequence I have feature vector - probability distribution that given observation appears in different groups.

I have seen this approach http://www.cs.jhu.edu/~mpaul/files/emnlp2012-m4.pdf that basically uses log-linear function to translate a vector of continuous values into a discrete value. However, I'm not sure whether I understood it correctly nor how to apply it. My idea was to apply this (or similar) algorithm on the data I have and then use any existing HMM implementation (probably in R) to fit a model.

Thanks for any ideas, suggestions...
 
  • #3
My expertize is in probability theory (not statistics). I was unable to respond because your terminology was too obscure. (HMM?)
 
  • #4
Hidden markov model.
I don't even understand what Srecko wants to do and where the problem is.
 
  • #5
Thanks mathman and mfb,

Really sorry for the confusion. Let me try to provide better explanation. The sample from my dataset would look like this:

Code:
sequence parent_i week    student cluster1 cluster2 cluster3 cluster4 cluster5 cluster6 cluster7
       0       0 w2          1111   0.032   0.007   0.056   0.027   0.056   0.072       0
       1      -1 w1          2222   0.073   0.114   0.089   0.004   0.044   0.165       0
       2       0 w6          3333   0.003   0.038       0  0.0004  0.0002   0.138   0.045
       3       0 w2          4444       0       0       0   0.004       0       0       0
       4      -1 w0          5555   0.212   0.016   0.011   0.011   0.004   0.071  0.0006
       5      -1 w1          5555   0.013  0.0004   0.016       0       0       0       0
       6    1329 w2          5555   0.008   0.011   0.099    0.04   0.096   0.158       0
       7    1442 w3          5555   0.098       0  0.0006       0    0.08   0.021   0.136
       8    1829 w4          5555   0.129   0.175   0.147   0.003   0.007    0.06   0.025

That is, instead of having discrete value for each observation, I have a feature of continuous values that represent to what extent each student belong to a certain cluster (community). What I am trying to do is to fit a Hidden Markov model in order to examine what are the most common transitions. Easier scenario would be if I had:

student, week, cluster - where cluster is a value between 1 and 10 (for example).

Please let me know is this is still unclear.
 
  • #6
Srecko said:
Please let me know is this is still unclear.

I suggest you begin by relying on the mathematical definition of a Markov process. A Markov process has "states". What defines a "state" in your problem?

The "states" in a Markov process are generally given by a vector of variables. A variable can simply denote a discrete category (like x = 1 for "cheerful" x = 2 for "sad"). A variable can denote a continuous quantity (like t = temperature). If the "state" of a Markov process is to be defined by a particular function then, in practice, you would define the function by a finite number of parameters (like A,B,C for the the function Ax^2 + Bx + C).

If "students" is to be a state variable, you need to explain whether the state of the process is defined by information about one student or whether it indicates information about a set of several students.

A Markov process has "transition rules" that give the probability that one state transitions to another states as the process moves forward one time step (in the case of a discrete time process). If you have a finite number of states, the transition rules can be specified by a matrix. Do you have a finite number of states? If a state is specified by a value of a continuous variable (like temperature) or by a function then you need a way to specify the transition rules as functions.

Data from one time series of a Markov process can be called data about a "trajectory" of the process. Does you data consist of one trajectory? Or do you have data from several different trajectories?
 
  • #7
Thank you for the great explanation, Stephen.

I agree that state is not clearly defined here and I'm trying to apply HMMs in a situation that is not very well defined. Let's make clear several things. Here I have trajectories for several students. When I specified the model, I basically defined the length of each trajectory. On the other hand, let's say that each observation is described with single discrete value (from 1 to 6, indicating different clusters of students) and forget the vectors for the moment. We would basically have student 1, and his sequence 1, 3, 5, 4; student 2 - 3, 2, 1, 5, etc. After fitting the model, each state would be defined by one or more clusters. If there are 3 states, I will have state 1 is described by clusters 2 and 3, state 2 with cluster 5, and state 3 with clusters 1 and 4. Given that, I should be able to say that during the course, students tend to transition from clusters 2 and 3 to cluster 5 (for example). Of course, I would have to identify and "describe" those clusters in order to be able to interpret such results.

Now, let's get back to vectors. I don't have discrete values (1-6), but instead of that each emission is described by a vector of probabilities. The approach I shared earlier (http://www.cs.jhu.edu/~mpaul/files/emnlp2012-m4.pdf) does something like that. Instead of students, they deal with words and topics instead of clusters. Eventually, we are able to say words transition from this topic to another topic.

Finally, regarding transitions. If I understood correctly, I don't have to define transitions, I guess that should be inferred by the model? At least, when we fit a model using depmix (depmixS4), usually it's enough to define number of states...
 
  • #8
You're using the terms "cluster" and "emission" without explaining what they mean and without explaining their relation to the definition of "state".

I suppose would be advisors must look at the documentation for depmix: http://raptor1.bizlab.mtsu.edu/S-Drive/TEFF/Rlib/library/depmix/doc/depmix-intro.pdf
 
Last edited by a moderator:
  • #9
You are right, Stephen.

Each state should be "described" with one or more "clusters". Those probabilities are actually obtained using Mixed Membership Stochastic Blockmodels (MMSB), applied on the graph of learners. I didn't want to go into all the details since I thought that what I said before would be enough. Obviously I was wrong. So, everything begins with a graph. Once I had a graph, I applied MMSB to identify communities of learners (clusters/groups). MMSB is "soft clustering" method, and as an output it gives you a probability that node (i.e., student) belongs to each of the identified communities. Those probabilities describe each of the observations.

Finally, each state in HMMs would be (ideally) described with one or more cluster(s) (i.e., communities obtained using MMSB).

I did check the depmix documentation, but I wasn't able to find implementation that would support fitting HMM with vector (instead of single discrete/continuous value).
 
  • #10
Srecko said:
So, everything begins with a graph.

In mathematics, there are "graphs" that are studied in Graph Theory. There are also graphs of functions. In business presentations, there are graphs in the sense of bar graphs and pie charts. If you are talking about the kind of graph that is studied in Graph Theory then such a graph has "vertices" and "edges". The vertices can have "labels". Perhaps the "labels" define "states"?

It's is ambiguous to mention "probabilities" without defining them. A probability is the the probability of some event happening. If you just call it a "probability", it isn't clear what event is associated with the probability.

It's hard to understand the problem from your writing - especially for people who aren't familiar with the software you mention. If you don't yet have a clear idea of the problem as a mathematical problem, try describing the problem in practical terms. Don't describe it merely by saying your are applying certain software.
 
  • #11
Hi Stephen,

I am familiar with graph theory (at least more than HMMs), and the only reason I posted here was to ask whether there is an HMM implementation that allows for describing each observation in a sequence/trajectory with feature vector, instead of a single (discrete or continuous) value. From practical perspective (at least how I see it), I am not sure that understanding what states are (or any other concept) is necessary.

I really appreciate your help, and really would like to discuss all the implementation details and why I did what I did. Currently, I am only concerned with HMM part. If I understood correctly, standard HMM implementation assumes that each observation in a trajectory can be described with a single value (e.g., Rain - 0.8, Sun - 0.3, Snow - 0.4). However, each observation in my case is described with a feature vector (e.g., Rain - (0.2, 0.3, 0.5), Sun - (0.4, 0.6, 0.000)).

I thought it would be enough and didn't want to overwhelm everyone here with all the details. However, if you think that would be necessary, I'm happy to share all the details.
 
  • #12
Srecko said:
If I understood correctly, standard HMM implementation assumes that each observation in a trajectory can be described with a single value (e.g., Rain - 0.8, Sun - 0.3, Snow - 0.4).
Let's straighten that out first! My idea of trajectory data for weather would be like this:
day 1 rain
day 2 sun
day 3 sun
day 4 rain
day 5 snow
...etc.

Your idea of trajectory data in that example has the numbers 0.8, 0.3, 0.4 in it. What do those numbers mean?
 
  • #13
That example would work as well. The difference between the problem you explained and the one I'm trying to solve is the following: I don't have "rain", "sun", "snow". Also, I don't have "1", "2", "3". Instead of those discrete values, I have

day 1 (0.2, 0.3, 0.5, 0.0)
day 2 (0.8, 0.0, 0.0, 0.2)
...
 
  • #14
Srecko said:
I have

day 1 (0.2, 0.3, 0.5, 0.0)
day 2 (0.8, 0.0, 0.0, 0.2)
...

OK, that makes things clearer. Since you mentioned students, is the model supposed to apply to each student? Or are their state variables that make the model behave differently for different students - for example do things depend on a student's age, weight etc?

I gather you have a continuum of states rather than a finite number of states. Is the purpose of "clusters" to create a model that uses only a finite number of states?

For example if the data has the format: [ day, (A,B,C,D) ] then do we want to define states as certain ranges of values for A,B,C,D. For example: State 1 might be the condition " A^2 + B^2 < 3 and 0.02 < B < C < 2.1"
 
  • #15
Each student has his/her own trajectory, but the model doesn't depend on student's age or other "properties". Also, it should be a finite number of states. I didn't think about conditions, but each states should be defined with certain values of A, B, C, and D. For example, State 1 is described with C, State 2 with A and D, while State 3 is defined with B.

Thinking of conditions, that does seem to make sense.
 
  • #16
Srecko said:
For example, State 1 is described with C,

Is State 1 actually a single state? Or is it a different state for each different value of C ?
 
  • #17
It is a single state. It just that we can say that cluster 1 represents this state. So, when we say (for example) that students most commonly transition from state 1 and state 3 to state 2, we should be able to say that students usually transition from C and B to A and D (from the previous example).
 
  • #18
Srecko said:
It is a single state. .

Then what does the numerical value associated with C represent?
 
  • #19
Numerical value simply indicate to what extent student "belongs" to group C. That is, observing his interactions with other students, how much of his activity is associated with other students in the group C.

Here we are talking about clustering again. There are (at least) to cluster students (or detect communities) in the network of students. The first approach "hard clustering" would tell us that student1 belongs to group/cluster/community C (depends what we are trying to observe). "Soft" clustering, on the other hand, will tell us that student1 belongs to group C 22%, group A 60%, etc.

Finally, when we say that state 1 is described with cluster C, we need to define cluster C. How we are going to do that? We will define a threshold and select all the students in that cluster that have the value of "belonging" above the threshold. Those students (with their properties - e.g., grade, gpa, count of posts) "describe" the cluster.
 
  • #20
Srecko said:
Numerical value simply indicate to what extent student "belongs" to group C. That is, observing his interactions with other students, how much of his activity is associated with other students in the group C.

I understand. So, as time passes, a student "degree of belonging" to cluster C can change - correct?

I would not call C a "state" of the particular Markov process that produces your example data. I'd call C a "state variable". A set of definite values of the state variables (like (0.5, 0.0, 0.2, 0.3) ) is a "state".

You mention an "emission" and the link you gave deals with the "emission of tokens". Are "emissions" relevant to your data? For example if the data for student number 6 begins: [day 1 , (0.5, 0.0, 0.2, 0.3)], [day 2, (0.3, 0.1, 0.5. 0.5)], ... is there some sort of "emission" produced by student 6 as he goes from day 1 to day 2 ?
 
  • #21
Stephen Tashi said:
I understand. So, as time passes, a student "degree of belonging" to cluster C can change - correct?
Yes, the degree can (and usually does) changes from week to week.

It could be that C and D describe a state, so it's not C only. I agree it should be called "state variable".
I don't see any emission between two states. It is important to consider previous state when inferring the current state, but guess that is the point of HMMs.
 
  • #22
The terminology for a model for you data is a "continuous state, discrete time Markov process". If you choose to add "hidden" states to the model then you apply the adjective "hidden" on "Markov".

I'm not familiar with fitting such models to data, but I think we can look up information on that topic.

One reference I found (http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA415930 is a page with the link to the pdf) makes the interesting claim that the Kalman filter technique is an example of such a model.

The idea of an equivalence between Kalman filters and HMMs will come as no
surprise to many readers, for the analogous nature of these two areas has been noted in
the Literature for many years. The specifics of the equivalence may be more surprising,
however. The Kalman filter is not just analogous to an HMM. The Kalman-filter model
is an HMM with linear Gaussian model densities. The Baum algorithm for this HMM
is a Kalman smoother, as is the Viterbi algorithm.

On the one hand, it's easy to find information on the Kalman filter. On the other hand, Kalman filter questions on this forum often go answered. The Kalman filter is better known in engineering than in theoretical mathematics.

Do you have an advisor for the work you are doing? Has the advisor dropped any hints about articles that are relevant. Offhand, I don't see the how the article you linked applies to your problem - plus the problem studied in that article is unfamiliar to me. I don't understand what real world phenomena are being modeled in that article.
 
  • #23
Thank you for the really amazing help, Stephen.

Appreciate all your comments. I will read the doc you shared now and try to see how Kalman filter could fit there. It seems like a good approach, but so did the one using Block HMMs... so, guess will have to do some more reading.
 

Related to Efficient HMM with Feature Vectors for Improved Sequence Analysis

1. What is an HMM?

An HMM (Hidden Markov Model) is a statistical model that is used to analyze and model sequential data, such as text, speech, or biological sequences. It is a type of probabilistic graphical model that uses a set of observed data to infer the underlying sequence of states or events that produced the data.

2. How do feature vectors improve HMM efficiency?

Feature vectors are numeric representations of features or characteristics of a sequence. By incorporating feature vectors into an HMM, the model can use this additional information to make more accurate predictions and improve its efficiency. This is because the feature vectors provide more specific and relevant information about the sequence, making it easier for the model to identify patterns and make predictions.

3. What types of sequences can benefit from efficient HMM with feature vectors?

Efficient HMM with feature vectors can be applied to a wide range of sequential data, including natural language text, speech signals, DNA sequences, and protein sequences. It can also be used in various fields, such as bioinformatics, speech recognition, and natural language processing.

4. How is the efficiency of HMM with feature vectors measured?

The efficiency of HMM with feature vectors can be measured by comparing its performance to traditional HMMs without feature vectors. This can be done by evaluating metrics such as accuracy, precision, recall, and F1 score. Additionally, the time and resources required to train and use the model can also be considered when measuring its efficiency.

5. Are there any limitations to using feature vectors in HMMs?

While feature vectors can improve the efficiency of HMMs, there are some limitations to consider. For example, the feature vectors must be carefully chosen and may not be suitable for all types of sequences. Additionally, incorporating feature vectors into an HMM may increase the complexity and computational cost of the model.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
25
Views
5K
  • Quantum Interpretations and Foundations
Replies
29
Views
3K
  • Quantum Physics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Cosmology
Replies
1
Views
969
Back
Top