HMM training with variable length data

In summary, different length sets may have an impact since some hidden states might be difficult to reach with small samples. If this is the scenario, using many short length sequences will stress the model in the initial steps and downsize the relevance of hidden states which influence appears only in long sequences. This is likely the reason why your model works so well for short sequences but increasingly fails for long ones. If you would not have these issues you could use all the data you have to train the model, but in this case you might be better off ignoring the short sequences altogether and just working with medium/long one to see how it works.
  • #1
malina
3
0
Hi All,

I need to train an HMM using data with sequences of variable length (5 - 500 symbols per input sequence).

From what I've seen thus far, all (or most) trainings are perfirmed on data-sets of a fixed size, although there is no explicit demand for this in the HMM structure.

So, first of all - what am I missing and is it indeed not advised to train HMM with variable-length data? Does this violate the stochastic assumptons of the EM/Viterbi algorithms?

Next, for the model that I receive, I have "good" performance for "short" sequences, but as the sequence gets longer, the perfromance decreases (and sometimes increases back). I can relate this to two possible causes:
1) Longer sequences have dynamics uncaptured by the HMM since they are not the majority of the training set hence the "random" prediction behavior
2) HMM gets stuck on short-length model (which is another way to rephrase (1), but not exactly).

Can someone please advise on the matter?
Thanks!
 
Physics news on Phys.org
  • #2
malina said:
Hi All,

I need to train an HMM using data with sequences of variable length (5 - 500 symbols per input sequence).

From what I've seen thus far, all (or most) trainings are perfirmed on data-sets of a fixed size, although there is no explicit demand for this in the HMM structure.

So, first of all - what am I missing and is it indeed not advised to train HMM with variable-length data? Does this violate the stochastic assumptons of the EM/Viterbi algorithms?

Next, for the model that I receive, I have "good" performance for "short" sequences, but as the sequence gets longer, the perfromance decreases (and sometimes increases back). I can relate this to two possible causes:
1) Longer sequences have dynamics uncaptured by the HMM since they are not the majority of the training set hence the "random" prediction behavior
2) HMM gets stuck on short-length model (which is another way to rephrase (1), but not exactly).

Can someone please advise on the matter?
Thanks!

Hi Malina,

Different length sets may have an impact since some hidden states might be difficult to reach with small samples. If this is the scenario, using many short length sequences will stress the model in the initial steps and downsize the relevance of hidden states which influence appears only in long sequences.

This is likely the reason why your model works so well for short sequences but increasingly fails for long ones.

Now, if you would not have these issues you could use all the data you have to train the model, but in this case you might be better off ignoring the short sequences altogether and just working with medium/long one to see how it works.
 
  • #3
Thanks Viraltux,

The assumptuion is that the model is reflected similarly in long/short sequences, i.e., short sequences teach the model about relations later on seen in longer sequences. Think about partial sequences' availability. Hence, supposedly you should not feel the difference between the sequences (unless if the states transitions are captured incorrectly which can happen with partila training data :-().
Unfortunately, i don't have enought long sequences for training :-(Malina.
 
  • #4
malina said:
Thanks Viraltux,

The assumptuion is that the model is reflected similarly in long/short sequences, i.e., short sequences teach the model about relations later on seen in longer sequences. Think about partial sequences' availability. Hence, supposedly you should not feel the difference between the sequences (unless if the states transitions are captured incorrectly which can happen with partila training data :-().
Unfortunately, i don't have enought long sequences for training :-(Malina.

OK then, you can treat this as a problem of missing data. One little trick you can try is the following; imagine you have 100 short sequences and only 10 long ones of length 500. Then for every short sequence you cut randomly one of your 10 long ones and paste it to the short sequence so that the result is 100 equal sized long sequences mixing short and long sequences.

For example:
if you have short sequences
A,B,B,C
D,A,B,A
and one long sequence B,C,D,D,E,A,B,C,C,E,A,A,B

So instead that data to train your model you use

A,B,B,C,E,A,B,C,C,E,A,A,B
D,A,B,A,E,A,B,C,C,E,A,A,B
B,C,D,D,E,A,B,C,C,E,A,A,B

By doing this the short sequence performance will remain the same but the long sequences parameters will not be underestimated by the lack of data. Now, this is not ideal, and there is a world out there on how to treat missing data, but I think the best you can do is to think about the best strategy to complete the sequences in the problem you are dealing with to avoid an over fitting of the model parameters for short sequences.

Good Luck Malina!
 
Last edited:
  • #5
Thanks!
Will keep you updated if something extremely cool will work out of this.
M.
 
  • #6
Sure! please do! :smile:
 

Related to HMM training with variable length data

1. What is HMM training with variable length data?

HMM training with variable length data is a method of training a Hidden Markov Model (HMM) using data sequences of varying lengths. This is often used in speech recognition, where the length of the spoken words or sentences can vary.

2. Why is HMM training with variable length data important?

This method is important because it allows for the modeling of sequences with varying lengths, which is necessary in many real-world applications such as speech recognition and natural language processing. It also helps to improve the accuracy of the model by taking into account the variability in the data.

3. How does HMM training with variable length data work?

In this method, the HMM is trained using an algorithm called the Baum-Welch algorithm, which takes into account the variable length of the data sequences. It uses a technique called dynamic programming to efficiently estimate the parameters of the HMM.

4. What are the challenges of HMM training with variable length data?

One of the main challenges is dealing with the varying lengths of the data sequences, as this can make it difficult to estimate the parameters of the HMM accurately. Additionally, the Baum-Welch algorithm can be computationally expensive for large datasets.

5. What are some applications of HMM training with variable length data?

HMM training with variable length data is commonly used in speech recognition, where the length of spoken words or sentences can vary. It is also used in natural language processing, bioinformatics, and other fields where sequential data with varying lengths needs to be modeled.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
4K
Replies
1
Views
1K
  • Electrical Engineering
Replies
1
Views
5K
  • Special and General Relativity
Replies
9
Views
2K
  • General Math
Replies
1
Views
3K
Replies
15
Views
2K
  • STEM Academic Advising
Replies
13
Views
2K
  • General Math
Replies
16
Views
2K
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
9K
Back
Top