C.2 Algorithms for Hidden Markov Models
C.2.5 The Baum-Welch Algorithm
An always recurrent problem with HMMs is to adjust the parametersA,B and πto maximise the probability of observing a certain sequence of symbols. This is quite a hard problem and no analytic solutions have been given yet. Though, in contrast to an analytic solution, there are a few algorithms which are able to optimise the parameters iteratively; one of them is known as the Baum-Welch algorithm. The idea here is to keep on training the HMM, to maximise the probability of observing a certain symbol sequenceO, until a certain threshold has been reached.
Before taking a look at the Baum-Welch algorithm, we introduce the joint prob-ability of being in state Si at time t and in state Sj at time t+ 1 given the observation sequence Oand the model λ:
ξt(i, j) =P(qt=Si, qt+1=Sj|O, λ), 1≤i≤N
1≤j≤N (C.31)
The event of being in stateSi at timet and in stateSj at timet+ 1 could be illustrated graphically; see figure C.6.
Si Si
bj( Ot+1) aij
αt(i) (j)
β t+1
t+1
t t+2
t−1
Figure C.6: Graphically representation of the joint event of be-ing in stateSi at timetand in stateSj at timet+ 1.
From figure C.6 it is clear to see, that the probability given in equation (C.31) could be written as:
ξt(i, j) = αt(i)aijbj(Ot+1)βt+1(j)
P(O|λ) , 1≤i≤N
1≤j≤N (C.32)
To be in state Si at time t we need to have observed the symbol sequence O=O1O2· · ·Ot; thus the αt(i) term. To be in stateSj at timet+ 1, coming from stateSi and observing the symbolOt+1 in Sj, we must make a transition from stateSitoSj and account for the probability ofOt+1inSj; thus the term aijbj(Ot+1). We should also take into account observing the rest of the symbol sequenceOt+2Ot+3· · ·OT, given we are in stateSjat timet+ 1; thus the term βt+1(j). Finally we divide with the probability P(O|λ), to achieve a proper normalisation value forξt(i, j).
If we take a closer look atξt(i, j) summing it up for allj, we get the following result:
XN
j=1
ξt(i, j) =
XN
j=1
αt(i)aijbj(Ot+1)βt+1(j)
P(O|λ) (C.33)
=
·XN
j=1
aijbj(Ot+1)βt+1(j)
¸ αt(i)
P(O|λ) (C.34)
by def. ofβt(i)
= βt(i)αt(i)
P(O|λ) (C.35)
C.2 Algorithms for Hidden Markov Models 125 which in fact is equal to the variableγt(i) as defined in equation (C.22) on page 120, resulting in the following equation:
γt(i) = XN
j=1
ξt(i, j) (C.36)
As mentioned in section C.2.4 on page 119, γt(i) is the probability of being in stateSi at timet. This means that:
γ1(i) = the probability of being in stateSi at time t= 1 γ2(i) = the probability of being in stateSi at time t= 2
· · · = · · ·
γT(i) = the probability of being in stateSi at time t=T
And summing up all these probabilities over time gives us a measure, which can be interpreted as the number of times we have visited stateSi. Or by excluding t = T from the sum, a measure, which can be interpreted as the number of transition made from stateSi.
T−1X
t=1
γt(i) = Expected number of transitions made from stateSi (C.37)
As mentioned beforeξ(i, j) is the probability of being in stateSiat timetand in stateSjat timet+1, thus making a transition fromSitoSj. The summation of ξt(i, j) over time from t=1 to t=T-1, can therefore be interpreted as the number of transitions made from stateSi to Sj.
T−1X
t=1
ξt(i, j) = Expected number of transitions from stateSi to stateSj
(C.38)
We can now reestimate the parameters of the HMM λ= (A, B, π) by the
pa-rametersA,B andπgiven by:
πi = the probability of being inSi at time 1 =γ1(i) (C.39)
aij = Expected number of transitions from state Si to stateSj
Expected number of transitions out of stateSi
=
PT−1
t=1 ξt(i, j) PT−1
t=1 γt(i) (C.40)
bj(k) = Expected number of times inSj observing symbolvk
Expected number of times inSj
= PT
Ott=1=vk
γt(j) PT
t=1γt(j) (C.41)
In the re-estimation of aij we need the probability of doing justone transition fromSitoSj, the sumPT−1
t=1 ξt(i, j) gives us the probability ofall the expected transitions, so we need to divide by the number of transitions out of stateSi. The same applies for the re-estimation ofbj(k), where we divide by the number of times in stateSj.
If we are given a HMMλ= (A, B, π) and its re-estimated modelλ= (A, B, π), where A, B and π are calculated as described above, it can be proven that either the two models are the same,λ=λ, or the probability of observing the symbol sequence O in λ is greater then observing it in λ, P(O|λ) > P(O|λ).
This enable us to reestimate the HMMλiteratively until a preferred threshold µhas been reached:
step 1: Given the HMMλ, find the re-estimated modelλusing equation (C.39), (C.40) and (C.41).
step 2: If µ > P(O|λ)−P(O|λ) then terminate the algorithm and return λ otherwise setλ=λand go to step 1.
If we look at the re-estimation of the initial probability distributionπ, we can see from equation (C.22) that it is defined by:
γ1(i) = α1(i)β1(i)
P(O|λ) (C.42)
From equation (C.13) we know that P(O|λ) can be expressed as PN
i=1αT(i) giving us:
γ1(i) = α1(i)β1(i) PN
i=1αT(i) (C.43)
C.2 Algorithms for Hidden Markov Models 127 The α1(i) term requires 1 multiplication for every i, see equation (C.11) on page 113, whereas the β1(i) term will require (3N−1)(T−1) calculations for everyi, see equation (C.18) on page 117. For everyi, 1 ≤i≤N, we need to find a new numerator of equation (C.43), resulting in (1 + (3N−1)(T−1))N calculations. The denominator only needs to be calculated once, this requires 1 multiplication for the initialisation part of αT(i), N+ 1 multiplications plus N−1 additions for the induction part ofαT(i), see section C.2.2 on page 116, and we need to find αT(i) for all i, where 1≤i≤N, giving the denominator in the order of (1 +N + 1 +N −1)N = (2N + 1)N calculations. If we add the required calculations for the numerator and denominator of equation (C.43) together, we get in the order of (1 + (3N−1)(T−1))N+ (2N+ 1)Ncalculations or an asymptotic running time ofO(N2T).
If we look at the re-estimation of the state transition probability distribution in terms of the forward and backward variables, we can see from equation (C.36) and (C.32) thataij is defined by:
aij = When finding the re-estimations of the state transition probabilities, we first calculate all forward and backward probabilities, taking in the order of (2N+ 1)N(T −1) and (3N −1)N(T−1) calculations respectively; see section C.2.2 and C.2.3. We then need to do 3 multiplications to find the expression of αt(i)aijbj(Ot+1)βt+1(j) in the numerator of equation (C.48) for every i,j and t, where 1 ≤ i ≤ N, 1 ≤ j ≤ N and 1 ≤ t ≤ T −1, taking in the order of 3N2(T−1) calculations. When finding the denominator of equation (C.48), we just reuse the calculated value ofαt(i)aijbj(Ot+1)βt+1(j) in the numerator and sum up all the values for eachj, resulting in iadditions for every j, giving us N2calculations. Finally we should divide the nominator with the denominator for every i and j, resulting in N2 divisions. The complete running time for the re-estimation of the state transition probabilities therefore results in (2N+
1)N(T−1)+(3N−1)N(T−1)+3N2(T−1)+N2+N2 = (T−1)(8N2) + 2N2 calculations or an upper asymptotic running time ofO(T N2).
When re-estimating the observation sequence probability distributionsbj(k), we first need to find the gamma values γt(j) for every t and j, where 1≤j ≤N calculations for theβt(j) term,N additions for thePN
i=1αT(i) term, and 2T N calculations when multiplyingαt(j)βt(j) and dividing withP
i=1αT(i) for every tandj; thus resulting in (2N+ 1)N(T−1) + (3N−1)N(T−1) +N+ 2T N = 5N2(T−1) +N(1 + 2T) calculations or an upper asymptotic running time of O(T N2) for finding γt(j). Secondly we need to go trough all the observation symbolsOtfor every stateSj and whenvk=Otwe should add up the value of γt(j) tobj(k). To normalise the values of everybj(k) we divide by thePT
t=1γt(j) term. It takesN T additions for calculating all thePT
Ott=1=vk
γt(j) terms for every t and j, N T additions when finding PT
t=1γt(j) for every t and j, and N T divisions when doing the normalisation of bj(k). All this together with the computing of the gamma values result in 3N T + 5N2(T −1) +N(1 + 2T) = 5N2(T−1) +N(1 + 5T) calculations or an upper asymptotic running time of O(T N2).
We should notice that the computing of the alpha, beta and gamma values, only need to be calculated one time and we are in such way able to spare some calculations when re-estimatingpii,aijandbj(k) all at once. Though, the upper asymptotic running time for re-estimating pii, aij andbj(k) at the same time, will still beO(T N2).