Basic Elements of the HMM Model (Hidden Markov Model)
I. The basic elements of the HMM model (Hidden Markov Model)
1. Five basic elements of HMM
In HMM, there are 5 basic elements: {I, O, A, B, π}. Let us
have an introduction to these 5 basic elements in combination with the sequence
marking task:
(1) I: state sequence. Here, it refers to the label behind
each word.
(2) O: Observation sequence. Here, it refers to each word
itself.
(3) A: State transition probability matrix. Here, it refers
to the probability that a certain annotation will transfer to the next
annotation.
(4) B: Observation probability matrix, that is, emission
probability matrix. Here, it refers to the probability of generating a certain
word under a certain label.
(5) π: Initial probability matrix. Here, refers to the
initialization probability of each annotation.
in:
I = ( i 1 , i 2 , i 3 . . . i N ) I =
({i_1,i_2,i_3...i_N})I=(i
1
,i
2
,i
3
...i
N
) state sequence
O = ( o 1 , o 2 , o 3 . . . o N ) O =
({o_1,o_2,o_3...o_N})O=(o
1
,o
2
,o
3
...o
N
) observation sequence
A = [ a i j ] N ∗ N A =
[a_{ij}]_{N*N}A=[a
ij
]
N∗N
State transition
probability matrix
B = [ b j ( k ) ] N ∗ N B =
[b_j(k)]_{N*N}B=[b
j
(k)]
N∗N
Observation
probability matrix
Ï€ initial state probability vector
2. HMM model
Model: λ = (A,B,π)
A, B, π are the three elements of the Hidden Markov Model
These three elements are obtained through statistics, these three values are the parameters of the model, and the process of obtaining these three values is the process of model training, so the HMM algorithm is a relatively simple algorithm.
The model has been known, and it can be
considered that the fully connected path of each state has been known. Given an
observation sequence, the optimal path among all paths is solved by the veterbi
algorithm.
3. Two assumptions of the HMM model
(1) Homogeneous Markov assumption (also called first-order Markov assumption)
The state of the hidden Markov chain at any time t only
depends on the state at the previous time, and has nothing to do with the state
at other times and the observed state.
P ( i t ∣ i t − 1 , o t − 1
, . . . , i 1 ) = P ( i t ∣ i t − 1 )
P(i_t|i_{t-1},o_{t-1},...,
i_1) = P(i_t|i_{t-1})P(i
t
∣i
t−1
,o
t−1
,...,i
1
)=P(i
t
∣i
t−1
)
(2) Observational Independence Hypothesis
The observed state at any time only depends on the state of
the Markov chain at that time, and has nothing to do with the observed states
at other times.
The above elements can be counted from the training corpus.
Finally, based on these statistics, we apply the Viterbi algorithm to calculate
the label sequence behind the word sequence.
2. There are three application scenarios for the Hidden Markov Model
We only use one of them for named entity recognition - to
find the most likely label sequence behind the observation sequence.
Finally, let's talk about the three problems that HMM solves:
1. Evaluation (probability calculation problem)
Knowing the model parameters λ= (A, B, π), calculate the
probability of a certain observation sequence, that is, find P(O|λ)
2. Learning (learning problems)
Given a sequence of observations O = ( o 1 , o 2 , . . . , o
n ) O=(o_1,o_2,...,o_n)O=(o
1
,o
2
,...,o
n
), how to adjust the
model parameters λ=(π, A, B) to maximize P(O|λ)? , this is the algorithm to
find the model
3. Decoding (prediction problem or decoding problem) is the
most commonly used
Given an observation sequence O and a model λ, find the most
probable state sequence S(s1,s2,…st+1).
For example: through entity labeling and training, we can get
the model λ, now given an observation sequence = I work in Phoenix Finance, to
find the most likely named entities, and want to find the corresponding state
sequence S = (I, in , Phoenix Finance, Work).
3. Find the model λ: solve the second problem
HMM is a generative model, so the joint probability is sought
Note: When we usually say, finding the model refers to
finding the objective function. For example, in linear regression, our
objective function is $h(λ)=λ_1X+λ_2$, and finding the objective function only
requires the parameter λ, so usually We say that seeking model is seeking
parameters.
1
P ( X , Y ) P(X,Y)P(X,Y)
Fourth, Viterbi algorithm (Viterbi): solve the third problem
The Viterbi algorithm mainly uses dynamic programming to
solve the prediction problem of HMM: always model and observation sequence,
find the most probable state sequence
Suppose the sequence of states is: x 1 , x 2 , x 3 . . . x N
x_1,x_2,x_3 ... x_Nx
1
,x
2
,x
3
...x
N
The corresponding observation sequence is: y 1 , y 3 , y 3 .
. . y N y_1,y_3,y_3 ... y_Ny
1
,y
3
,y
3
...y
N
Then our problem is transformed into: the known input
sequence y 1 , y 3 , y 3 . . . y N y_1,y_3,y_3 ... y_Ny
1
,y
3
,y
3
...y
N
, corresponding to the
most probable Chinese characters x 1 , x 2 , x 3 . . . x N x_1,x_2,x_3 ... x_Nx
1
,x
2
,x
3
...x
N
. What is the most
likely sequence of Chinese characters?
formula:
x 1 , x 2 , x 3 . . . x N = A r g M a x P ( x 1 , x 2 , x 3 .
. . x N ∣ y 1 , y 3 , y 3 . . . y N ) = A r g M a x ∏ i =
1 N P ( y i ∣ x i ) ∗ P ( x i ∣ x i − 1 ) x_1,x_2,x_3 ... x_N = ArgMaxP(x_1,x_2,x_3 ...
x_N|y_1,y_3,y_3 . .. y_N) = ArgMax \prod_{i=1}^N P(y_i|x_i)*P(x_i|x_{i-1})x
1
,x
2
,x
3
...x
N
=ArgMaxP(x
1
,x
2
,x
3
...x
N
∣y
1
,y
3
,y
3
...y
N
)=ArgMax∏
i=1
N
P(y
i
∣x
i
)∗P(x
i
∣x
i−1
)
where the formula A r g M a x ∏ i = 1 N P ( y i ∣ x i ) ∗ P ( x i ∣ x i − 1 ) ArgMax \prod_{i=1}^N
P(y_i|x_i)*P(x_i|x_{i-1 })ArgMax∏
i=1
N
P(y
i
∣x
i
)∗P(x
i
∣x
i−1
) is mainly
transformed by the Bayesian formula
We know that the Bayesian formula is: P ( A ∣ B ) = P ( B ∣ A ) ∗ P ( A ) P ( B ) P(A|B) = \frac {P(B|A)*P(A
)}{P(B)}P(A∣B)=
P(B)
P(B∣A)∗P(A)
Then P ( x ∣ y ) = P ( y ∣ x ) ∗ P ( x ) P ( y )
P(x|y) = \frac {P(y|x)*P(x)}{P(y) }P(x∣y)=
P(y)
P(y∣x)∗P(x)
, where P(y) is a
known constant, where P(x) is actually P ( x t ∣ x t − 1 ) P(x_t|x_{t-1})P(x
t
∣x
t−1
), according to the
Markov hypothesis, the current moment hypothesis is related to the previous
moment.
For example, enter the observation sequence:
I love China
O O O B
O B O I
O O B I
B O I B
That is, the third line sought is the optimal path:
Fourth, the Viterbi algorithm (Viterbi)
Note: During the calculation of the viberbi algorithm, the
shortest path between two points is calculated, not the shortest path between
the two layers.
1
1. Nature
If the path p with the highest probability (or the shortest
path) passes through a certain point, such as a on the way, then the sub-path Q
from the starting point S to a on this path must be the shortest path between S
and X22.
Otherwise, replacing Q with the shortest path R from S to a
constitutes a shorter path than P, which is obviously contradictory. It is
proved that the optimality principle is satisfied.
2. Algorithms
If you find the shortest path between S and E, what better
way than to traverse all the paths?
In fact there must be a shortest path among all paths:
Let's start solving step by step from scratch:
(1) First, the starting point is S, and there are three
possible paths from S to A column: S-A1, S-A2, S-A3, as shown in the following
figure:
We cannot arbitrarily say which segment of S-A1, S-A2, and
S-A3 must be part of the global shortest path. So far, any segment may be an
alternative to the global shortest path.
(2). Then start the second layer
<1> We start with the first node of the second layer:
As shown above, there are only 3 paths through B1:
S-A1-B1
S-A2-B1
S-A3-B1
If the final second layer node passes through B1, the
shortest path must be selected from these three paths, then the other two can
be deleted.
<2> Then we start the second node of the second layer:
Similarly, as shown in the figure above, there are 3 paths
through B2:
S-A1-B2
S-A2-B2
S-A3-B2
If the final second layer node passes through B2, the
shortest path must be selected from these three paths, and the other two can be
deleted.
<3> Then we start the third node of the second layer:
Similarly, as shown in the figure above, there are also 3
paths through B3:
S-A1-B3
S-A2-B3
S-A3-B3
If the final second layer node passes through B3, one of the
shortest paths must be selected from these three paths, and the other two can
be deleted.
<4> After all the stages of the second layer are
traversed, there are three paths left.
We don't yet have enough information to tell which one must
be a subpath of the global shortest path.
(3) Then we continue the algorithm at the third layer:
We don't yet have enough information to tell which one must
be a subpath of the global shortest path.
(4) Then we continue the algorithm at the last layer:
Point E is already the end point. We can know which one is
the shortest path by comparing the total length of the three paths.