# Basic Elements of the HMM Model (Hidden Markov Model)

## I.  The basic elements of the HMM model (Hidden Markov Model)

### 1. Five basic elements of HMM

In HMM, there are 5 basic elements: {I, O, A, B, π}. Let us have an introduction to these 5 basic elements in combination with the sequence marking task:

(1) I: state sequence. Here, it refers to the label behind each word.

(2) O: Observation sequence. Here, it refers to each word itself.

(3) A: State transition probability matrix. Here, it refers to the probability that a certain annotation will transfer to the next annotation.

(4) B: Observation probability matrix, that is, emission probability matrix. Here, it refers to the probability of generating a certain word under a certain label.

(5) π: Initial probability matrix. Here, refers to the initialization probability of each annotation.

in:

I = ( i 1 , i 2 , i 3 . . . i N ) I = ({i_1,i_2,i_3...i_N})I=(i

1

,i

2

,i

3

...i

N

) state sequence

O = ( o 1 , o 2 , o 3 . . . o N ) O = ({o_1,o_2,o_3...o_N})O=(o

1

,o

2

,o

3

...o

N

) observation sequence

A = [ a i j ] N N A = [a_{ij}]_{N*N}A=[a

ij

]

NN

State transition probability matrix

B = [ b j ( k ) ] N N B = [b_j(k)]_{N*N}B=[b

j

(k)]

NN

Observation probability matrix

π initial state probability vector

2. HMM model

Model: λ = (A,B,π)

A, B, π are the three elements of the Hidden Markov Model

These three elements are obtained through statistics, these three values ​​are the parameters of the model, and the process of obtaining these three values ​​is the process of model training, so the HMM algorithm is a relatively simple algorithm.

The model has been known, and it can be considered that the fully connected path of each state has been known. Given an observation sequence, the optimal path among all paths is solved by the veterbi algorithm.

## 3. Two assumptions of the HMM model

### (1) Homogeneous Markov assumption (also called first-order Markov assumption)

The state of the hidden Markov chain at any time t only depends on the state at the previous time, and has nothing to do with the state at other times and the observed state.

P ( i t i t − 1 , o t − 1 , . . . , i 1 ) = P ( i t i t − 1 ) P(i_t|i_{t-1},o_{t-1},..., i_1) = P(i_t|i_{t-1})P(i

t

i

t−1

,o

t−1

,...,i

1

)=P(i

t

i

t−1

)

### (2) Observational Independence Hypothesis

The observed state at any time only depends on the state of the Markov chain at that time, and has nothing to do with the observed states at other times.

The above elements can be counted from the training corpus. Finally, based on these statistics, we apply the Viterbi algorithm to calculate the label sequence behind the word sequence.

## 2. There are three application scenarios for the Hidden Markov Model

We only use one of them for named entity recognition - to find the most likely label sequence behind the observation sequence.

Finally, let's talk about the three problems that HMM solves:

1. Evaluation (probability calculation problem)

Knowing the model parameters λ= (A, B, π), calculate the probability of a certain observation sequence, that is, find P(O|λ)

2. Learning (learning problems)

Given a sequence of observations O = ( o 1 , o 2 , . . . , o n ) O=(o_1,o_2,...,o_n)O=(o

1

,o

2

,...,o

n

), how to adjust the model parameters λ=(π, A, B) to maximize P(O|λ)? , this is the algorithm to find the model

3. Decoding (prediction problem or decoding problem) is the most commonly used

Given an observation sequence O and a model λ, find the most probable state sequence S(s1,s2,…st+1).

For example: through entity labeling and training, we can get the model λ, now given an observation sequence = I work in Phoenix Finance, to find the most likely named entities, and want to find the corresponding state sequence S = (I, in , Phoenix Finance, Work).

3. Find the model λ: solve the second problem

HMM is a generative model, so the joint probability is sought

Note: When we usually say, finding the model refers to finding the objective function. For example, in linear regression, our objective function is $h(λ)=λ_1X+λ_2$, and finding the objective function only requires the parameter λ, so usually We say that seeking model is seeking parameters.

1

P ( X , Y ) P(X,Y)P(X,Y)

Fourth, Viterbi algorithm (Viterbi): solve the third problem

The Viterbi algorithm mainly uses dynamic programming to solve the prediction problem of HMM: always model and observation sequence, find the most probable state sequence

Suppose the sequence of states is: x 1 , x 2 , x 3 . . . x N x_1,x_2,x_3 ... x_Nx

1

,x

2

,x

3

...x

N

The corresponding observation sequence is: y 1 , y 3 , y 3 . . . y N y_1,y_3,y_3 ... y_Ny

1

,y

3

,y

3

...y

N

Then our problem is transformed into: the known input sequence y 1 , y 3 , y 3 . . . y N y_1,y_3,y_3 ... y_Ny

1

,y

3

,y

3

...y

N

, corresponding to the most probable Chinese characters x 1 , x 2 , x 3 . . . x N x_1,x_2,x_3 ... x_Nx

1

,x

2

,x

3

...x

N

. What is the most likely sequence of Chinese characters?

formula:

x 1 , x 2 , x 3 . . . x N = A r g M a x P ( x 1 , x 2 , x 3 . . . x N y 1 , y 3 , y 3 . . . y N ) = A r g M a x ∏ i = 1 N P ( y i x i ) P ( x i x i − 1 ) x_1,x_2,x_3 ... x_N = ArgMaxP(x_1,x_2,x_3 ... x_N|y_1,y_3,y_3 . .. y_N) = ArgMax \prod_{i=1}^N P(y_i|x_i)*P(x_i|x_{i-1})x

1

,x

2

,x

3

...x

N

=ArgMaxP(x

1

,x

2

,x

3

...x

N

y

1

,y

3

,y

3

...y

N

)=ArgMax∏

i=1

N

P(y

i

x

i

)P(x

i

x

i−1

)

where the formula A r g M a x ∏ i = 1 N P ( y i x i ) P ( x i x i − 1 ) ArgMax \prod_{i=1}^N P(y_i|x_i)*P(x_i|x_{i-1 })ArgMax∏

i=1

N

P(y

i

x

i

)P(x

i

x

i−1

) is mainly transformed by the Bayesian formula

We know that the Bayesian formula is: P ( A B ) = P ( B A ) P ( A ) P ( B ) P(A|B) = \frac {P(B|A)*P(A )}{P(B)}P(AB)=

P(B)

P(BA)P(A)

Then P ( x y ) = P ( y x ) P ( x ) P ( y ) P(x|y) = \frac {P(y|x)*P(x)}{P(y) }P(xy)=

P(y)

P(yx)P(x)

, where P(y) is a known constant, where P(x) is actually P ( x t x t − 1 ) P(x_t|x_{t-1})P(x

t

x

t−1

), according to the Markov hypothesis, the current moment hypothesis is related to the previous moment.

For example, enter the observation sequence:

I love China

O O O B

O B O I

O O B I

B O I B

That is, the third line sought is the optimal path:

Fourth, the Viterbi algorithm (Viterbi)

Note: During the calculation of the viberbi algorithm, the shortest path between two points is calculated, not the shortest path between the two layers.

1

1. Nature

If the path p with the highest probability (or the shortest path) passes through a certain point, such as a on the way, then the sub-path Q from the starting point S to a on this path must be the shortest path between S and X22.

Otherwise, replacing Q with the shortest path R from S to a constitutes a shorter path than P, which is obviously contradictory. It is proved that the optimality principle is satisfied.

2. Algorithms

If you find the shortest path between S and E, what better way than to traverse all the paths?

In fact there must be a shortest path among all paths:

Let's start solving step by step from scratch:

(1) First, the starting point is S, and there are three possible paths from S to A column: S-A1, S-A2, S-A3, as shown in the following figure:

We cannot arbitrarily say which segment of S-A1, S-A2, and S-A3 must be part of the global shortest path. So far, any segment may be an alternative to the global shortest path.

(2). Then start the second layer

As shown above, there are only 3 paths through B1:

S-A1-B1

S-A2-B1

S-A3-B1

If the final second layer node passes through B1, the shortest path must be selected from these three paths, then the other two can be deleted.

<2> Then we start the second node of the second layer:

Similarly, as shown in the figure above, there are 3 paths through B2:

S-A1-B2

S-A2-B2

S-A3-B2

If the final second layer node passes through B2, the shortest path must be selected from these three paths, and the other two can be deleted.

<3> Then we start the third node of the second layer:

Similarly, as shown in the figure above, there are also 3 paths through B3:

S-A1-B3

S-A2-B3

S-A3-B3

If the final second layer node passes through B3, one of the shortest paths must be selected from these three paths, and the other two can be deleted.

<4> After all the stages of the second layer are traversed, there are three paths left.

We don't yet have enough information to tell which one must be a subpath of the global shortest path.

(3) Then we continue the algorithm at the third layer:

We don't yet have enough information to tell which one must be a subpath of the global shortest path.

(4) Then we continue the algorithm at the last layer:

Point E is already the end point. We can know which one is the shortest path by comparing the total length of the three paths.

Labels: