|PR414 / PR813
|| ||Estimating models via the
|| ||Expectation-Maximisation (EM) algorithm
This document is also available in
Purpose: To introduce the EM algorithm which is used to train mixture models, HMMs, etc.
Material: LECTURE NOTES
paper by TK Moon, ``The Expectation Maximization Algorithm'', IEEE Signal Processing Magazine, November 1996.
In this lecture we introduce a very powerful and elegant algorithm which can be used
in situations where we have observations that are only indirectly tied to underlying
parameters that we wish to estimate. It is extremely useful, has decent properties
and find very wide application. Below we consider a number of applications of the
EM algorithm, all of them directly relevant (there are more in the notes).
- Clustering/Vector Quantisation (VQ):
Quite often in pattern recognition/coding, we observe
a series of feature vectors
and want to
find a series of representative vectors
a "codebook") that can serve as a model/substitute for the original
. We want this codebook to be optimal in some sense, e.g.
minimising the sum of squared quantisation errors. We have observed
the feature vectors
, but how can we estimate
- Mixture probability density functions (PDFs):
To escape the limitations of unimodality and symmetry of a Gaussian PDF
we often prefer to substitute it with a mixture of Gaussian PDFs.
If the match between a feature vector
and a single Gaussian PDF
is given by
, then its multimodal mixture version is given by
Once again, we have observed the feature vectors
, but how can we use them
to estimate the (unknown) underlying mixtures ? (By the way,
the VQ problem is just a special case of mixture PDFs!)
- Hidden Markov models (HMMs):
The scenario is very reminiscent of the above, only more complex.
Now our model consists of a number of states
has a transition probability joining it to
state . Each state also has a PDF
, which in its turn may
be a mixture as described above. Typically we also have a
probabilistic description of which states the model may
start off in. How do we use feature vectors
estimate these unknown parameters? (By the way, if you think about
it long enough, you will discover that the mixture PDF problem is
just a special (degenerate) case of the HMM!)
Before we continue on to the EM algorithm, we first need to
introduce the concept of estimation, particularly maximum likelihood estimation (MLE). Consider a model with parameters
. We have a number of observations
we want to estimate . The MLE approach is to consider the
in a slightly unusual way. In contrast
to normal use, we now take
as constant and
as a variable. When viewed from this perspective, we refer to the
function value as a likelihood instead of a probability density
value. We now search for the so as to maximise the
likelihood (ML) for the given
. This value then serves as
the ML estimate. When the relationship between the observations and
the unknown parameter is direct, we can solve this equation via
differentiation, numerical techniques or whatever. It is from this
approach that we obtain the expressions we so often use for
estimating means etc. In previous work where you estimated Gaussian
PDFs the relationship was direct and you (probably without knowing
it) were using MLE. Please note, although MLE is very popular, it is
by no means the only acceptable estimate. In later lectures we will
investigate some alternatives.
In the scenarios sketched above, notice that each time we have observations
that are only indirectly related to the parameters that we want to estimate.
For instance, in the Gaussian mixture case, we really would have
wanted to know with which of the mixture components we must associate
a given observation. In each case the problem was tricky because we were not sure
where in the underlying model any given observation fitted. However, if we did have
the underlying model available, we would probably be able to state where we
expect a given observation to fit in. This is where the EM algorithm comes in.
- Initialisation: Assume some initial model. (The better
you get it, the better you get it.)
- Expectation Step: Use you current model and the observed data
to estimate the unknown factors that will make the relationship to the
underlying model direct.
- Maximisation Step: Based on this (estimated) information now
make a ML estimation of the parameters we are really interested in.
- Convergence: Iterate from the E-step until convergence.
The algorithm is guaranteed to converge (and normally does so fairly quickly),
but it may only be a local optimum! The absence of tunable parameters
is a major benefit. Compared to steepest descent and similar methods
this is nirvana!
Project: (To hand in two lectures from today)
- (All) Play around with simple MLE to derive equations for estimating
the (by now well-known) expressions for determining the parameters
of Gaussian PDFs. (PR813 - why the N-1 term in the variance estimate?)
- (PR414 only full-covariance)
Generate a doughnut-shaped annulus data set containing
two-dimensional feature vectors, of which an example run is shown below.
Implement Gaussian mixture models based on full, diagonal and
spherical covariance matrices, respectively. Fit each GMM to the
annulus data set and display the resulting component densities on
top of the data set, using ellipses centred at the component means.
The shape of each ellipse is derived from the component covariance matrix
cv via the following Matlab code snippet:
t = (0:200)*2*pi/200;
circle = [cos(t); sin(t)];
[U,D] = eig(cv);
ellipse = U*sqrt(D)*circle;
Discuss the differences between the three GMM versions, as well as
the effect of the number of components on the fit achieved.
- (All, PR414 only full covariance) Use Gaussian mixtures in the previous
classification tasks (simvowel, speakers, faces).