Computer vision for dummies » Statistics https://www.visiondummy.com A blog about intelligent algorithms, machine learning, computer vision, datamining and more. Tue, 04 May 2021 14:17:31 +0000 en-US hourly 1 https://wordpress.org/?v=3.8.39 How to draw a covariance error ellipse? https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/ https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#comments Thu, 03 Apr 2014 16:42:10 +0000 http://www.visiondummy.com/?p=288 In this post, I will show how to draw an error ellipse, a.k.a. confidence ellipse, for 2D normally distributed data. The error ellipse represents an iso-contour of the Gaussian distribution, and allows you to visualize a 2D confidence interval. The following figure shows a 95% confidence ellipse for a set of 2D normally distributed data [...]

The post How to draw a covariance error ellipse? appeared first on Computer vision for dummies.

]]>
Introduction

In this post, I will show how to draw an error ellipse, a.k.a. confidence ellipse, for 2D normally distributed data. The error ellipse represents an iso-contour of the Gaussian distribution, and allows you to visualize a 2D confidence interval. The following figure shows a 95% confidence ellipse for a set of 2D normally distributed data samples. This confidence ellipse defines the region that contains 95% of all samples that can be drawn from the underlying Gaussian distribution.

Error ellipse

Figure 1. 2D confidence ellipse for normally distributed data

In the next sections we will discuss how to obtain confidence ellipses for different confidence values (e.g. 99% confidence interval), and we will show how to plot these ellipses using Matlab or C++ code.

Axis-aligned confidence ellipses

Before deriving a general methodology to obtain an error ellipse, let’s have a look at the special case where the major axis of the ellipse is aligned with the X-axis, as shown by the following figure:

Confidence ellipse

Figure 2. Confidence ellipse for uncorrelated Gaussian data

The above figure illustrates that the angle of the ellipse is determined by the covariance of the data. In this case, the covariance is zero, such that the data is uncorrelated, resulting in an axis-aligned error ellipse.

Table 1. Covariance matrix of the data shown in Figure 2
8.4213 0
0 0.9387

Furthermore, it is clear that the magnitudes of the ellipse axes depend on the variance of the data. In our case, the largest variance is in the direction of the X-axis, whereas the smallest variance lies in the direction of the Y-axis.

In general, the equation of an axis-aligned ellipse with a major axis of length 2a and a minor axis of length 2b, centered at the origin, is defined by the following equation:

(1)   \begin{equation*} \left(\frac{ x } { a }\right)^2 + \left(\frac{ y } { b }\right)^2 = 1 \end{equation*}

In our case, the length of the axes are defined by the standard deviations \sigma_x and \sigma_y of the data such that the equation of the error ellipse becomes:

(2)   \begin{equation*}  \left(\frac{ x } { \sigma_x }\right)^2 + \left(\frac{ y } { \sigma_y }\right)^2 = s \end{equation*}

where s defines the scale of the ellipse and could be any arbitrary number (e.g. s=1). The question is now how to choose s, such that the scale of the resulting ellipse represents a chosen confidence level (e.g. a 95% confidence level corresponds to s=5.991).

Our 2D data is sampled from a multivariate Gaussian with zero covariance. This means that both the x-values and the y-values are normally distributed too. Therefore, the left hand side of equation (2) actually represents the sum of squares of independent normally distributed data samples. The sum of squared Gaussian data points is known to be distributed according to a so called Chi-Square distribution. A Chi-Square distribution is defined in terms of ‘degrees of freedom’, which represent the number of unknowns. In our case there are two unknowns, and therefore two degrees of freedom.

Therefore, we can easily obtain the probability that the above sum, and thus s equals a specific value by calculating the Chi-Square likelihood. In fact, since we are interested in a confidence interval, we are looking for the probability that s is less then or equal to a specific value which can easily be obtained using the cumulative Chi-Square distribution. As statisticians are lazy people, we usually don’t try to calculate this probability, but simply look it up in a probability table: https://people.richland.edu/james/lecture/m170/tbl-chi.html.

For example, using this probability table we can easily find that, in the 2-degrees of freedom case:

    \begin{equation*} P(s < 5.991) = 1-0.05 = 0.95 \end{equation*}

Therefore, a 95% confidence interval corresponds to s=5.991. In other words, 95% of the data will fall inside the ellipse defined as:

(3)   \begin{equation*} \left(\frac{ x } { \sigma_x }\right)^2 + \left(\frac{ y } { \sigma_y }\right)^2 = 5.991 \end{equation*}

Similarly, a 99% confidence interval corresponds to s=9.210 and a 90% confidence interval corresponds to s=4.605.

The error ellipse show by figure 2 can therefore be drawn as an ellipse with a major axis length equal to 2\sigma_x \sqrt{5.991} and the minor axis length to 2\sigma_y \sqrt{5.991}.

Arbitrary confidence ellipses

In cases where the data is not uncorrelated, such that a covariance exists, the resulting error ellipse will not be axis aligned. In this case, the reasoning of the above paragraph only holds if we temporarily define a new coordinate system such that the ellipse becomes axis-aligned, and then rotate the resulting ellipse afterwards.

In other words, whereas we calculated the variances \sigma_x and \sigma_y parallel to the x-axis and y-axis earlier, we now need to calculate these variances parallel to what will become the major and minor axis of the confidence ellipse. The directions in which these variances need to be calculated are illustrated by a pink and a green arrow in figure 1.

Error ellipse

Figure 1. 2D confidence ellipse for normally distributed data

These directions are actually the directions in which the data varies the most, and are defined by the covariance matrix. The covariance matrix can be considered as a matrix that linearly transformed some original data to obtain the currently observed data. In a previous article about eigenvectors and eigenvalues we showed that the direction vectors along such a linear transformation are the eigenvectors of the transformation matrix. Indeed, the vectors shown by pink and green arrows in figure 1, are the eigenvectors of the covariance matrix of the data, whereas the length of the vectors corresponds to the eigenvalues.


The eigenvalues therefore represent the spread of the data in the direction of the eigenvectors. In other words, the eigenvalues represent the variance of the data in the direction of the eigenvectors. In the case of axis aligned error ellipses, i.e. when the covariance equals zero, the eigenvalues equal the variances of the covariance matrix and the eigenvectors are equal to the definition of the x-axis and y-axis. In the case of arbitrary correlated data, the eigenvectors represent the direction of the largest spread of the data, whereas the eigenvalues define how large this spread really is.

Thus, the 95% confidence ellipse can be defined similarly to the axis-aligned case, with the major axis of length 2\sqrt{5.991 \lambda_1} and the minor axis of length 2\sqrt{5.991 \lambda_2}, where \lambda_1 and \lambda_2 represent the eigenvalues of the covariance matrix.

To obtain the orientation of the ellipse, we simply calculate the angle of the largest eigenvector towards the x-axis:

(4)   \begin{equation*} \alpha = \arctan \frac{\mathbf{v}_1(y)}{\mathbf{v}_1(x)} \end{equation*}

where \mathbf{v}_1 is the eigenvector of the covariance matrix that corresponds to the largest eigenvalue.

Based on the minor and major axis lengths and the angle \alpha between the major axis and the x-axis, it becomes trivial to plot the confidence ellipse. Figure 3 shows error ellipses for several confidence values:

Error ellipses

Confidence ellipses for normally distributed data

Source Code

Matlab source code
C++ source code (uses OpenCV)

Conclusion

In this article we showed how to obtain the error ellipse for 2D normally distributed data, according to a chosen confidence value. This is often useful when visualizing or analyzing data and will be of interest in a future article about PCA.

Furthermore, source code samples were provided for Matlab and C++.

If you’re new to this blog, don’t forget to subscribe, or follow me on twitter!

JOIN MY NEWSLETTER
Receive my newsletter to get notified when new articles and code snippets become available on my blog!
We all hate spam. Your email address will not be sold or shared with anyone else.

The post How to draw a covariance error ellipse? appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/feed/ 62
Why divide the sample variance by N-1? https://www.visiondummy.com/2014/03/divide-variance-n-1/ https://www.visiondummy.com/2014/03/divide-variance-n-1/#comments Fri, 07 Mar 2014 15:29:22 +0000 http://www.visiondummy.com/?p=196 In this article, we will derive the well known formulas for calculating the mean and the variance of normally distributed data, in order to answer the question in the article’s title. However, for readers who are not interested in the ‘why’ of this question but only in the ‘when’, the answer is quite simple: If [...]

The post Why divide the sample variance by N-1? appeared first on Computer vision for dummies.

]]>
Introduction

In this article, we will derive the well known formulas for calculating the mean and the variance of normally distributed data, in order to answer the question in the article’s title. However, for readers who are not interested in the ‘why’ of this question but only in the ‘when’, the answer is quite simple:

If you have to estimate both the mean and the variance of the data (which is typically the case), then divide by N-1, such that the variance is obtained as:

    \begin{equation*} \sigma^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2 \end{equation*}

If, on the other hand, the mean of the true population is known such that only the variance needs to be estimated, then divide by N, such that the variance is obtained as:

    \begin{equation*} \sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2 \end{equation*}

Whereas the former is what you will typically need, an example of the latter would be the estimation of the spread of white Gaussian noise. Since the mean of white Gaussian noise is known to be zero, only the variance needs to be estimated in this case.

If data is normally distributed we can completely characterize it by its mean \mu and its variance \sigma^2. The variance is the square of the standard deviation \sigma which represents the average deviation of each data point to the mean. In other words, the variance represents the spread of the data. For normally distributed data, 68.3% of the observations will have a value between \mu-\sigma and \mu+\sigma. This is illustrated by the following figure which shows a Gaussian density function with mean \mu=10 and variance \sigma^2 = 3^2 = 9:

Gaussian density

Figure 1. Gaussian density function. For normally distributed data, 68% of the samples fall within the interval defined by the mean plus and minus the standard deviation.

Usually we do not have access to the complete population of the data. In the above example, we would typically have a few observations at our disposal but we do not have access to all possible observations that define the x-axis of the plot. For example, we might have the following set of observations:

Table 1
Observation ID Observed Value
Observation 1 10
Observation 2 12
Observation 3 7
Observation 4 5
Observation 5 11

If we now calculate the empirical mean by summing up all values and dividing by the number of observations, we have:

(1)   \begin{equation*} \mu = \frac{10+12+7+5+11}{5} = 9. \end{equation*}

Usually we assume that the empirical mean is close to the actually unknown mean of the distribution, and thus assume that the observed data is sampled from a Gaussian distribution with mean \mu=9. In this example, the actual mean of the distribution is 10, so the empirical mean indeed is close to the actual mean.

The variance of the data is calculated as follows:

(2)   \begin{equation*} \sigma^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2 = \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{4}) = 8.5. \end{equation*}

Again, we usually assume that this empirical variance is close to the real and unknown variance of underlying distribution. In this example, the real variance was 9, so indeed the empirical variance is close to the real variance.

The question at hand is now why the formulas used to calculate the empirical mean and the empirical variance are correct. In fact, another often used formula to calculate the variance, is defined as follows:

(3)   \begin{equation*} \sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2 = \frac{(10-9)^2+(12-9)^2+(7-9)^2+(5-9)^2+(11-9)^2}{5}) = 6.8. \end{equation*}

The only difference between equation (2) and (3) is that the former divides by N-1, whereas the latter divides by N. Both formulas are actually correct, but when to use which one depends on the situation.

In the following sections, we will completely derive the formulas that best approximate the unknown variance and mean of a normal distribution, given a few samples from this distribution. We will show in which cases to divide the variance by N and in which cases to normalize by N-1.

A formula that approximates a parameter (mean or variance) is called an estimator. In the following, we will denote the real and unknown parameters of the distribution by \hat{\mu} and \hat{\sigma}^2. The estimators, e.g. the empirical average and empirical variance, are denoted as \mu and \sigma^2.

To find the optimal estimators, we first need an analytical expression for the likelihood of observing a specific data point x_i, given the fact that the population is normally distributed with a given mean \mu and standard deviation \sigma. A normal distribution with known parameters is usually denoted as N(\mu, \sigma^2). The likelihood function is then:

(4)   \begin{align*} &x_i \sim N(\mu, \sigma^2) \\ &\Rightarrow P(x_i; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2\sigma^2}(x_i - \mu)^2}. \end{align*}

To calculate the mean and variance, we obviously need more than one sample from this distribution. In the following, let vector \vec{x}=(x_1, x_2,... x_N) be a vector that contains all the available samples (e.g. all the values from the example in table 1). If all these samples are statistically independent, we can write their joint likelihood function as the sum of all individual likelihoods:

(5)   \begin{equation*} P(\vec{x}; \mu, \sigma^2) = P(x_1, x_2, ..., x_N; \mu, \sigma^2) = P(x_1; \mu, \sigma^2)P(x_2; \mu, \sigma^2)...P(x_N; \mu, \sigma^2) = \prod_{i=1}^N P(x_i; \mu, \sigma^2) \end{equation*}

Plugging equation (4) into equation (5) then yields an analytical expression for this joint probability density function:

(6)   \begin{align*} P(\vec{x}; \mu, \sigma^2) &= \prod_{i=1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2\sigma^2}(x_i - \mu)^2}\\ &= \frac{1}{(2 \pi \sigma^2)^{\frac{N}{2}}} e^{-\frac{1}{2\sigma^2}\sum_{i=1}^N(x_i - \mu)^2} \end{align*}

Equation (6) will be important in the next sections and will be used to derive the well known expressions for the estimators of the mean and the variance of a Gaussian distribution.

Minimum variance, unbiased estimators

To determine if an estimator is a ‘good’ estimator, we first need to define what a ‘good’ estimator really is. The goodness of an estimator depends on two measures, namely its bias and its variance (yes, we will talk about the variance of the mean-estimator and the variance of the variance-estimator). Both measures are briefly discussed in this section.

Parameter bias

Imagine that we could obtain different (disjoint) subsets of the complete population. In analogy to our previous example, imagine that, apart from the data in Table 1, we also have a Table 2 and a Table 3 with different observations. Then a good estimator for the mean, would be an estimator that on average would be equal to the real mean. Although we can live with the idea that the empirical mean from one subset of data is not equal to the real mean like in our example, a good estimator should make sure that the average of the estimated means from all subsets is equal to the real mean. This constraint is expressed mathematically by stating that the Expected Value of the estimator should equal the real parameter value:

(7)   \begin{align*}  &E[\mu] = \hat{\mu}\\ &E[\sigma^2] = \hat{\sigma^2} \end{align*}

If the above conditions hold, then the estimators are called ‘unbiased estimators’. If the conditions do not hold, the estimators are said to be ‘biased’, since on average they will either underestimate or overestimate the true value of the parameter.

Parameter variance

Unbiased estimators guarantee that on average they yield an estimate that equals the real parameter. However, this does not mean that each estimate is a good estimate. For instance, if the real mean is 10, an unbiased estimator could estimate the mean as 50 on one population subset and as -30 on another subset. The expected value of the estimate would then indeed be 10, which equals the real parameter, but the quality of the estimator clearly also depends on the spread of each estimate. An estimator that yields the estimates (10, 15, 5, 12, 8) for five different subsets of the population is unbiased just like an estimator that yields the estimates (50, -30, 100, -90, 10). However, all estimates from the first estimator are closer to the true value than those from the second estimator.

Therefore, a good estimator not only has a low bias, but also yields a low variance. This variance is expressed as the mean squared error of the estimator:

    \begin{align*} &Var(\mu) = E[(\hat{\mu} - \mu)^2]\\ &Var(\sigma^2) = E[(\hat{\sigma} - \sigma)^2] \end{align*}

A good estimator is therefore is a low bias, low variance estimator. The optimal estimator, if such estimator exists, is then the one that has no bias and a variance that is lower than any other possible estimator. Such an estimator is called the minimum variance, unbiased (MVU) estimator. In the next section, we will derive the analytical expressions for the mean and the variance estimators of a Gaussian distribution. We will show that the MVU estimator for the variance of a normal distribution requires us to divide the variance by N under certain assumptions, and requires us to divide by N-1 if these assumptions do not hold.

Maximum Likelihood estimation

Although numerous techniques can be used to obtain an estimator of the parameters based on a subset of the population data, the simplest of all is probably the maximum likelihood approach.

The probability of observing \vec{x} was defined by equation (6) as P(\vec{x}; \mu, \sigma^2). If we fix \mu and \sigma^2 in this function, while letting \vec{x} vary, we obtain the Gaussian distribution as plotted by figure 1. However, we could also choose a fixed \vec{x} and let \mu and/or \sigma vary. For example, we can choose \vec{x}=(10, 12, 7, 5, 11) like in our previous example. We also choose a fixed \mu=10, and we let \sigma^2 vary. Figure 2 shows the plot of each different value of \sigma^2 for the distribution with the proposed fixed \vec{x} and \mu:

Maximum likelihood parameter estimation

Figure 2. This plot shows the likelihood of observing fixed data \vec{x} if the data is normally distributed with a chosen, fixed \mu=10, plotted against various values of a varying \sigma^2.

In the above figure, we calculated the likelihood P(\vec{x};\sigma^2) by varying \sigma^2 for a fixed \mu=10. Each point in the resulting curve represents the likelihood that observation \vec{x} is a sample from a Gaussian distribution with parameter \sigma^2. The parameter value that corresponds to the highest likelihood is then most likely the parameter that defines the distribution our data originated from. Therefore, we can determine the optimal \sigma^2 by finding the maximum in this likelihood curve. In this example, the maximum is at \sigma^2 = 7.8, such that the standard deviation is \sqrt{(\sigma^2)} = 2.8. Indeed if we would calculate the variance in the traditional way, with a given \mu=10, we would find that it is equal to 7.8:

    \begin{equation*} \frac{(10-10)^2+(12-10)^2+(7-10)^2+(5-10)^2+(11-10)^2}{5} = 7.8$. \end{equation*}

Therefore, the formula to compute the variance based on the sample data is simply derived by finding the peak of the maximum likelihood function. Furthermore, instead of fixing \mu, we let both \mu and \sigma^2 vary at the same time. Finding both estimators then corresponds to finding the maximum in a two-dimensional likelihood function.

To find the maximum of a function, we simply set its derivative to zero. If we want to find the maximum of a function with two variables, we calculate the partial derivative towards each of these variables and set both to zero. In the following, let \hat{\mu}_{ML} be the optimal estimator for the population mean as obtained using the maximum likelihood method, and let \hat{\sigma}^2_{ML} be the optimal estimator for the variance. To maximize the likelihood function we simply calculate its (partial) derivatives and set them to zero as follows:

    \begin{align*} &\hat{\mu}_{ML} = \arg\max_\mu P(\vec{x}; \mu, \sigma^2)\\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \mu} = 0 \end{align*}

and

    \begin{align*} &\hat{\sigma}^2_{ML} = \arg\max_{\sigma^2} P(\vec{x}; \mu, \sigma^2)\\ &\Rightarrow \frac{\partial P(\vec{x}; \mu, \sigma^2)}{\partial \sigma^2} = 0 \end{align*}

In the following paragraphs we will use this technique to obtain the MVU estimators of both \hat{\mu} and \hat{\sigma}. We consider two cases:

The first case assumes that the true mean of the distribution \hat{\mu} is known. Therefore, we only need to estimate the variance and the problem then corresponds to finding the maximum in a one-dimensional likelihood function, parameterized by \sigma^2. Although this situation does not occur often in practice, it definitely has practical applications. For instance, if we know that a signal (e.g. the color value of a pixel in an image) should have a specific value, but the signal has been polluted by white noise (Gaussian noise with zero mean), then the mean of the distribution is known and we only need to estimate the variance.

The second case deals with the situation where both the true mean and the true variance are unknown. This is the case you would encounter most and where you would obtain an estimate of the mean and the variance based on your sample data.

In the next paragraphs we will show that each case results in a different MVU estimator. More specific, the first case requires the variance estimator to be normalized by N to be MVU, whereas the second case requires division by N-1 to be MVU.

Estimating the variance if the mean is known

Parameter estimation

If the true mean of the distribution is known, then the likelihood function is only parameterized on \sigma^2. Obtaining the maximum likelihood estimator then corresponds to solving:

(8)   \begin{equation*} &\hat{\sigma}^2_{ML} = \arg\max_{\sigma^2} P(\vec{x}; \sigma^2) \end{equation*}

However, calculating the derivative of P(\vec{x}; \sigma^2), defined by equation (6) is rather involved due to the exponent in the function. In fact, it is much easier to maximize the log-likelihood function instead of the likelihood function itself. Since the logarithm is a monotonous function, the maximum will be the same. Therefore, we solve the following problem instead:

(9)   \begin{equation*} &\hat{\sigma}^2_{ML} = \arg\max_{\sigma^2}\log(P(\vec{x}; \sigma^2)). \end{equation*}

In the following we set s=\sigma^2 to obtain a simpler notation. To find the maximum of the log-likelihood function, we simply calculate the derivative of the logarithm of equation (6) and set it to zero:

    \begin{align*} &\frac{\partial \log(P(\vec{x}; \sigma^2))}{\partial \sigma^2} = 0\\ &\Leftrightarrow \frac{\partial \log(P(\vec{x}; s))}{\partial s} = 0\\ &\Leftrightarrow \frac{\partial}{\partial s} \log \left( \frac{1}{(2 \pi s)^{\frac{N}{2}}} e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\ &\Leftrightarrow \frac{\partial}{\partial s} \log \left( \frac{1}{(2 \pi)^{\frac{N}{2}}} \right) +  \frac{\partial}{\partial s} \log \left( \frac{1}{\sqrt{(s})^{\frac{N}{2}}} \right) + \frac{\partial}{\partial s} \log \left(e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\ &\Leftrightarrow \frac{\partial}{\partial s} \log \left( (s)^{-\frac{N}{2}} \right) + \frac{\partial}{\partial s} \left(-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\ &\Leftrightarrow -\frac{N}{2} \frac{\partial}{\partial s} \log \left( s \right) - \frac{1}{2} \sum_{i=1}^N(x_i - \mu)^2 \frac{\partial}{\partial s} \left(\frac{1}{s}\right) = 0\\ &\Leftrightarrow -\frac{N}{2s} + \frac{1}{2} \sum_{i=1}^N(x_i - \mu)^2 \left(\frac{1}{s^2}\right) = 0\\ &\Leftrightarrow \frac{N}{2s^2} \left (-s + \frac{1}{N} \sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\ &\Leftrightarrow \frac{N}{2s^2} \left (\frac{1}{N} \sum_{i=1}^N(x_i - \mu)^2 - s \right) = 0\\ \end{align*}

It is clear that if N > 0, then the only possible solution to the above is:

(10)   \begin{equation*} s = \sigma^2 = \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2. \end{equation*}

Note that this maximum likelihood estimator for \hat{\sigma} is indeed the traditional formula to calculate the variance of normal data. The normalization factor is \frac{1}{N}.

However, the maximum likelihood method does not guarantee to deliver an unbiased estimator. On the other hand, if the obtained estimator is unbiased, then the maximum likelihood method does guarantee that the estimator is also minimum variance and thus MVU. Therefore, we need to check if the estimator in equation (10) is unbiassed.

Performance evaluation

To check if the estimator defined by equation (10) is unbiassed, we need to check if the condition of equation (7) holds, and thus if

    \begin{equation*} E[s] = \hat{s}. \end{equation*}

To do this, we plug equation (10) into E[s] and write:

    \begin{align*} E[s] &= E \left[\frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[(x_i - \mu)^2 \right] = \frac{1}{N} \sum_{i=1}^N E \left[x_i^2 - 2x_i \mu + \mu^2 \right]\\ &= \frac{1}{N} \left( N E[x_i^2] -2N \mu E[x_i] + N \mu^2 \right) \\ &= \frac{1}{N} \left( N E[x_i^2] -2N \mu^2 + N \mu^2 \right) \\ &= \frac{1}{N} \left( N E[x_i^2] -N \mu^2 \right) \\ \end{align*}

Furthermore, an important property of variance is that the true variance \hat{s} can be written as \hat{s} = E[x_i^2] - E[x_i]^2 such that E[x_i^2] = \hat{s} + E[x_i]^2 = \hat{s} + \mu^2. Using this property in the above equation yields:

    \begin{align*} E[s] &= \frac{1}{N} \left( N E[x_i^2] -N \mu^2 \right) \\ &= \frac{1}{N} \left( N \hat{s} + N \mu^2 -N \mu^2 \right)\\ &= \frac{1}{N} \left( N \hat{s} \right)\\ &= \hat{s} \end{align*}

Since E[s]=\hat{s}, the condition shown by equation (7) holds, and therefore the obtained estimator for the variance \hat{s} of the data is unbiassed. Furthermore, because the maximum likelihood method guarantees that an unbiased estimator is also minimum variance (MVU), this means that no other estimator exists that can do better than the one obtained here.
Therefore, we have to divide by N instead of N-1 while calculating the variance of normally distributed data, if the true mean of the underlying distribution is known.

Estimating the variance if the mean is unknown

Parameter estimation

In the previous section, the true mean of the distribution was known, such that we only had to find an estimator for the variance of the data. However, if the true mean is not known, then an estimator has to be found for the mean too. Furthermore, this mean estimate is used by the variance estimator. As a result, we will show that the earlier obtained estimator for the variance is no longer unbiassed. Furthermore, we will show that we can ‘unbias’ the estimator in this case by dividing by N-1 instead of by N, which slightly increases the variance of the estimator.

As before, we use the maximum likelihood method to obtain the estimators based on the log-likelihood function. We first find the ML estimator for \hat{\mu}:

    \begin{align*} &\frac{\partial \log(P(\vec{x}; s, \mu))}{\partial \mu} = 0\\ &\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi s)^{\frac{N}{2}}} e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\ &\Leftrightarrow \frac{\partial}{\partial \mu} \log \left( \frac{1}{(2 \pi)^{\frac{N}{2}}} \right) + \frac{\partial}{\partial \mu} \log \left(e^{-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2} \right) = 0\\ &\Leftrightarrow \frac{\partial}{\partial \mu} \left(-\frac{1}{2s}\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\ &\Leftrightarrow -\frac{1}{2s}\frac{\partial}{\partial \mu} \left(\sum_{i=1}^N(x_i - \mu)^2 \right) = 0\\ &\Leftrightarrow -\frac{1}{2s} \left(\sum_{i=1}^N -2(x_i - \mu) \right) = 0\\ &\Leftrightarrow \frac{1}{s} \left(\sum_{i=1}^N (x_i - \mu) \right) = 0\\ &\Leftrightarrow \frac{N}{s} \left( \frac{1}{N} \sum_{i=1}^N (x_i) - \mu \right) = 0\\ \end{align*}

If N>0, then it is clear that the above equation only has a solution if:

(11)   \begin{equation*} \mu = \frac{1}{N} \sum_{i=1}^N (x_i). \end{equation*}

Note that indeed this is the well known formula to calculate the mean of a distribution. Although we all knew this formula, we now proved that it is the maximum likelihood estimator for the true and unknown mean \hat{\mu} of a normal distribution. For now, we will just assume that the estimator that we found earlier for the variance \hat{s}, defined by equation (10), is still the MVU variance estimator. In the next section however, we will show that this estimator is no longer unbiased now.

Performance evaluation

To check if the estimator \mu for the true mean \hat{\mu} is unbiassed, we have to make sure that the condition of equation (7) holds:

    \begin{equation*} E[\mu] = E \left[\frac{1}{N} \sum_{i=1}^N (x_i) \right] = \frac{1}{N}\sum_{i=1}^N E[x_i] = \frac{1}{N} N E[x_i] = \frac{1}{N} N \hat{\mu} = \hat{\mu}.  \end{equation*}

Since E[\mu] = \hat{\mu}, this means that the obtained estimator for the mean of the distribution is unbiassed. Since the maximum likelihood method guarantees to deliver the minimum variance estimator if the estimator is unbiassed, we proved that \mu is the MVU estimator of the mean.

To check if the earlier found estimator s for the variance \hat{s} is still unbiassed if it is based on the empirical mean \mu instead of the true mean \hat{\mu}, we simply plug the obtained estimator \mu into the earlier derived estimator s of equation (10):

    \begin{align*} s &= \sigma^2 = \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\ &=\frac{1}{N}\sum_{i=1}^N \left(x_i - \frac{1}{N} \sum_{i=1}^N (x_i) \right)^2\\ &=\frac{1}{N}\sum_{i=1}^N \left[x_i^2 - 2 x_i \frac{1}{N} \sum_{i=1}^N (x_i) + \left(\frac{1}{N} \sum_{i=1}^N (x_i) \right)^2 \right]\\ &=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\ &=\frac{\sum_{i=1}^N x_i^2}{N} - \frac{2\sum_{i=1}^N x_i \sum_{i=1}^N x_i}{N^2} + \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\ &=\frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2\\ \end{align*}

To check if the estimator is still unbiased, we now need to check again if the condition of equation (7) holds:

    \begin{align*} E[s] &= E \left[ \frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N} \right)^2 \right ] \\ & = \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2} \\ \end{align*}

As mentioned in the previous section, an important property of variance is that the true variance \hat{s} can be written as \hat{s} = E[x_i^2] - E[x_i]^2 such that E[x_i^2] = \hat{s} + E[x_i]^2 = \hat{s} + \mu^2. Using this property in the above equation yields:

    \begin{align*} E[s] &= \frac{\sum_{i=1}^N E[x_i^2]}{N} - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2} \\ &= s + \mu^2 - \frac{E[(\sum_{i=1}^N x_i)^2]}{N^2} \\ &= s + \mu^2 - \frac{E[\sum_{i=1}^N x_i^2 + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2} \\ &= s + \mu^2 - \frac{E[N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N x_i x_j]}{N^2} \\ &= s + \mu^2 - \frac{N(s+\mu^2) + \sum_i^N \sum_{j\neq i}^N E[x_i] E[x_j]}{N^2} \\ &= s + \mu^2 - \frac{N(s+\mu^2) + N(N-1)\mu^2}{N^2} \\ &= s + \mu^2 - \frac{N(s+\mu^2) + N^2\mu^2 -N\mu^2}{N^2} \\ &= s + \mu^2 - \frac{s+\mu^2 + N\mu^2 -\mu^2}{N} \\ &= s + \mu^2 - \frac{s}{N} - \frac{\mu^2}{N} - \mu^2 + \frac{\mu^2}{N}\\ &= s - \frac{s}{N}\\ &= s \left( 1 - \frac{1}{N} \right)\\ &= s \left(\frac{N-1}{N} \right) \end{align*}

Since clearly E[s] \neq \hat{s}, this shows that estimator for the variance of the distribution is no longer unbiassed. In fact, this estimator on average underestimates the true variance with a factor \frac{N-1}{N}. As the number of samples approaches infinity (N \rightarrow \infty), this bias converges to zero. For small sample sets however, the bias is signification and should be eliminated.

Fixing the bias

Since the bias is merely a factor, we can eliminate it by scaling the biased estimator s defined by equation (10) by the inverse of the bias. We therefore define a new, unbiased estimate s\prime as follows:

    \begin{align*} s\prime &= \left ( \frac{N-1}{N} \right )^{-1} s\\ s\prime &= \left ( \frac{N-1}{N} \right )^{-1} \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\ s\prime &= \left ( \frac{N}{N-1} \right ) \frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2\\ s\prime &= \frac{1}{N-1}\sum_{i=1}^N(x_i - \mu)^2\\ \end{align*}

This estimator is now unbiassed and indeed resembles the traditional formula to calculate the variance, where we divide by N-1 instead of N. However, note that the resulting estimator is no longer the minimum variance estimator, but it is the estimator with the minimum variance amongst all unbiased estimators. If we divide by N, then the estimator is biassed, and if we divide by N-1, the estimator is not the minimum variance estimator. However, in general having a biased estimator is much worse than having a slightly higher variance estimator. Therefore, if the mean of the population is unknown, division by N-1 should be used instead of division by N.

Conclusion

In this article, we showed where the usual formulas for calculating the mean and the variance of normally distributed data come from. Furthermore, we have proven that the normalization factor in the variance estimator formula should be \frac{1}{N} if the true mean of the population is known, and should be \frac{1}{N-1} if the mean itself also has to be estimated.

If you’re new to this blog, don’t forget to subscribe, or follow me on twitter!

JOIN MY NEWSLETTER
Receive my newsletter to get notified when new articles and code snippets become available on my blog!
We all hate spam. Your email address will not be sold or shared with anyone else.

The post Why divide the sample variance by N-1? appeared first on Computer vision for dummies.

]]>
https://www.visiondummy.com/2014/03/divide-variance-n-1/feed/ 18