Approximating the Sample Proportion as a Normal Distribution

We have already seen that, if the underlying data points of a sample can be modelled as a Bernoulli distribution, the mean and the variance of the sample proportion are given by

$E[\hat{p}] = p$

$\operatorname{Var}(\hat{p}) = \frac{p(1-p)}{n}$

However, have not discussed the distribution of the sample proportion. If we did not divide the sample propotion by the number of data points in the sample, we would simply have the number of successes, which we know follows the binomial distribution. This means that the sample proportion follows the scaled binomial distribution, and we can use this to calculate probabilities related to the sample distribution.

For a binomial distribution with a large number of samples, we see that its probability distribution is very similar to a normal distribution with the parameters

$\mu = np$ $\sigma^2 = np(1-p)$

This is not just a lucky coincidence, but in fact a fundamental result in probability theory called the Central Limit Theorem. We will not explore the theorem in this course but it is important to note that this result only holds on the assumption that each data point is independent of the others.

Since the sample proportion is a scaled binomial distribution, this means we can approximate its distribution as a normal distribution with parameters

$\mu = p$ $\sigma^2 = \frac{p(1-p)}n$

This approximation means we can calculate probabilities related to the sample proportion statistic, which we see can help us to reason about whether our model parameters are an accurate description of reality.

...