The Sample Proportion as a Random Variable

Imagine are collecting data which we can categorise into two groups: a success and a failure case. Once we have collected a sample of $n$ data points for this data, we might be interested in looking at the proportion of points in this sample that were successes. We call this the sample proportion (denoted $\hat{p}$ ), and the formula to calculate it is

$\hat{p} = \frac{\#\{\text{successes}\}}{\#\{\text{data points}\}}$

This is our first example of a statistic that we have explored, and again we note that since its numerical value is dependent on the random variables of the individual data points, $\hat{p}$ is a random variable.

As we are dealing with multiple data points that have two outcomes (success or failure), this statistic is closely tied to Bernoulli and binomial distributions. In fact, if the underlying data points can be modelled as a Bernoulli distribution, then the expected value of the sample proportion is the parameter $p$ .

$E[\hat{p}] = p$

We can also find the variance of the sample proportion, which is given by

$\operatorname{Var}(\hat{p}) = \frac{p(1-p)}{n}$

The interesting observation from this formula is that the variance of the sample proportion decreases with the number of data points collected. This means as we collect more data points, the difference between $\hat{p}$ and $p$ will tend to get smaller.

This is an important result. Recall that we cannot directly observe the parameters of a distribution like $p$ , but we can measure a statistic like $\hat{p}$ . This means that we can use our observations of $\hat{p}$ to estimate the parameters of an underlying Bernoulli distribution.

...