[ESTIMATION] Must-have Stat Knowledge

This post is the first of the series of estimation.

You need the following knowledge to understand maximum likelihood estimation and everything else based on it. This is the bottom line.

I assume you already have some basic knowledge of probability theory, e.g. the definition of probability, so we’ll skip that part.

We are reviewing conditional probability, the Total probability theorem, the Bayesian Theorem, Bernoulli distribution, binomial distribution, multinomial distribution, normal distribution, the likelihood function, logarithm rules, and derivative rules

The Venn diagrams are useful for understanding probabilities.

Upper panel: Mutually exclusive events (or disjoint events, dependent events); Lower panel: Non-mutually exclusive events (or overlapping events, potentially independent events)

Conditional probability

Given two events A and B from the sigma-field of a probability space with $P(B) > 0$ , the conditional probability of A given B is defined as the quotient of the probability of the joint of events A and B, and the probability of B:

$P(A|B)=\frac{P(A \cap B)}{P(B)}$

Sometimes $P(A \cap B)$ is denoted as $P(A,B)$ .

The conditional probability defines a new universe.

In general, it cannot be assumed that $P(A|B) \approx P(B|A)$ .

Events A and B are defined to be statistically independent if

$P(A \cap B)=P(A)P(B)$

Note that mutually exclusive events (or disjoint events) must be dependent events, but non-mutually exclusive events (or overlapping events) could be independent.

Total probability theorem

The events $A$ and $\bar{A}$ form a partition of $S$ , where $S$ denotes the sample space.

$P(A)+P(\bar{A})=1$

$P(B)=P(A,B)+P(\bar{A},B)=P(B|A)P(A)+P(B|\bar{A})P(\bar{A})$

Bayesian Theorem

$P(A|B)=\frac{P(A,B)}{P(B)}=\frac{P(B|A)P(A)}{P(B)}=\frac{P(B|A)P(A}{P(B|A)P(A)+P(B|\bar{A})P(\bar{A})}$

Bayesian Theorem can be regarded as a rule to update an initial probability $P(A)$ , also called the prior probability, into a revised probability $P(A|B)$ , called the posterior probability, that takes into account the updated knowledge currently available.

Click here for an Intuitive explanation of Bayesian Theorem.

Bernoulli distribution

The Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is the probability distribution of a random variable which takes the value 1 with success probability of $p$ and the value 0 with failure probability of $q=1-p$ . For example, toss a coin to see whether the head turns up.

If X is a random variable with this distribution, we have:

$Pr(X=1) = 1 - Pr(X=0) = 1 - q = p$

$\mathbb{E}[X] = Pr(X=1) \cdot 1 + Pr(X=0) \cdot 0 = p \cdot 1 + q\cdot 0 = p$

Have some fun with the following R code.

#draw two numbers from the Bernoulli distribution with p=0.5
p<-0.5
set.seed(1232)
if( runif(1) < p ) {y<-1} else {y<-0}
y
set.seed(2222)
if( runif(1) < p ) {y<-1} else {y<-0}
y

Binomial distribution

The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability $p$ .

$f(k;n,p) = \Pr(X = k) = \binom n k p^k(1-p)^{n-k}=exp\{x \cdot \log{\frac{p}{1-p}}+n \cdot \log(1-p)+\log \binom n k\}$

for k = 0, 1, 2, …, n, where

$\binom n k =\frac{n!}{k!(n-k)!}$

Suppose a biased coin comes up heads with probability 0.3 when tossed. What is the probability of achieving 0, 1,…, 6 heads after six tosses?

$\Pr(0\text{ heads}) = f(0) = \Pr(X = 0) = {6\choose 0}0.3^0 (1-0.3)^{6-0} \approx 0.1176$
$\Pr(1\text{ heads}) = f(1) = \Pr(X = 1) = {6\choose 1}0.3^1 (1-0.3)^{6-1} \approx 0.3025$
$\Pr(2\text{ heads}) = f(2) = \Pr(X = 2) = {6\choose 2}0.3^2 (1-0.3)^{6-2} \approx 0.3241$
$\Pr(3\text{ heads}) = f(3) = \Pr(X = 3) = {6\choose 3}0.3^3 (1-0.3)^{6-3} \approx 0.1852$
$\Pr(4\text{ heads}) = f(4) = \Pr(X = 4) = {6\choose 4}0.3^4 (1-0.3)^{6-4} \approx 0.0595$
$\Pr(6\text{ heads}) = f(6) = \Pr(X = 6) = {6\choose 6}0.3^6 (1-0.3)^{6-6} \approx 0.0007$

If X ~ B(n, p), that is, X is a binomially distributed random variable, n being the total number of experiments and p the probability of each experiment yielding a successful result, then the expected value of X is:

$\mathbb{E}[X] = np$

# random generation for the binomial distribution with size=10 and prob=0.5
rbinom(n=2, siz=10, prob=0.5)

Multinomial distribution

The multinomial distribution is a generalization of the binomial distribution. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

$f(x_1,...,x_k;n,p_1,...,p_k)= Pr(X_1 = x_1~~and~~...~~and~~X_k = x_k)= \frac{n!}{x_1!... x_k!}p_1^{x_1}... p_k^{x_k}$

when $\sum_{i=1}^k x_i=n$ for non-negative integers $x_1, ..., x_k$ .

The expected number of times the outcome i was observed over n trials is

$\mathbb{E}[X_i] = n p_i$

# random generation for the multinomial distribution with size=10 and prob=0.5
rmultinom(n=2, size=10, prob=c(0.1,0.3,0.6))

Normal distribution

The normal (or Gaussian) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.

The probability density of the normal distribution is:

$f(x~|~\mu,~\sigma^2) = \frac{1}{\sigma\sqrt{2\pi} }~e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

$\mathbb{E}[X]=\mu$

$var(X)=\sigma^2$

If $\mu = 0$ and $\sigma = 1$ , the distribution is called the standard normal distribution or the unit normal distribution denoted by N(0,1) and a random variable with that distribution is a standard normal deviate.

# random generation for the normal distribution with mean=10 and variance=9
rnorm(n=10, mean=10, sd=3)

Likelihood function

A likelihood function (often simply the likelihood) is a function of the parameters of a statistical model. Likelihood functions play a key role in statistical inference, especially methods of estimating a parameter from a set of statistics.

“Probability” is used when describing a function of the outcome given a fixed parameter value.

“Likelihood” is used when describing a function of a parameter given an outcome.

The likelihood of a set of parameter values, $\theta$ , given outcomes x, is equal to the probability of those observed outcomes given those parameter values, that is
$\mathscr{L}(\theta |x) = P(x | \theta)$

Logarithm rules
When b raised to the power of y is equal x: $b^y = x$ , then the base b logarithm of x is equal to y: $\log_b(x) = y$
Natural logarithm is a logarithm to the base e: $\ln(x) = \log_e(x)$
Logarithm product rule: $\log_b(x \cdot y) = \log_b(x) + \log_b(y)$
Logarithm quotient rule: $\log_b(x / y) = \log_b(x) - \log_b(y)$
Logarithm power rule: $\log_b(x^y) = y \cdot \log_b(x)$

Derivative rules
$c'=0$ where c is a constant.
$x'=1$
$(x^2)'=2x$
$(\ln(x))'=1/x$
Power Rule: $(x^n)'=nx^{n-1}$
Chain Rule: $f(g(x))'=f'(g(x))g'(x)$