Binomial distribution
Yi∣θ∼Ber(θ)Σi=1nYi∣θ∼Binom(n,θ)θ∼Beta(a,b) θ∣data∼Beta(a+Σi=1nyi,b+n−Σy=1nyi)y~∣θ∼Ber(θ)y~∣data∼Ber(a+b+na+Σi=1nyi) Poisson Distribution
📖 There is a book with N pages.
Yi:Numberoftypoinithpage, the type in whole pages becomes ΣYi
When we assume E(Yi)=2, let θbe the number of typo per page.
θ=NΣYiYi∣θ∼Poi(θ)ΣYi∣θ∼Poi(nθ) The sum of Y follows poisson distribution, because the typos per page are independent with i.i.d. assumption. We can assume theta follow this distribution(prior):
θ∼Gamma(a,b)θ∣data∼Gamma(a+Σyi,b+n) θ∣data follows gamma distribution as a following proof.
We find prior and posterior distribution, so we can make a bayesian poisson model.
a+Σy: The number of typo we already knew.
b+n: The number of pages we already knew.
1) What is the expected typo error per page? - Posterior Moments
θ∣data∼Gamma(a+Σyi,b+n)μ=b+na+Σyi=b+nbba+b+nnnΣyi↩︎σ2=(b+n)2a+Σyi The weighted average between prior information and data information becomes the mean of gamma distribution.
2) What is the distribution of new observation? - Posterior Prediction
Y~∣data∼NB(a+Σyi,p),p=b+n+11 It is followed by this proof.
p(y~∣data)=p(y~,θ∣data)d(θ)=∫p(y~∣θ,data)p(θ∣data)d(θ)=∫p(y~∣θ)p(θ∣data)d(θ)Y~∣θ∼Poi(θ) The likelihood of new observation is same to one of data.
NB(Negative Binomial) is the number of success until rth failure with p success rate. Before we failure with the number ofa+Σyi, Y~means the number of success. This success means the typo takes place at new page.
We can also derive mean and variance using Posterior predictive. However when it comes to poisson modeling, there is a disadvantage that data becomes overdispersion. It is because the variance of data is much bigger than theta. In this case we can use NB model or Hierarchical Normal model.
Exponential Families
f(x;θ)={exp[p(θ)K(x)+s(x)+q(θ)],0,x∈S.otherwise. Under the conditions as follows:
S does not depend on θ
p(θ) is a nontrivial continuous function of θ∈Ω
If X is continuous, K′(x)=0and s(x) is continuous function. If X is discrete, K(x)is nontrivial function
f(y∣ϕ)={h(y)c(ϕ)exp[ϕK(y)],0,x∈S.otherwise. Under the conditions as follows:
S does not depend on θ
ϕ is a nontrivial continuous function of θ∈Ω
If Y is continuous, K′(y)=0and h(y) is continuous function. If Y is discrete, K(y)is nontrivial function
If probability density/mass function is expressed above, it means that the probability distribution belongs to an exponential family. Well-known distributions are usually included in exponential family.
Sufficient statistic for theta
Y1=u1(X1,⋯,Xn)is sufficient statistic for theta.
p(X1,X2,⋯,Xn∣Y1=y1)doesn't contain theta(parameter.)
Our statistic fully explains our parameters.
X1,⋯,Xn∼iid;f(x;θ) If f is exponential family, then K(x)=ΣK(xi)is sufficient statistic for theta
These 4 sentences are in equivalence relation.
If the posterior distribution p(θ∣x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions. (by wikipedia)
f(y1,…,yn∣ϕ)=Πh(yi)c(ϕ)eπK(yi)∝c(ϕ)neϕΣK(yi)p(ϕ)=k(no,t0)c(ϕ)n0en0t0ϕ∝c(ϕ)n0en0t0ϕp(ϕ∣y)∝p(ϕ)f(y∣ϕ)∝c(ϕ)n0+nexp[ϕ(n0t0+nnΣK(yi))] Which has the same with prior.