Applying Gibbs Sampling

Gibbs Sampling can be used in population genetics research to build populations.

Gibbs Sampling in statistical genetics

Often times in population genetics research, it is useful to classify individuals in a sample into populations. While other areas of study may use factors like linguistic, cultural, or physical characters, it is not always easy to assign populations based on genotypes.

Pritchard et al. 2000 proposed a method to assign individuals to populations and simultaneously estimate allele frequencies: homozygous dominant, heterozygous, and homozygous recessive. This method can utilize various genetic markers as indication of alleles including SNPS, RFLPS, micro-satellites, etc. Additionally, they follow a few assumptions:

Hardy Weinberg Equilibrium \[p^2+2pq+q^2 = 1\] Where \(p^2\) is the homozygous dominant allele frequency, \(2pq\) is the heterozygous allele frequency, and \(q^2\) is the homozygous recessive allele frequency. By assuming Hardy Weinberg Equilibrium, we are stating that the genetic variation in a population will remain constant from one generation to the next.

Under these assumptions, each allele at each locus in each genotype is an independent draw from the appropriate frequency distribution.

Building a Model

Assume each population is modeled by a characteristic set of allele frequencies.

Information about P and Z is given by the posterior distribution:

\[\begin{aligned}P(Z, P |X) &= \frac{P(Z, P, X)}{P(X)} \\ &\propto P(P) P(Z) P(X|Z, P) \end{aligned}\]

This is a great opportunity to use Gibbs Sampling! It is not possible to sample from the posterior, \(P(Z, P|X)\), directly. We don’t know what \(P(P)\) or \(P(Z)\) are because they are our unknown variables. However, it is possible to use conditional sampling to build an approximate distribution: \((Z_1,P_1), (Z_2, P_2), ..., (Z_n, P_n)\).

Start with randomly drawn initial value \(Z_0\) as a hypothetical population of origin and \(P_0\) as a hypothetical allele frequency. Then, iterate the following steps:

  1. Sample \(P_1\) from \(P(P | X, Z_0)\)
    • This estimates allele frequencies for each population assuming population of origin for the individual is known.
  2. Sample \(Z_1\) from \(P(Z | X, P_1)\)
    • This estimates the population of origin for each individual assuming that the population allele frequencies are known.
  3. Continue the pattern \(n\) times:
    • Sample \(P_n\) from \(P(P | X, Z_{n-1})\)
    • Sample \(Z_n\) from \(P(Z | X, P_n)\)