Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Applied Probabilistic Programming. Part I: RStan

Practical Demonstration

Renny Doig

1 / 24

Overview

1. Example Dataset

2. Proposed Statistical Model

3. The Model as a Stan Program

4. Implementation in R

5. Analyzing Results

2 / 24

Example Problem Statement

3 / 24

Gene Expression Data

Suppose we have counts for the expression of the MT-ATP6 gene, taken from brain tissues from 1,000 subjects. We would like to make inference about the mean expression level for this gene which accounts for potential difference in means between types of brain tissue. In particular, we assume each sample is taken from one of the four following types:

  1. Cerebellum

  2. Hippocampus

  3. Hypothalamus

  4. Brain stem



Simulated data (modeled on GTEx)

4 / 24

Proposed Statistical Model

5 / 24

Finite Gaussian Mixture Model

A Gaussian mixture model is a linear combination of Gaussian random variables, where each component corresponds to a sub-population and each weight is the probability of belonging to each sub-population

  • In general, have K components (sub-populations)

  • A randomly sampled individual belongs to component k{1,,K} with probability αk

  • Random variable Z used to indicate sub-population: ZCategorical(K,αα)

  • An observation on an individual belonging to component k will be a Y|Z=kN(μk,σ)

6 / 24

Finite Gaussian Mixture Model

A Gaussian mixture model is a linear combination of Gaussian random variables, where each component corresponds to a sub-population and each weight is the probability of belonging to each sub-population

  • In general, have K components (sub-populations)

  • A randomly sampled individual belongs to component k{1,,K} with probability αk

  • Random variable Z used to indicate sub-population: ZCategorical(K,αα)

  • An observation on an individual belonging to component k will be a Y|Z=kN(μk,σ)


  • Since Z is discrete it is incompatible with HMC

  • The workaround is to marginalize out this indicator by evaluating the likelihood as a sum

YKk=1αkN(μk,σ)

6 / 24

Bayesian GMM

To make this model Bayesian, we need to assign prior distributions over the component proportions, means, and variance:

Y|αα,μμ,σKk=1αkN(μk,σ)ααDirichlet(1)μμNormal(4000,1500)σ2InvGamma(3,6002)

  • Dirichlet is a multivariate extension of the Beta distribution

    • Used for a random vectors whose components take values in (0,1) and sum to 1
7 / 24

Bayesian GMM

To make this model Bayesian, we need to assign prior distributions over the component proportions, means, and variance:

Y|αα,μμ,σKk=1αkN(μk,σ)ααDirichlet(1)μμNormal(4000,1500)σ2InvGamma(3,6002)

  • Dirichlet is a multivariate extension of the Beta distribution

    • Used for a random vectors whose components take values in (0,1) and sum to 1

The resulting posterior distribution is: π(αα,μμ,σ|y)ni=1[Kk=1αkN(yi|μk,σ)]N(μμ;4000,1500)IG(σ2;3,6002)D(αα;1)

7 / 24

The Model as a Stan Program

8 / 24

Basic Components of a Stan Program

Our model is specified in a separate .stan file

  • Written in the Stan probabilistic programming language

  • Contains up to seven blocks, each serving a different purpose

  • We only need the three basic blocks:

    1. data{...}: Specify all user input

    2. parameters{...}: Specify any parameters of the model which need to be estimated

    3. model{...}: Evaluates the log-posterior

  • Stan uses this file to automatically construct a Hamiltonian Monte Carlo sampler for this model

9 / 24

Variable Definition

The syntax for defining a variable in Stan is:

type<lower=a><upper=b>[dimension] name;

  • type: Specifies type of variable

    • Basic types are int and real

    • Can also have vector, matrix, and others

  • <lower=a>: Specifies the lower bound of the variable; default is unbounded below

  • <upper=b>: Specifies the upper bound of the variable; default is unbounded above

  • dimension: Specifies dimension of the variable

    • For vector it is the length

    • For n×m matrix is specified [n,m]

10 / 24

GMM: Data and Parameters

Now we can specify the data and parameters section of our RStan code:

data {
int<lower=1> n; // number of observations
int<lower=2> K; // number of components
vector[n] y; // observations
}
parameters {
simplex[K] alpha; // mixing proportions
ordered[K] mu; // means of components
real<lower=0> sigma; // standard deviations of components
}

Identifiability: Without ordering μμ, the model is unidentifiable

  • E.g. between two iterations, swap μ1 and μ2

    • This would leave the likelihood unchanged, and therefore the model is unidentifiable
  • Imposing an ordering prevents this swapping

11 / 24

Evaluating the Posterior

The log-posterior can be specified in two ways

  • Suppose we want to model YN(μ,σ)
  1. Distributional statement: y ~ normal(mu, sigma);

  2. Increment statement: target += normal_lpdf(y | mu, sigma);

    • target is a protected keyword and requires special functions to access


  • These are equivalent statements; can have combination in your model block

  • Both are vectorized, so if y is a vector, then it sums over all log-pdf evaluations

12 / 24

Note on Log-Sum-Exponential

Recall in our posterior, we needed to evaluate Kk=1αkN(y|μk,σ)

  • Since Stan operates on the log-posterior, we need the log of this quantity

  • In terms of log-densities, this would be written as: log(Kk=1exp[log(αk)+log(N(y|μk,σ))])

  • In Stan, if we have a vector x=[a,b,c], the function log_sum_exp(x) will evaluate log(exp(a)+exp(b)+exp(c))

  • This will be used to evaluate the log-posterior

13 / 24

GMM: Model

Recall that the posterior we want to encode is: π(αα,μμ,σ|y)ni=1[Kk=1αkN(yi|μk,σ)]N(μμ;4000,1500)IG(σ2;3,6002)D(αα;1)

model {
// priors
target += gamma_lpdf(1/sigma^2 | 3, 600^2);
target += normal_lpdf(mu | 4000, 1500);
target += dirichlet_lpdf(alpha | rep_vector(1, K));
// likelihood
for(i in 1:n){
vector[K] temp = log(alpha);
for(k in 1:K){
temp[k] += normal_lpdf(y[i] | mu[k], sigma);
}
target += log_sum_exp(temp);
}
}
14 / 24

Implementation in R

15 / 24

The RStan Package

The Stan software is accessed in R through the rstan package


General flow of implementation is

  1. Create .stan file specifying our model

  2. Read data into R and reformat to match the Stan program

  3. Execute the stan function

    • Automatically generate a Hamiltonian Monte Carlo sampler

    • Produce posterior samples, plus diagnostics

  4. Generate plots, numerical summaries, etc.


Note on Compute Canada installation: may require manual installation of V8, Rcpp, or RcppEigen packages

16 / 24

The stan function

The required arguments to the stan function are:

  • file: A string indicating the path of the .stan file containing the model code

  • data: A named list with elements matching those specified in the data block of the Stan file


Some optional, but often used arguments:

  • chains: The number of MCMC chains used (default 4)

  • iter: Total length of each chain (default 2000)

  • init: A list or function used to initialize each chain (default "random")

  • seed: The seed used for random number generation


A full list of parameters can be found in the CRAN documentation

17 / 24

Sampling from the Bayesian GMM

# load the RStan library
library(rstan)
# read in data
Expression <- read_csv("../Data/Expression.csv")$Expression
# enable parallel computing
options(mc.cores=5)
# set up data
n <- length(Expression)
K <- 4
stan_data <- list(n=n, K=K, y=Expression)
18 / 24

Sampling from the Bayesian GMM

# for the rdirichlet function
library(extraDistr)
# initial values
init_func <- function(){
list(alpha = as.numeric(rdirichlet(1, rep(1, K))),
mu = sort(rnorm(K, 4000, 1500)),
sigma = 1 / sqrt(rgamma(1, 3, 600^2)))
}
# run stan
fit <- stan(file = "gaussian_mixture.stan",
data = stan_data,
chains = 5,
iter = 1e4,
init = init_func,
seed = 2)

Note: If seed not supplied then Stan will set it randomly, overriding previously set seeds

19 / 24

Analyzing the Results

20 / 24

Examples of Plots

library(bayesplot)
mcmc_areas(fit, regex_pars="alpha", prob=0.95) + theme(axis.text = element_text(size=20))

mcmc_areas(fit, regex_pars="mu", prob=0.95) + theme(axis.text = element_text(size=20))

21 / 24

Numerical Summary of Results

RStan also has a summary function for stan-fit objects:

summ <- summary(fit)$summary
colnames(summ)
[1] "mean" "se_mean" "sd" "2.5%" "25%" "50%" "75%"
[8] "97.5%" "n_eff" "Rhat"
round(summ[rownames(summ)!="lp__", c("2.5%", "mean", "97.5%")],2)
2.5% mean 97.5%
alpha[1] 0.32 0.35 0.38
alpha[2] 0.17 0.20 0.22
alpha[3] 0.23 0.26 0.29
alpha[4] 0.17 0.20 0.23
mu[1] 2963.46 2994.70 3025.78
mu[2] 3936.70 3987.60 4038.10
mu[3] 4964.72 5006.32 5047.77
mu[4] 5902.28 5945.14 5987.92
sigma 249.41 263.07 278.16
22 / 24

Extracting the Data

It might be preferable to just extract the samples from the stanfit object using the extract function

par_fit <- extract(fit, pars="lp__", include=F)
  • This returns a list of named matrices or vectors of posterior samples

    • E.g. par_fit$alpha is a matrix with 4 columns, one for each αk
  • lp__ is the value of the log-posterior at each iteration

  • include can be used to invert the meaning of pars (i.e. lp__ is excluded from the extraction)


This can then be converted to a data frame

par_fit <- as.data.frame(par_fit)
23 / 24

Concluding Remarks

RStan provides a appealing Bayesian analysis software

  • Probabilistic programming allows for flexibility; any model that can be written in the Stan languange can be sampled from

  • A Hamiltonian Monte Carlo sampler is automatically created and executed to sample from the posterior

  • The package is well-maintained; no debugging of the algorithm itself necessary


These slides can be found here: https://www.sfu.ca/~rennyd/RStanTutorial/tutorial.html

  • Supporting files can be found in the parent directory
24 / 24

Overview

1. Example Dataset

2. Proposed Statistical Model

3. The Model as a Stan Program

4. Implementation in R

5. Analyzing Results

2 / 24
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow