Fitting Bayesian Models using Stan and R

An introduction to fitting Bayesian models using Stan and R

56 minute read

What’s Stan and Why Use It?

Stan is a programming language designed to make statistical modeling easier and faster, especially for Bayesian estimation problems. Stan can help you estimate complex models with large numbers of parameters, and can generally do it faster than alternatives like JAGS/BUGS

That all sounds good, but why is that useful for me? Suppose you have a hierarchical ecological modeling problem with data clustered by space, time, and species, such as estimating the effect of ocean temperatures on coral growth. You can use Stan to fit that model (and it will likely be faster than BUGS/JAGS if you’re already used to working in those platforms). Suppose just want to use informative priors to help fit a growth model. You can use Stan for that. Suppose you just prefer Bayesian analysis and want to run a simple multiple regression. Stan can do that.

The purpose of this document is not to perfectly describe or debate Bayesian analysis, but to provide a path to get you started using Stan in your research. This is not a deep dive into the inner workings of Stan and model fitting - explanations and examples are designed to try and help people get the hang of using Stan. There are many great resources to understand the inner workings of Stan and provide higher-level examples, but I’ve found that they are often either too basic, or jump straight to the finished product without explaining the steps to get there. This tutorial is intended as a bridge to help get people from zero to working in Stan. Comments and suggestions are welcome!

Bayesian analysis on its own has a number of appealing traits

  • Forces you to very explicitly write out the error structure of your model and prior beliefs

  • Bayesian credible intervals mean what we’d like Frequentist confidence intervals to mean

  • Is a statistically sound way to use prior knowledge

  • Gets you away from the (in my opinion) very confusing world of “fixed” vs “random” effects, since in a Bayesian world all parameters are random, and the question how you structure the priors on those parameters

  • Diagnostics are much more intuitive than the wild wilderness of Frequentist tests (IMO)

But like all things comes with tradeoffs

  • Computationally intensive (read: can take a lot longer to run, like hours or days compared to seconds to minutes)

  • Priors require careful thought

  • Unfamiliar reviewers might not like it (though you can and should push back on this one, but still, worth thinking about)

I assume that if you’re reading this you’re already interested in using Bayesian tools in your analysis, if you want further information on the philosophy, mathematics, and usefulness of Bayesian methods I really recommend Bayesian Models: A Statistical Primer for Ecologists by Hobbs and Hooten (2015) and Statiscial Rethinking: A Bayesian Course with Examples in R and Stan by McElreath (2016).

How does Stan work?

Stan is a programming language that allows you to write and fit models. It is also “compiled language,” meaning that you have to write a model, then compile it to run it. This is unlike interpreted languages like R that let you more or less run code as you go. This can be a pain if you’re not used to that sort of thing, but it helps make Stan much faster than saying writing the same model in R.

Stan’s key feature are a series of tools for numerical model fitting. While Stan provides a few different fitting algorithms, we’re going to focus on the Markov Chain Monte Carlo (MCMC) based methods here.

MCMC’s are an incredibly useful class of algorithms that can be used to fit complex models and provide Bayesian inference to statistical problems.

An MCMC works more or less like this

  1. Pick some initial parameters

  2. See how well those parameters fit the data (Calculate the posterior probability of those parameters given your data, the likelihood, and priors)

  3. Pick a new set of parameters through some function.

  4. Accept or reject the new parameters by some function proportional to how much the new parameters improve the model fit.

  5. Repeat the process many many many times

This simple algorithm can be shown to always converge on an approximation of the posterior probability distribution of the model eventually. The challenge then is mostly in designing functions for selecting and accepting/rejecting new values that will help the algorithm converge in your (or the universe’s) lifetime.

That’s where Stan comes in. If you write the model, Stan has a number of built in algorithms for helping you use MCMC to fit and diagnose that model quickly and efficiently. It’s beyond the scope of this article or my ability to explain exactly how it does this, but Monnahan, Thorson, and Branch (2016) provides a really great summary of the mathematics behind Stan’s algorithms, that I will now butcher.

Stan makes use of two main tools to efficiently solve Bayesian problems: Hamiltonian Monte Carlo (HMC) and the no-U-turn sampler (NUTS). A simple MCMC might choose a new parameter value by drawing from a multivariate normal distribution centered on the last parameter value, with some tuned or supplied covariance matrix. This means that each new parameter value is likely to be highly correlated with the last parameter value, requiring you to draw a large number of samples and then “thin” the samples to create independent draws from the posterior.

HMC and NUTS works a little differently. The details are very complicated, and I recommend reading more on this if you’re interested (see Monnahan, Thorson, and Branch (2016)), but the good news is someone else worked out all the details so that you can use it! But it’s good to have a rough sense of what’s going on.

Think of the posterior probability of your model like a small planet with peaks and valleys. The HMC/NUTS algorithm is a an astronaut that lands on this planet, and the astronaut’s goal is to find the highest peaks in the landscape. But, there’s a catch. The astronaut is blindfolded, but has a sensor that tells her four things: her elevation, her distance from her starting point, her speed, and her thrust. So, she could just systematically walk across the landscape, taking an elevation measurement at each and every point. This would work, but would take forever if the landscape is at all large (which it’s Mars, so it is). So instead, she does something like this. She starts traveling in a random direction, with a rule being she has to pick an initial thrust that she tries to maintain. She can then decide to take tiny little steps, or huge gigantic leaps (remember, it’s Mars so the gravity is much lower), all while trying to maintain the same thrust and speed. If she takes giant leaps, she covers the ground really quickly, but risks jumping right over mountains and missing them entirely. If she takes little tiny steps it’ll just take forever to survey the planet.

She picks a step size and a thrust, and starts traveling, keeping track of her distance from her initial starting point, only stopping when her distance from her initial starting point starts to decrease (a U-turn, i.e. when the path she’s on doubles back on itself), or when, for the same thrust, her speed has plummeted (she’s crashed into the mountain) or gone through the roof (she’s accidentally reached escape velocity and is leaving the planet, this is my sad attempt to come up with another analogy for divergences), or when she has taken the most number of steps that she is willing to take on any one search path (max treedepth hit). Once she’s stopped, she looks at her altimeter. If she’s higher than she was when she started, she marks that spot as a possible peak. If she’s lower than where she started, she marks that spot as a possible peak with probability proportional to how much lower she is than her starting point (if she’s just a bit lower she’s likely to mark it, if she’s way lower she’s very unlikely to mark it). If she decides to mark that spot, she then starts a new path from that point. If she doesn’t mark that spot, she returns to her original position, marks that original position again, and tries a new path.

This leaves really one key decision for the astronaut to make: how big should her stepsize be? She makes this decision by tallying the average rate at which she accepts a new position, and she knows from some smart people that an 80% acceptance rate is reflective of a good balance of of taking big enough steps to be efficient but not so large that you miss important features of the landscape. So, she keeps trying new step sizes until she hits that target acceptance rate.

There’s a lot in here then, but to translate this back into practical terms, this means that for the most part, the only key decision that you, the analyst, needs to make in implementing the HMC/NUTS algorithm is what your target acceptance rate is. Stan then does the work for you of finding the right step size to achieve that target acceptance rate during its “warmup” period. Once it’s done with that, it enters the “sampling period,” holding that same step size constant.

We’ll get into the implementation details of this later, but that gives a general idea of what’s happening when you call Stan. This has a few really nice features. It’s usually more efficient than say the MCMC algorithms underpinning JAGS, and the HMC process means that you usually don’t have to thin anymore, since the HMC itself should more or less pick values that are largely independent of the last value (though this isn’t always true, and sometimes some thinning is needed). This saves you that annoying step in other Monte-Carlo based approached of having to run a boatload of samples and then “waste” them by thinning them down (though HMC still discards a lot of steps in the process). Stan makes it much easier to set a target number of samples from the posterior. In addition, programs such as BUGs like to try and take advantage of Gibbs sampling to speed things up, which requires conjugate priors, giving you an incentive to select conjugate priors for your model. Unfortunately, nature is rarely so kind; we should pick distributions based off reasonable hypotheses about the state of the world, not computational efficiency. Stan (i.e HMC and NUTS) doesn’t care at all whether you are using conjugate priors, encouraging you to find the right model, not the most mathematically convenient model.

So, this all means that you can focus on writing your model, and leave the complicated tuning and running of HMC to NUTS.

A Note on Divergences

HMC uses a process called the “leapfrog integrator” to draw a sketch of the posterior probability surface, since for all but the most trivial model cases there is no analytical solution. The nice thing is, failures in this integrator are identified by “divergences,” which basically mean that the sampler is no longer following the surface of the model correctly. So, the model is going along at say an average energy of 10, and then all of a sudden goes to 20, 40, 100, etc. This tells NUTS that something has gone wrong, and it abandons ship. This can be a sign of two things. Either you just need to decrease the step size (which you do by increasing the target acceptance rate), or it can be a sign that there’s a problem in the design of your model. In the later case, this can sometimes be solved by just reparametrizing your model (which we’ll touch on later), but in other cases can be a sign of a fundamental problem in your model (e.g. a population model that goes negative and as a result produces Inf or NA values in your posterior probability). Understanding and dealing with divergences is a bit of an art, but it’s important to note that Stan will warn you if divergences pop up, and it’s something you have to look into. Divergences mean that the model didn’t correctly survey the posterior probability and so your results may by suspect.

Filling your Stan Toolbox

Stan is based off of C++, but is written differently. It has a bunch of features that make it a bit easier to deal with than C++, but also means that unlike other frameworks like TMB, you can’t just Google the C++ solution to any given problem.

You can write your Stan model as a code chunk in a R Markdown file, or as a string, but for anything but the most trivial model I prefer to write my model in a separate file with a .stan extension (e.g. in RStudio, File > New File > Save As > my_model.stan). You can find the .stan files in the repo, but for this post I’ll also present the code as a code chunk in the body of the blog.

A Stan model is broken into a number of “blocks,” each of which define a particular part of the model. There are many different possible blocks you can use, but to start with we’re going to work with the three that basically every model has to have: data, parameters, and model

/* 
You can do long blocks of comments
like this
*/


data{
// load in data
}

parameters{
// define parameters the model is trying to estimate

}
model{

// the posterior probability function

}

You need to preallocate space for any variable that you include in your model, by declaring its type and its size. These declarations need to all happen at the start of each block. Stan uses what is called “strong typing,” meaning basically that it is very strict in terms of what you can do to different kinds of data, and if you try and treat say an array of real numbers like a vector of real numbers it will blow up.

So, to start with our data section. Suppose we have N observations of Y data, where N is an integer (the number of observations), and Y is a vector of real numbers. We would declare those like


data{

int N; // the number of observations

vector[N] Y; // our data

}

This tells Stan that we have an integer object N, and we have a vector Y of length N.

Notice that like C++, every statement needs to be closed with a semi-colon ;, and that comments are marked by //.

You can also declare arrays, X, using a very confusing different structure


data{

int N; // the number of observations

vector[N] Y; //

real X[N];

}

In this case, Y is a vector of size [N,1] (Stan treats Vectors as a matrix based data type)

and X is a an array of 10 1 dimensional objects. That is a pretty confusing distinction, the main difference being that

  • vectors allow vector/matrix algebra. You can’t do this with arrays

  • arrays allow for integer storage. So, if for example you want to pass a bunch of indices showing which columns correspond to different subgroups, you need to use int index[n]; would produce an array of length n each storing one integer

To give a few illustrations of this


real x[10];

real y[10];

real z[10];

z = x * y

Would not work, since you can’t multiply arrays like this.


vector[10] x;

vector[10] y;

vector[10] z;

z = x * y

Also does not work! Why not??? These are vectors, we should be able to multiply them. By default, Stan goes with matrix multiplication, and X and Y are both [10,1] matrices, which can’t be matrix multiplied together. You can get Stan to do element-wise operations by using .* (or ./ for division).

So this would work


vector[10] x;

vector[10] y;

vector[10] z;

z = x .* y

One really nice thing about Stan, as opposed to straight up C++, is that it allows for pretty easy indexing.

This works


vector[10] x;

vector[10] y;

vector[5] z;

z = x[1:5] .* y[1:5];

As does this


vector[10] x;

vector[10] y;

int i[5];

vector[5] z;

i = {1,2,3,4,7}

z = x[i] .* y[i];

If things seem like they should be working and they’re not, 9 times out of 10 it’s a problem with these kinds of things (e.g. trying to multiply an array times a vector). One nice feature of running models through Rstudio is if you open up your .stan script in Rstudio, inside the IDE there’s a little “check” button in the top right-hand corner of the script that will catch a lot of that stuff. It won’t tell you where the problem is, but it will let you know that your .stan file won’t run as written.

There are a ton of other details out there in terms of structuring and manipulating data, and you can see chapters 3 and 4 of the official Stan documentation for very detailed description, but these examples should get you started with the kinds of operations most of us do.

One other feature worth calling out for a moment is Stan’s ability to set bounds on just about anything. For example, suppose you have data x that you know can only be positive. You can let Stan know this by <lower = 0>


data{

int n; // the number of observations

vector[n] y; //

vector<lower = 0>[n] x;

}

This lets stan know that X is defined on the range [0, Inf]

If x has a positive constraint, say 100, you guessed it <lower = 0, upper = 100>

When declared in the data block this isn’t all that useful, beside causing Stan to crash if you accidentally pass data in violation of the constraints (which can be handy as a check that you haven’t messed something up in you R script that prepares the data).

Where the bounds get really useful is in the parameters block. For example, suppose we are estimating a standard deviation \(\sigma\). We know that \(\sigma\) has to be positive. So, we could do something like estimate \(log(\sigma)\) and then use \(exp(log(\sigma))\) in our model.

Or, we can just declare the parameter \(\sigma\) to have a lower bound of zero


parameters{

  real<lower = 0> sigma;

}

By doing this, Stan knows not to look for negative values of \(\sigma\), and will even allow us do set normal priors on sigma


model{
  
  sigma ~ normal(0, 2);
  
}

This is equivalent of saying that our prior on sigma is half normal, with standard deviation 2. The above example may have been a bit confusing since we haven’t gotten to the model block yet.

The model block is where we actually define our model (in terms of likelihoods and priors).

There are a few ways to do this, but for now we’ll stick with generally preferred syntax of

some_thing ~ some_distribution(distribution_parameters)

So in the above example, we are saying that parameter \(\sigma\) come from a normal distribution with mean 0 and standard deviation 2. Stan has support for all kinds of different distributions, and you should see the documentation for descriptions and notes on the meaning of different distribution parameters.

Lastly, we should touch on for, while, and if statements. These all work more or less the same as they do in R, thankfully.


model{

real test[10];

int n;

int N;

N = 10;

for (i in 1:10){

print(test[i])

}

n = 1;

while (n <= N) {
  
  n = n + 1; 
  
  print(n)

}

if (n == N){

print("hooray")
}

}

Notice the use of print there. print is a great way to see what’s going on inside your code once you’ve got it at least syntactically correct (meaning it will compile). If your code is syntactically correct, but you’re still getting errors or weird results, can you use print to debug.

For example, if you know that some model output thing has to be positive, you can add print(thing) to your code to make sure that it is in fact positive.

A Note on Priors

Setting priors is an art and a science that goes well beyond anything we can discuss here, and there are lots of resources out there to help you on this Gelman et al. (2013). You’ll notice though that Stan doesn’t force you to specify priors, so it can be tempting to say “hey, I like Stan, but priors scare me, so I just won’t specify any.” By default though, any parameter that isn’t supplied a specific prior gets assigned a uniform prior on the range \([-\infty, \infty]\), which might seem innocuous, but is not in fact an uninformative prior (in fact nothing may technically be), especially if your model is not actually defined along that range (e.g. if your parameter can only be positive). Saying that \(-\infty\) is just as likely as $ $ is in most cases a pretty strong and unlikely statement.

So, while Stan won’t force you to specify priors explicitly, you should always take the time to think carefully about your priors and test the affect of your choices on your model outcomes.

That’s a small look at the key features of writing Stan code. There’s a ton more out there, but those are the basic tools for most coding. Feel free to play around with the scratch.stan file to test out different behaviors.


data{

int n; // the number of observations

vector[n] y; //

vector[n] x;

real z[n];


}
parameters{

  real beta;

  real<lower = 0> sigma;


}
model {

real test[10];

int cc;

int CC;

 CC = 10;

for (i in 1:10){

 print(test[i])

}

cc= 1;

while (cc <= CC) {

  cc = cc + 1;

  print(cc)

}

if (cc == CC){

 print("hooray")
}

  y ~ normal(x * beta, sigma);

}
beta <- 0.2

sigma <- 0.5

 x <- -1:200

data <- list(y = x * beta + rnorm(length(x), 0, sigma), n = length(x), x = x,
             z = x, g = x)

plot(x,data$y)

scratch <- stan(file = here::here('src', 'scratch.stan'),
                iter = 2000, 
                warmup = 1000,
                data = data)

plot(scratch)

Example Model: Predicting Salmon

Now that you’ve hopefully developed a general idea of what Stan is and why you might like to use it, we’re going to dig into actual using it.

This is just generally good practice: before you start frantically coding you should sit down and write out your model. Hobbs and Hooten (2015) provides a great introduction on how to do this if you’re not familiar with the process.

You can certainly do your entire analysis in Stan by itself. However, every language has its purpose, and the purpose of Stan is not fast and easy data manipulation. The good news is that Stan easily interfaces with other programming languages like R and Python, allowing you to do a lot of the complex data manipulation in languages better suited to those tasks. You can then pass your processed data to Stan to do the model fitting, and then analyze your results back in say R.

We’ll now work through the process of using R and Stan to fit a model predicting the number of young fish (recruits) produced by a given amount of adult fish (spawning biomass). This type of model is generally called a stock-recruitment model, and is a core part of fisheries management. For our purposes though, it’s a handy way to see how to fit a non-linear model in R and Stan.

Write Your Statistical Model

We are going use data from the RAM Legacy Stock Assessment Database to try and fit a Beverton-Holt stock recruitment relationship using the steepness parameterization from Dorn (2002) (for reasons that will become apparent).

To back up, a stock recruit relationship’s job is to say, for a given amount of observed spawning biomass (SSB) in a fishery, how many new fish (recruits, R) enter the population.

We can write a general BH model by

\[ R = \frac{\alpha{SSB}}{1 + \beta{SSB}}\]

Where SSB is spawning stock biomass (the biomass of reproductively mature fish in the population), \(\alpha\) is the maximum average number of recruits (new fish) possible in the population, and \(\beta\) is the unfished spawning biomass that produces \(\alpha\) recruits

Example Beverton-Holt stock recruitmenet relationship (black line), red line shows beta

Figure 1: Example Beverton-Holt stock recruitmenet relationship (black line), red line shows beta

The problem here is that \(\beta\) is obviously very specific to each species, making it difficult to really say much about the resilience of a species based on that parameter. To that end Mace & Doonan 1988 provided a reparameterization of the BH equation using a term called “steepness” (h), which is the slope of the stock-recruitment curve when SSB is 20% of max (i.e. unfished) SSB. This allows for species with vastly different stock sizes to be compared in terms of their steepness, with species with higher values of steepness being more resilient to fishing than those with lower steepness. \(\beta\) is now the unfished SSB that produces .

\[\frac{0.8(\alpha){(h){{SSB}}}}{0.2 (\beta)(1 - h) + (h - 0.2)SSB}\]

alpha <- 1000

cross_df(list(ssb = 0:1000, h = c(0.2,0.6,1))) %>% 
  mutate(recruits = (0.8 * alpha * h * ssb) / (
            0.2 * alpha * (1 - h) +
              (h - 0.2) * ssb
          )) %>% 
  ggplot(aes(ssb, recruits, color = factor(h))) + 
  geom_line() + 
  scale_color_discrete(name = 'steepness')

Our goal then is going to be to try and fit a stock-recruitment model for some real data from some Pink Salmon in Alaska.

Examine spawning and recruitment data

Let’s look at one stock in particular. Data for this analyses can be found in the underlying repo here

# sr_data <- read_csv(here::here("data","rlsadb_v4.25_ssb_recruits.csv")) %>%
#   set_names(tolower)

sr_data <- read_rds(here::here("data","rlsadb_v4.25_ssb_recruits.rds")) %>%
  set_names(tolower)

sal_data <- sr_data %>% 
  filter(stockid == 'PSALMAKPSWUD') %>% 
  select(stocklong, year, ssb, r) %>% 
  na.omit()

sal_data %>% 
  ggplot(aes(ssb, r)) + 
  geom_point() + 
  labs(x = "SSB", y = "Recruits")

So, there seems to be a relationship between SSB and R, but it’s messy. Our goal is to use Stan to estimate a Beverton-Holt stock recruitment curve for this stock.

Let’s start by writing out the model for this model. We have three parameters we need to estimate, steepness h, maximum recruitment \(\alpha\), and some error term \(\sigma\).

We can write this as

\[[\alpha,h,\sigma,\beta| r] \propto [ r | \alpha,h,\sigma][\alpha][h][\sigma]\]

Where r are our recruitment data. That’s just a conceptual framework for the model; we can write it more clearly by specifying the model in terms of distributions.

\[[\alpha,h,\beta,\sigma | r] \propto N( log(r) | bh(h,\alpha),\sigma) \times N(log(\beta)|10,2) \times unif(h|0.2,1) * N(\alpha|10*max(r),0.1*max(r)) cauchy(\sigma|0,2.5)\]

In English, this says that we believe that recruitment is a log-normal process, while specifying appropriate priors for our other parameters. E.g. we know that h has to be between 0.2 and 1, and it’s reasonable to think that max recruitment is something larger than the largest recruitment ever observed (much more care needs to be taken in prior construction, this is just an example)

Write Your Stan Model

Now that we know what we need to model, we just have to code it. The model is below (though note that I store the actual .stan file in src/bh_model.stan)


data{

int<lower = 0> n; // number of observations

vector[n] ssb; // vector of observed ssb

vector[n] r; // vector of recruits

real max_r;  // max observed recruitment


}
transformed data{

vector[n] log_r; // log recruitment

log_r = log(r);


}

parameters{

real<lower = 0.2, upper = 1> h; //steepness

real<lower = 0> alpha; // max recruitment

real log_beta;

real<lower = 0> sigma; // recruitment standard deviation 


}
transformed parameters{

real beta = exp(log_beta);

vector[n] rhat;

vector[n] log_rhat;

rhat = (0.8 * alpha * h * ssb) ./ (0.2 * beta * (1 - h) +(h - 0.2) * ssb); // beverton holt model

log_rhat = log(rhat);

}


model{

log_r ~ normal(log_rhat, sigma); //account for retransformation bias

sigma ~ cauchy(0,2.5);

alpha ~ normal(2*max_r, 0.2*max_r);

log_beta ~ normal(10,5);

}

generated quantities{

  vector[n] pp_rhat;

  for (i in 1:n) {

   pp_rhat[i] = exp(normal_rng(log_rhat[i] - 0.5 * sigma^2, sigma)); // generate posterior predictives

  }

}

Transformed Parameters Block

You’ll notice a new block in our .stan file: transformed parameters. This is where I usually write out the actual operations involved in my model; in this case, translating parameters into actual recruitment estimates, but in more complex cases I will usually put for example my model of a fish population in the transformed parameters block. My background is in population dynamics as opposed to pure statistics, so I found this terminology confusing: my instinct was to put all these steps in the model block. Stan thinks of the model block as statistical model. You can put steps in the model block, but this has a few drawbacks. Stan will produce draws from the posterior for anything you put in the transformed parameters block. In this case, suppose we want to easily plot the mean estimated recruits (rhat) and the credible intervals around our estimates. By putting this step in the transformed parameters block, Stan will return those values. We could put the calculation of rhat in the model block, and our code will run, but when we look at our results we won’t see the draws from the posterior for rhat. In addition, steps in the model block aren’t visible to the generated quantities block (we’ll talk about this later), but steps in transormed parameters are. In our case, we used log_rhat in generated quantities. If we’d put the calculation of log_rhat in the model block, we would have to calculate it again in generated quantities. I’m not entirely sure if there are speed benefits to putting things in the model block instead of the transformed parameters block, so my default is to put just about everything that I might have interest in the transformed parameters block except of course for the actual statistical model. But, if you have a bunch of intermediate steps that you don’t really care about and don’t want to use the memory to store those objects, consider putting them in the model block.

Fitting your model

Now that we have our .stan file written, we just need to pass out data to it and fit the model. the rstan package makes it really easy to interface between R and Stan. The first step is passing data from the R environment to Stan. You remember our DATA block in our .stan file? We simply need to pass create a list in R containing named objects matching each of the entries in our DATA block.

Once we’ve done that, we use the Stan function to fit our model.

There are a few options that are important to specify in the call to Stan

  • The file entry specifies the path to the .stan file containing your model

  • data is your list of data

  • chains specifies the number of chains used in the model fitting. Any actual model run should contain multiple chains to verify convergence, but you can start with one chain for diagnostics. If you have more than one chain, by default Stan will run them one after another, so if your model takes a long time this can be daunting. However, you can also specify cores. If you set cores to more than 1, then Stan will run each chain in parallel on different cores. So, if you specify 4 chains and 4 cores, each chain will be run simultaneously on separate cores, so your run time should be about the same as 1 chain on 1 core

  • warmup is the number of model iterations dedicated to burn-in/tuning/whatever you want to call it. This number defaults to half of iter (the total number of model iterations), but if you start to do large iteration runs (e.g. 20,000), there isn’t necessarily a need to do 10,000 warmup runs. If you’ve tested your model on lower iterations and diagnostics look good with say 1,000 warmup runs, there shouldn’t be any problem with leaving warmup at 1,000 and setting iterations to 20,000; it just gives you a lot more samples to play with.

  • init allows you to pass initial parameter values for each chain. This is optional, but can help A LOT. By default, Stan randomly draws numbers between -2 and 2 for initial values for each parameter. This works if you’re model is reasonably centered. But, if you’re working in a situation where parameters can vary wildly from that (say estimating carrying capacity for a population), this range is going to be a really bad guess if the true parameter value is in the millions. If your model is correctly written, Stan will get to the right result eventually, but it will take a lot longer if you feed it a really poor starting guess. There are a few different ways to set init, I’m just going to cover passing explicit starting guesses. init must by a list of list, of the general form list(chain_1 = list(h = 0.2), chain_2 = list(h = 0.8)). The inner lists contain are names objects for any parameters in the model. In this case, I have a parameter named h in the model, and I’m going to specify an initial guess of h at 0.2 for the first chain, and 0.8 for the second chain. It is very important if you are manually specifying starting guesses to have different initial values for your parameters, since a test of model convergence is whether or not different chains initiated at different values reach the same result. Any parameters you do not manually specify a starting guess for Stan goes back to the default random number between -2 and 2

warmups <- 1000

total_iterations <- 2000

max_treedepth <-  10

n_chains <-  4

n_cores <- 4

adapt_delta <- 0.95

data <- list(n = nrow(sal_data),
               r = sal_data$r,
             ssb = sal_data$ssb,
               max_r = max(sal_data$r)
             )

bh_fit <- stan(
  file = here::here("src", "bh_model.stan"),
  data = data,
  chains = n_chains,
  warmup = warmups,
  iter = total_iterations,
  cores = n_cores,
  refresh = 250,
  init = list(
    list(
      h = 0.4,
      log_alpha = log(1 * data$max_r),
      log_beta = log(2* max(sal_data$ssb))
    ),
    list(
      h = 0.21,
      log_alpha = log(3 * data$max_r),
      log_beta = log(.5 *max(sal_data$ssb))
    ),
    list(
      h = 0.8,
      log_alpha = log(1 * data$max_r),
      log_beta = log(1.1*max(sal_data$ssb))
    ),
    list(
      h = 0.3,
      log_alpha = log(.8 * data$max_r),
      log_beta = log(5*max(sal_data$ssb))
    )
  ),
  control = list(max_treedepth = max_treedepth,
                 adapt_delta = adapt_delta)
)

Run Diagnostics

Now that we have a model run, it’s time to examine our fits.

In my opinion, rstanarm::launch_shinystan is by far the best way to do this. The folks at Stan built a pretty amazing interface (shinystan) that automatically puts together a wide array of numeric and graphical diagnostics that they recommend running on Stan model.

I recommend running ?launch_shinystan and taking a look at their examples to get a feel for what it can do.

rstanarm::launch_shinystan(bh_fit)

However, we might also want to be able to run some important diagnostics from within R, either for model comparison or inclusion in reports/publications, so we’ll now look at use the fitted Stan model in R.

Calling Stan creates an object of class stanfit

class(bh_fit)
## [1] "stanfit"
## attr(,"package")
## [1] "rstan"

stanfit options are designed to interface with a few base R commands that you’re use to, like summary and plot (though if you have lots of parameters simply calling these can be pretty messy)

summary(bh_fit)$summary %>% head(20)
##                   mean      se_mean           sd         2.5%          25%
## h         7.078239e-01 3.169747e-03 1.442343e-01 3.939527e-01 6.102924e-01
## log_alpha 1.531332e+01 2.872326e-02 9.831090e-01 1.394481e+01 1.461657e+01
## log_beta  1.635728e+01 3.693697e-02 1.375199e+00 1.375117e+01 1.543265e+01
## sigma     1.249446e+00 3.212504e-03 1.542709e-01 9.865620e-01 1.137413e+00
## beta      3.529879e+07 2.161060e+06 8.762722e+07 9.376908e+05 5.038670e+06
## alpha     8.901170e+06 5.579195e+05 1.999969e+07 1.138033e+06 2.227905e+06
## rhat[1]   1.621490e+06 7.394711e+03 3.892462e+05 9.692065e+05 1.340103e+06
## rhat[2]   1.104467e+06 3.527194e+03 2.373059e+05 7.063650e+05 9.383191e+05
## rhat[3]   1.277909e+06 4.282790e+03 2.757634e+05 8.174251e+05 1.086051e+06
## rhat[4]   7.345194e+05 3.225063e+03 1.851484e+05 4.413389e+05 6.009212e+05
## rhat[5]   3.314767e+05 2.796518e+03 1.287098e+05 1.717287e+05 2.435863e+05
## rhat[6]   4.088632e+05 2.981960e+03 1.429569e+05 2.188126e+05 3.085639e+05
## rhat[7]   8.416381e+05 3.232160e+03 1.976493e+05 5.194235e+05 6.985731e+05
## rhat[8]   1.332869e+06 4.692239e+03 2.904921e+05 8.534010e+05 1.130690e+06
## rhat[9]   8.331789e+05 3.231805e+03 1.966209e+05 5.139237e+05 6.912395e+05
## rhat[10]  6.624290e+05 3.213212e+03 1.769783e+05 3.883628e+05 5.347691e+05
## rhat[11]  1.206105e+05 1.698646e+03 6.983547e+04 5.617102e+04 8.083893e+04
## rhat[12]  1.035079e+05 1.546167e+03 6.290990e+04 4.765039e+04 6.876601e+04
## rhat[13]  1.796375e+05 2.127475e+03 9.047238e+04 8.653524e+04 1.239481e+05
## rhat[14]  6.114202e+05 3.197734e+03 1.710481e+05 3.519857e+05 4.876340e+05
##                    50%          75%        97.5%    n_eff      Rhat
## h         7.270845e-01 8.181560e-01 9.333214e-01 2070.561 1.0024320
## log_alpha 1.511319e+01 1.581729e+01 1.781410e+01 1171.483 1.0014070
## log_beta  1.629629e+01 1.722079e+01 1.920069e+01 1386.145 1.0011261
## sigma     1.236081e+00 1.345266e+00 1.591512e+00 2306.113 0.9993572
## beta      1.195057e+07 3.012274e+07 2.181497e+08 1644.162 1.0010526
## alpha     3.660791e+06 7.402201e+06 5.452135e+07 1285.000 1.0021554
## rhat[1]   1.583968e+06 1.852581e+06 2.498254e+06 2770.808 1.0006444
## rhat[2]   1.078392e+06 1.247371e+06 1.642774e+06 4526.456 1.0006552
## rhat[3]   1.248251e+06 1.441385e+06 1.903794e+06 4145.907 1.0006514
## rhat[4]   7.125345e+05 8.401275e+05 1.164322e+06 3295.820 1.0005496
## rhat[5]   3.042651e+05 3.832535e+05 6.626600e+05 2118.304 1.0003367
## rhat[6]   3.811398e+05 4.719024e+05 7.782208e+05 2298.301 1.0003686
## rhat[7]   8.203252e+05 9.580455e+05 1.296886e+06 3739.424 1.0005990
## rhat[8]   1.301005e+06 1.506169e+06 1.991472e+06 3832.730 1.0006478
## rhat[9]   8.116127e+05 9.482468e+05 1.287068e+06 3701.428 1.0005956
## rhat[10]  6.392977e+05 7.593903e+05 1.077647e+06 3033.620 1.0005104
## rhat[11]  1.033913e+05 1.348004e+05 3.000998e+05 1690.233 1.0003505
## rhat[12]  8.797979e+04 1.150027e+05 2.620690e+05 1655.486 1.0003632
## rhat[13]  1.578682e+05 2.036789e+05 4.197118e+05 1808.435 1.0003235
## rhat[14]  5.868862e+05 7.032020e+05 1.020632e+06 2861.222 1.0004810

The plot command is a great way to get a first glance at your fits

plot(bh_fit)

Before we look at using our model though, let’s take a look at a few diagnostics to try to evaluate our model fit. DISCLAIMER You should put a lot more thought and effort in model model diagnosis in real cases, this is just an example of accessing some of the starting points in this process. And, all of these diagnostics are produced by shinystan, but it’s handy to know how to access these things on your own (and I’d rather not have to take a bunch of screenshots for this blog-post)

Assessing and Fixing Divergences and Treedepth Problems

The first thing we can check is for the presence of “divergent” transitions (see earlier section for a reminder on what these are). Divergent transition during the sampling period of your model (the iterations after the burn-in) are sign that there maybe a problem with your model. We’ll talk about dealing with these later, but for now here’s how to see if they happened.

rstan has a few functions to check these things

rstan::check_divergences(bh_fit)

We can also look manually at these diagnostics using the output of rstan::get_sampler_params. get_sampler_params returns a list with one object per chain. Each object is a matrix showing diagnostics of each of the stored iterations from the model fitting (by default get_sampler_params includes the warmup iterations, you can set the option inc_warmpup = FALSE to omit these from the report if you want). I prefer to wrangle things into tibbles for analysis (you can check out my introduction to purrr if these steps are confusing to you)

mack_diagnostics <- rstan::get_sampler_params(bh_fit) %>% 
  set_names(1:n_chains) %>% 
   map_df(as_data_frame,.id = 'chain') %>% 
  group_by(chain) %>% 
  mutate(iteration = 1:length(chain)) %>% 
  mutate(warmup = iteration <= warmups)
 

mack_diagnostics %>% 
  group_by(warmup, chain) %>% 
  summarise(percent_divergent = mean(divergent__ >0)) %>% 
  ggplot() +
  geom_col(aes(chain, percent_divergent, fill = warmup), position = 'dodge', color = 'black') + 
  scale_y_continuous(labels = scales::percent, name = "% Divergent Runs")  + 
  scale_fill_npg()

We see then that across all chains, we had no divergences during the sampling period (after the warm-ups), which is what we want to see!

treedepth is another really important thing to take a look at. Remember how the HMC algorithm works (more or less). Ideally, a new candidate draw from the parameter space is selected from a place where the likelihood surface bends back on itself. If you think of the posterior probability space like a circular racetrack, the sampler is a runner on that racetrack. The runner starts off on the left side of the track and starts running north, goes around the bend, and then starts running south. The HMC algorithm would stop and try a new parameter at that point, where the runner has fully turned around. So that sounds great if you’ve got a small track to run on. Suppose though that you are on a 10,000 mile track. Your runner is going to have to run a looooong way before things start to bend around. Or, suppose your runner is on a straight line, the runner is never going to turn around, and so the HMC algorithm would just keep running forever! That’s where the max_treedepth option comes in. HMC will select a candidate parameter value when the parameter space bends back on itself OR when the number of steps specified by max_treedepth is reached. Basically, if the algorithm gets to max_treedepth, the runner says “phew, I’m tired, I’m stopping here,” evaluates that point, and then tries again for another iteration.

By default, max_treedepth is set to 10. So, we should check and make sure that our model isn’t bumping up against max_treedepth a bunch. If it is, that means that the model is selecting candidate draws based on hitting this cap, rather than properties of the posterior probability.

mack_diagnostics %>% 
  ggplot(aes(iteration, treedepth__, color = chain)) + 
  geom_line() + 
  geom_hline(aes(yintercept = max_treedepth), color = 'red') + 
  scale_color_locuszoom()

Looks like we’re good, the treedepth of each all of our iterations was below the max_treedepth, meaning that Stan was selected the parameters for that iteration.

If you’re curious, you can also see the warmup process through the stepsize parameter. Stan uses the warmup period to tune the stepsize parameter to achieve a target acceptance rate (specified by adapt_delta). You can think of stepsize like resolution. A big stepsize means the model will quickly cover the entire picture of the posterior, but the picture will be really fuzzy, and if the posterior probability surface has important fine scale variation, the model will miss them. A really small stepsize will produce a really high resolution picture, but it will wake a lot longer to make that picture. So, a great feature of Stan is it uses this target acceptance rate to find the right stepsize for the model. You can see this process play out by plotting the stepsize as the model tunes itself during the warmup period (first 1000 iterations).

mack_diagnostics %>% 
  ggplot(aes(iteration, stepsize__, color = chain)) + 
  geom_line()  + 
  scale_color_locuszoom()

There are many more diagnostics for the actual sampler, but those are a few of the really critical ones. Just because the divergences and treedepth look good doesn’t mean that your model doesn’t have problems that deeper diagnostics would reveal, but seeing problems in those two diagnostics should give you a huge red flag right off the bat.

You can test these things by changing the control options in your call to Stan. Try for example setting adapt_delta = 0.5 or max_treedepth = 2. You’ll see that you start to develop divergences in the first case, since in order to achieve that target acceptance rate Stan sets the stepsize quite large, meaning that you miss important parts of the parameter space and create divergences. In the second case you’ll start to see that the treedepth start so bump up against the max_treedepth.

While these are extreme examples, this also gives an idea of a first step at fixing these problems if they pop up: if you fit a model and you get divergences, the first thing you can try is to increase adapt_delta (in fact, Stan will suggest as much to you if this happens). If you’re bumping up against max_treedepth, increase max_treedepth! If either of these don’t solve the problem, then you’ll need to start think about the specification of your model, which we’ll cover a little later.

Feature Engineering and Non-Centering

If you’ve tried these tricks and you’re still getting lots of divergent transitions, or if your model is just taking forever to run, feature engineering and non-centered model parameterizations can help. Feature engineering refers to how you decide to encode your data. Suppose for example that you have a model of health outcomes that includes both birth weight (mean say 3 kg) and city population (mean say 20,000 people). These two variables are on vastly different scales, meaning that a numerical routine like HMC needs to jump around vastly different scales of numbers to find the right coefficient for each. You could instead center (subtract the mean value) and then scale (divide by the standard deviation) each variable (e.g. sc_pop = (pop - mean(pop)) / sd(pop)) before passing it to Stan, so that each variable is on on the same scale (standard deviations from its mean). The information content of the data hasn’t changed, it’s now just easier for Stan to hunt over the parameter space. Though of course this does change the interpretation of any coefficients that come out of the model, so you’ll need to think about that. There are lots of ways to do feature engineering, I recommend checking out the recipes for a great set of feature engineering tools and discussion.

Suppose you do that and you’re still running into divergences. You also decide that you want to include random effects for each city in your model, of the form

\[city_{i} \sim normal(\mu_{city},\sigma_{city}) \]

To estimate this, you could simply estimate the coefficient for each city, use the above relationship as its prior. This can run into problems related to a fun thing called “Neal’s Funnel” (see the Stan Documentation for a good description) that causes the model to produce a bunch of divergences and have trouble converging (this phenomenon pops up all the time in hierarchical models).

Instead, you can do this

\[city_i = \mu_{city} + d_{i}\sigma_{city}\]

Where \(d_{i}\) represents the estimated deviation (in standard deviations) of city i’s coefficient from the mean city coefficient, and is distributed

\[d_{i} \sim normal(0,1) \]

This will produce the same estimates of the net coefficient \(city_{i}\), but will do so much more efficiently. This might seem like voodoo, but it works, see the Stan reference guide for more examples on this.

Parameter Diagnostics

Now that we’ve taken a look at the highest-level red flags (divergences and treedepth) and satisfied ourselves that we’re in the clear, we can start to diagnose individual parameter estimates. There are a lot of ways to look at the these, the most useful starting point in my opinion is to extract summary statistics on each model parameter. You can do that by summary(my_model)$summary. summary on its own prints out summaries of each parameter, the $summary part allows you to access and store the data behind the printed statistics.

 bh_summary <- summary(bh_fit)$summary %>% 
  as.data.frame() %>% 
  mutate(variable = rownames(.)) %>% 
  select(variable, everything()) %>% 
  as_data_frame()

bh_summary %>% head()
## # A tibble: 6 × 11
##   variable    mean se_mean      sd  `2.5%`   `25%`   `50%`   `75%` `97.5%` n_eff
##   <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>
## 1 h        7.08e-1 3.17e-3 1.44e-1 3.94e-1 6.10e-1 7.27e-1 8.18e-1 9.33e-1 2071.
## 2 log_alp… 1.53e+1 2.87e-2 9.83e-1 1.39e+1 1.46e+1 1.51e+1 1.58e+1 1.78e+1 1171.
## 3 log_beta 1.64e+1 3.69e-2 1.38e+0 1.38e+1 1.54e+1 1.63e+1 1.72e+1 1.92e+1 1386.
## 4 sigma    1.25e+0 3.21e-3 1.54e-1 9.87e-1 1.14e+0 1.24e+0 1.35e+0 1.59e+0 2306.
## 5 beta     3.53e+7 2.16e+6 8.76e+7 9.38e+5 5.04e+6 1.20e+7 3.01e+7 2.18e+8 1644.
## 6 alpha    8.90e+6 5.58e+5 2.00e+7 1.14e+6 2.23e+6 3.66e+6 7.40e+6 5.45e+7 1285.
## # … with 1 more variable: Rhat <dbl>

There are two really important diagnostic statistics hidden in this summary:

  • n_eff: the effective sample size

  • Rhat: the “Gelman and Rubin potential scale reduction statistic”

n_eff measures the effective sample size of that particular parameter. Remember that each iteration of the HMC is based off the parameter value on the previous iteration. Ideally though, if the algorithm works correctly, the parameter chosen at the next iteration will be independent of that early parameter value (this is what “thinning” looks to accomplish in other MCMCs, though you can also thin using HMC). If you’re not doing a very efficient job at sampling the parameter space though, parameter values at a given iteration are more likely to be close to the parameter values at the last iteration. This means that these parameters aren’t really independent, and so if you have 1000 draws from the posterior, you might not actually have 1000 independent samples of the parameter, but rather some smaller number of truly “independent” draws.

So, n_eff is the sum of the effectively independent sampling iterations across all chains. In this case, we have 4 chains, with 2000 iterations, half of which are warmup, meaning we sample 1000 iterations in each chain, so the max n_eff possible in this case is 4000

bh_summary %>% 
  ggplot(aes(n_eff)) + 
  geom_histogram() + 
  geom_vline(aes(xintercept = 4000), color = 'red')

Most of our parameters have a fairly high n_eff, though we see a few are somewhat lower. How do we know if an n_eff like 2000 (out of 4000 possible) is too low? The Rhat statistic helps tell us whether these parameters are so poorly sampled that we have a problem. More or less Rhat tells you whether or not each of the chains has reached a stable posterior distribution, despite starting at different starting values. Gelman recommends that Rhat for each parameter be less than 1.1

bh_summary %>% 
  ggplot(aes(Rhat)) + 
  geom_histogram() + 
  geom_vline(aes(xintercept = 1.1), color = 'red')

Looks like we’re good! However, if you’re concerned effective sample size of some of your parameters, the easiest thing to do is simply increase the total number of iterations and especially the warmup period. E.g. in this case we could move to 4000 iterations per chain with 2000 warmup iterations.

So now that we’ve checked some individual parameter diagnostics, we can take a look at our parameter estimates themselves.

Going back to the summary we created, you’ll notice that Stan kindly calculated mean values and credible intervals for us.

Remember that we had three parameters in our model, steepness h, max recruits \(\alpha\), and our standard deviation of the log recruits \(\sigma\). Let’s take a look at the fits for those variables

bh_summary %>% 
  filter(variable %in% c('h','alpha','sigma')) %>% 
  ggplot() + 
  geom_linerange(aes(variable, ymin = `2.5%`,ymax = `97.5%`)) + 
  geom_crossbar(aes(variable, mean, ymin = `25%`, ymax = `75%`), fill= 'grey') + 
  facet_wrap(~variable, scales = 'free')

That’s all well and good, but what we might really want to see are the actual recruits estimated by our model. Stan is great for this as well: Since we calculated our estimates of recruitment in the transformed parameter block, Stan automatically stores the values for recruitment associated with each draw from the posterior, giving us our credible intervals for our recruitment estimates as well!

Our recruitment estimates were stored in the rhat object

rhat <- bh_summary %>% 
  filter(str_detect(variable,'rhat') & !str_detect(variable,'log') & !str_detect(variable,'pp'))

sal_data <- sal_data %>% 
  mutate(mean_rhat = rhat$mean,
         lower = rhat$`2.5%`,
         upper = rhat$`97.5%`)

sal_data %>% 
  ggplot() + 
  geom_point(aes(ssb, r)) + 
  geom_line(aes(ssb, mean_rhat)) + 
  geom_ribbon(aes(ssb, ymin = lower, ymax = upper), alpha = 0.25)

Looks good! So, the black line is our model’s estimate of the expected recruits at a given observed SSB, and the grey ribbon corresponds to the 95% credibility interval around this expected value.

If we’re sticklers for the truly raw data, we can also look at the parameter values at each of our samples, rather than using summary to process them for us. We can get at those using rstan::extract(). Note: by default rstan::extract() excludes the warmup iterations, and reshuffles the draws. If you want to keep the draws in their original order (for example to check for autocorrelation), you can set rstan::extract(permuted = FALSE), and if you want to include the warmup period rstan::extract(inc_warmup = TRUE)

bh_mcmc <- bh_fit %>% 
  rstan::extract()

bh_pars <- bh_mcmc[ c('h','alpha','sigma')] %>% 
  map_df(as_data_frame, .id = 'variable')

bh_pars %>% 
  ggplot(aes(value, fill = variable)) + 
  geom_density() + 
  facet_wrap(~variable, scales = 'free') + 
  coord_flip() + 
  scale_fill_locuszoom()

Posterior Predictive Analysis

So far, we’ve fit our model, checked some critical diagnostics, and examined our model fits. Stan also allows us to examine “posterior predictive” fits, an immensely powerful tool in diagnosing Bayesian models, and in using Bayesian models for prediction.

A huge advantage of Bayesian modeling is that it forces us to very explicitly write out our model in terms of our beliefs about the error structures of the model. For example, here we have assumed that our observed recruitment data come from a log-normal distribution, that steepness comes from a uniform distribution on the interval 0.2-1, etc.

Together then, each of these distributions make up the posterior probability distribution, which Stan helps us sample from. This also means though that we have a very clear hypothesis about the underlying process generating our data. So, if our hypothesis is right, our model should be able to generate data that looks very similar to the data we actually observed.

In this case, our data are observed recruits. We hypothesize that these observed recruits come from a distribution.

\[log(recruits) \sim normal(bh(h,\alpha,ssb), \sigma)\]

So, using our draws from the posterior of h, \(\alpha\), \(\sigma\) and our observed SSB, we can use this model to generate draws of log_recruits, and compare those to the values that we actually see (we did this in the generated quantities block of bh_model.stan). The generated quantities is only evaluated during successful draws from the algorithm, and so is less time/memory intensive than the other blocks.

pp_rhat <- bh_mcmc[ 'pp_rhat'] %>% 
  map_df(as_data_frame, .id = 'variable') %>% 
  gather(observation,value, -variable)

ggplot() + 
  geom_density(data = pp_rhat, aes(log(value),fill = 'Posterior Predictive'), alpha = 0.5) + 
  geom_density(data = sal_data, aes(log(r), fill = 'Observed'), alpha = 0.5)

So, we see that our data does a reasonable job of reproducing the overall shape of the observed data, a good sign! But, we can also see that our model is not perfectly reproducing the data it was fit to; the data had both more smaller values and larger values than our model predicts (our model looks like it basically split the difference here). We can do a lot more with this type of analysis, for example testing the ability of the posterior predictive to estimate “test statistics” like the min, max, mean, and standard deviation of the observed data, launch_shinystan can do several of these for you if you set your model up correctly.

We can also use the posterior predictive process to make predictions for new data-points. Suppose we get a new observation of SSB for our salmon stock; what does our model suggest is the 95% credible interval of recruits that that SSB will produce? We can of course take the mean of h, \(\alpha\), and \(\sigma\) and make a prediction. Ideally though we want to incorporate the uncertainty (and covariance) across our parameters. We can do this by taking our posteriors for all of our parameters, and generating new draws from our model using our these draws.

rhat <- bh_summary %>% 
  filter(str_detect(variable,'rhat') & !str_detect(variable,'log') & !str_detect(variable,'pp'))

pp_rhat <- bh_summary %>% 
  filter(str_detect(variable,'pp_rhat')) %>% 
  mutate(ssb = sal_data$ssb)

new_ssb <-  mean(sal_data$ssb)

bh_prediction <- bh_pars %>%
  filter(variable %in% c("h", "alpha", "sigma")) %>%
  group_by(variable) %>%
  mutate(iteration = seq_along(value)) %>%
  spread(variable, value) %>%
  mutate(mean_predicted_recruits = (0.8 * alpha * h * new_ssb) / (0.2 * alpha * (1 - h) +
  (h - 0.2) * new_ssb)) %>% 
  ungroup() %>% 
  mutate(log_predicted_recruits = rnorm(nrow(.), log(mean_predicted_recruits) - 0.5 * sigma^2, sigma)) %>% 
  mutate(predicted_recruits = exp(log_predicted_recruits)) %>% 
  mutate(percentile = percent_rank(predicted_recruits)) %>% 
  filter(percentile >= 0.025 & percentile <= 0.965)


sal_data <- sal_data %>% 
  mutate(mean_rhat = rhat$mean,
         lower = rhat$`2.5%`,
         upper = rhat$`97.5%`)

sal_data %>% 
  ggplot() + 
  geom_point(aes(ssb, r)) + 
  geom_line(aes(ssb, mean_rhat)) + 
  geom_ribbon(aes(ssb, ymin = lower, ymax = upper), alpha = 0.25) +
  geom_violin(data = bh_prediction, aes(new_ssb,predicted_recruits), color = "red") 

All together then, this simple model walks us through the basic steps of using Stan to fit models:

  1. Writing our model (in terms of likelihoods)

  2. Coding our model

  3. Passing our data from R to Stan

  4. Performing high-level diagnostics on our model fit (divergence, trees, Rhat, etc.)

  5. Examining the fitted coefficients of our model

  6. Examining the posterior predictive statistics

  7. Using our coefficients for prediction

Each of these steps warrants more careful consideration than we have gone through here, but this is a solid foundation to base future analysis are, that will work with simple models like this, or much more complex models, the process stays the same.

Comparing Models

So far, we’ve been focused on diagnosing our model to make sure that it has in fact converged and to understand the behavior of our model if it has. However, to put it bluntly, if you’re only writing one model, you’re doing it wrong. We made a huge number of assumptions in the design of this model, from the functional form of our stock-recruitment relationship, to our choices of likelihoods and priors.

Our next step should be to test these assumptions by constructing alternative models and comparing them. Luckily, Stan has a number of tools available to help us with model comparison.

The Beverton-Holt model assumes a “compensatory” nature of density dependence. A simple ecological example would be habitat filling: if there is a finite amount of available habitat for recruits, once those spots fill up even if you put more and more eggs into the system the total number of recruits will stay the same. The Ricker model is a bit more flexible, and allows for “depensatory” dynamics, which basically means that the SR curve can start to bend back down. An example of this would be a cannibalistic process, where once adult density (SSB) gets high enough, they start to prey on recruits and actually drive recruitment back down.

We can use Stan to test the relative performance of the BH vs Ricker models in explaining the observed patterns of SSB and recruitment.

We’ll use a parameterization of the Ricker curve that still includes steepness

\[ \beta = log(5 * h) / (0.8 * ssb_{max})\]

\[\alpha = log(r_{ssb_{max}} / ssb_{max}) + beta * ssb_{max}\]

\[ r = ssb e^{(alpha - beta * ssb)}\]

rzero <-  10

szero <- 200

h <-  0.9

beta <- log(5 * h) / (0.8 * szero)

ssb <- 0:500

alpha <- log(rzero / szero) + (beta * szero)

recruits <- (ssb * exp(alpha - beta * ssb))

data_frame(ssb = ssb, recruits = recruits) %>% 
  ggplot(aes(ssb, recruits)) +
  geom_point()

So how to fit this? We could of course write a new .stan file called “ricker_model.stan.” However, this seems a little wasteful; the Beverton-Holt and Ricker parameterizations take the same data and do many of the same operations, just with slightly different parameters. We’d like then to write one script that can be used to fit either of these two models. I’m going to do this by writing a function in Stan that calculates calculates recruits using either Beverton-Holt or Ricker dynamics, leaving the rest of the code the same. You specify functions in Stan in the functions block. I recommend reading the Stan documentation for more information on functions, but you’ll see that they work pretty much the same as other things. The biggest sticking point is in specifying the type of the function arguments inside the definition.


functions{

  vector calc_recruits(vector rec_pars, real h, vector ssb, int n, int bh){

    vector[n] rhat;

    real beta;

    real alpha;

     if (bh == 1) {
      rhat = (0.8 * rec_pars[1] * h * ssb) ./ (0.2 * rec_pars[1] * (1 - h) +(h - 0.2) * ssb);

     } else {

      beta = log(5 * h) / (0.8 * rec_pars[2]);

      alpha = log(rec_pars[1] / rec_pars[2]) + 0.8 * beta * rec_pars[2];

      rhat = ssb .* exp(alpha - beta * ssb); //.* exp(alpha - beta .* ssb);

      }

  return rhat;


} // close function

}


data{

int<lower = 0> n; // number of observations

int<lower = 0> n_sr_params; // number spawner recruit parameters

vector[n] ssb; // vector of observed ssb

vector[n] r; // vector of recruits

real max_r;  // max observed recruitment

real max_h; // max steepness

int<lower = 0, upper = 1> bh;

real<lower = 0> rec_par_mean[n_sr_params];

real<lower = 0> rec_par_cv[n_sr_params];


}
transformed data{

vector[n] log_r; // log recruitment

log_r = log(r);

}

parameters{

real<lower = 0.2, upper = max_h> h; //steepness

real<lower = 0> sigma;

vector<lower = 0 >[n_sr_params] rec_pars;

}
transformed parameters{

vector[n] rhat;

vector[n] log_rhat;

rhat = calc_recruits(rec_pars, h, ssb, n, bh);

log_rhat = log(rhat);

}

model{
log_r ~ normal(log_rhat, sigma);

sigma ~ cauchy(0,2.5);

for (i in 1:n_sr_params){

rec_pars[i] ~ normal(rec_par_mean[i],rec_par_cv[i] * rec_par_mean[i]);

}

}

generated quantities{

  vector[n] pp_rhat;

  vector[n] log_likelihood;

  for (i in 1:n) {

   pp_rhat[i] = exp(normal_rng(log_rhat[i] - 0.5 * sigma^2, sigma));

  }

   for (i in 1:n) {

   log_likelihood[i] = normal_lpdf(log_r[i] | log_rhat[i], sigma);

   }

}

To do this, I’m going to include a variable called bh, where if bh == 1 the model uses Beverton-Holt dynamics, and otherwise it uses Ricker dynamics (you can see how this can easily be generalized to more than two candidate models). I’ll call this model “generic_stock_recruit.stan,” and I’ll now use it to fit both the Ricker and Beverton-Holt models.

warmups <- 2500

total_iterations <- 5000

max_treedepth <-  10

adapt_delta <-  0.95

chains <- 4

data <- list(
  n = nrow(sal_data),
  r = sal_data$r,
  ssb = sal_data$ssb,
  max_r = max(sal_data$r),
  bh = 0,
  n_sr_params = 2,
  max_h = 3,
  rec_par_mean = c(1 * max(sal_data$r),  0.25 * max(sal_data$ssb)),
  rec_par_cv = c(5, 5)
  )


inits <-
  map(1:chains,
  ~ list(
  h = runif(1, 0.2, 1),
  rec_pars = rnorm(2, c(1 * data$max_r, 1 * data$max_r), .2 * data$max_r)
  ))
  

ricker_fit <-
stan(
file = here::here("src", "generic_stock_recruit.stan"),
data = data,
chains = 4,
warmup = warmups,
iter = total_iterations,
cores = 1,
refresh = 250,
init = inits,
control = list(max_treedepth = max_treedepth,
adapt_delta = adapt_delta)
)

You’ll notice below when I call the Beverton-Holt version of the model that I’m doing some slightly odd things with arrays when I create my data list. This is because Stan is much more finicky (or exact if you want to think of it that way) about data types. Part of the appeal of R from a data wrangling perspective is that it is REALLY forgiving about data types. R is perfectly happy to construct a vector of length 1, or a 1 by 1 matrix, and these things will more or less behave in the same way (don’t hold me to that). Stan though is a bit pickier. If Stan sees a piece of data that is supposed to be a vector, and sees that it is only length 1, it says “nope, that’s a scalar,” and bad things happen. In this case, I have specified rec_par_mean and rec_par_cv to be real vectors in my DATA block, of length n_sr_params. But, when n_sr_params is 1, and Stan sees that rec_par_mean is just one number, it says OK, this is a scalar, and a scalar can’t have a length, so what’s this dimensions thing doing here? We get around that by specifying that rec_par_mean is an array with 1 dimension (see here, down above the references).

data <- list(
  n = nrow(sal_data),
  r = sal_data$r,
  ssb = sal_data$ssb,
  max_r = max(sal_data$r),
  bh = 1,
  n_sr_params = 1,
  max_h = 1,
  rec_par_mean = array(2 * max(sal_data$r), dim = 1),
  rec_par_cv = array(2, dim = 1)
  )

inits <-
  map(1:chains,
  ~ list(
  h = runif(1, 0.2, 1),
  rec_pars = as.array(rnorm(1, 2 * data$max_r, .2 * data$max_r)
  )))
  

bh_fit <- stan(file = here::here("src","generic_stock_recruit.stan"),
                 data = data,
                 chains = chains,
                 warmup = warmups,
                 iter = total_iterations,
                 cores = 1,
                 refresh = 250,
                 init = inits,
                 control = list(max_treedepth = max_treedepth,
                                adapt_delta = 0.95))

Great, we’ve used one script to fit both models, now let’s start comparing them.

 bh_summary <- summary(bh_fit)$summary %>% 
  as.data.frame() %>% 
  mutate(variable = rownames(.)) %>% 
  select(variable, everything()) %>% 
  as_data_frame()


 ricker_summary <- summary(ricker_fit)$summary %>% 
  as.data.frame() %>% 
  mutate(variable = rownames(.)) %>% 
  select(variable, everything()) %>% 
  as_data_frame()
 
ricker_rhat <- ricker_summary %>% 
  filter(str_detect(variable,'rhat') & !str_detect(variable,'log') & !str_detect(variable,'pp')) %>% mutate(model = 'ricker',
                                                                                                            ssb = sal_data$ssb)


bh_rhat <- bh_summary %>% 
  filter(str_detect(variable,'rhat') & !str_detect(variable,'log') & !str_detect(variable,'pp')) %>% 
  mutate(model = 'bh',
         ssb = sal_data$ssb)

rhat <- ricker_rhat %>% 
  bind_rows(bh_rhat)

  rhat %>% 
  ggplot() + 
  geom_point(data = sal_data,aes(ssb, r)) + 
  geom_line(aes(ssb, mean, color = model)) + 
  geom_ribbon(aes(ssb, ymin = `2.5%`, ymax = `97.5%`, fill = model),alpha = 0.25) + 
  scale_fill_manual(values = wes_palette("Royal1"))  + 
  scale_color_manual(values = wes_palette("Royal1"))  

As a first step we should of course conduct all the same convergence tests for the Ricker model that we conducted for the BH model, and we can base some judgement based on those results. For example, if the posterior predictive test look much better for one model that might give us some indication that the data at least support one model over another. For now let’s just compare the posterior predictives of the two models

bh_pp_rhat <- bh_mcmc[ 'pp_rhat'] %>% 
  map_df(as_data_frame, .id = 'variable') %>% 
  gather(observation,value, -variable) %>% 
  mutate(model = "Beverton-Holt")


ricker_pp_rhat <- ricker_fit %>% 
  rstan::extract()
  
  ricker_pp_rhat <- ricker_pp_rhat["pp_rhat"] %>% 
  map_df(as_data_frame, .id = 'variable') %>% 
  gather(observation,value, -variable) %>% 
  mutate(model = "Ricker")


pp_rhat <- bh_pp_rhat %>% 
  bind_rows(ricker_pp_rhat)
  

ggplot() + 
  geom_density(data = pp_rhat, aes(log(value), fill = model), alpha = 0.5) + 
  geom_density(data = sal_data, aes(log(r), fill = 'Observed'), alpha = 0.5) + 
  scale_fill_manual(values = wes_palette("Zissou1"))  

We can also use the loo package to try and quantitatively compare the two models. loo stands for leave-one-out, and the loo function provides a powerful interface for performing leave-one-out cross validation for Bayesian models. Basically, it tests the out-of-sample predictive accuracy of each of the models. You can think of it as an improvement over AIC/DIC for model comparison (see ?"loo-package"). Of course, if visual diagnostics like the posterior predictive test strongly suggest that you should select one model over another, you can often trust that. But, for more ambiguous cases, we’d like to have some quantitative way to judge model performance, and that’s where loo can come in.

There are a few ways to use loo, but the simplest requires a bit of prep work. loo needs to evaluate the likelihood as a function of leaving out data. So, it needs to have access to the pure likelihood. You can either write a function to do this, which we won’t cover here (see ?loo::loo), or you can go back in your model and store the log-likelihood in the generated quantities block


  vector[n] log_likelihood;

  for (i in 1:n) {

   log_likelihood[i] = normal_lpdf(log_r[i] | log_rhat[i], sigma);
    
  }

Once we’ve done this, we can extract the log-likelihood using loo::extract_log_lik.

loo::extract_log_lik() has an option parameter_name that defaults to parameter_name = "log_lik", but for the sake of this example we’ve named our log-likelihood object in the stanfit object parameter_name

log_lik_ricker <- extract_log_lik(ricker_fit, parameter_name = "log_likelihood")

Once we have this, we can pass the log-likelihood matrix to the loo function to get our diagnostics.

ricker_loo <- loo::loo(log_lik_ricker)

On their own, these values aren’t too informative for us (in the same way that a lone AIC value doesn’t really tell you much).

But, we can now repeat this process with out BH model, and use loo to compare them.

log_lik_bh <- extract_log_lik(bh_fit, parameter_name = "log_likelihood")

bh_loo <- loo::loo(log_lik_bh)

compare(bh_loo, ricker_loo)
## elpd_diff        se 
##       0.1       0.8

The output of compare is a big confusing, but basically, if elpd_diff is positive, that means that according to loo, the second model is preferred. If it’s negative, the first model is preferred. So, in this case, per the loo criteria, there is a bit more support for the Beverton-Holt model, at least as we parameterized it here. But we can also see that elpd_diff is on about the same scale as the standard error se, giving some indication that there isn’t a big difference between the models (if for example elpd_diff had been -2000, much bigger than the se of 0.5, this would indicate more support for the difference). Biologically, we have reasons to think that the Ricker model might still be the better option.

If you want to compare more than two models, you just pass more loo objects to compare! Compare conveniently orders the matrix in descending order of model performance.

compare(bh_loo, ricker_loo, ricker_loo)
##            elpd_diff se_diff elpd_loo p_loo looic
## ricker_loo   0.0       0.0   -58.1      2.4 116.2
## ricker_loo   0.0       0.0   -58.1      2.4 116.2
## bh_loo      -0.1       0.8   -58.2      2.2 116.4

And just like that, we have a solid sketch of going from raw data, to model fits, to model comparison, using Stan and R. There is clearly a lot more work that would have to go into doing this analysis properly (e.g. we haven’t done any testing of the effects of our choices for our prior distributions), but the tools we’ve gone over here should serve as a useful template to build off of for more complete analysis.

Hopefully this helps you to start using Stan in your analyses, I’ll post some more complicated examples as time allows. Good luck!

rstanarm

The approaches I’ve gone through here are great for rolling your own model, which lots of times we have to do. But, it can be pretty annoying to do all that work to just run say a linear model. To that end, the folks at Stan put together the rstanarm package, which provides Bayesian methods for basically all your linear modeling needs, from lm to glmer and everything in between. Check it out!

library(rstanarm)
example_model <- 
  rstanarm::stan_glmer(cbind(incidence, size - incidence) ~ size + period + (1|herd),
             data = lme4::cbpp, family = binomial, QR = TRUE,
             # this next line is only to keep the example small in size!
             chains = 2, cores = 1, seed = 12345, iter = 500)

plot(example_model)

A note on brms

It seems like a lot more people are using the brms implementation of Stan for their Bayesian analyses. brms is sort of an in-between rstanarm and “raw” Stan, allowing you to write funkier models than you might be able to represent with versions of linear models, but without some of the complexity of writing raw Stan code. I chose not to cover brms here since a) I don’t use it or really know how and b) I think there’s a lot of value in learning how to code in Stan itself if you’re really going to be internalizing it into your research. Not to take anything away from brms, seems great, but this resource is here to help you learn the Stan language itself.

References

Dorn, Martin W. 2002. “Advice on West Coast Rockfish Harvest Rates from Bayesian Meta-Analysis of Stock-Recruit Relationships.” North American Journal of Fisheries Management 22 (1): 280–300. https://doi.org/10.1577/1548-8675(2002)022<0280:AOWCRH>2.0.CO;2.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. CRC press.
Hobbs, N. Thompson, and Mevin B. Hooten. 2015. Bayesian Models: A Statistical Primer for Ecologists. Princeton, New Jersey: Princeton University Press.
McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman & Hall/CRC Texts in Statistical Science Series 122. Boca Raton: CRC Press/Taylor & Francis Group.
Monnahan, Cole C., James T. Thorson, and Trevor A. Branch. 2016. “Faster Estimation of Bayesian Models in Ecology Using Hamiltonian Monte Carlo.” Methods in Ecology and Evolution, November, n/a–. https://doi.org/10.1111/2041-210X.12681.
comments powered by Disqus