A Bayesian SIR model in JAGS

Introduction

In this notebook I am going to fit a Bayesian SIR model in JAGS. Some of this material is a simplification of the more advanced spatial SEIRS model found in this thesis by Grant Brown from 2015 which is well worth reading. Another very useful paper is Lekone and Finkelstadt which has no spatial aspect, and is the majority of what is re-created here. I have changed the notation to what I believe is a clearer version.

The main idea is to simplify the differential equations of the SIR model into difference equations and then use probability distributions and transmission matrices to quantify uncertainties between compartments. All time is treated as discrete, which is fine for current COVID-19 data which is almost all available at a daily time step. In another document I will extend this model to SIER, and further a spatial or stratified type model. In the next document I will fit this model to a chosen country from the ECDC data set.

It may be possible to fit a similar model in Stan but this would require integrating out many of the discrete parameters.

Notation

First data:

Let \(t\) be time for \(t = 1, \ldots, T\) with all time points known and equally spaced.
Let \(N\) be the population size
Let \(N_{S \rightarrow I}(t)\) be the number of susceptible people who get infected at time \(t\). This is included for the moment as data but often is actually unknown.
Let \(N_{I \rightarrow R}(t)\) be the number of infected people who are removed (died or recovered) at time \(t\). We have a (likely reasonable) estimate of the number of people who have died at time \(t\) (or at least before time \(t\)) but a much poorer estimate of the number who have recovered. The latter is reported by John Hopkins but likely to be a massive under-estimate.

Now latent variables:

Let \(S(t)\) be the number of susceptible individuals at time \(t\) with starting value \(S(1)\)
Let \(I(t)\) be the number of infected individuals at time \(t\) with starting value \(I(1)\)
Let \(R(t)\) be the number of removed individuals at time \(t\) with starting value \(R(1)\)

Some parameters:

\(\beta\) is the base transmission rate before control measures are introduced, i.e. the rate at which susceptible individuals become infected
\(q\) is a parameter that controls how the transmission rate decays with control measures
\(\gamma\) is a parameter that controls the rate at which infected individuals are removed. I think the inverse of this is the mean number of days before an infected individual becomes removed

Some extra parameters are derived:

Let \(R_0\) (or \(R_0(t)\)) be the basic reproduction number, possibly changing over time, that determines the expected number of people who turn susceptible from an infected person
Let \(\tau^*\) be the time point when the epidemic goes extinct, i.e. when there are no more infected individuals left

Model

The model equations are given by: \[S(t+1) = S(t) - N_{S \rightarrow I}(t)\] \[I(t+1) = I(t) + N_{S \rightarrow I}(t) - N_{I \rightarrow R}(t)\] \[S(t) + I(t) + R(t) = N\]

In English these equations mean:

The number of susceptible individuals at time \(t+1\) is equal to the previous number of susceptible individuals less the number of people who were susceptible but now infected at time \(t\)
The number of infected people at time \(t+1\) is the number of infected people at time \(t\) plus the number of people who move from susceptible to infected at time \(t\) less the number of removed individuals who have moved from exposed to removed.
The total number of people in each compartment is always equal to the population size

Each of the latent variables is given a binomial distribution

\[N_{S \rightarrow I}(t) \sim Bin(S(t), p_{S \rightarrow I}(t))\] \[N_{I \rightarrow R}(t) \sim Bin(I(t), p_{I \rightarrow R})\]

In words these equations mean:

The number of susceptible people who get infected at time \(t\) is a binomial distribution with maximum value \(S(t)\) and time-dependent probability \(p_{S \rightarrow I}(t)\). This means that the number of people moving from susceptible to infected at time \(t\) has to be a fraction of the total number of susceptible individuals
The number of infected individuals who get removed at time \(t\) is a binomial distribution with maximum value \(I(t)\) and probability \(p_{I \rightarrow R}\). This means that the number of people who get removed at time \(t\) is a fraction of the number of people who are infected at time \(t\)

These probabilities are set at

\[p_{S \rightarrow I}(t) = 1 - \exp \left[ - \frac{\beta(t)}{N} I(t) \right]\] This equation makes the proportion of people moving from susceptible to infected dependent on the proportion of people who are infected (\(I(t)/N\)), i.e. the greater the proportion of people infected, the fewer can become infected.

\[p_{I \rightarrow R} = 1 - \exp( - \gamma)\] where

\(\beta(t)\) is the time dependent transmission rate
\(1/\gamma\) is the mean infectious period

Finally, the transmission rate is assumed to be constant up until a time point \(t^*\) at which point control measures are introduced:

\[\beta(t) = \left\{ \begin{array}{ll} \beta, & t< t^* \\ \beta e^{-q(t - t^*)}, & t \ge t^* \end{array} \right.\]

One of the key parameters to estimate is the basic reproduction number which is defined as \(R_0\) or \(\beta / \gamma\) in this notation. This model has the cool feature that we can estimate a time dependent reproduction number as:

\[ R_0(t) = \frac{\beta(t)}{\gamma} \frac{S(t)}{N}\]

Ideally, the data for this model would be \(N, N_{S \rightarrow I}(t), N_{I \rightarrow R}(t)\), the latter three for all time values. In fact we often do not have access to \(N_{S \rightarrow I}(t)\) (the number of susceptible individuals who become infected) so we may need to integrate out these values during the Bayesian model.

JAGS model

Below is some code for a JAGS model:

jags_code = '
model {
  # Likelihood
  for (t in 1:T) {
    N_I_R[t] ~ dbinom(p_I_R, I[t])
    # These are imputed:
    N_S_I[t] ~ dbinom(p_S_I[t], S[t])
    R_0[t] <- (beta[t]/gamma)*(S[t]/N)
  }
  
  # These are the known time evolution steps:
  for(t in 2:T) {
    S[t] <- S[t-1] - N_S_I[t-1]
    I[t] <- I[t-1] + N_S_I[t-1] - N_I_R[t-1]
    R[t] <- N - S[t] - I[t]
  }

  # Need a value for S[1] and I[1]
  I[1] <- I_start # Assume the number of people infected on day 1 is the same as those transitioning from E to I
  R[1] <- R_start # As above but for removed
  S[1] <- N - I[1] - R[1] # Left over
  
  # This is the prior on p_S_I[t]
  for(t in 1:T) {
    p_S_I[t] <- 1 - exp( - beta[t] * I[t] / N )
  }
  
  # Sort out the change point
  for(t in 1:(t_star - 1)) {
    beta[t] <- beta_const
  } 
  for (t in t_star:T) {
    beta[t] <- beta_const * exp( - q * (t - t_star))
  }
  
  # These are the priors on the other probabilities
  p_I_R <- 1 - exp( -gamma )
  
  # Now the prior on the hyper-parameters
  beta_const ~ dunif(0.3, 0.6) #= 0.4
  gamma <- 1/gamma_inv
  gamma_inv ~ dunif(5,8) # 6
  q ~ dbeta(2, 10) # 0.2
  
  # Can now forecast into the future
  for (t in (T+1):T_max) {
  
    # Transitions
    N_I_R[t] ~ dbinom(p_I_R, I[t])
    N_S_I[t] ~ dbinom(p_S_I[t], S[t])
    
    # R_0
    R_0[t] = (beta[t]/gamma)*(S[t]/N)
    
    # Compartment values
    S[t] <- S[t-1] - N_S_I[t-1]
    I[t] <- I[t-1] + N_S_I[t-1] - N_I_R[t-1]
    R[t] = N - S[t] - I[t]
    
    # Probability values
    p_S_I[t] <- 1 - exp( - beta[t] * I[t] / N )
    beta[t] <- beta_const * exp( - q * (t - t_star))
  }
}
'

Fit to simulated data

I will first try simulating some data from this model and checking that JAGS actually works.

# Summary values
N = 1e3 # Population size
T = 30 # Maximum time steps
t = 1:T

# First the hyper-parameters
t_star = 20 # Days in before intervention
gamma_inv = 10; gamma = 1/gamma_inv
beta_const = 0.6
q = 0.4

# Now the probabilities
p_I_R = 1 - exp(-gamma)

# Give an initial values for everthing required
p_S_I = N_S_I = N_I_R = S = I = R = beta = rep(NA, T)
beta[1:(t_star - 1)] = beta_const
beta[t_star:T] = beta_const * exp( - q * (t[t_star:T] - t_star))

S[1] = N - 10; I[1] = 10; R[1] = 0
p_S_I[1] =  1 - exp( - beta[1] * I[1] / N )
N_S_I[1] = rbinom(1, S[1], p_S_I[1])
N_I_R[1] = rbinom(1, I[1], p_I_R)

# Now can loop through filling in the other values
for (t in 2:T) {
  S[t] <- S[t-1] - N_S_I[t-1]
  I[t] <- I[t-1] + N_S_I[t-1] - N_I_R[t-1]
  R[t] = N - S[t] - I[t] 
  p_S_I[t] <- 1 - exp( - beta[t] * I[t] / N )
  
  # Now get the actual values
  N_I_R[t] = rbinom(1, I[t], p_I_R)
  # These are usually imputed:
  N_S_I[t] = rbinom(1, S[t], p_S_I[t])
}

# Create R0
R_0 = (beta/gamma)*(S/N)

# Create a plot of the epidemic
library(ggplot2)
library(tidyr)
tibble(t = 1:T,
       S, I, R) %>% 
  pivot_longer(names_to = 'Compartment', values_to = 'People', -t) %>% 
  ggplot(aes(x = t, y = People, colour = Compartment)) + 
  geom_vline(xintercept = t_star) +
  #scale_y_log10() +
  geom_line()

Now we can fit this model by providing the data to JAGS

N_future = 50 # Forecast 100 days into the future
T_max = T + N_future 
jags_data = list(N = N,
                 T = T,
                 T_max = T_max, 
                 t_star = t_star,
                 I_start = I[1],
                 R_start = R[1],
                 N_S_I = c(N_S_I, rep(NA, N_future)),
                 N_I_R = c(N_I_R, rep(NA, N_future))) 

library(R2jags)
jags_run = jags(data = jags_data,
                parameters.to.save = c("gamma_inv", "q", "beta",
                                       "beta_const", "S","I", 
                                       "R", "R_0"),
                model.file = textConnection(jags_code))
plot(jags_run)

Now plot the number of infected over time:

S_post = jags_run$BUGSoutput$median$S
I_post = jags_run$BUGSoutput$median$I
R_post = jags_run$BUGSoutput$median$R

tibble(t = 1:T_max,
  S_post, I_post, R_post) %>% 
  pivot_longer(names_to = 'Compartment', values_to = 'People', -t) %>% 
  ggplot(aes(x = t, y = People, colour = Compartment)) + 
  geom_vline(xintercept = t_star) + 
  geom_line()

Also plot the dynamic \(R_0\) against it’s true value

R0_post = jags_run$BUGSoutput$median$R_0
tibble(t = 1:T_max,
       R0_true = c(R_0, rep(NA, N_future)),
       R0_post = R0_post) %>% 
  pivot_longer(names_to = 'Type', values_to = 'R0', -t) %>% 
  ggplot(aes(x = t, y = R0, colour = Type)) + 
  geom_vline(xintercept = t_star) + 
  geom_line()

## Warning: Removed 50 row(s) containing missing values (geom_path).