---
title: "Modelling monthly data"
author: "Adrian Barnett"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Modelling monthly data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r}
#| label: setup
#| include: FALSE
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(season)
library(dplyr)
library(tibble)
library(stringr)
library(ggplot2)
library(flextable)
library(nimble) # for Bayesian models
```

## Monthly data

Time series data are often collected as monthly counts, for example, the monthly number of deaths or hospital admissions. In this vignette we will use the monthly number of deaths from cardiovascular disease (CVD) in people aged 75 and over in Los-Angeles for the years 1987 to 2000 ($n=168$ observations). We will compare three models for examining a seasonal pattern using month.

First we load the data and then use `ggplot2` to draw a lineplot of the monthly death counts over time.

```{r}
#| label: plot-cvd-yearmon-cvd
ggplot(CVD, aes(x = yrmon, y = cvd)) +
  geom_line(colour = 'darkred', linewidth = 1.05) +
  labs(
    x = "Time",
    y = "Monthly number of CVD deaths"
  ) +
  theme_bw()
```

There is a clear seasonal pattern, with more deaths in the winter months and fewer in the summer.

## Month as a fixed effect

In the first model, we treat each month as a separate category with a fixed parameter. 

```{r}
#| label: month-fixed-effect
model1 <- monthglm(
  formula = cvd ~ 1,
  data = CVD,
  family = quasipoisson(),
  offsetpop = expression(pop / 100000),
  offsetmonth = TRUE,
  refmonth = 6
)
plot(model1, ylab = 'Rate ratio')
```

The outcome is counts, so it is best to use a Poisson distribution. Using the `family = quasipoisson` instead of `family = poisson` accounts for any over-dispersion in the data, where the variance in counts is greater than the mean. Without acknowledging this over-dispersion, the confidence intervals for the estimated effects would be too narrow. You can see this if you change the `quasipoisson` to `poisson`.

We adjust for the population size using an offset scaled to 100,000 people. This is useful for adjusting for changes over time in the population size.

We also adjust for the varying number of days in the month using the `offsetmonth=TRUE` option as longer months will have more deaths on average than shorter months (e.g., January vs February).

The estimates in the plot are rate ratios. We used June as the reference month, so the rate ratio in this month is 1. The plot shows a typical seasonal pattern with a winter increase in deaths that peaks in January (around 40% higher than June). 

## Month as a random effect

Next we fit month as a random instead of fixed effect. This allows the monthly estimates to follow a distribution. The distribution most often used for random effects is the Normal distribution. 

To fit the random effects model we switch to a Bayesian approach using the `nimble` package.

```{r}
#| label: code-random-month
code_random_month <- nimbleCode({
  ## Likelihood
  for (i in 1:N) {
    # loop through time
    cvd[i] ~ dpois(mu[i])
    log(mu[i]) <- log(offset[i]) + log(n_days_month[i]) + alpha + beta.c[month[i]]
  }
  # priors
  alpha ~ dnorm(0, sd = 100)
  for (k in 1:12) {
    beta[k] ~ dnorm(0, tau = tau.month)
    beta.c[k] <- beta[k] - mean.beta
  }
  mean.beta <- mean(beta[1:12])
  tau.month ~ dgamma(1, 1)
  # rescale to standard deviation
  sigma.month <- 1 / sqrt(tau.month) 
})
```

As before, we assume that the counts follow a Poisson distribution. We have two offsets, one for the population and one for the days of the month.

Each month is fitted as a random effect centred on zero. The ensure that the month effects sum to zero, we create a centred version by subtracting the mean.

Next we set up the data ready for `nimble` to use. The number of days in each month are scaled to a 30 day month.

```{r}
#| label: constants-data
constants <- list(
  N = nrow(CVD),
  month = CVD$month,
  # scaled to per 100,000 people
  offset = CVD$pop / 100000,
  # scaled to per 30 days
  n_days_month = (CVD$n_days_month) / 30
)
data <- list(cvd = CVD$cvd)

```

Next we create the initial values to get the chains started, using an estimate of the mean for the intercept.

```{r}
#| label: inits
inits <- list(
  tau.month = 1,
  alpha = log(mean(CVD$cvd / (CVD$pop / 100000))),
  beta = rep(0, 12)
)
```

Next we run the Markov chain Monte Carlo (MCMC) estimates using two chains. We plot the samples for both chains for the intercept as a quick check that the estimates have converged.

```{r}
#| label: run-mcmc
# parameters to store
parms <- c('alpha', 'beta.c', 'sigma.month')
# define the model
model <- nimbleModel(
  code_random_month,
  data = data,
  inits = inits,
  constants = constants
)
# MCMC samples
thin <- 3
MCMC <- 1000
mcmc_random <- nimbleMCMC(
  model = model,
  inits = inits,
  monitors = parms,
  # times 2 for burn-in
  niter = MCMC * 2 * thin, 
  thin = thin,
  nchains = 2,
  nburnin = MCMC,
  summary = TRUE,
  # one seed per chain
  setSeed = c(1, 2), 
  WAIC = TRUE
)
# check chains for intercept
plot(mcmc_random$samples$chain1[, 1], type = 'l')
lines(mcmc_random$samples$chain2[, 1], type = 'l', col = 'red')
```

Next we plot the estimated month effect with the 95% credible intervals.

```{r}
#| label: plot-month-effect
# prepare estimates for plot
to_plot <- as.data.frame(mcmc_random$summary$all.chains) |>
  rownames_to_column() |>
  filter(str_detect(rowname, pattern = 'beta')) |>
  mutate(month = as.numeric(str_extract(rowname, pattern = '[1-9][0-2]?')))
# labels for x-axis
month.letter <- substr(month.abb, 1, 1)
# plot
colour <- 'darkorange2'
ggplot(
  to_plot,
  aes(
    x = month,
    y = exp(Mean),
    ymin = exp(`95%CI_low`),
    ymax = exp(`95%CI_upp`)
  )
) +
  geom_line(linewidth = 1.05, col = colour) +
  geom_errorbar(width = 0, linewidth = 1.05, col = colour) +
  geom_point(col = colour, size = 3) +
  geom_hline(lty = 2, yintercept = 1) +
  ylab('Rate ratio') +
  xlab(NULL) +
  scale_x_continuous(breaks = 1:12, labels = month.letter) +
  theme_bw() +
  theme(panel.grid.minor = element_blank())

```

This approach can be described as a `global smooth' because we applied a shrinkage factor to all months equally using the Normal distribution.

## Month as a correlated random effect

The previous global smooth was likely not the best smoothing approach as we know that neighbouring months will have more in common than distant months. We can model this correlation using the neighbouring approach that is often used for spatial models. First, we create a neighbourhood matrix for the 12 months.

```{r}
#| label: adj-matrix
neighbours <- toeplitz(c(NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA))
# link January and December
neighbours[1, 12] <- neighbours[12, 1] <- 1 
neighbours
adj <- createAdj(neighbours)
adj
```

In the 12 by 12 matrix, each month has two neighbours. The `createAdj` function is to create the adjacency matrix needed for the spatial CAR function, where CAR stands for conditional autoregressive. In this case the autoregression will be applied to neighbouring months to create the smooth.

```{r}
#| label: neighbour-nimble
code_neighbour_month <- nimbleCode({
  ## Likelihood
  for (i in 1:N) {
    # loop through time
    cvd[i] ~ dpois(mu[i])
    log(mu[i]) <- log(offset[i]) + log(n_days_month[i]) + alpha + beta[month[i]]
  }
  # priors
  alpha ~ dnorm(0, sd = 100)
  beta[1:12] ~ dcar_normal(
    adj[1:L],
    weights[1:L],
    num[1:12],
    tau.neighbours,
    zero_mean = 1
  )
  tau.month ~ dgamma(1, 1)
  tau.neighbours ~ dgamma(1, 1)
  # rescale to standard deviation
  sigma.month <- 1 / sqrt(tau.month) 
  sigma.neighbours <- 1 / sqrt(tau.neighbours)
})

```

The data needs to be updated with the neighbourhood information.

```{r}
#| label: nimble-constants
constants <- list(
  N = nrow(CVD),
  L = length(adj$adj),
  adj = adj$adj,
  num = adj$num,
  weights = adj$weight,
  month = CVD$month,
  # scaled to per 100,000 people
  offset = CVD$pop / 100000, 
  # scaled to per 30 days
  n_days_month = (CVD$n_days_month) / 30
) 

```

```{r}
#| label: nimble-mcmc-plot
# parameters to store
parms <- c('alpha', 'beta', 'sigma.month', 'sigma.neighbours')

# define the model
model <- nimbleModel(
  code_neighbour_month,
  data = data,
  inits = inits,
  constants = constants
)

# MCMC samples
mcmc_neighbours <- nimbleMCMC(
  model = model,
  inits = inits,
  monitors = parms,
  # times 2 for burn-in
  niter = MCMC * 2 * thin, 
  thin = thin,
  nchains = 2,
  nburnin = MCMC,
  summary = TRUE,
  # one seed per chain
  setSeed = c(1, 2), 
  WAIC = TRUE
)
# check chains for intercept
plot(mcmc_neighbours$samples$chain1[, 1], type = 'l')
lines(mcmc_neighbours$samples$chain2[, 1], type = 'l', col = 'red')

```

```{r}
#| label: plot-estimates
# prepare estimates for plot
to_plot <- as.data.frame(mcmc_neighbours$summary$all.chains) |>
  rownames_to_column() |>
  filter(str_detect(rowname, pattern = 'beta')) |>
  mutate(month = as.numeric(str_extract(rowname, pattern = '[1-9][0-2]?')))
# plot
colour <- 'darkseagreen3'
ggplot(
  to_plot,
  aes(
    x = month,
    y = exp(Mean),
    ymin = exp(`95%CI_low`),
    ymax = exp(`95%CI_upp`)
  )
) +
  geom_line(linewidth = 1.05, col = colour) +
  geom_errorbar(width = 0, linewidth = 1.05, col = colour) +
  geom_point(col = colour, size = 3) +
  geom_hline(lty = 2, yintercept = 1) +
  ylab('Rate ratio') +
  xlab(NULL) +
  scale_x_continuous(breaks = 1:12, labels = month.letter) +
  theme_bw() +
  theme(panel.grid.minor = element_blank())

```

The plots of rate ratios for the two models are similar. We can compare the fit of the two models using the WAIC.

```{r}
#| label: compare-model-fit
waic1 <- data.frame(
  model = 'Random',
  parameters = mcmc_random$WAIC$pWAIC,
  WAIC = mcmc_random$WAIC$WAIC
)
waic2 <- data.frame(
  model = 'Neighbours',
  parameters = mcmc_neighbours$WAIC$pWAIC,
  WAIC = mcmc_neighbours$WAIC$WAIC
)
to_table <- bind_rows(waic1, waic2)
flextable(to_table) |>
  colformat_double(j = 2:3, digits = 1)

```

The neighbourhood model uses slightly more parameters but this extra complexity has not improved the fit. The strong seasonal pattern has meant that the neighbouring smoothing is not needed. 

Note, for simplicity, the above examples do not include over-dispersion in the Bayesian models.