2. Model Specification

The zero-inflated negative binomial (ZINB) framework

Maritime accidents are extremely rare events, resulting in a dataset overwhelmingly dominated by zeros (no accidents in a given grid cell during a specific month). To rigorously account for both this zero-inflation and the overdispersion typical of count data, I employ a zero-inflated negative binomial (ZINB) distribution.

The ZINB model assumes that the data is generated by a mixture of two processes: 1. A binary process that generates structural zeros with probability $\psi$. 2. A negative binomial count process that generates counts $y_i$ (including sampling zeros) with probability $1 - \psi$.

Mathematically, the probability mass function for an observation $Y_i$ is defined as:

\[ P(Y_i = y_i | \mu_i, \phi, \psi) = \begin{cases} \pi + (1 - \psi) \left( \frac{\phi}{\mu_i + \phi} \right)^\phi, & \text{if } y_i = 0 \\ (1 - \psi) \frac{\Gamma(y_i + \phi)}{\Gamma(y_i + 1)\Gamma(\phi)} \left( \frac{\mu_i}{\mu_i + \phi} \right)^{y_i} \left( \frac{\phi}{\mu_i + \phi} \right)^\phi, & \text{if } y_i > 0 \end{cases} \]

where: * $\mu_i$ is the expected count of the negative binomial component. * $\phi$ is the shape/dispersion parameter (as $\phi \to \infty$, the distribution approaches Poisson). * $\psi$ is the zero-inflation probability parameter (denoted as zi in the model output).

Log-linear predictor and exposure offset

To evaluate the impact of environmental and temporal factors on the accident rate, I model the expected count $\mu_i$ using a log-link function.

Crucially, to avoid confounding bias related to varying traffic volumes, I include vessel traffic density as an exposure offset. This ensures the model estimates the underlying risk per unit of traffic, rather than merely reflecting areas with more ships.

The full hierarchical regression equation is specified as follows:

\[ \log(\mu_i) = \alpha + \beta_{SST} \cdot \text{SST}_i + \beta_{year} \cdot \text{Year}_i + \sum_{m=2}^{12} \beta_{month(m)} \cdot \text{Month}_{m,i} + \log(\text{Traffic}_i) + u_{\text{spatial}_i} \]

Where: * $\alpha$ is the global intercept. * $\beta$ terms represent the fixed effects for Sea Surface Temperature (mean_sst), secular trend (year_num), and seasonality (month_factor). * $\log(\text{Traffic}_i)$ is the fixed offset for exposure (mean_traffic). * $u_{\text{spatial}_i}$ represents the nested spatial random effects (grid cells) to account for unobserved spatial heterogeneity.

Bayesian implementation in `brms`

The model was estimated using Hamiltonian Monte Carlo (HMC) sampling via the brms package in R. I ran 4 parallel chains, each with 4000 iterations (including 2000 warmup draws), resulting in 8000 post-warmup posterior samples.

Below is the exact brms syntax used to fit the model:

#| label: model-code #| eval: false #| code-fold: show #| code-summary: “Show brms model code”

library(brms)

ZINB Model Specification

fit_full_time_with_traffic <- brm( formula = total_count ~ mean_sst + month_factor + year_num + offset(log(mean_traffic)) + (1 | lat_grid / lon_grid), data = data, family = zero_inflated_negbinomial(link = “log”, link_zi = “logit”), chains = 4, iter = 4000, warmup = 2000, cores = 4, control = list(adapt_delta = 0.95), # Increased to ensure stable sampling seed = 12345 )

--- title: "2. Model Specification" engine: knitr format: html: toc: true code-fold: true warning: false message: false --- ## The zero-inflated negative binomial (ZINB) framework Maritime accidents are extremely rare events, resulting in a dataset overwhelmingly dominated by zeros (no accidents in a given grid cell during a specific month). To rigorously account for both this zero-inflation and the overdispersion typical of count data, I employ a zero-inflated negative binomial (ZINB) distribution. The ZINB model assumes that the data is generated by a mixture of two processes: 1. A binary process that generates structural zeros with probability $\psi$. 2. A negative binomial count process that generates counts $y_i$ (including sampling zeros) with probability $1 - \psi$. Mathematically, the probability mass function for an observation $Y_i$ is defined as: $$ P(Y_i = y_i | \mu_i, \phi, \psi) = \begin{cases} \pi + (1 - \psi) \left( \frac{\phi}{\mu_i + \phi} \right)^\phi, & \text{if } y_i = 0 \\ (1 - \psi) \frac{\Gamma(y_i + \phi)}{\Gamma(y_i + 1)\Gamma(\phi)} \left( \frac{\mu_i}{\mu_i + \phi} \right)^{y_i} \left( \frac{\phi}{\mu_i + \phi} \right)^\phi, & \text{if } y_i > 0 \end{cases} $$ where: * $\mu_i$ is the expected count of the negative binomial component. * $\phi$ is the shape/dispersion parameter (as $\phi \to \infty$, the distribution approaches Poisson). * $\psi$ is the zero-inflation probability parameter (denoted as `zi` in the model output). *** ## Log-linear predictor and exposure offset To evaluate the impact of environmental and temporal factors on the accident rate, I model the expected count $\mu_i$ using a log-link function. Crucially, to avoid confounding bias related to varying traffic volumes, I include vessel traffic density as an exposure offset. This ensures the model estimates the underlying risk *per unit of traffic*, rather than merely reflecting areas with more ships. The full hierarchical regression equation is specified as follows: $$ \log(\mu_i) = \alpha + \beta_{SST} \cdot \text{SST}_i + \beta_{year} \cdot \text{Year}_i + \sum_{m=2}^{12} \beta_{month(m)} \cdot \text{Month}_{m,i} + \log(\text{Traffic}_i) + u_{\text{spatial}_i} $$ Where: * $\alpha$ is the global intercept. * $\beta$ terms represent the fixed effects for Sea Surface Temperature (`mean_sst`), secular trend (`year_num`), and seasonality (`month_factor`). * $\log(\text{Traffic}_i)$ is the fixed offset for exposure (`mean_traffic`). * $u_{\text{spatial}_i}$ represents the nested spatial random effects (grid cells) to account for unobserved spatial heterogeneity. *** ## Bayesian implementation in `brms` The model was estimated using Hamiltonian Monte Carlo (HMC) sampling via the `brms` package in R. I ran 4 parallel chains, each with 4000 iterations (including 2000 warmup draws), resulting in 8000 post-warmup posterior samples. Below is the exact `brms` syntax used to fit the model: ```{r} #| label: model-code #| eval: false #| code-fold: show #| code-summary: "Show brms model code" library(brms) # ZINB Model Specification fit_full_time_with_traffic <- brm( formula = total_count ~ mean_sst + month_factor + year_num + offset(log(mean_traffic)) + (1 | lat_grid / lon_grid), data = data, family = zero_inflated_negbinomial(link = "log", link_zi = "logit"), chains = 4, iter = 4000, warmup = 2000, cores = 4, control = list(adapt_delta = 0.95), # Increased to ensure stable sampling seed = 12345 )

The zero-inflated negative binomial (ZINB) framework

Log-linear predictor and exposure offset

Bayesian implementation in brms

ZINB Model Specification

Bayesian implementation in `brms`