`R`

package `Bspatial`

for spatial only point referene data, `Bsptime`

for spatio-temporal point reference data and `Bcartime`

for areal unit data, which may also vary in time, perform the main modeling and validation tasks. Computations and inference in a Bayesian modeling framework are done using popular `R`

software packages such as Point referenced data are modeled using the Gaussian error distribution only but a top level generalized linear model is used for areal data modeling. The user of

`summary`

, `residuals`

and `plot`

are included so that a beginner user confident in model fitting using the base `R`

function `lm`

can quickly learn to analyzing data by fitting a range of appropriate spatial and spatio-temporal models. This vignette illustrates the package using five built-in data sets. Three of these are on point referenced data on air pollution and temperature at the deep ocean and the other two are areal unit data sets on Covid-19 mortality in England.
bmstdr 0.3.0

- 1 Introduction
- 2 Point reference spatial data modeling
- 2.1 Illustration data set nyspatial
- 2.2 The Bspatial function for fitting spatial regression models
- 2.3 Fitting independent error regression models
- 2.4 Fitting linear models with spatial error distribution
- 2.5 Fitting spatial models with nugget effect
- 2.6 Illustrating the model validation statistics

- 3 Point reference spatio-temporal data modeling
- 3.1 Illustration data set nysptime
- 3.2 The Bsptime function for fitting spatio-temporal models
- 3.3 Exploring the separable model
- 3.4 Spatio-temporal model fitting with the spTimer package
- 3.5 Marginal model fitting with the rstan package
- 3.6 Autoregressive model fitting with spTimer and INLA
- 3.7 Spatio-temporal dynamic models using the spTDyn package
- 3.8 Fitting dynamic models using spBayes
- 3.9 Autoregressive modeling using Gaussian predictive processes
- 3.10 Comparing all the spatio-temporal models
- 3.11 Modeling data from moving sensors in time

- 4 Modeling areal unit data
- 5 Discussion
- References

A fuller version of this document with additional graphical illustrations is also available.

Model-centric analysis of spatial and spatio-temporal data is essential in many applied areas of research such as atmospheric sciences, climatology, ecology, environmental health and oceanography. Such diversity in application areas is being serviced by the rich diversity of `R`

contributed packages listed in the abstract and many others, see the CRAN Task Views:
Handling and Analyzing Spatio-Temporal Data and Analysis of Spatial Data.
Moreover, there are a number of packages and text books discussing handling of spatial and spatio-temporal data. For example, see the references Pebesma and Bivand (2005), Bivand, Pebesma, and Gomez-Rubio (2013), Pebesma (2012), Millo and Piras (2012), Banerjee, Carlin, and Gelfand (2015), Wikle, Zammit-Mangion, and Cressie (2019) and Sahu (2022). The diversity in packages, however, is also a source of challenge for an applied scientist who is also interested in exploring solutions offered by other models from rival packages. The challenge comes from the essential requirement to learn the package specific commands and setting up of the prior distributions that are to be used for the applied problem at hand.

The current package *bmstdr* sets out to help researchers in applied sciences model a large variety of spatial and spatio-temporal data using a multiplicity of packages but by using only three commands with different options. Point reference spatial data, where each observation comes with a single geo-coded location reference such as a latitude-longitude pair, can be analyzed by fitting several spatial and spatio-temporal models using `R`

software packages such as
*spBayes* (Finley, Banerjee, and Gelfand 2015), *spTimer* (Bakar and Sahu 2015), *INLA* (Rue, Martino, and Chopin 2009),
*rstan* (Stan Development Team 2020), and *spTDyn* (Bakar, Kokic, and Jin 2016).

A particular package is chosen with the `package=`

option to the *bmstdr* model fitting routines `Bspatial`

for point reference spatial only data and
`Bsptime`

for spatio-temporal data. In each of these cases a Bayesian linear model, which can be fitted with the option `package="none"`

provides a base line for model comparison purposes. For areal unit data modeling the *bmstdr* function `Bcartime`

provides opportunities for model fitting using three packages:
*CARBayes* (Lee 2021), *CARBayesST* (Lee, Rushworth, and Napier 2018) and *INLA* (Blangiardo and Cameletti 2015). Here also a base line Bayesian generalized linear model for independent data, fitted using `CARBayes`

, is included for model comparison purposes.

Models fitted using *bmstdr* can be validated using the optional argument `validrows`

, which can be a vector of row numbers of the model fitting data frame, to any of the three model fitting functions. The package then automatically sets aside the nominated data rows as specified by the `validrows`

argument and use the remaining data rows for model fitting. Inclusion of this argument also automatically triggers calculation of four popular model validation statistics: root mean square error, mean absolute error, continuous ranked probability score (Gneiting and Raftery 2007) and coverage percentage. While performing validation the package also produces a scatter plot of predictions against observations with further options controlling the behavior of this plot.

The generality of the platforms *INLA* and *rstan* allows tremendous flexibility in modeling and validation. However, bespoke code must be written to implement each different model. The current version of `bmstdr`

provides a limited number of models which can be fitted using *INLA* and *rstan*. For point reference spatial only data a marginal model, after integrating the spatial random effects, with a known exponential correlation function is implemented using *rstan* and for spatio-temporal point reference data a marginal model has also been implemented using this package. A marginal model has also been implemented in a separate *bmstdr* model fitting function called `Bmoving_sptime`

which facilitates modeling of point reference temporal data from moving sensors such as Argo floats in the oceans, see Section 3.11.
*INLA* based models use a discretized Gaussian Markov random field with penalized complexity prior distribution (Fuglstad et al. 2018).

For areal unit data modeling the *bmstdr* function `Bcartime`

provides opportunities for selected model fitting using *CARBayes* (Lee 2021), and *CARBayesST* (Lee, Rushworth, and Napier 2018)
*CARBayesST*(Lee, Rushworth, and Napier 2018) and also *INLA* as illustrated by Blangiardo and Cameletti (2015). Here also a base line Bayesian generalized linear model for independent data, fitted using `CARBayes`

, is included for model comparison purposes. The *INLA* based models can fit the celebrated Besag, York, and MolliÃ© (1991) model and the Leroux model (Leroux, Lei, and Breslow 2000).

Models fitted using *bmstdr* can be validated using the optional argument `validrows`

, which can be a vector of row numbers of the model fitting data frame, to any of the three model fitting functions. The package then automatically sets aside the nominated data rows as specified by the `validrows`

argument and use the remaining data rows for model fitting. Inclusion of this argument also automatically triggers calculation of four popular model validation statistics: root mean square error, mean absolute error, continuous ranked probability score (Gneiting and Raftery 2007) and coverage percentage. While performing validation, the package also produces a scatter plot of predictions against observations with further options controlling the behavior of this plot.

The remainder of this vignette is organized as follows. Section 2 illustrates point reference spatial data modeling with Gaussian error distribution. Section 3 discusses Gaussian models for point reference spatio-temporal data. Area data are modeled in Section 4 where Section 4.3 illustrates models for static areal unit data and Section 4.4 considers areal temporal data. Some summary remarks are provided in Section 5.

To illustrate point reference spatio-temporal data modeling we use the `nyspatial`

data set included in the package. This data set has 28 rows and 9 columns containing average ground level ozone air pollution data from 28 sites in the state of New York. The averages are taken over the 62 days in July and August 2006. The full spatio-temporal data set from 28 sites for 62 days is used to illustrate spatio-temporal modeling, see Section 3.1. Figure 1 represents a map of the state of New York together with the 28 monitoring locations where the three sites 1, 5 and 10 have been identified.

For regression modeling purposes, the response variable is `yo3`

and the three important covariates are maximum temperature: `xmaxtemp`

in degree Celsius, wind speed: `xwdsp`

in nautical miles and percentage average relative humidity: `xrh`

.
This data set is included in the package and further information regarding this can be obtained from the help file `?nyspatial`

.

The *bmstdr* package includes the function `Bspatial`

for fitting regression models to point referenced spatial data.
The arguments to this function has been documented in the help file which can be viewed by issuing the `R`

command `?Bspatial`

. The package manual also contains the full documentation. The discussion below highlights the
main features of this model fitting function.

Besides the usual `data`

and `formula`

the argument `scale.transform`

can take one of three possible values:
`NONE, SQRT`

and `LOG`

. This argument defines the on the fly transformation for the response variable which
appears on the left hand side of the formula.

Default values of the arguments `prior.beta0, prior.M`

and `prior.sigma2`

defining the prior
distributions for \(\mathbf{\beta}\) and \(1/\sigma^2_{\epsilon}\) are provided.

The options `model="lm"`

and `model="spat"`

are respectively used for fitting and analysis using the independent spatial regression model with
exponential correlation function. If the latter regression model is to be fitted, the function requires
three additional arguments, `coordtype`

, `coords`

and `phi`

. The `coords`

argument provides the
coordinates of the data locations. The type of these coordinates, specified by
the `coordtype`

argument, taking one of three possible values: `utm`

, `lonlat`

and `plain`

determines various aspects of distance calculation and hence model fitting.
The default for this argument is `utm`

when it is expected that the coordinates are supplied in units of meter.
The `coords`

argument provides the actual coordinate values and this argument can be supplied as a vector of
size two identifying the two column numbers of the data frame to take as coordinates. Or this argument
can be given as a matrix of number of sites by 2 providing the coordinates of all the data locations.

The parameter `phi`

determines the rate of decay of the spatial correlation for the assumed
exponential covariance function. The default value, if not provided, is taken to be 3 over the
maximum distance between the data locations so that the effective range is
the maximum distance.

The argument `package`

chooses one package to fit the spatial model from among four possible choices. The default option `none`

is used to fit the independent linear regression model and the also the spatial regression model without the nugget effect when the parameter `phi`

is assumed to be known. The three other options are
`spBayes, stan`

and `inla`

. Each of these options use the corresponding
R packages for model fitting. The exact form of the models in each case is documented in Chapter 6 of the book Sahu (2022).

Calculation of model choice statistics is triggered by the option `mchoice=T`

. In this case the DIC, WAIC and PMCC
values are calculated.

An optional vector argument `validrows`

providing the row numbers of the supplied data frame for
model validation can also be given. The model choice statistics are calculated on the opted scale
but model validations and their uncertainties are calculated on the original scale of the response for ease of interpretation. This strategy of a possible transformed modeling scale but predictions on the original scale is adopted throughout the package.

There are other arguments of `Bspatial`

, e.g.Â `verbose`

, which control various
aspects of model fitting and return values. Some of these other arguments are only relevant for specifying prior
distributions and performing specific tasks as we will see throughout the remainder of this section.

The return value of `Bspatial`

is a list of class *bmstdr* providing parameter estimates, and if requested model choice statistics and validation predictions
and statistics. The S3methods `print, plot, summary, fitted`

, and `residuals`

have been
implemented for objects of the *bmstdr* class. Thus the user can give the commands such as `summary(M1)`

and `plot(M1)`

where `M1`

is the model fitted object .

The *bmstdr* package allows us to fit the base linear regression model given by:
\[\begin{equation}
Y_i = \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \epsilon_i,
i=1, \ldots, n \tag{1}
\end{equation}\]
where \(\beta_1, \ldots, \beta_p\) are unknown regression coefficients and \(\epsilon_i\) is the error term that we assume to
follow the normal distribution with mean zero and variance \(\sigma^2_{\epsilon}\). The usual linear model assumes the errors \(\epsilon_i\) to be independent for \(i=1, \ldots, n\).
With the suitable default assumptions regarding the prior distributions we can fit the above
model (1) by using the following command:

`M1 <- Bspatial(formula=yo3~xmaxtemp+xwdsp+xrh, data=nyspatial, mchoice=T)`

The independent linear regression model (1) is now extended to have spatially colored covariance matrix \(\sigma^{2}_{\epsilon}H\) where \(H\) is a known correlation matrix of the error vector \(\mathbf{\epsilon}\), i.e.Â \(H_{ij}=\mbox{Cor}(\epsilon_i, \epsilon_j)\) for \(i, j=1, \ldots,n\), \[\begin{equation} {\bf Y} \sim N_n \left(X{\mathbf \beta}, \sigma^{2}_{\epsilon} H\right) \tag{2} \end{equation}\] Assuming the exponential correlation function, i.e., \(H_{ij} = \exp(-\phi d_{ij})\) where \(d_{ij}\) is the distance between locations \({\bf s}_i\) and \({\bf s}_j\) we can fit the model (2) by issuing the command:

```
M2 <- Bspatial(model="spat", formula=yo3~xmaxtemp+xwdsp+xrh, data=nyspatial,
coordtype="utm", coords=4:5, phi=0.4, mchoice=T)
```

We discuss the choice of the fixed value of the spatial decay parameter \(\phi=0.4\) in `M2`

. We use cross-validation methods to find an optimal value for \(\phi\). We take a grid of values for \(\phi\) and calculate a cross-validation error statistics, e.g.Â root mean square-error (rmse), for each value of \(\phi\) in the grid. The optimal \(\phi\) is the one that minimizes the statistics.

To perform the grid search a simple `R`

function, `phichoice_sp`

is provided especially for the `nyspatial`

data set. The documentation of this function explains how to set the other arguments. For example, the following commands work:

```
asave <- phichoice_sp()
asave
```

For the `nyspatial`

data example 0.4 turns out to be the optimal value for \(\phi\).

A general spatial model with nugget effect is written as: \[\begin{equation} Y({\bf s}_i) = {\bf x}'({\bf s}_i) \mathbf{\beta} + w({\bf s}_i) + \epsilon( {\bf s}_i) \tag{3} \end{equation}\] for all \(i=1, \ldots, n\). In the above equation, the pure error term \(\epsilon({\bf s}_i)\) is assumed to follow the independent zero mean normal distribution with variance \(\sigma^2_{\epsilon}\), called the nugget effect, for all \(i=1. \ldots, n\). The stochastic process \(w({\bf s})\) is assumed to follow a zero mean Gaussian Process with the exponential covariance function, see Sahu (2022) for more details.

The un-observed random variables \(w({\bf s}_i)\), \(i=1, \ldots, n\), also known as
the spatial random effects can be integrated out to arrive at the marginal model
\[\begin{align}
{\bf Y} & \sim N\left(X{\mathbf \beta}, \sigma^2_{\epsilon} \, I + \sigma^2_w
S_w \right), \tag{4}
\end{align}\]
where the matrix \(S_w\) is determined using the exponential correlation function.
This marginal model is fitted using any of the three packages mentioned above.
The code for this model fitting is very similar to the one for fitting `M2`

above; the only
important change is in the `package=`

argument as noted below.

```
M3 <- Bspatial(package="spBayes", formula=yo3~xmaxtemp+xwdsp+xrh, data=nyspatial,
coordtype="utm", coords=4:5, prior.phi=c(0.005, 2), mchoice=T)
M4 <- Bspatial(package="stan", formula=yo3~xmaxtemp+xwdsp+xrh, data=nyspatial,
coordtype="utm", coords=4:5,phi=0.4, mchoice=T)
M5 <- Bspatial(package="inla",formula=yo3~xmaxtemp+xwdsp+xrh, data=nyspatial,
coordtype="utm", coords=4:5, mchoice=T)
```

Model fitting is very fast except for `M4`

with the `stan`

package. The model run
for `M4`

takes about 20 minutes on a fast personal computer. These run produces the values of various Bayesian model choice criteria shown in Table 1.

Criteria | M0 | M1 | M2 | M3 | M4 | M5 |
---|---|---|---|---|---|---|

pdic | 2.07 | 4.99 | 4.98 | 5.17 | 5.31 | 4.17 |

pdicalt | 13.58 | 5.17 | 5.16 | 7.83 | 6.46 | |

dic | 169.20 | 158.36 | 158.06 | 158.68 | 158.75 | 157.23 |

dicalt | 192.22 | 158.72 | 158.41 | 163.99 | 161.04 | |

pwaic1 | 1.82 | 5.20 | 4.93 | 4.88 | 4.93 | 4.73 |

pwaic2 | 2.52 | 6.32 | 5.91 | 6.77 | 5.96 | |

waic1 | 168.95 | 158.57 | 157.51 | 158.70 | 157.92 | 158.46 |

waic2 | 170.35 | 160.82 | 159.47 | 162.48 | 159.99 | |

gof | 591.82 | 327.98 | 330.08 | 323.56 | 316.67 | 334.03 |

penalty | 577.13 | 351.52 | 346.73 | 396.63 | 394.86 | 39.17 |

pmcc | 1168.95 | 679.50 | 676.82 | 720.18 | 711.52 | 373.19 |

Mathematical expressions for all the quantities in the above table are
provided in Sahu (2022). Here M0 is the intercept only model for which the results are obtained using the following *bmstdr* command,

`Bmchoice(case="MC.sigma2.unknown", y=ydata).`

The implementation using `inla`

does not calculate the alternative values of the DIC and WAIC.

The model fitting function `Bspatial`

also calculates the values of four validation
statistics:
- root mean square-error (rmse),
- mean absolute error (mae),
- continuous ranked probability score (crps) and
- coverage (cvg)
if an additional argument `validrows`

containing the row numbers of the supplied
data frame to be validated is provided.

Data from eight validation sites 8, 11, 12, 14, 18, 21, 24 and 28 are set aside and model fitting is performed using the data from the remaining 20 sites.

The *bmstdr* command for performing validation needs an additional argument `validrows`

which are the row numbers of the supplied data frame which should be used for validation.

```
s <- c(8,11,12,14,18,21,24,28)
f1 <- yo3~xmaxtemp+xwdsp+xrh
M1.v <- Bspatial(package="none", model="lm", formula=f1,
data=nyspatial, validrows=s)
M2.v <- Bspatial(package="none", model="spat", formula=f1,
data=nyspatial, coordtype="utm", coords=4:5,phi=0.4, validrows=s)
M3.v <- Bspatial(package="spBayes", prior.phi=c(0.005, 2), formula=f1,
data=nyspatial, coordtype="utm", coords=4:5, validrows=s)
M4.v <- Bspatial(package="stan",formula=f1,
data=nyspatial, coordtype="utm", coords=4:5,phi=0.4 , validrows=s)
M5.v <- Bspatial(package="inla", formula=f1, data=nyspatial,
coordtype="utm", coords=4:5, validrows=s)
```

Table 2 presents the validation statistics for all five models. Coverage is 100% for all five models and the validation performances are comparable. Model `M4`

with \(\phi=0.4\) can be used as the best model if it is imperative that one must be chosen using the rmse criterion.

Criteria | M1 | M2 | M3 | M4 | M5 |
---|---|---|---|---|---|

rmse | 2.447 | 2.400 | 2.428 | 2.422 | 2.392 |

mae | 2.135 | 2.015 | 2.043 | 2.054 | 1.985 |

crps | 2.891 | 2.885 | 2.891 | 2.037 | 2.324 |

To illustrate \(K\)-fold cross-validation, the 28 observations in the `nyspatial`

data set are randomly assigned to \(K=4\) groups of equal size.

```
set.seed(44)
x <- runif(n=28)
u <- order(x)
s1 <- u[1:7]
s2 <- u[8:14]
s3 <- u[15:21]
s4 <- u[22:28]
```

Now the `M2.v`

command is called four times with the `validrows`

argument taking values `s1, ... s4`

. Table 3 presents the 4-fold cross-validation statistics for `M2`

only. It shows a wide variability in performance
with a low coverage of 57.14% for Fold 3.

Criteria | Fold1 | Fold2 | Fold3 | Fold4 |
---|---|---|---|---|

rmse | 2.441 | 5.005 | 5.865 | 2.508 |

mae | 1.789 | 3.545 | 5.462 | 2.145 |

crps | 2.085 | 2.077 | 1.228 | 2.072 |

cvg | 100.000 | 85.714 | 57.143 | 100.000 |

A validation plot is automatically drawn each time a validation is performed. Below, we include the validation plot for fold-3 only.

```
M2.v3 <- Bspatial(model="spat", formula=yo3~xmaxtemp+xwdsp+xrh, data=nyspatial,
coordtype="utm", coords=4:5, validrows= s3, phi=0.4, verbose = FALSE)
```

In this particular instance four of the seven validation observations are over-predicted. The above figure shows low coverage and high rmse. However, these statistics are based on data from seven validation sites only and as a result these may have large variability explaining the differences in the \(K\)-fold validation results.

The above validation plot has been drawn using the *bmstdr* command `obs_v_pred_plot`

. This validation plot may be drawn without the line segments, which is recommended when there are a large number of validation observations. The plot may also use the `mean`

values of the predictions instead of the default `median`

values. The documentation of the function explains how to do this. For example, having the fitted object `M2.v3`

, we may issue the commands:

```
names(M2.v3)
psums <- get_validation_summaries(M2.v3$valpreds)
names(psums)
a <- obs_v_pred_plot(yobs=M2.v3$yobs_preds$yo3, predsums=psums, segments=FALSE, summarystat = "mean" )
```

To illustrate point reference spatio-temporal data modeling we use the `nysptime`

data set included in the package. This is a spatio-temporal version of the data set `nyspatial`

introduced in Section 2.1. This data set, taken from Sahu and Bakar (2012), has 1736 rows and 12 columns containing ground level ozone air pollution data from 28 sites in the state of New York for the 62 days in July and August 2006.

For regression modeling purposes, the response variable is `y8hrmax`

and the three important covariates are maximum temperature: `xmaxtemp`

in degree Celsius, wind speed: `xwdsp`

in nautical miles and percentage average relative humidity: `xrh`

.

In this section we extend the spatial model (3) to the following spatio-temporal model.
\[\begin{equation}
Y({\bf s}_i, t) = {\bf x}'({\bf s}_i, t) \mathbf{\beta} + w({\bf s}_i, t) + \epsilon(
{\bf s}_i, t)
\tag{5}
\end{equation}\]
for \(i=1, \ldots, n\) and \(t=1, \ldots, T.\) Different distribution specifications for the spatio-temporal random effects \(w({\bf s}_i, t)\) and the observational errors
\(\epsilon({\bf s}_i, t)\) give rise to different models. Variations of these models have been described in Sahu (2022). The *bmstdr* function `Bsptime`

has been developed to fit these models.

Similar to the `Bspatial`

function, the `Bsptime`

function takes a formula and a data argument. It is important to note that the `Bsptime`

function *always assumes* that the data frame is first sorted by space and then time within each site in space. Note that missing covariate values are not permitted.

The arguments defining the scale, `scale.transform`

,
and the hyper parameters of the prior distribution for the regression coefficients \({\mathbf \beta}\) and the variance parameters are also similar to the corresponding ones in the spatial model fitting case with `Bspatial`

. Other important arguments
are described below.

The arguments `coordtype`

, `coords`

, and `validrows`

are also similarly defined as before. However, note that when the separable model is fitted the `validrows`

argument must include all the rows of time points for each site to be validated.

The `package`

argument can take one of six values: `spBayes`

, `stan`

, `inla`

, `spTimer`

, `sptDyn`

and `none`

with `none`

being the default. Fittings using each of these package options are illustrated in the sections below.
Only a limited number
of models, specified by the `model`

argument, can be fitted with each of these six choices. The `model`

argument is described below.

In case the package is `none`

, the `model`

can either be
`lm`

or `seperable`

. The `lm`

option is for an independent error regression model while the other option fits a separable spatio-temporal model without any nugget effect. The separable model fitting method cannot handle missing data. All missing data points in the response variable will be replaced by the grand mean of the available observations. When the `package`

option is one of the five
named packages the `model`

argument is passed to the chosen package.

For fitting a `separable`

model `Bsptime`

requires specification of
two decay parameters \(\phi_s\) and \(\phi_t\). If these are not specified then values are chosen which correspond to the effective ranges as the maximum distance in space and length in time.

There are numerous other package specific arguments that define the prior distributions and many important behavioral aspects of the selected package. Those are not described here. Instead the user is directed to the documentation `?Bsptime`

and also the vignettes of the individual packages.

With the default value of `package="none"`

the independent error regression model `M1`

and the separable model `M2`

are fitted using the commands:

```
f2 <- y8hrmax~xmaxtemp+xwdsp+xrh
M1 <- Bsptime(model="lm", formula=f2, data=nysptime, scale.transform = "SQRT")
M2 <- Bsptime(model="separable", formula=f2, data=nysptime, scale.transform = "SQRT",
coordtype="utm", coords=4:5)
```

The fitted model objects `M1`

and `M2`

are of class *bmstdr* and these can be explored using the S3 methods `print, plot`

, `summary`

and `residuals`

.
To explore the model fitted object issue the command `names(M2)`

. Here we explore the residuals by issuing the command:

```
a <- residuals(M2)
#>
#> Note that the residuals are provided on the transformed scale. Please see the scale.transform argument.
#>
#> Summary of the residuals
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> -3.13367 -0.98683 -0.49008 -0.49402 0.02336 2.11039 24
```