This vignette describes the motivation for offsetreg and when its usage becomes necessary.

For certain use cases, offsets are supported in tidymodels. Generally
speaking, for models that allow for offsets to be specified in a model
formula, tidymodels works fine out of the box and offsetreg is not
needed. The `glm()`

function from the stats package is a good
example of this.

`glm()`

Below, a Poisson model is fit using the `us_deaths`

data
set with an offset equal to the log of population.

```
library(parsnip)
library(offsetreg)
library(broom)
library(recipes)
library(workflows)
library(rsample)
library(tune)
us_deaths$log_pop <- log(us_deaths$population)
poisson_reg() |>
set_engine("glm") |>
fit(deaths ~ gender + age_group + year + offset(log_pop),
data = us_deaths)
#> parsnip model object
#>
#>
#> Call: stats::glm(formula = deaths ~ gender + age_group + year + offset(log_pop),
#> family = stats::poisson, data = data)
#>
#> Coefficients:
#> (Intercept) genderMale age_group35-44 age_group45-54 age_group55-64
#> -18.337940 0.327632 0.442935 1.212463 1.990698
#> age_group65-74 age_group75-84 age_group85+ year
#> 2.713410 3.645763 4.770408 0.005683
#>
#> Degrees of Freedom: 139 Total (i.e. Null); 131 Residual
#> Null Deviance: 51700000
#> Residual Deviance: 237800 AIC: 239800
```

The code above works for a few reasons:

`fit()`

captures the formula expression passed to it, and that formula is allowed to contain calls to other functions, like`offset()`

.- Since there is no additional pre-processing of the data required,
that formula is passed to the
`glm()`

function as-is, as shown in the call printed above.

Let’s assume we want to use a recipe to pre-process our data. In the
example below, a bare bones recipe is used to verify that we can
reproduce the same coefficients as the original example. Unfortunately,
this creates a problem because `recipe()`

doesn’t allow
in-line functions like `offset()`

.

```
mod <- poisson_reg() |> set_engine("glm")
rec <- recipe(deaths ~ gender + age_group + year + offset(log_pop),
data = us_deaths)
#> Error in `inline_check()`:
#> ✖ No in-line functions should be used here.
#> ℹ The following function was found: `offset`.
#> ℹ Use steps to do transformations instead.
#> ℹ If your modeling engine uses special terms in formulas, pass that formula to
#> workflows as a model formula (`?parsnip::model_formula()`).
```

As the hint above explains, this error can be avoided by removing the
call to `offset()`

in the recipe and passing a second formula
to `add_model()`

as part of a workflow. Note that the
variable passed to `offset()`

must still be included in the
recipe.

```
rec <- recipe(deaths ~ gender + age_group + year + log_pop,
data = us_deaths)
workflow() |>
add_model(mod,
formula = deaths ~ gender + age_group + year + offset(log_pop)) |>
add_recipe(rec) |>
fit(us_deaths)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: poisson_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 0 Recipe Steps
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call: stats::glm(formula = deaths ~ gender + age_group + year + offset(log_pop),
#> family = stats::poisson, data = data)
#>
#> Coefficients:
#> (Intercept) genderMale age_group35-44 age_group45-54 age_group55-64
#> -18.337940 0.327632 0.442935 1.212463 1.990698
#> age_group65-74 age_group75-84 age_group85+ year
#> 2.713410 3.645763 4.770408 0.005683
#>
#> Degrees of Freedom: 139 Total (i.e. Null); 131 Residual
#> Null Deviance: 51700000
#> Residual Deviance: 237800 AIC: 239800
```

These coefficients match the first example without a recipe, so we know this model was set up correctly.

`glmnet()`

Not all modeling engines allow for offsets to be passed via the
formula interface. For example, the `glmnet()`

function does
not not accept formulas; it requires model matrices. Instead, offsets
are passed as a numeric vector using an optional engine-specific
`offset`

argument.

```
poisson_reg(penalty = 1E-5) |>
set_engine("glmnet", offset = us_deaths$log_pop) |>
fit(deaths ~ year + gender + age_group,
data = us_deaths) |>
tidy()
#> Warning: package 'glmnet' was built under R version 4.2.3
#> Warning: package 'Matrix' was built under R version 4.2.3
#> Loaded glmnet 4.1-8
#> Warning: package 'poissonreg' was built under R version 4.2.3
#> # A tibble: 9 × 3
#> term estimate penalty
#> <chr> <dbl> <dbl>
#> 1 (Intercept) -17.7 0.00001
#> 2 year 0.00540 0.00001
#> 3 genderMale 0.326 0.00001
#> 4 age_group35-44 0.338 0.00001
#> 5 age_group45-54 1.11 0.00001
#> 6 age_group55-64 1.89 0.00001
#> 7 age_group65-74 2.62 0.00001
#> 8 age_group75-84 3.55 0.00001
#> 9 age_group85+ 4.68 0.00001
```

This code works because the argument
`offset = us_deaths$log_pop`

is captured and passed directly
into `glmnet()`

.

If we try to use a recipe with an offset passed to the
`formula`

argument of `add_model()`

, a
difficult-to-spot problem emerges. The model runs without errors, but a
completely different set of coefficients is returned.

```
mod_glmnet <- poisson_reg(penalty = 1E-5) |> set_engine("glmnet")
rec <- recipe(deaths ~ year + gender + age_group + log_pop,
data = us_deaths)
workflow() |>
add_model(mod_glmnet,
formula = deaths ~ year + gender + age_group + offset(log_pop)) |>
add_recipe(rec) |>
fit(us_deaths) |>
tidy()
#> # A tibble: 9 × 3
#> term estimate penalty
#> <chr> <dbl> <dbl>
#> 1 (Intercept) -42.9 0.00001
#> 2 year 0.0263 0.00001
#> 3 genderMale 0.0243 0.00001
#> 4 age_group35-44 0.303 0.00001
#> 5 age_group45-54 1.12 0.00001
#> 6 age_group55-64 1.85 0.00001
#> 7 age_group65-74 2.19 0.00001
#> 8 age_group75-84 2.45 0.00001
#> 9 age_group85+ 2.71 0.00001
```

What happened here? Since `glmnet()`

doesn’t natively
support the formula interface, it doesn’t know what to do with the
`offset()`

term passed to the formula. Under the hood, the
`offset()`

term is quietly dropped in a call to
`model.matrix()`

that is used to convert the formula to a
matrix format acceptable to `glmnet()`

.

```
model.matrix(deaths ~ year + gender + age_group + offset(log_pop),
us_deaths) |>
head()
#> (Intercept) year genderMale age_group35-44 age_group45-54 age_group55-64
#> 1 1 2011 0 0 0 0
#> 2 1 2012 0 0 0 0
#> 3 1 2013 0 0 0 0
#> 4 1 2014 0 0 0 0
#> 5 1 2015 0 0 0 0
#> 6 1 2016 0 0 0 0
#> age_group65-74 age_group75-84 age_group85+
#> 1 0 0 0
#> 2 0 0 0
#> 3 0 0 0
#> 4 0 0 0
#> 5 0 0 0
#> 6 0 0 0
```

As a result, the model is exactly what we would see if there were no offset terms to begin with. This is a situation when offsetreg is required.

offsetreg becomes necessary when the underlying modeling engine does
not support offsets in formulas **and** either of these
tasks are performed:

- A pre-processing recipe is applied to the data
- Resampling is performed, often in conjunction with hyperparameter tuning

`recipe()`

when offsets cannot be specified in a
formulaLet’s continue with the last example. The problem can be addressed using offsetreg as follows:

- Replace
`poisson_reg()`

with`poisson_reg_offset()`

- Replace the “glmnet” engine with “glmnet_offset” and provide the name of the offset column
- Remove the
`formula`

argument in`add_model()`

- Add a call to
`step_dummy()`

. This step was previously not necessary when`formula`

was passed to`add_model()`

.

```
mod_offset <- poisson_reg_offset(penalty = 1E-5) |>
set_engine("glmnet_offset", offset_col = "log_pop")
rec <- recipe(deaths ~ year + gender + age_group + log_pop,
data = us_deaths) |>
step_dummy(all_nominal_predictors())
workflow() |>
add_model(mod_offset) |>
add_recipe(rec) |>
fit(us_deaths) |>
tidy()
#> # A tibble: 9 × 3
#> term estimate penalty
#> <chr> <dbl> <dbl>
#> 1 (Intercept) -17.7 0.00001
#> 2 year 0.00540 0.00001
#> 3 gender_Male 0.326 0.00001
#> 4 age_group_X35.44 0.338 0.00001
#> 5 age_group_X45.54 1.11 0.00001
#> 6 age_group_X55.64 1.89 0.00001
#> 7 age_group_X65.74 2.62 0.00001
#> 8 age_group_X75.84 3.55 0.00001
#> 9 age_group_X85. 4.68 0.00001
```

For models like `glmnet()`

where offsets can only be
specified as a numeric vector in engine-specific arguments, resampling
presents a few challenges:

- tidymodels is only aware of the numeric vector of offsets that has been passed and there is no defined relationship between individual observations and their associated offsets. As a result, when resampling occurs, offsets aren’t carried over to resampled data sets.
- Related, and pertinent to
`glmnet()`

, if the`predict()`

function requires offset terms, there is no mechanism to pass those along, which will result in an error.

Below is what happens if we attempt to fit 5 bootstrap resamples of
the `us_deaths`

data set without offsetreg.

```
resamples <- bootstraps(us_deaths, times = 5)
mod_glmnet <- poisson_reg(penalty = 1E-5) |>
set_engine("glmnet", offset = us_deaths$log_pop)
workflow() |>
add_recipe(rec) |>
add_model(mod_glmnet) |>
fit_resamples(resamples) |>
collect_metrics()
#> → A | error: No newoffset provided for prediction, yet offset used in fit of glmnet
#> There were issues with some computations A: x1There were issues with some computations A: x1
#> → B | error: Cannot find current progress bar for `<environment: 0x0000013903c4a650>`
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
#> Warning: More than one set of outcomes were used when tuning. This should never
#> happen. Review how the outcome is specified in your model.
#> Error in `estimate_tune_results()`:
#> ! All models failed. Run `show_notes(.Last.tune.result)` for more information.
```

All models failed to fit, and we receive a specific error message about no offsets being available for predictions.

```
show_notes(.Last.tune.result)
#> unique notes:
#> ──────────────────────────────────────────────────────────────────────
#> No newoffset provided for prediction, yet offset used in fit of glmnet
```

With offsetreg, this code performs as expected. offsetreg works because behind the scenes it ensures that offset terms are attached to the data at all times, which enables resampling and predictions to function without error.

```
workflow() |>
add_recipe(rec) |>
add_model(mod_offset) |>
fit_resamples(resamples) |>
collect_metrics()
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 rmse standard 25075. 5 2241. Preprocessor1_Model1
#> 2 rsq standard 0.975 5 0.00472 Preprocessor1_Model1
```