{splitTools} is a fast, lightweight toolkit for data splitting.

Its two main functions `partition()`

and
`create_folds()`

support

- data partitioning (e.g. into training, validation and test),
- creating (in- or out-of-sample) folds for cross-validation (CV),
- creating
*repeated*folds for CV, - stratified splitting,
- grouped splitting as well as
- blocked splitting (if the sequential order of the data should be retained).

The function `create_timefolds()`

does time-series
splitting in the sense that the out-of-sample data follows the in-sample
data.

We will now illustrate how to use {splitTools} in a typical modeling workflow.

We will go through the following steps:

- We split the
`iris`

data into 60% training, 20% validation, and 20% test data, stratified by the variable`Sepal.Length`

. Since this variable is numeric, stratification uses quantile binning. - We will model the response
`Sepal.Length`

with a linear regression, once with and once without interaction between`Species`

and`Sepal.Width`

. - After selecting the better of the two models via validation RMSE, we evaluate the final model on the test data.

```
library(splitTools)
# Split data into partitions
set.seed(3451)
<- partition(iris$Sepal.Length, p = c(train = 0.6, valid = 0.2, test = 0.2))
inds str(inds)
#> List of 3
#> $ train: int [1:81] 2 3 6 7 8 10 11 18 19 20 ...
#> $ valid: int [1:34] 1 12 14 15 27 34 36 38 42 48 ...
#> $ test : int [1:35] 4 5 9 13 16 17 25 39 41 45 ...
<- iris[inds$train, ]
train <- iris[inds$valid, ]
valid <- iris[inds$test, ]
test
<- function(y, pred) {
rmse sqrt(mean((y - pred)^2))
}
# Use simple validation to decide on interaction yes/no...
<- lm(Sepal.Length ~ ., data = train)
fit1 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = train)
fit2
rmse(valid$Sepal.Length, predict(fit1, valid))
#> [1] 0.3020855
rmse(valid$Sepal.Length, predict(fit2, valid))
#> [1] 0.2954321
# Yes! Choose and test final model
rmse(test$Sepal.Length, predict(fit2, test))
#> [1] 0.3482849
```

Since the `iris`

data consists of only 150 rows, investing
20% of observations for validation seems like a waste of resources.
Furthermore, the performance estimates might not be very robust. Let’s
replace simple validation by five-fold CV, again using stratification on
the response variable.

- Split
`iris`

into 80% training data and 20% test, stratified by the variable`Sepal.Length`

. - Use stratified five-fold CV to choose between the two models.
- We evaluate the final model on the test data.

```
# Split into training and test
<- partition(iris$Sepal.Length, p = c(train = 0.8, test = 0.2), seed = 87)
inds
<- iris[inds$train, ]
train <- iris[inds$test, ]
test
# Get stratified CV in-sample indices
<- create_folds(train$Sepal.Length, k = 5, seed = 2734)
folds
# Vectors with results per model and fold
<- cv_rmse2 <- numeric(5)
cv_rmse1
for (i in seq_along(folds)) {
<- train[folds[[i]], ]
insample <- train[-folds[[i]], ]
out
<- lm(Sepal.Length ~ ., data = insample)
fit1 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
fit2
<- rmse(out$Sepal.Length, predict(fit1, out))
cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit2, out))
cv_rmse2[i]
}
# CV-RMSE of model 1 -> close winner
mean(cv_rmse1)
#> [1] 0.330189
# CV-RMSE of model 2
mean(cv_rmse2)
#> [1] 0.3306455
# Fit model 1 on full training data and evaluate on test data
<- lm(Sepal.Length ~ ., data = train)
final_fit rmse(test$Sepal.Length, predict(final_fit, test))
#> [1] 0.2892289
```

If feasible, *repeated* CV is recommended in order to reduce
uncertainty in decisions. Otherwise, the process remains the same.

```
# Train/test split as before
# 15 folds instead of 5
<- create_folds(train$Sepal.Length, k = 5, seed = 2734, m_rep = 3)
folds <- cv_rmse2 <- numeric(15)
cv_rmse1
# Rest as before...
for (i in seq_along(folds)) {
<- train[folds[[i]], ]
insample <- train[-folds[[i]], ]
out
<- lm(Sepal.Length ~ ., data = insample)
fit1 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
fit2
<- rmse(out$Sepal.Length, predict(fit1, out))
cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit2, out))
cv_rmse2[i]
}
mean(cv_rmse1)
#> [1] 0.3296087
mean(cv_rmse2)
#> [1] 0.331373
# Refit and test as before
```

The function `multi_strata()`

creates a stratification
factor from multiple columns that can then be passed to
`create_folds(, type = "stratified")`

or
`partition(, type = "stratified")`

. The resulting partitions
will be (quite) balanced regarding these columns.

Two grouping strategies are offered:

- k-means clustering based on scaled input.
- All combinations of columns, where numeric input is being binned.

Let’s have a look at a simple example where we want to model “Sepal.Width” as a function of the other variables in the iris data set. We want to do a stratified train/valid/test split, aiming at being balanced regarding not only the response “Sepal.Width”, but also regarding the important predictor “Species”. In this case, we could use the following workflow:

```
set.seed(3451)
<- iris[c("Sepal.Length", "Species")]
ir <- multi_strata(ir, k = 5)
y <- partition(
inds p = c(train = 0.6, valid = 0.2, test = 0.2), split_into_list = FALSE
y,
)
# Check
by(ir, inds, summary)
#> inds: train
#> Sepal.Length Species
#> Min. :4.300 setosa :30
#> 1st Qu.:5.100 versicolor:30
#> Median :5.800 virginica :30
#> Mean :5.836
#> 3rd Qu.:6.400
#> Max. :7.700
#> ------------------------------------------------------------
#> inds: valid
#> Sepal.Length Species
#> Min. :4.400 setosa :10
#> 1st Qu.:5.425 versicolor:10
#> Median :5.900 virginica :10
#> Mean :5.903
#> 3rd Qu.:6.300
#> Max. :7.900
#> ------------------------------------------------------------
#> inds: test
#> Sepal.Length Species
#> Min. :4.700 setosa :10
#> 1st Qu.:5.100 versicolor:10
#> Median :5.700 virginica :10
#> Mean :5.807
#> 3rd Qu.:6.475
#> Max. :7.100
```