{splitTools} is a fast, lightweight toolkit for data splitting.

Its two main functions `partition()`

and
`create_folds()`

support

data partitioning (e.g. into training, validation and test),

creating folds for cross-validation,

creating

*repeated*folds for cross-validation,stratified splitting (e.g. for stratified cross-validation),

grouped splitting (e.g. for group-k-fold cross-validation) as well as

blocked splitting (if the sequential order of the data should be retained).

The function `create_timefolds()`

does time-series
splitting where the out-of-sample data follows the (growing) in-sample
data.

We will now illustrate how to use {splitTools} in a typical modelling workflow.

We will go through the following three steps:

Split the data set

`iris`

into 60% training data, 20% validation and 20% test, stratified by the variable`Sepal.Length`

. Since it is numeric, stratification is done internally by quantile binning.We then tune the parameter

`mtry`

of a random forest to predict`Sepal.Length`

as good as possible by the other variables. We do this on the validation data.After selecting the best

`mtry`

, we challenge the final model on the remaining test data set.

```
library(splitTools)
library(ranger)
# Split data into partitions
set.seed(3451)
<- partition(iris$Sepal.Length, p = c(train = 0.6, valid = 0.2, test = 0.2))
inds str(inds)
#> List of 3
#> $ train: int [1:81] 2 3 6 7 8 10 11 18 19 20 ...
#> $ valid: int [1:34] 1 12 14 15 27 34 36 38 42 48 ...
#> $ test : int [1:35] 4 5 9 13 16 17 25 39 41 45 ...
<- iris[inds$train, ]
train <- iris[inds$valid, ]
valid <- iris[inds$test, ]
test
# Root-mean-squared error function used to evaluate results
<- function(y, pred) {
rmse sqrt(mean((y - pred)^2))
}
# Tune mtry on validation data
<- numeric(ncol(train) - 1)
valid_mtry
for (i in seq_along(valid_mtry)) {
<- ranger(Sepal.Length ~ ., data = train, mtry = i)
fit <- rmse(valid$Sepal.Length, predict(fit, valid)$predictions)
valid_mtry[i]
}
valid_mtry#> [1] 0.3809234 0.3242173 0.3208119 0.3241518
<- which.min(valid_mtry))
(best_mtry #> [1] 3
# Fit and test final model
<- ranger(Sepal.Length ~ ., data = train, mtry = best_mtry)
final_fit rmse(test$Sepal.Length, predict(final_fit, test)$predictions)
#> [1] 0.3480947
```

Since our data set consists of only 150 rows, investing 20% of observations for validation is neither very robust nor data efficient. Let’s modify the modelling strategy by replacing simple validation by five-fold cross-validation, again using stratification on the response variable.

Split the data set

`iris`

into 80% training data and 20% test, stratified by the variable`Sepal.Length`

.Use stratified five-fold cross-validation to tune the parameter

`mtry`

. We do this on the training data.After selecting the best

`mtry`

by this simple “GridSearchCV”, we challenge the final model on the test data set.

```
# Split into training and test
<- partition(iris$Sepal.Length, p = c(train = 0.8, test = 0.2))
inds
<- iris[inds$train, ]
train <- iris[inds$test, ]
test
# Get stratified cross-validation in-sample indices
<- create_folds(train$Sepal.Length, k = 5)
folds
# Tune mtry by GridSearchCV
<- numeric(ncol(train) - 1)
valid_mtry
for (i in seq_along(valid_mtry)) {
<- numeric()
cv_mtry for (fold in folds) {
<- ranger(Sepal.Length ~ ., data = train[fold, ], mtry = i)
fit <- c(cv_mtry,
cv_mtry rmse(train[-fold, "Sepal.Length"], predict(fit, train[-fold, ])$predictions))
}<- mean(cv_mtry)
valid_mtry[i]
}
# Result of cross-validation
valid_mtry#> [1] 0.3870915 0.3460605 0.3337763 0.3302831
<- which.min(valid_mtry))
(best_mtry #> [1] 4
# Use optimal mtry to make model
<- ranger(Sepal.Length ~ ., data = train, mtry = best_mtry)
final_fit rmse(test$Sepal.Length, predict(final_fit, test)$predictions)
#> [1] 0.2970876
```

If feasible, repeated cross-validation is recommended in order to reduce uncertainty in decisions. The process is the same as above. Instead of getting five performance values per fold, we get five times the number of repetitions (here, three).

```
# We start by making repeated, stratified cross-validation folds
<- create_folds(train$Sepal.Length, k = 5, m_rep = 3)
folds length(folds)
#> [1] 15
for (i in seq_along(valid_mtry)) {
<- numeric()
cv_mtry for (fold in folds) {
<- ranger(Sepal.Length ~ ., data = train[fold, ], mtry = i)
fit <- c(cv_mtry,
cv_mtry rmse(train[-fold, "Sepal.Length"], predict(fit, train[-fold, ])$predictions))
}<- mean(cv_mtry)
valid_mtry[i]
}
# Result of cross-validation
valid_mtry#> [1] 0.3934294 0.3544207 0.3422013 0.3393454
<- which.min(valid_mtry))
(best_mtry #> [1] 4
# Use optimal mtry to make model
<- ranger(Sepal.Length ~ ., data = train, mtry = best_mtry)
final_fit rmse(test$Sepal.Length, predict(final_fit, test)$predictions)
#> [1] 0.2937055
```

When modelling time series, usual cross-validation destroys the sequential nature of the data. This can be avoided by the following modification of k-fold cross-validation:

The data is first split into \(k+1\)
blocks \(B_1, ..., B_{k+1}\), in
sequential order. Depending of `type = "extending"`

(default)
or `type == "moving"`

, the following data sets are used in
cross-validation:

First fold: \(B_1\) is used for training, \(B_2\) for evaluation.

Second fold: \(\{B_1, B_2\}\) resp. \(B_2\) is used for training if type is “extending” resp. “moving”, \(B_3\) for evaluation.

…

\(k\)-th fold: \(\{B_1, ..., B_k\}\) resp. \(B_{k}\) is used for training for type “extending” resp. “moving”, \(B_{k+1}\) for evaluation.

These schemata make sure that the evaluation data set always follows
the training data. Note that the training data grows over the whole
process linearly with `type = "extending"`

whereas its length
is approximately constant with `type = "moving"`

.

In order to have a final evaluation of the optimized model, typically an initial blocked split into sequential training and testing data is done.

We first create a time series and derive lagged features for
training. Then, again, we optimize `mtry`

of a random forest
by time-series cross-validation. We evaluate the optimized model on the
last 10% of the time series.

```
# Create data
set.seed(452)
<- 1000
n <- seq(0, 2 * pi, length.out = n)
t <- 0.2 * sin(t) - 0.1 * cos(t) + 0.2 * runif(n)
y plot(y ~ t, pch = ".", cex = 2)
```

```
# Helper function
<- function(z, k = 1) {
Lag c(z[-seq_len(k)], rep(NA, k))
}Lag(1:4, k = 1)
#> [1] 2 3 4 NA
# Add lagged features
<- data.frame(y,
dat lag1 = Lag(y),
lag2 = Lag(y, k = 2),
lag3 = Lag(y, k = 3))
<- dat[complete.cases(dat), ]
dat head(dat)
#> y lag1 lag2 lag3
#> 1 -0.085447174 0.075314649 -0.007841658 0.10081516
#> 2 0.075314649 -0.007841658 0.100815165 0.09635395
#> 3 -0.007841658 0.100815165 0.096353945 -0.06476294
#> 4 0.100815165 0.096353945 -0.064762935 0.10474890
#> 5 0.096353945 -0.064762935 0.104748904 0.07030228
#> 6 -0.064762935 0.104748904 0.070302283 0.01085425
cor(dat)
#> y lag1 lag2 lag3
#> y 1.0000000 0.8789100 0.8858208 0.8840929
#> lag1 0.8789100 1.0000000 0.8791226 0.8858995
#> lag2 0.8858208 0.8791226 1.0000000 0.8791388
#> lag3 0.8840929 0.8858995 0.8791388 1.0000000
# Block partitioning
<- partition(dat$y, p = c(train = 0.9, test = 0.1), type = "blocked")
inds str(inds)
#> List of 2
#> $ train: int [1:898] 1 2 3 4 5 6 7 8 9 10 ...
#> $ test : int [1:99] 899 900 901 902 903 904 905 906 907 908 ...
<- dat[inds$train, ]
train <- dat[inds$test, ]
test
# Get time series folds
<- create_timefolds(train$y, k = 5)
folds str(folds)
#> List of 5
#> $ Fold1:List of 2
#> ..$ insample : int [1:150] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:150] 151 152 153 154 155 156 157 158 159 160 ...
#> $ Fold2:List of 2
#> ..$ insample : int [1:300] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:150] 301 302 303 304 305 306 307 308 309 310 ...
#> $ Fold3:List of 2
#> ..$ insample : int [1:450] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:150] 451 452 453 454 455 456 457 458 459 460 ...
#> $ Fold4:List of 2
#> ..$ insample : int [1:600] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:150] 601 602 603 604 605 606 607 608 609 610 ...
#> $ Fold5:List of 2
#> ..$ insample : int [1:750] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:148] 751 752 753 754 755 756 757 758 759 760 ...
# Tune mtry by GridSearchCV
<- numeric(ncol(train) - 1)
valid_mtry
for (i in seq_along(valid_mtry)) {
<- numeric()
cv_mtry for (fold in folds) {
<- ranger(y ~ ., data = train[fold$insample, ], mtry = i)
fit <- c(cv_mtry,
cv_mtry rmse(train[fold$outsample, "y"],
predict(fit, train[fold$outsample, ])$predictions))
}<- mean(cv_mtry)
valid_mtry[i]
}
# Result of cross-validation
valid_mtry#> [1] 0.08227426 0.08188822 0.08230784
<- which.min(valid_mtry))
(best_mtry #> [1] 2
# Use optimal mtry to make model and evaluate on future test data
<- ranger(y ~ ., data = train, mtry = best_mtry)
final_fit <- predict(final_fit, test)$predictions
test_pred rmse(test$y, test_pred)
#> [1] 0.07184702
# Plot
<- seq_along(dat$y)
x plot(x, dat$y, pch = ".", cex = 2)
points(tail(x, length(test$y)), test$y, col = "red", pch = ".", cex = 2)
lines(tail(x, length(test$y)), test_pred)
```

The helper function `multi_strata()`

takes a data frame
with multiple columns and turns it into a factor. Each level represents
a stratification group containing (quite) similar rows. The factor can
then be passed to `create_folds(, type = "stratified")`

or
`partition(, type = "stratified")`

to produce (quite)
balanced folds regarding these strata.

Two grouping strategies are offered:

- k-means clustering based on scaled input.
- All combinations of columns, where numeric input is being binned.

Let’s have a look at a simple example where we want to model “Sepal.Width” as a function of the other variables in the iris data set. We want to do a stratified train/valid/test split, aiming at being balanced regarding not only the response “Sepal.Width”, but also regarding the important predictor “Species”. In this case, we could use the following workflow:

```
set.seed(3451)
<- iris[c("Sepal.Length", "Species")]
ir <- multi_strata(ir, k = 5)
y <- partition(
inds p = c(train = 0.6, valid = 0.2, test = 0.2), split_into_list = FALSE
y,
)
# Check
by(ir, inds, summary)
#> inds: train
#> Sepal.Length Species
#> Min. :4.300 setosa :30
#> 1st Qu.:5.100 versicolor:30
#> Median :5.800 virginica :30
#> Mean :5.836
#> 3rd Qu.:6.400
#> Max. :7.700
#> ------------------------------------------------------------
#> inds: valid
#> Sepal.Length Species
#> Min. :4.400 setosa :10
#> 1st Qu.:5.425 versicolor:10
#> Median :5.900 virginica :10
#> Mean :5.903
#> 3rd Qu.:6.300
#> Max. :7.900
#> ------------------------------------------------------------
#> inds: test
#> Sepal.Length Species
#> Min. :4.700 setosa :10
#> 1st Qu.:5.100 versicolor:10
#> Median :5.700 virginica :10
#> Mean :5.807
#> 3rd Qu.:6.475
#> Max. :7.100
```