The autocart package is a version of the classification and regression tree algorithm, but adapted to explicitly consider measures of spatial autocorrelation inside the splitting itself. The autocart package will be of most use for ecological datasets that cover a global spatial process that can’t be assumed to act the same at every local scale.
To get started, load the library into R.
The provided snow dataset contains measures of ground snow load at a variety of sites located in Utah. The response value “yr50” contains the ground snow load value, and a variety of other predictor variables are provided at each of the locations.
head(snow) #> STATION STATION_NAME STATE LONGITUDE LATITUDE ELEVATION YRS maxobs #> 1 USC00480027 AFTON WY -110.933 42.733 1893 41 3.256 #> 2 USS0012M26S AGUA CANYON UT -112.270 37.520 2713 23 5.793 #> 3 USC00420050 ALLEN S RANCH UT -109.143 40.892 1673 21 1.245 #> 4 USC00420061 ALPINE UT -111.777 40.451 1526 32 2.059 #> 5 USC00420072 ALTA UT -111.633 40.600 2661 40 18.577 #> 6 USC00420074 ALTAMONT UT -110.283 40.361 1943 43 2.011 #> yr50 HUC TD FFP MCMT MWMT PPTWT RH MAT #> 1 3.064 17040105 25.4 70 -8.8 16.7 94 46 3.7 #> 2 5.410 16030002 21.9 76 -5.7 16.2 186 51 4.3 #> 3 1.101 14040106 27.7 130 -5.3 22.4 29 47 8.3 #> 4 1.867 16020201 26.6 163 -2.8 23.7 129 56 10.0 #> 5 19.391 16020204 22.3 88 -7.5 14.8 413 65 2.3 #> 6 2.442 14060003 28.1 119 -7.8 20.3 49 49 6.4
There are a couple NA values in this dataset. If we pass in a dataset with a lot of missing information, it will be hard for autocart to make good splits in the absence of information. For this vignette, we will choose to remove all the rows that contain any sort of missing observations.
Let’s split the data into 85% training data and 15% test data. We will create a model with the training data and then try to predict the response value in the test dataset.
# Extract the response vector in the regression tree response <- as.matrix(snow$yr50) # Create a dataframe for the predictors used in the model predictors <- data.frame(snow$LONGITUDE, snow$LATITUDE, snow$ELEVATION, snow$YRS, snow$HUC, snow$TD, snow$FFP, snow$MCMT, snow$MWMT, snow$PPTWT, snow$RH, snow$MAT) # Create the matrix of locations so that autocart knows where our observations are located locations <- as.matrix(cbind(snow$LONGITUDE, snow$LATITUDE)) # Split the data into 85% training data and 15% test data numtraining <- round(0.85 * nrow(snow)) training_index <- rep(FALSE, nrow(snow)) training_index[1:numtraining] <- TRUE training_index <- sample(training_index) train_response <- response[training_index] test_response <- response[!training_index] train_predictors <- predictors[training_index, ] test_predictors <- predictors[!training_index, ] train_locations <- locations[training_index, ] test_locations <- locations[!training_index, ]
One crucial parameter we pass into autocart is the “alpha” parameter. Inside of the splitting function, we consider both a measure of reduction of variance, as well as a statistic of spatial autocorrelation. We can choose to weight each of the measures different. The alpha value that we pass into the autocart function says how much the splitting function will weight the statistic of spatial autocorrelation (either Moran’s I or Geary’s C). If we set alpha to 1, then we will only consider autocorrelation in the splitting. If we set alpha to 0, then autocart will function the same as a normal regression tree. For this example, let’s set alpha to be 0.60 to give most of the influence to the spatial autocorrelation.
Another parameter we can weight is “beta”, which ranges from a scale of 0 to 1. This controls the shape of the regions that are formed. If beta is near 1, then the shapes will be very close together and compact. If beta is 0, then this shape will not be considered in the splitting. If beta is something around 0.20, then small shapes will be encouraged, but another dominant term may take over.
Although the alpha and beta parameters control most of the splitting done by autocart, the user may require a bit more control. The “autocartControl” object was developed for this specifically in mind. There are a variety of parameters that the user can set that autocart will use when making the splits. As an example, let’s use inverse distance squared instead of inverse distance when calculating Moran’s I in the splitting function.
Finally, we can create our model!
The autocart function returns an S3 object of type “autocart”. We can use the “predictAutocart” function to use the object to make predictions for the testing dataset.
We can see how well we did by getting the root mean squared error.