The package `wbacon`

implements a weighted variant of the
BACON (blocked adaptive computationally-efficient outlier nominators)
algorithms Billor et al. (2000) for multivariate
outlier detection and robust linear regression. The extension of the
BACON algorithm for outlier detection to allow for weighting is due to
Béguin and Hulliger (2008).

The details of the package are discussed in the accompanying paper; see Schoch (2021)

First, we attach the package to the search space.

`wBACON()`

is for multivariate outlier nomination and robust estimation of location/ center and covariance matrix`wBACON_reg()`

is for robust linear regression (the method is robust against outliers in the response variable and the model’s design matrix)

The BACON algorithms assume that the underlying model is an appropriate description of the non-outlying observations; Billor et al. (2000). More precisely,

- the outlier nomination method assumes that the “good” data have
(roughly) an
*elliptically contoured*distribution (this includes the Gaussian distribution as a special case); - the regression method assumes that the non-outlying (“good”) data
are described by a
*linear*(homoscedastic) regression model and that the independent variables (having removed the regression intercept/constant, if there is a constant) follow (roughly) an elliptically contoured distribution.

“Although the algorithms will often do something reasonable even when these assumptions are violated, it is hard to say what the results mean.” Billor et al. (2000, p. 290)

It is strongly recommended that the structure of the data be examined and whether the assumptions made about the “good” observations are reasonable.

In line with Billor et al. (2000, p. 290), we
use the term outlier “nomination” rather than “detection” to highlight
that algorithms should not go beyond nominating observations as
*potential* outliers; see also Béguin and
Hulliger (2008). It is left to the analyst to finally label outlying
observations as such.

The software provides the analyst with tools and measures to study potentially outlying observations. It is strongly recommended to use the tools.

Additional information on the BACON algorithms and the implementation can be found in the documents:

`methods.pdf`

: A mathematical description of the algorithms and their implementation;`doc_c_functions.pdf`

: A documentation of the`C`

functions.

Both documents can be found in the package folder
`doc`

.

In this section, we study multivariate outlier detection for the two datasets

- bushfire data (with sampling weights),
- philips data (without sampling weights).

The bushfire dataset is on satellite remote sensing. These data were used by Campbell (1984) to locate bushfire scars. The data are radiometer readings from polar-orbiting satellites of the National Oceanic and Atmospheric Administration (NOAA) which have been collected continuously since 1981. The measurements are taken on five frequency bands or channels. In the near infrared band, it is possible to distinguish vegetation types from burned surface. At visible wavelengths, the vegetation spectra are similar to burned surface. The spatial resolution is rather low (1.1 km per pixel).

The bushfire data contain radiometer readings for 38 pixels and have
been studied in Maronna and Yohai (1995), Béguin and Hulliger (2002), Béguin
and Hulliger (2008), and Hulliger and Schoch
(2009). The data can be obtained from the `R`

package
`modi`

(Hulliger, 2023).^{1}

The first 6 readings on the five frequency bands (variables) are

```
> head(bushfire)
X1 X2 X3 X4 X5
1 111 145 188 190 260
2 113 147 187 190 259
3 113 150 195 192 259
4 110 147 211 195 262
5 101 136 240 200 266
6 93 125 262 203 271
```

Béguin and Hulliger (2008) generated a set of sampling weights. The weights can be attached to the current session by

```
> fit <- wBACON(bushfire, w = bushfire.weights, alpha = 0.05)
> fit
Weighted BACON: Robust location, covariance, and distances
Converged in 3 iterations (alpha = 0.05)
Number of potential outliers: 13 (34.21%)
```

The argument `alpha`

determines the \((1-\alpha)\)-quantile \(\chi_{\alpha,d}^2\) of the chi-square
distribution with \(d\) degrees of
freedom.^{2} All observations whose squared
Mahalanobis distances are smaller than the quantile (times a correction
factor) are selected into the subset of outlier-free data. It is
recommended to choose `alpha`

on grounds of an educated guess
of the share of “good” observations in the data. Here, we suppose that
95% of the observations are not outliers.

By default, the initial subset is determined by the Euclidean norm
(initialization method: `version = "V2"`

).

- This initialization method is robust because it is based on the
coordinate-wise (weighted) median. The resulting estimators of center
and scatter are
*not affine equivariant*. Let \(T(\cdot)\) denote an estimator of a parameter of interest (e.g., covariance matrix) and let \(X\) denote the \((n \times p)\) data matrix. An estimator \(T\) is affine equivariant if and only if \(T(A X + b) = A T(X) + b\), for any nonsingular \((m \times n)\) matrix \(A\) and any \(n\)-vector \(b\). Although version`"V2"`

of the BACON method yields an estimator that is not affine equivariant in the above sense, Billor et al. (2000) point out that the method is nearly affine equivariant. - There exists an alternative initialization method
(
`"version = V1"`

) which is based on the coordinate-wise (weighted) means; therefore, it is affine equivariant but*not robust*.

From the above output, we see that the algorithm converged in three
iterations. In case the algorithm does not converge, we may increase the
maximum number of iterations (default: `maxiter = 50`

) and
toggle `verbose = TRUE`

to (hopefully) learn more why the
method did not converge.

In the next step, we want to study the result in more detail. In
particular, we are interested in the estimated center and scatter (or
covariance) matrix. To this end, we can call the `summary()`

method on the object `fit`

.

```
> summary(fit)
Weighted BACON: Robust location, covariance, and distances
Initialized by method: V2
Converged in 3 iterations (alpha = 0.05)
Number of potential outliers: 13 (34.21%)
Robust estimate of location:
X1 X2 X3 X4 X5
108.4 149.0 275.6 218.7 279.8
Robust estimate of covariance:
X1 X2 X3 X4 X5
X1 397.3 301.4 -1368.2 -268.1 -227.0
X2 301.4 258.9 -916.2 -161.3 -143.2
X3 -1368.2 -916.2 7262.8 1757.8 1406.6
X4 -268.1 -161.3 1757.8 472.2 368.5
X5 -227.0 -143.2 1406.6 368.5 290.4
Distances (cutoff: 5.675):
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.321 1.923 2.534 6.957 13.082 20.550
```

The method has detected 13 *potential* outliers. It is
important to study the diagnostic plot to learn more about the potential
outliers. The robust (Mahalanobis) distances vs. the index of the
observations (`1:n`

) can be plotted as follows.

The dashed horizontal line shows the cutoff threshold on the robust distances. Observations above the line are nominated as potential outliers by the BACON algorithm. It is left to the analyst to finally label outlying observations as such. In the next section, we introduce an alternative plotting method (see below).

The method `is_outlier()`

returns a vector of logicals
whether an observation has been flagged as a potential outlier.

The (robust) center and covariance (scatter) matrix can be extracted
with the auxiliary functions, respectively, `center()`

and
`cov()`

.

The robust Mahalanobis distances can be extracted with the
`distance()`

method.

Old television sets had a cathode ray tube with an electron gun. The
emitted beam runs through a diaphragm that lets pass only a partial beam
to the screen. The diaphragm consists of 9 components. The Philips data
set contains \(n = 667\) measurements
on the \(p = 9\) components
(variables); see Rousseeuw and van Driessen
(1999).^{3} These data do not have sampling
weights.

```
> data(philips)
> head(philips)
X1 X2 X3 X4 X5 X6 X7 X8 X9
[1,] 0.153 -0.259 0.140 0.514 2.242 0.443 -0.021 -0.035 -0.065
[2,] 0.119 -0.309 0.132 0.518 2.269 0.458 -0.018 -0.035 -0.053
[3,] 0.173 -0.296 0.138 0.516 2.266 0.461 -0.023 -0.026 -0.052
[4,] 0.135 -0.306 0.139 0.522 2.288 0.464 -0.015 -0.031 -0.051
[5,] 0.143 -0.278 0.139 0.519 2.284 0.465 -0.016 -0.018 -0.054
[6,] 0.140 -0.284 0.159 0.531 2.287 0.465 -0.004 -0.024 -0.052
```

We compute the BACON algorithm but this time with the initialization
method `version = "V1"`

.

```
> fit <- wBACON(philips, alpha = 0.05, version = "V1")
> fit
Weighted BACON: Robust location, covariance, and distances
Converged in 7 iterations (alpha = 0.05)
Number of potential outliers: 82 (12.11%)
```

The BACON algorithm detected 82 potential outliers. The robust (Mahalanobis) distances can be plotted against the univariate projection of the data, which maximizes the separation criterion of Qiu and Joe (2006). This kind of diagnostic graph attempts to separate outlying from non-outlying observations as much as possible; see Willems et al. (2009). It is helpful if the outliers are clustered. The graph is generated as follows.

From the visual display, we see a cluster of potential outliers in the top right corner. The dashed horizontal line indicates the cutoff threshold on the distances as imposed by the BACON algorithm.

For very large datasets, the plot method can be called with the
(additional) argument `hex = TRUE`

to show a hexagonally
binned scatter plot; see below. This plot method uses the functionality
of the R package `hexbin`

(Carr et al.,
2023).

The education data is on education expenditures in 50 US states in
1975 (Chatterjee and Hadi, 2012, Chap. 5.7). The
data can be loaded from the `robustbase`

package.

It is convenient to rename the variables.