`SimReg`

has a function `sim_reg`

for performing *Bayesian Similarity Regression* [1]. It returns an estimated probability of an association between ontological term sets and a binary variable, and conditional on the association: a *characteristic ontological profile*, such that ontological similarity to the profile increases the probability of the binary variable taking the value `TRUE`

. The procedure has been used in the context of linking an ontologically encoded phenotype (as HPO terms) to a binary genotype (indicating the presence or absence of a rare variant within given genes) [1], so in this guide, we’ll adopt the same theme.

The function accepts arguments including `logical`

response variable `y`

, ontologically encoded predictor variable `x`

, and additional arguments for tuning the compromise between execution speed and precision for the procedure.

It returns an object of class `sim_reg_output`

, which contains the pertinent information about the results of the inference. Of particular interest is the estimated probability of association, i.e. the probability that model selection indicator `gamma`

= 1. The function `prob_association`

can be used on the output to obtain such and estimate. Also, the posterior distribution of the characteristic ontological profile `phi`

may be of interest, for which the function `get_term_marginals`

can be used.

To set up an environment where we can run a simple example, we need an `ontology_index`

object. The `ontologyIndex`

package contains such an object - the Human Phenotype Ontology, `hpo`

- so we load the package and the data, and proceed to create an HPO profile `template`

and an enclosing set of terms, `terms`

, from which we’ll generate random term sets upon which to apply the function. In our setting, we’ll interpret this HPO profile `template`

as the typical phenotype of a hypothetical disease. We set `template`

to the set `HP:0005537, HP:0000729`

and `HP:0001873`

, corresponding to phenotype abnormalities ‘Decreased mean platelet volume’, ‘Autistic behavior’ and ‘Thrombocytopenia’ respectively.

```
library(ontologyIndex)
library(SimReg)
data(hpo)
# only use terms which are descended from HP:0000001
intersection_with_descendants(hpo, "HP:0000001", hpo$id)
pa_descs <- lapply(hpo, "[", pa_descs)
hpo <-class(hpo) <- "ontology_index"
set.seed(1)
c("HP:0005537", "HP:0000729", "HP:0001873")
template <- get_ancestors(hpo, c(template, sample(hpo$id, size=50))) terms <-
```

First, we’ll do an example where there is no association between `x`

and `y`

, and then one where there is an association.

In the example with no association, we’ll fix `y`

, with 10 `TRUE`

s and generate the `x`

randomly, with each set of ontological terms determined by sampling 5 random terms from `terms`

.

```
c(rep(TRUE, 10), rep(FALSE, 90))
y <- replicate(simplify=FALSE, n=100, expr=minimal_set(hpo, sample(terms, size=5))) x <-
```

Thus, our input data looks like:

` y`

```
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
```

`head(x)`

```
## [[1]]
## [1] "HP:0031690" "HP:0002360" "HP:0100235" "HP:0500238"
##
## [[2]]
## [1] "HP:0012072" "HP:0033888" "HP:0034057" "HP:0000924"
##
## [[3]]
## [1] "HP:0008798" "HP:5200299" "HP:0040069" "HP:0010421"
##
## [[4]]
## [1] "HP:0009140" "HP:0034243" "HP:0011821" "HP:0007686" "HP:0002664"
##
## [[5]]
## [1] "HP:0000962" "HP:0034237" "HP:0410380" "HP:0100326" "HP:0025132"
##
## [[6]]
## [1] "HP:0001197" "HP:0100872" "HP:0033151" "HP:0010424"
```

Now we can call the `sim_reg`

function to estimate the probability of an association (note: by default, the probability of an association has a prior of 0.05 and this can be set by passing a `gamma_prior_prob`

argument), and print the mean posterior value of `gamma`

, corresponding to our estimated probability of association.

```
sim_reg(ontology=hpo, x=x, y=y)
no_assoc <-prob_association(no_assoc)
```

`## [1] 0.0001322736`

We note that there is a low probability of association. Now, we sample `x`

conditional on `y`

, so that if `y[i] == TRUE`

, then `x`

has 2 out of the 3 terms in `template`

added to its profile.

```
lapply(y, function(y_i) minimal_set(hpo, c(
x_assoc <-sample(terms, size=5), if (y_i) sample(template, size=2))))
```

If we look again at the first few values in `x`

for which `y[i] == TRUE`

, we notice that they contain terms from the template.

`head(x_assoc)`

```
## [[1]]
## [1] "HP:0034243" "HP:0033303" "HP:0033797" "HP:0001881" "HP:0001873"
## [6] "HP:0000729"
##
## [[2]]
## [1] "HP:0000119" "HP:0000479" "HP:0002817" "HP:0030272" "HP:0000600"
## [6] "HP:0001873" "HP:0000729"
##
## [[3]]
## [1] "HP:0000962" "HP:0008798" "HP:0011314" "HP:0011821" "HP:0002503"
## [6] "HP:0000729" "HP:0001873"
##
## [[4]]
## [1] "HP:0001574" "HP:0002085" "HP:0009116" "HP:4000137" "HP:0001120"
## [6] "HP:0001873" "HP:0000729"
##
## [[5]]
## [1] "HP:0100872" "HP:0001992" "HP:0010181" "HP:0010631" "HP:0000729"
## [6] "HP:0001873"
##
## [[6]]
## [1] "HP:0000077" "HP:0032973" "HP:0032183" "HP:0010355" "HP:0005537"
## [6] "HP:0001873"
```

Now we run the procedure again with the new `x`

and `y`

and print the mean posterior value of `gamma`

.

```
sim_reg(ontology=hpo, x=x_assoc, y=y)
assoc <-prob_association(assoc)
```

`## [1] 0.9999939`

We note that we infer a higher probability of association. We can also visualise the estimated characteristic ontological profile, using the function `plot_term_marginals`

, and note that the inferred characteristic phenotype corresponds well to `template`

.

`plot_term_marginals(hpo, get_term_marginals(assoc), max_terms=8, fontsize=30)`

- D. Greene, NIHR BioResource, S. Richardson, E. Turro, Phenotype similarity regression for identifying the genetic determinants of rare diseases, The American Journal of Human Genetics 98, 1-10, March 3, 2016.