---
title: "biogram package"
author: "Micha&#322; Burdukiewicz, Piotr Sobczyk"
date: "5.01.2017"
output: 
  rmarkdown::html_vignette:
    toc: true
bibliography: "biogram_pub.bib"
vignette: >
  %\VignetteIndexEntry{biogram package - an overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE, message = FALSE, results='asis'}
library(knitr)
opts_chunk$set(fig.width=7, fig.height=6)

library(biogram)

library(ggplot2)

size_mod <- -5

my_theme <- theme(plot.background=element_rect(fill = "transparent",
                                               colour = "transparent"),
                  panel.grid.major = element_line(colour="lightgrey", linetype = "dashed", size = 0.5),
                  panel.background = element_rect(fill = "transparent",colour = "black"),
                  legend.background = element_rect(fill = "NA"),
                  legend.position = "bottom",
                  axis.text = element_text(size=13 + size_mod),
                  axis.title.x = element_text(size=16 + size_mod, vjust = -1), 
                  axis.title.y = element_text(size=16 + size_mod, vjust = 1),
                  strip.text = element_text(size=17 + size_mod, face = "bold"),
                  legend.text = element_text(size=13 + size_mod), 
                  legend.title = element_text(size=17 + size_mod),
                  plot.title = element_text(size=20 + size_mod),
                  strip.background = element_rect(fill = "NA", colour = "NA"))

```

# Reduction of dimensionality

Since the number of potential n-grams grows exponentially with the $n$, n-gram datasets are often very large. To deal with the curse of dimensionality, the **biogram** package offers two solutions. The first relies on the reduction of an alphabet, which is a common approach in case of analysis of amino acid sequences (@murphy_simplified_2000) and less applied in studies of nucleic acids.  

The alternate solution is to filter the non-informative n-grams. The **biogram** package employs feature selection algorithm QuiPT, which allows very fast feature filtering.   

## Alphabet reduction

In many cases the properties of the sequences are not depending on the exact sequence of the amino acids, but rather on their physicochemical properties. In this case, the full amino acid alphabet may be replaced with a shorter alphabet, where amino acids are aggregated to larger groups using some design criteria as physicochemical properties. The **biogram** package supports creation of reduced amino acid alphabets and their analysis by two distance measures: similarity index and encoding distance.

### Similarity index

The similarity index, as computed by the *calc_si* function, was firstly introduced as the unnamed distance measure for reduced alphabets by @stephenson_unearthing_2013. Briefly, if a pair of elements is in both encodings in the same group or in different groups, the pair scores 1 and in the opposite case 0. The pairs of the identical elements are ignored. The score is later divided by the number of possible pairs ($20 \times 19$ in case of the amino acid alphabet). 

$A$: an alphabet.  

$a_1$: an element in the reduced alphabet 1.  

$a_2$: an element in the reduced alphabet 2.  

$$
S = \sum_{a_1 \in A} \sum_{a_2 \in A, a_2 \neq a_1 } \sum_{enc_1 \in encodings_1} \sum_{enc_2 \in encodings_2} 1_{a_1 \in enc_1 \land a_2 \in enc_2}
$$

$$
I_S = \frac{S}{ |A|^2 - |A| }
$$

### Encoding distance

The encoding distance, as computed per the *calc_ed* function, is defined as the minimum number of amino acids that have to be moved between subgroups of encoding to make **a** identical to **b** (the order of subgroups in the encoding and amino acids in a group is unimportant).

This measure may be further normalized by a factor reflecting how much moving amino acids between groups altered mean group properties. Such behavior if generated by specifying parameter *prop*.

```{r, echo = FALSE, message = FALSE, results='asis'}
group2df <- function(group_list, caption = NULL, label = NULL) {
  data.frame(ID = 1L:length(group_list), 
             Groups = sapply(group_list, function(i)
    paste0(toupper(sort(i)), collapse = ", ")))
}

a <- list(`1` = "p", 
          `2` = c("f", "i", "w", "y"), 
          `3` = c("a", "c", "d", "e", "g", "h", "k", "l", "m", "n", "q", "r", "s", "t", "v"))

kable(group2df(a), caption = "Encoding A")
```


```{r, echo = FALSE, message = FALSE, results='asis'}
b <- list(`1` = c("f", "r", "w", "y"), 
          `2` = c("c", "i", "l", "t", "v"), 
          `3` = c("a", "d", "e", "g", "h", "k", "m", "n", "p", "q", "s"))

kable(group2df(b), caption = "Encoding B")
```

The encoding distance between both groups is `r calc_ed(a, b, measure = "pi")`.  

```{r, echo = FALSE, message = FALSE, results='asis'}
data(aaprop)
a_prop <- aaprop[c(22, 211), ]

#b_prop <- aa_nprop[na.omit(traits_table[ao, ]), , drop = FALSE]

# must have unified lists of features

coords_a <- lapply(a, function(single_subgroup) rowMeans(a_prop[, single_subgroup, drop = FALSE]))
coords_b <- lapply(b, function(single_subgroup) rowMeans(a_prop[, single_subgroup, drop = FALSE]))

dat_a <- data.frame(enc = "a", do.call(rbind, coords_a), label = paste0("A", 1L:3))
dat_b <- data.frame(enc = "b", do.call(rbind, coords_b), label = paste0("B", 1L:3))

dat <- data.frame(do.call(rbind, lapply(1L:nrow(dat_a), function(id) 
  data.frame(id = id, rbind(do.call(rbind, lapply(1L:3, function(dummy) 
    dat_a[id, , drop = FALSE])),
    dat_b)))), pair = c(paste0("d", 1L:3), paste0("d", 1L:3)))

colnames(dat) <- c("id", "enc", "f1", "f2", "label", "pair")
dat[["id"]] <- paste0("Encoding a\nsubgroup ", dat[["id"]])


ggplot(dat, aes(x = f1, y = f2, colour = pair, label = label)) +
  geom_line() +
  geom_point(aes(x = f1, y = f2, colour = enc), size = 4) + 
  facet_wrap(~ id) + 
  geom_text(aes(x = f1, y = f2, colour = enc, label = label), vjust = 1.8, size = 4) + 
  scale_color_brewer(palette="Dark2", guide = "none") +
  my_theme
```

The figure above represents the distances between groups of encoding **a** (green dots) and groups of encoding **b** (red dots). The position of the dot defined by mean values of specific properties of all amino acids belonging to the group.

```{r, echo = FALSE, message = FALSE, results='asis'}
tmp <- sapply(coords_a, function(single_coords_a) {
  distances <- sapply(coords_b, function(single_coords_b) 
    #vector of distances between groups
    sqrt(sum((single_coords_a - single_coords_b)^2))
  )
  #c(dist = min(distances), id = unname(which.min(distances)))
  distances
})

colnames(tmp) <- paste0("Enc a, group ", colnames(tmp))
rownames(tmp) <- paste0("Enc b, group ", rownames(tmp))

kable(tmp, caption = "Distances between groups of encodings a and b.")
```

The scale factor $s$ for the encoding distance between the encoding 
**a** with $n$ subgroups (enumerated with $i$) and the encoding **b** 
with $m$ subgroups (enumerated with $j$), we first calculated $p_i$ and $q_j$, 
i.e. the mean values of corresponding physicochemical properties of all amino acids for each 
subgroup. 

$$ 
s_{ab} = \sum^n_{i = 1}  \left( \min_{j=1,\dots,m} \; \; \sum^L_{l=1} \sqrt{ 
(p_{i,l} - q_{j,l})^2} \right) 
$$

In the case depicted above, $s_{ab}$ is equal to the sum of `r paste0(round(apply(tmp, 2, min), 4), collapse = ", ")`: `r round(sum(apply(tmp, 2, min)), 4)`.

#### biogram implementation

The encoding distance is computed with the *calc_ed* function.

```{r, echo = TRUE}

# define two encodings
a <- list(`1` = "p", 
          `2` = c("f", "i", "w", "y"), 
          `3` = c("a", "c", "d", "e", "g", "h", "k", "l", "m", "n", "q", "r", "s", "t", "v"))

b <- list(`1` = c("f", "r", "w", "y"), 
          `2` = c("c", "i", "l", "t", "v"), 
          `3` = c("a", "d", "e", "g", "h", "k", "m", "n", "p", "q", "s"))

# calculate encoding distance
calc_ed(a = a, b = b, measure = "pi")

# get properties from aaprop dataset and calculate normalized encoding distance
data(aaprop)
calc_ed(a = a, b = b, measure = "pi", prop = aaprop[c(22, 211), ])
```

### Quick Permutation Test (QuiPT)

QuiPT is fast solution for filtering a large number of binary features provided 
that target vector is also binary. It is typical case for positioned n-gram data. 
Moreover, even considering unpositioned n-grams, if the $n$ is sufficiently high, 
the matrix of counts is almost sparse and little to no information is lost after 
data is artificially binarized.

The simplest use case of QuiPT is a case of two Bernoulli random variables. One of 
them is feature, the other is target. We are interested in deciding whether feature brings
some important information about target. The **biogram** package employs several measures 
of such a dependence (see `?calc_criterion`). We shall consider in detail only information 
gain, but similar conclusions can be done for other measures.

##### Definition - entropy
For a discrete random variable we define entropy by:
$$ H(X) = - \sum^m p_j log p_j$$
For a Bernoulli r.v. we get a simplified expression
$$ H(X) = -p \cdot log(p) - (1-p) \cdot log(1-p)$$

##### Definition -  conditional entropy
For two discrete r.v. X, Y we define average conditional entropy by:
$$H(Y|X) = \sum_j P(X=v_j) H(Y|X=v_j)$$
If X and Y are Bernoulli's then:
$$H(Y|X) = q \cdot H(Y|X=1) + (1-q) \cdot H(Y|X=0)$$

##### Definition - Information gain
$$IG(Y|X) = H(Y) - H(Y|X)$$

The intuition is that:
> IG tells you how interesting a 2-d contingency table is going to be

In our application, analysis of n-grams, we are interested in Bernoulli r.v.
Let us consider the contingency table we get:

| target\\feature | 1 | 0 |
|:---:|:---:|:---:|
| 1 | $n_{1,1}$ | $n_{1,0}$ |
| 0 | $n_{0,1}$ | $n_{0,0}$ |


#### Computing IG distribution under null hypothesis
As can be seen above we have 4 possible outcomes.
If probability that target equals 1 is $p$ and probability that feature equals
1 is $q$ and feature and target are independent then each of them has the 
following probabilities 
$$P((Target, Feature) = (1,1)) = p \cdot q$$
$$P((Target, Feature) = (1,0)) = p \cdot (1-q)$$
$$P((Target, Feature) = (0,1)) = (1-p) \cdot q$$
$$P((Target, Feature) = (0,0)) = (1-p) \cdot (1-q)$$

This means that a target-feature can be described as multinomial distribution:

$$ 
\begin{aligned}
{n \choose n_{1,1}} (p\cdot q)^{n_{1,1}} 
{n - n_{1,1} \choose n_{1,0}} (p\cdot (1-q))^{n_{1,0}}
{n - n_{1,1} - n_{1,0} \choose n_{0,1}} ((1-p)\cdot q)^{n_{0,1}} \\
{n - n_{1,1} - n_{1,0} -n_{0,1}\choose n_{0,0}} ((1-p)\cdot (1-q))^{n_{0,0}}
\end{aligned}
$$

However we have important restriction that $n_{1,\cdot} = n_{1,1} + n_{1,0}$ and
$n_{\cdot, 1} = n_{1,1} + n_{0,1}$ are known and fixed as they describe the number 
of "ones" for target and feature respectively.

This might look very complicated but this restriction in fact simplifies
our computation significantly.

Observe that $n_{1,1}$ is from range $[0,min(n_{\cdot, 1}, n_{1, \cdot})]$.
So we get probability of certain contingency table as conditional distribution,
as impose restrictions on two parameters $n_{\cdot, 1}$ and $n_{1, \cdot}$
We can compute IG for each possible value of $n_{1,1}$ and finally we get
distribution of Information Gain under hypothesis that target and feature
are independent.

The calculation of distributions is performed by `distr_crit` function.
To facilitate time-consuming computations when dealing with a very large
number of features, we introduce a possibility to set the limit of 
calculated contingence matrices using `iter_limit` parameter. By default, 
IG is calculated for 200 contingence matrices, therefore we get an 
approximate distribution of Information Gain. 

Having exact or even approximate (when limiting the number of calculated 
contingence matrices) distribution lets us perform permutation test much 
quicker as we no longer need to perform any replications. Furthermore, 
by using exact test we will get precise values of tails which was not 
guaranteed with random permutations.

# References