Title: | N-Gram Analysis of Biological Sequences |
---|---|
Description: | Tools for extraction and analysis of various n-grams (k-mers) derived from biological sequences (proteins or nucleic acids). Contains QuiPT (quick permutation test) for fast feature-filtering of the n-gram data. |
Authors: | Michal Burdukiewicz [cre, aut] , Piotr Sobczyk [aut], Chris Lauber [aut], Dominik Rafacz [aut], Katarzyna Sidorczuk [ctb] |
Maintainer: | Michal Burdukiewicz <[email protected]> |
License: | GPL-3 |
Version: | 1.6.3 |
Built: | 2025-01-27 04:55:56 UTC |
Source: | https://github.com/michbur/biogram |
biogram
package is a toolbox for the analysis of
nucleic acid and protein sequences using n-grams. Possible applications include
motif discovery, feature selection, clustering, and classification.
n-grams (k-tuples) are sets of n
characters derived from the input sequence(s).
They may form continuous sub-sequences or be discontinuous. For example, from the
sequence of nucleotides AATA
one can extract the following continuous
2-grams (bigrams): AA
, AT
and TA
. Moreover, there are two
possible bigrams separated by a single space: A_T
and A_A
, and one
bigram separated by two spaces: A__A
.
Another important n-gram parameter is its position. Instead of just counting n-grams,
one may want to count how many n-grams occur at a given position in multiple (e.g. related)
sequences. For example, in the sequences AATA
and AACA
there is only one
bigram at position 1: AA
, but there are two bigrams at position two: AT
and
AC
. The following notation is used for position-specific n-grams: 1_AA
,
2_AT
, 2_AC
.
In the biogram
package, the count_ngrams
function is used for
counting and extracting n-grams. Using the d
argument the user can specify the
distance between elements of the n-grams. The pos
argument can be used to enable
position specificity.
We note that n-grams suffer from the curse of dimensionality. For example, for a peptide
of length 6 n-grams and
positioned n-grams are possible.
Data sets of such an enormous size are hard to manage and analyze in R.
The biogram
package deals with both of the abovementioned problems. It uses
innate properties of the n-gram data which usually can be represented by sparse
matrices. Data storage is done using functionalities from the slam
package. To ease
the selection of significant features, biogram
provides the user with QuiPT,
a very fast permutation test for binary data (see test_features
).
Another way of reducing dimensionality is the aggregation of sequence residues into more
general groups. For example, all positively-charged amino acids may be aggregated into
one group. This action can be performed using the degenerate
function.
Encoding of amino acids can easu sequence analysis, but multidimensional
objects as the aggregations of amino acids are not easily comparable. We introduced the
encoding distance, a measure defining the distance between encodings. It can be computed
using the calc_ed
function.
Michal Burdukiewicz, Piotr Sobczyk, Chris Lauber
Useful links:
# use data set from package data(human_cleave) # first nine columns represent subsequent nine amino acids from cleavage sites # degenerate the sequence to reduce the dimensionality of the problem # (use five groups instead of 20 amino acids) deg_seqs <- degenerate(human_cleave[, 1L:9], list(`a` = c(1, 6, 8, 10, 11, 18), `b` = c(2, 13, 14, 16, 17), `c` = c(5, 19, 20), `d` = c(7, 9, 12, 15), 'e' = c(3, 4))) # EXAMPLE 1 - extract significant trigrams # extract trigrams trigrams <- count_ngrams(deg_seqs, 3, letters[1L:5], pos = TRUE) # select features that differ between the two target groups using QuiPT test1 <- test_features(human_cleave[, "tar"], trigrams) # see a summary of the results summary(test1) # aggregate features in groups based on their p-value gr <- cut(test1) # get position map of the most significant n-grams position_ngrams(gr[[1]]) # transform the most significant n-grams to more readable form decode_ngrams(gr[[1]]) # EXAMPLE 2 - search for specific n-grams # the n-grams of the interest are a_a (a-gap-a) and e_e (e-gap-e) on the # 3rd and 4th position # firstly code n-grams in biogram notation and add position information coded <- code_ngrams(c("a_a", "c_c")) # add position information coded <- c(paste0("3_", coded), paste0("4_", coded)) # count only the features of the interest bigrams <- count_specified(deg_seqs, coded) # test which of the features of the interest is significant test2 <- test_features(human_cleave[, "tar"], bigrams) cut(test2)
# use data set from package data(human_cleave) # first nine columns represent subsequent nine amino acids from cleavage sites # degenerate the sequence to reduce the dimensionality of the problem # (use five groups instead of 20 amino acids) deg_seqs <- degenerate(human_cleave[, 1L:9], list(`a` = c(1, 6, 8, 10, 11, 18), `b` = c(2, 13, 14, 16, 17), `c` = c(5, 19, 20), `d` = c(7, 9, 12, 15), 'e' = c(3, 4))) # EXAMPLE 1 - extract significant trigrams # extract trigrams trigrams <- count_ngrams(deg_seqs, 3, letters[1L:5], pos = TRUE) # select features that differ between the two target groups using QuiPT test1 <- test_features(human_cleave[, "tar"], trigrams) # see a summary of the results summary(test1) # aggregate features in groups based on their p-value gr <- cut(test1) # get position map of the most significant n-grams position_ngrams(gr[[1]]) # transform the most significant n-grams to more readable form decode_ngrams(gr[[1]]) # EXAMPLE 2 - search for specific n-grams # the n-grams of the interest are a_a (a-gap-a) and e_e (e-gap-e) on the # 3rd and 4th position # firstly code n-grams in biogram notation and add position information coded <- code_ngrams(c("a_a", "c_c")) # add position information coded <- c(paste0("3_", coded), paste0("4_", coded)) # count only the features of the interest bigrams <- count_specified(deg_seqs, coded) # test which of the features of the interest is significant test2 <- test_features(human_cleave[, "tar"], bigrams) cut(test2)
Normalized (0-1) 554 amino acid properties as retreived from AAIndex database (release 9.1) enriched with contactivity of amino acids.
A data frames with 20 columns and 600 rows.
Following properties are included (AAIndex key: description of the property)
alpha-CH chemical shifts (Andersen et al., 1992)
Hydrophobicity index (Argos et al., 1982)
Signal sequence helical potential (Argos et al., 1982)
Membrane-buried preference parameters (Argos et al., 1982)
Conformational parameter of inner helix (Beghin-Dirkx, 1975)
Conformational parameter of beta-structure (Beghin-Dirkx, 1975)
Conformational parameter of beta-turn (Beghin-Dirkx, 1975)
Average flexibility indices (Bhaskaran-Ponnuswamy, 1988)
Residue volume (Bigelow, 1967)
Information value for accessibility; average fraction 35% (Biou et al., 1988)
Information value for accessibility; average fraction 23% (Biou et al., 1988)
Retention coefficient in TFA (Browne et al., 1982)
Retention coefficient in HFBA (Browne et al., 1982)
Transfer free energy to surface (Bull-Breese, 1974)
Apparent partial specific volume (Bull-Breese, 1974)
alpha-NH chemical shifts (Bundi-Wuthrich, 1979)
alpha-CH chemical shifts (Bundi-Wuthrich, 1979)
Spin-spin coupling constants 3JHalpha-NH (Bundi-Wuthrich, 1979)
Normalized frequency of alpha-helix (Burgess et al., 1974)
Normalized frequency of extended structure (Burgess et al., 1974)
Steric parameter (Charton, 1981)
Polarizability parameter (Charton-Charton, 1982)
Free energy of solution in water, kcal/mole (Charton-Charton, 1982)
The Chou-Fasman parameter of the coil conformation (Charton-Charton, 1983)
A parameter defined from the residuals obtained from the best correlation of the Chou-Fasman parameter of beta-sheet (Charton-Charton, 1983)
The number of atoms in the side chain labelled 1+1 (Charton-Charton, 1983)
The number of atoms in the side chain labelled 2+1 (Charton-Charton, 1983)
The number of atoms in the side chain labelled 3+1 (Charton-Charton, 1983)
The number of bonds in the longest chain (Charton-Charton, 1983)
A parameter of charge transfer capability (Charton-Charton, 1983)
A parameter of charge transfer donor capability (Charton-Charton, 1983)
Average volume of buried residue (Chothia, 1975)
Residue accessible surface area in tripeptide (Chothia, 1976)
Residue accessible surface area in folded protein (Chothia, 1976)
Proportion of residues 95% buried (Chothia, 1976)
Proportion of residues 100% buried (Chothia, 1976)
Normalized frequency of beta-turn (Chou-Fasman, 1978a)
Normalized frequency of alpha-helix (Chou-Fasman, 1978b)
Normalized frequency of beta-sheet (Chou-Fasman, 1978b)
Normalized frequency of beta-turn (Chou-Fasman, 1978b)
Normalized frequency of N-terminal helix (Chou-Fasman, 1978b)
Normalized frequency of C-terminal helix (Chou-Fasman, 1978b)
Normalized frequency of N-terminal non helical region (Chou-Fasman, 1978b)
Normalized frequency of C-terminal non helical region (Chou-Fasman, 1978b)
Normalized frequency of N-terminal beta-sheet (Chou-Fasman, 1978b)
Normalized frequency of C-terminal beta-sheet (Chou-Fasman, 1978b)
Normalized frequency of N-terminal non beta region (Chou-Fasman, 1978b)
Normalized frequency of C-terminal non beta region (Chou-Fasman, 1978b)
Frequency of the 1st residue in turn (Chou-Fasman, 1978b)
Frequency of the 2nd residue in turn (Chou-Fasman, 1978b)
Frequency of the 3rd residue in turn (Chou-Fasman, 1978b)
Frequency of the 4th residue in turn (Chou-Fasman, 1978b)
Normalized frequency of the 2nd and 3rd residues in turn (Chou-Fasman, 1978b)
Normalized hydrophobicity scales for alpha-proteins (Cid et al., 1992)
Normalized hydrophobicity scales for beta-proteins (Cid et al., 1992)
Normalized hydrophobicity scales for alpha+beta-proteins (Cid et al., 1992)
Normalized hydrophobicity scales for alpha/beta-proteins (Cid et al., 1992)
Normalized average hydrophobicity scales (Cid et al., 1992)
Partial specific volume (Cohn-Edsall, 1943)
Normalized frequency of middle helix (Crawford et al., 1973)
Normalized frequency of beta-sheet (Crawford et al., 1973)
Normalized frequency of turn (Crawford et al., 1973)
Size (Dawson, 1972)
Amino acid composition (Dayhoff et al., 1978a)
Relative mutability (Dayhoff et al., 1978b)
Membrane preference for cytochrome b: MPH89 (Degli Esposti et al., 1990)
Average membrane preference: AMP07 (Degli Esposti et al., 1990)
Consensus normalized hydrophobicity scale (Eisenberg, 1984)
Solvation free energy (Eisenberg-McLachlan, 1986)
Atom-based hydrophobic moment (Eisenberg-McLachlan, 1986)
Direction of hydrophobic moment (Eisenberg-McLachlan, 1986)
Molecular weight (Fasman, 1976)
Melting point (Fasman, 1976)
Optical rotation (Fasman, 1976)
pK-N (Fasman, 1976)
pK-C (Fasman, 1976)
Hydrophobic parameter pi (Fauchere-Pliska, 1983)
Graph shape index (Fauchere et al., 1988)
Smoothed upsilon steric parameter (Fauchere et al., 1988)
Normalized van der Waals volume (Fauchere et al., 1988)
STERIMOL length of the side chain (Fauchere et al., 1988)
STERIMOL minimum width of the side chain (Fauchere et al., 1988)
STERIMOL maximum width of the side chain (Fauchere et al., 1988)
N.m.r. chemical shift of alpha-carbon (Fauchere et al., 1988)
Localized electrical effect (Fauchere et al., 1988)
Number of hydrogen bond donors (Fauchere et al., 1988)
Number of full nonbonding orbitals (Fauchere et al., 1988)
Positive charge (Fauchere et al., 1988)
Negative charge (Fauchere et al., 1988)
pK-a(RCOOH) (Fauchere et al., 1988)
Helix-coil equilibrium constant (Finkelstein-Ptitsyn, 1977)
Helix initiation parameter at posision i-1 (Finkelstein et al., 1991)
Helix initiation parameter at posision i,i+1,i+2 (Finkelstein et al., 1991)
Helix termination parameter at posision j-2,j-1,j (Finkelstein et al., 1991)
Helix termination parameter at posision j+1 (Finkelstein et al., 1991)
Partition coefficient (Garel et al., 1973)
Alpha-helix indices (Geisow-Roberts, 1980)
Alpha-helix indices for alpha-proteins (Geisow-Roberts, 1980)
Alpha-helix indices for beta-proteins (Geisow-Roberts, 1980)
Alpha-helix indices for alpha/beta-proteins (Geisow-Roberts, 1980)
Beta-strand indices (Geisow-Roberts, 1980)
Beta-strand indices for beta-proteins (Geisow-Roberts, 1980)
Beta-strand indices for alpha/beta-proteins (Geisow-Roberts, 1980)
Aperiodic indices (Geisow-Roberts, 1980)
Aperiodic indices for alpha-proteins (Geisow-Roberts, 1980)
Aperiodic indices for beta-proteins (Geisow-Roberts, 1980)
Aperiodic indices for alpha/beta-proteins (Geisow-Roberts, 1980)
Hydrophobicity factor (Goldsack-Chalifoux, 1973)
Residue volume (Goldsack-Chalifoux, 1973)
Composition (Grantham, 1974)
Polarity (Grantham, 1974)
Volume (Grantham, 1974)
Partition energy (Guy, 1985)
Hydration number (Hopfinger, 1971), Cited by Charton-Charton (1982)
Hydrophilicity value (Hopp-Woods, 1981)
Heat capacity (Hutchens, 1970)
Absolute entropy (Hutchens, 1970)
Entropy of formation (Hutchens, 1970)
Normalized relative frequency of alpha-helix (Isogai et al., 1980)
Normalized relative frequency of extended structure (Isogai et al., 1980)
Normalized relative frequency of bend (Isogai et al., 1980)
Normalized relative frequency of bend R (Isogai et al., 1980)
Normalized relative frequency of bend S (Isogai et al., 1980)
Normalized relative frequency of helix end (Isogai et al., 1980)
Normalized relative frequency of double bend (Isogai et al., 1980)
Normalized relative frequency of coil (Isogai et al., 1980)
Average accessible surface area (Janin et al., 1978)
Percentage of buried residues (Janin et al., 1978)
Percentage of exposed residues (Janin et al., 1978)
Ratio of buried and accessible molar fractions (Janin, 1979)
Transfer free energy (Janin, 1979)
Hydrophobicity (Jones, 1975)
pK (-COOH) (Jones, 1975)
Relative frequency of occurrence (Jones et al., 1992)
Relative mutability (Jones et al., 1992)
Amino acid distribution (Jukes et al., 1975)
Sequence frequency (Jungck, 1978)
Average relative probability of helix (Kanehisa-Tsong, 1980)
Average relative probability of beta-sheet (Kanehisa-Tsong, 1980)
Average relative probability of inner helix (Kanehisa-Tsong, 1980)
Average relative probability of inner beta-sheet (Kanehisa-Tsong, 1980)
Flexibility parameter for no rigid neighbors (Karplus-Schulz, 1985)
Flexibility parameter for one rigid neighbor (Karplus-Schulz, 1985)
Flexibility parameter for two rigid neighbors (Karplus-Schulz, 1985)
The Kerr-constant increments (Khanarian-Moore, 1980)
Net charge (Klein et al., 1984)
Side chain interaction parameter (Krigbaum-Rubin, 1971)
Side chain interaction parameter (Krigbaum-Komoriya, 1979)
Fraction of site occupied by water (Krigbaum-Komoriya, 1979)
Side chain volume (Krigbaum-Komoriya, 1979)
Hydropathy index (Kyte-Doolittle, 1982)
Transfer free energy, CHP/water (Lawson et al., 1984)
Hydrophobic parameter (Levitt, 1976)
Distance between C-alpha and centroid of side chain (Levitt, 1976)
Side chain angle theta(AAR) (Levitt, 1976)
Side chain torsion angle phi(AAAR) (Levitt, 1976)
Radius of gyration of side chain (Levitt, 1976)
van der Waals parameter R0 (Levitt, 1976)
van der Waals parameter epsilon (Levitt, 1976)
Normalized frequency of alpha-helix, with weights (Levitt, 1978)
Normalized frequency of beta-sheet, with weights (Levitt, 1978)
Normalized frequency of reverse turn, with weights (Levitt, 1978)
Normalized frequency of alpha-helix, unweighted (Levitt, 1978)
Normalized frequency of beta-sheet, unweighted (Levitt, 1978)
Normalized frequency of reverse turn, unweighted (Levitt, 1978)
Frequency of occurrence in beta-bends (Lewis et al., 1971)
Conformational preference for all beta-strands (Lifson-Sander, 1979)
Conformational preference for parallel beta-strands (Lifson-Sander, 1979)
Conformational preference for antiparallel beta-strands (Lifson-Sander, 1979)
Average surrounding hydrophobicity (Manavalan-Ponnuswamy, 1978)
Normalized frequency of alpha-helix (Maxfield-Scheraga, 1976)
Normalized frequency of extended structure (Maxfield-Scheraga, 1976)
Normalized frequency of zeta R (Maxfield-Scheraga, 1976)
Normalized frequency of left-handed alpha-helix (Maxfield-Scheraga, 1976)
Normalized frequency of zeta L (Maxfield-Scheraga, 1976)
Normalized frequency of alpha region (Maxfield-Scheraga, 1976)
Refractivity (McMeekin et al., 1964), Cited by Jones (1975)
Retention coefficient in HPLC, pH7.4 (Meek, 1980)
Retention coefficient in HPLC, pH2.1 (Meek, 1980)
Retention coefficient in NaClO4 (Meek-Rossetti, 1981)
Retention coefficient in NaH2PO4 (Meek-Rossetti, 1981)
Average reduced distance for C-alpha (Meirovitch et al., 1980)
Average reduced distance for side chain (Meirovitch et al., 1980)
Average side chain orientation angle (Meirovitch et al., 1980)
Effective partition energy (Miyazawa-Jernigan, 1985)
Normalized frequency of alpha-helix (Nagano, 1973)
Normalized frequency of bata-structure (Nagano, 1973)
Normalized frequency of coil (Nagano, 1973)
AA composition of total proteins (Nakashima et al., 1990)
SD of AA composition of total proteins (Nakashima et al., 1990)
AA composition of mt-proteins (Nakashima et al., 1990)
Normalized composition of mt-proteins (Nakashima et al., 1990)
AA composition of mt-proteins from animal (Nakashima et al., 1990)
Normalized composition from animal (Nakashima et al., 1990)
AA composition of mt-proteins from fungi and plant (Nakashima et al., 1990)
Normalized composition from fungi and plant (Nakashima et al., 1990)
AA composition of membrane proteins (Nakashima et al., 1990)
Normalized composition of membrane proteins (Nakashima et al., 1990)
Transmembrane regions of non-mt-proteins (Nakashima et al., 1990)
Transmembrane regions of mt-proteins (Nakashima et al., 1990)
Ratio of average and computed composition (Nakashima et al., 1990)
AA composition of CYT of single-spanning proteins (Nakashima-Nishikawa, 1992)
AA composition of CYT2 of single-spanning proteins (Nakashima-Nishikawa, 1992)
AA composition of EXT of single-spanning proteins (Nakashima-Nishikawa, 1992)
AA composition of EXT2 of single-spanning proteins (Nakashima-Nishikawa, 1992)
AA composition of MEM of single-spanning proteins (Nakashima-Nishikawa, 1992)
AA composition of CYT of multi-spanning proteins (Nakashima-Nishikawa, 1992)
AA composition of EXT of multi-spanning proteins (Nakashima-Nishikawa, 1992)
AA composition of MEM of multi-spanning proteins (Nakashima-Nishikawa, 1992)
8 A contact number (Nishikawa-Ooi, 1980)
14 A contact number (Nishikawa-Ooi, 1986)
Transfer energy, organic solvent/water (Nozaki-Tanford, 1971)
Average non-bonded energy per atom (Oobatake-Ooi, 1977)
Short and medium range non-bonded energy per atom (Oobatake-Ooi, 1977)
Long range non-bonded energy per atom (Oobatake-Ooi, 1977)
Average non-bonded energy per residue (Oobatake-Ooi, 1977)
Short and medium range non-bonded energy per residue (Oobatake-Ooi, 1977)
Optimized beta-structure-coil equilibrium constant (Oobatake et al., 1985)
Optimized propensity to form reverse turn (Oobatake et al., 1985)
Optimized transfer energy parameter (Oobatake et al., 1985)
Optimized average non-bonded energy per atom (Oobatake et al., 1985)
Optimized side chain interaction parameter (Oobatake et al., 1985)
Normalized frequency of alpha-helix from LG (Palau et al., 1981)
Normalized frequency of alpha-helix from CF (Palau et al., 1981)
Normalized frequency of beta-sheet from LG (Palau et al., 1981)
Normalized frequency of beta-sheet from CF (Palau et al., 1981)
Normalized frequency of turn from LG (Palau et al., 1981)
Normalized frequency of turn from CF (Palau et al., 1981)
Normalized frequency of alpha-helix in all-alpha class (Palau et al., 1981)
Normalized frequency of alpha-helix in alpha+beta class (Palau et al., 1981)
Normalized frequency of alpha-helix in alpha/beta class (Palau et al., 1981)
Normalized frequency of beta-sheet in all-beta class (Palau et al., 1981)
Normalized frequency of beta-sheet in alpha+beta class (Palau et al., 1981)
Normalized frequency of beta-sheet in alpha/beta class (Palau et al., 1981)
Normalized frequency of turn in all-alpha class (Palau et al., 1981)
Normalized frequency of turn in all-beta class (Palau et al., 1981)
Normalized frequency of turn in alpha+beta class (Palau et al., 1981)
Normalized frequency of turn in alpha/beta class (Palau et al., 1981)
HPLC parameter (Parker et al., 1986)
Partition coefficient (Pliska et al., 1981)
Surrounding hydrophobicity in folded form (Ponnuswamy et al., 1980)
Average gain in surrounding hydrophobicity (Ponnuswamy et al., 1980)
Average gain ratio in surrounding hydrophobicity (Ponnuswamy et al., 1980)
Surrounding hydrophobicity in alpha-helix (Ponnuswamy et al., 1980)
Surrounding hydrophobicity in beta-sheet (Ponnuswamy et al., 1980)
Surrounding hydrophobicity in turn (Ponnuswamy et al., 1980)
Accessibility reduction ratio (Ponnuswamy et al., 1980)
Average number of surrounding residues (Ponnuswamy et al., 1980)
Intercept in regression analysis (Prabhakaran-Ponnuswamy, 1982)
Slope in regression analysis x 1.0E1 (Prabhakaran-Ponnuswamy, 1982)
Correlation coefficient in regression analysis (Prabhakaran-Ponnuswamy, 1982)
Hydrophobicity (Prabhakaran, 1990)
Relative frequency in alpha-helix (Prabhakaran, 1990)
Relative frequency in beta-sheet (Prabhakaran, 1990)
Relative frequency in reverse-turn (Prabhakaran, 1990)
Helix-coil equilibrium constant (Ptitsyn-Finkelstein, 1983)
Beta-coil equilibrium constant (Ptitsyn-Finkelstein, 1983)
Weights for alpha-helix at the window position of -6 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of -5 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of -4 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of -3 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of -2 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of -1 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of 0 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of 1 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of 2 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of 3 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of 4 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of 5 (Qian-Sejnowski, 1988)
Weights for alpha-helix at the window position of 6 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of -6 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of -5 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of -4 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of -3 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of -2 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of -1 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of 0 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of 1 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of 2 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of 3 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of 4 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of 5 (Qian-Sejnowski, 1988)
Weights for beta-sheet at the window position of 6 (Qian-Sejnowski, 1988)
Weights for coil at the window position of -6 (Qian-Sejnowski, 1988)
Weights for coil at the window position of -5 (Qian-Sejnowski, 1988)
Weights for coil at the window position of -4 (Qian-Sejnowski, 1988)
Weights for coil at the window position of -3 (Qian-Sejnowski, 1988)
Weights for coil at the window position of -2 (Qian-Sejnowski, 1988)
Weights for coil at the window position of -1 (Qian-Sejnowski, 1988)
Weights for coil at the window position of 0 (Qian-Sejnowski, 1988)
Weights for coil at the window position of 1 (Qian-Sejnowski, 1988)
Weights for coil at the window position of 2 (Qian-Sejnowski, 1988)
Weights for coil at the window position of 3 (Qian-Sejnowski, 1988)
Weights for coil at the window position of 4 (Qian-Sejnowski, 1988)
Weights for coil at the window position of 5 (Qian-Sejnowski, 1988)
Weights for coil at the window position of 6 (Qian-Sejnowski, 1988)
Average reduced distance for C-alpha (Rackovsky-Scheraga, 1977)
Average reduced distance for side chain (Rackovsky-Scheraga, 1977)
Side chain orientational preference (Rackovsky-Scheraga, 1977)
Average relative fractional occurrence in A0(i) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in AR(i) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in AL(i) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in EL(i) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in E0(i) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in ER(i) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in A0(i-1) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in AR(i-1) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in AL(i-1) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in EL(i-1) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in E0(i-1) (Rackovsky-Scheraga, 1982)
Average relative fractional occurrence in ER(i-1) (Rackovsky-Scheraga, 1982)
Value of theta(i) (Rackovsky-Scheraga, 1982)
Value of theta(i-1) (Rackovsky-Scheraga, 1982)
Transfer free energy from chx to wat (Radzicka-Wolfenden, 1988)
Transfer free energy from oct to wat (Radzicka-Wolfenden, 1988)
Transfer free energy from vap to chx (Radzicka-Wolfenden, 1988)
Transfer free energy from chx to oct (Radzicka-Wolfenden, 1988)
Transfer free energy from vap to oct (Radzicka-Wolfenden, 1988)
Accessible surface area (Radzicka-Wolfenden, 1988)
Energy transfer from out to in(95%buried) (Radzicka-Wolfenden, 1988)
Mean polarity (Radzicka-Wolfenden, 1988)
Relative preference value at N" (Richardson-Richardson, 1988)
Relative preference value at N' (Richardson-Richardson, 1988)
Relative preference value at N-cap (Richardson-Richardson, 1988)
Relative preference value at N1 (Richardson-Richardson, 1988)
Relative preference value at N2 (Richardson-Richardson, 1988)
Relative preference value at N3 (Richardson-Richardson, 1988)
Relative preference value at N4 (Richardson-Richardson, 1988)
Relative preference value at N5 (Richardson-Richardson, 1988)
Relative preference value at Mid (Richardson-Richardson, 1988)
Relative preference value at C5 (Richardson-Richardson, 1988)
Relative preference value at C4 (Richardson-Richardson, 1988)
Relative preference value at C3 (Richardson-Richardson, 1988)
Relative preference value at C2 (Richardson-Richardson, 1988)
Relative preference value at C1 (Richardson-Richardson, 1988)
Relative preference value at C-cap (Richardson-Richardson, 1988)
Relative preference value at C' (Richardson-Richardson, 1988)
Relative preference value at C" (Richardson-Richardson, 1988)
Information measure for alpha-helix (Robson-Suzuki, 1976)
Information measure for N-terminal helix (Robson-Suzuki, 1976)
Information measure for middle helix (Robson-Suzuki, 1976)
Information measure for C-terminal helix (Robson-Suzuki, 1976)
Information measure for extended (Robson-Suzuki, 1976)
Information measure for pleated-sheet (Robson-Suzuki, 1976)
Information measure for extended without H-bond (Robson-Suzuki, 1976)
Information measure for turn (Robson-Suzuki, 1976)
Information measure for N-terminal turn (Robson-Suzuki, 1976)
Information measure for middle turn (Robson-Suzuki, 1976)
Information measure for C-terminal turn (Robson-Suzuki, 1976)
Information measure for coil (Robson-Suzuki, 1976)
Information measure for loop (Robson-Suzuki, 1976)
Hydration free energy (Robson-Osguthorpe, 1979)
Mean area buried on transfer (Rose et al., 1985)
Mean fractional area loss (Rose et al., 1985)
Side chain hydropathy, uncorrected for solvation (Roseman, 1988)
Side chain hydropathy, corrected for solvation (Roseman, 1988)
Loss of Side chain hydropathy by helix formation (Roseman, 1988)
Transfer free energy (Simon, 1976), Cited by Charton-Charton (1982)
Principal component I (Sneath, 1966)
Principal component II (Sneath, 1966)
Principal component III (Sneath, 1966)
Principal component IV (Sneath, 1966)
Zimm-Bragg parameter s at 20 C (Sueki et al., 1984)
Zimm-Bragg parameter sigma x 1.0E4 (Sueki et al., 1984)
Optimal matching hydrophobicity (Sweet-Eisenberg, 1983)
Normalized frequency of alpha-helix (Tanaka-Scheraga, 1977)
Normalized frequency of isolated helix (Tanaka-Scheraga, 1977)
Normalized frequency of extended structure (Tanaka-Scheraga, 1977)
Normalized frequency of chain reversal R (Tanaka-Scheraga, 1977)
Normalized frequency of chain reversal S (Tanaka-Scheraga, 1977)
Normalized frequency of chain reversal D (Tanaka-Scheraga, 1977)
Normalized frequency of left-handed helix (Tanaka-Scheraga, 1977)
Normalized frequency of zeta R (Tanaka-Scheraga, 1977)
Normalized frequency of coil (Tanaka-Scheraga, 1977)
Normalized frequency of chain reversal (Tanaka-Scheraga, 1977)
Relative population of conformational state A (Vasquez et al., 1983)
Relative population of conformational state C (Vasquez et al., 1983)
Relative population of conformational state E (Vasquez et al., 1983)
Electron-ion interaction potential (Veljkovic et al., 1985)
Bitterness (Venanzi, 1984)
Transfer free energy to lipophilic phase (von Heijne-Blomberg, 1979)
Average interactions per side chain atom (Warme-Morgan, 1978)
RF value in high salt chromatography (Weber-Lacey, 1978)
Propensity to be buried inside (Wertz-Scheraga, 1978)
Free energy change of epsilon(i) to epsilon(ex) (Wertz-Scheraga, 1978)
Free energy change of alpha(Ri) to alpha(Rh) (Wertz-Scheraga, 1978)
Free energy change of epsilon(i) to alpha(Rh) (Wertz-Scheraga, 1978)
Polar requirement (Woese, 1973)
Hydration potential (Wolfenden et al., 1981)
Principal property value z1 (Wold et al., 1987)
Principal property value z2 (Wold et al., 1987)
Principal property value z3 (Wold et al., 1987)
Unfolding Gibbs energy in water, pH7.0 (Yutani et al., 1987)
Unfolding Gibbs energy in water, pH9.0 (Yutani et al., 1987)
Activation Gibbs energy of unfolding, pH7.0 (Yutani et al., 1987)
Activation Gibbs energy of unfolding, pH9.0 (Yutani et al., 1987)
Dependence of partition coefficient on ionic strength (Zaslavsky et al., 1982)
Hydrophobicity (Zimmerman et al., 1968)
Bulkiness (Zimmerman et al., 1968)
Polarity (Zimmerman et al., 1968)
Isoelectric point (Zimmerman et al., 1968)
RF rank (Zimmerman et al., 1968)
Normalized positional residue frequency at helix termini N4'(Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini N"' (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini N" (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini N'(Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini Nc (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini N1 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini N2 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini N3 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini N4 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini N5 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini C5 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini C4 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini C3 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini C2 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini C1 (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini Cc (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini C' (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini C" (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini C"' (Aurora-Rose, 1998)
Normalized positional residue frequency at helix termini C4' (Aurora-Rose, 1998)
Delta G values for the peptides extrapolated to 0 M urea (O'Neil-DeGrado, 1990)
Helix formation parameters (delta delta G) (O'Neil-DeGrado, 1990)
Normalized flexibility parameters (B-values), average (Vihinen et al., 1994)
Normalized flexibility parameters (B-values) for each residue surrounded by none rigid neighbours (Vihinen et al., 1994)
Normalized flexibility parameters (B-values) for each residue surrounded by one rigid neighbours (Vihinen et al., 1994)
Normalized flexibility parameters (B-values) for each residue surrounded by two rigid neighbours (Vihinen et al., 1994)
Free energy in alpha-helical conformation (Munoz-Serrano, 1994)
Free energy in alpha-helical region (Munoz-Serrano, 1994)
Free energy in beta-strand conformation (Munoz-Serrano, 1994)
Free energy in beta-strand region (Munoz-Serrano, 1994)
Free energy in beta-strand region (Munoz-Serrano, 1994)
Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water (Wimley-White, 1996)
Thermodynamic beta sheet propensity (Kim-Berg, 1993)
Turn propensity scale for transmembrane helices (Monne et al., 1999)
Alpha helix propensity of position 44 in T4 lysozyme (Blaber et al., 1993)
p-Values of mesophilic proteins based on the distributions of B values (Parthasarathy-Murthy, 2000)
p-Values of thermophilic proteins based on the distributions of B values (Parthasarathy-Murthy, 2000)
Distribution of amino acid residues in the 18 non-redundant families of thermophilic proteins (Kumar et al., 2000)
Distribution of amino acid residues in the 18 non-redundant families of mesophilic proteins (Kumar et al., 2000)
Distribution of amino acid residues in the alpha-helices in thermophilic proteins (Kumar et al., 2000)
Distribution of amino acid residues in the alpha-helices in mesophilic proteins (Kumar et al., 2000)
Side-chain contribution to protein stability (kJ/mol) (Takano-Yutani, 2001)
Propensity of amino acids within pi-helices (Fodje-Al-Karadaghi, 2002)
Hydropathy scale based on self-information values in the two-state model (5% accessibility) (Naderi-Manesh et al., 2001)
Hydropathy scale based on self-information values in the two-state model (9% accessibility) (Naderi-Manesh et al., 2001)
Hydropathy scale based on self-information values in the two-state model (16% accessibility) (Naderi-Manesh et al., 2001)
Hydropathy scale based on self-information values in the two-state model (20% accessibility) (Naderi-Manesh et al., 2001)
Hydropathy scale based on self-information values in the two-state model (25% accessibility) (Naderi-Manesh et al., 2001)
Hydropathy scale based on self-information values in the two-state model (36% accessibility) (Naderi-Manesh et al., 2001)
Hydropathy scale based on self-information values in the two-state model (50% accessibility) (Naderi-Manesh et al., 2001)
Averaged turn propensities in a transmembrane helix (Monne et al., 1999)
Alpha-helix propensity derived from designed sequences (Koehl-Levitt, 1999)
Beta-sheet propensity derived from designed sequences (Koehl-Levitt, 1999)
Composition of amino acids in extracellular proteins (percent) (Cedano et al., 1997)
Composition of amino acids in anchored proteins (percent) (Cedano et al., 1997)
Composition of amino acids in membrane proteins (percent) (Cedano et al., 1997)
Composition of amino acids in intracellular proteins (percent) (Cedano et al., 1997)
Composition of amino acids in nuclear proteins (percent) (Cedano et al., 1997)
Surface composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)
Surface composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
Surface composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
Surface composition of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)
Interior composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)
Interior composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
Interior composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
Interior composition of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)
Entire chain composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)
Entire chain composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
Entire chain composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
Entire chain compositino of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)
Screening coefficients gamma, local (Avbelj, 2000)
Screening coefficients gamma, non-local (Avbelj, 2000)
Slopes tripeptide, FDPB VFF neutral (Avbelj, 2000)
Slopes tripeptides, LD VFF neutral (Avbelj, 2000)
Slopes tripeptide, FDPB VFF noside (Avbelj, 2000)
Slopes tripeptide FDPB VFF all (Avbelj, 2000)
Slopes tripeptide FDPB PARSE neutral (Avbelj, 2000)
Slopes dekapeptide, FDPB VFF neutral (Avbelj, 2000)
Slopes proteins, FDPB VFF neutral (Avbelj, 2000)
Side-chain conformation by gaussian evolutionary method (Yang et al., 2002)
Amphiphilicity index (Mitaku et al., 2002)
Volumes including the crystallographic waters using the ProtOr (Tsai et al., 1999)
Volumes not including the crystallographic waters using the ProtOr (Tsai et al., 1999)
Electron-ion interaction potential values (Cosic, 1994)
Hydrophobicity scales (Ponnuswamy, 1993)
Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/MeCN/H2O (Wilce et al. 1995)
Hydrophobicity coefficient in RP-HPLC, C8 with 0.1%TFA/MeCN/H2O (Wilce et al. 1995)
Hydrophobicity coefficient in RP-HPLC, C4 with 0.1%TFA/MeCN/H2O (Wilce et al. 1995)
Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/2-PrOH/MeCN/H2O (Wilce et al. 1995)
Hydrophilicity scale (Kuhn et al., 1995)
Retention coefficient at pH 2 (Guo et al., 1986)
Modified Kyte-Doolittle hydrophobicity scale (Juretic et al., 1998)
Interactivity scale obtained from the contact matrix (Bastolla et al., 2005)
Interactivity scale obtained by maximizing the mean of correlation coefficient over single-domain globular proteins (Bastolla et al., 2005)
Interactivity scale obtained by maximizing the mean of correlation coefficient over pairs of sequences sharing the TIM barrel fold (Bastolla et al., 2005)
Linker propensity index (Suyama-Ohara, 2003)
Knowledge-based membrane-propensity scale from 1D_Helix in MPtopo databases (Punta-Maritan, 2003)
Knowledge-based membrane-propensity scale from 3D_Helix in MPtopo databases (Punta-Maritan, 2003)
Linker propensity from all dataset (George-Heringa, 2003)
Linker propensity from 1-linker dataset (George-Heringa, 2003)
Linker propensity from 2-linker dataset (George-Heringa, 2003)
Linker propensity from 3-linker dataset (George-Heringa, 2003)
Linker propensity from small dataset (linker length is less than six residues) (George-Heringa, 2003)
Linker propensity from medium dataset (linker length is between six and 14 residues) (George-Heringa, 2003)
Linker propensity from long dataset (linker length is greater than 14 residues) (George-Heringa, 2003)
Linker propensity from helical (annotated by DSSP) dataset (George-Heringa, 2003)
Linker propensity from non-helical (annotated by DSSP) dataset (George-Heringa, 2003)
The stability scale from the knowledge-based atom-atom potential (Zhou-Zhou, 2004)
The relative stability scale extracted from mutation experiments (Zhou-Zhou, 2004)
Buriability (Zhou-Zhou, 2004)
Linker index (Bae et al., 2005)
Mean volumes of residues buried in protein interiors (Harpaz et al., 1994)
Average volumes of residues (Pontius et al., 1996)
Hydrostatic pressure asymmetry index, PAI (Di Giulio, 2005)
Hydrophobicity index (Wolfenden et al., 1979)
Average internal preferences (Olsen, 1980)
Hydrophobicity-related index (Kidera et al., 1985)
Apparent partition energies calculated from Wertz-Scheraga index (Guy, 1985)
Apparent partition energies calculated from Robson-Osguthorpe index (Guy, 1985)
Apparent partition energies calculated from Janin index (Guy, 1985)
Apparent partition energies calculated from Chothia index (Guy, 1985)
Hydropathies of amino acid side chains, neutral form (Roseman, 1988)
Hydropathies of amino acid side chains, pi-values in pH 7.0 (Roseman, 1988)
Weights from the IFH scale (Jacobs-White, 1989)
Hydrophobicity index, 3.0 pH (Cowan-Whittaker, 1990)
Scaled side chain hydrophobicity values (Black-Mould, 1991)
Hydrophobicity scale from native protein structures (Casari-Sippl, 1992)
NNEIG index (Cornette et al., 1987)
SWEIG index (Cornette et al., 1987)
PRIFT index (Cornette et al., 1987)
PRILS index (Cornette et al., 1987)
ALTFT index (Cornette et al., 1987)
ALTLS index (Cornette et al., 1987)
TOTFT index (Cornette et al., 1987)
TOTLS index (Cornette et al., 1987)
Relative partition energies derived by the Bethe approximation (Miyazawa-Jernigan, 1999)
Optimized relative partition energies - method A (Miyazawa-Jernigan, 1999)
Optimized relative partition energies - method B (Miyazawa-Jernigan, 1999)
Optimized relative partition energies - method C (Miyazawa-Jernigan, 1999)
Optimized relative partition energies - method D (Miyazawa-Jernigan, 1999)
Hydrophobicity index (Engelman et al., 1986)
Hydrophobicity index (Fasman, 1989)
Values of Wc in proteins from class Beta, cutoff 6 A, separation 5 (Wozniak, 2014)
Values of Wc in proteins from class Beta, cutoff 8 A, separation 5 (Wozniak, 2014)
Values of Wc in proteins from class Beta, cutoff 12 A, separation 5 (Wozniak, 2014)
Values of Wc in proteins from class Beta, cutoff 6 A, separation 15 (Wozniak, 2014)
Values of Wc in proteins from class Beta, cutoff 8 A, separation 15 (Wozniak, 2014)
Values of Wc in proteins from class Beta, cutoff 12 A, separation 15 (Wozniak, 2014)
AAIndex database.
Kawashima, S. and Kanehisa, M. (2000) AAindex: amino acid index database. Nucleic Acids Res., 28:374.
Wozniak, P. and Kotulska M. (2014) Characteristics of protein residue-residue contacts and their application in contact prediction. 20(11):2497
data(aaprop)
data(aaprop)
Builds (n+1)-grams from n-grams.
add_1grams(ngram, u, seq_length)
add_1grams(ngram, u, seq_length)
ngram |
a single n-gram. |
u |
|
seq_length |
length of an origin sequence. |
n-grams are built by pasting every possible unigram in the every possible free position. The total length of n-gram (n plus total distance between elements of the n-gram) is limited by the length of an origin sequence, because the n-gram cannot be longer than an origin sequence.
vector of n-grams (where n
is equal to the n
of the input plus one).
Reverse function: gap_ngrams
.
add_1grams("1_2.3.4_3.0", 1L:4, 8) add_1grams("a.a_1", c("a", "b", "c"), 4)
add_1grams("1_2.3.4_3.0", 1L:4, 8) add_1grams("a.a_1", c("a", "b", "c"), 4)
Coerce results of test_features
function to a
data.frame
.
## S3 method for class 'feature_test' as.data.frame( x, row.names = NULL, optional = FALSE, stringsAsFactors = FALSE, ... )
## S3 method for class 'feature_test' as.data.frame( x, row.names = NULL, optional = FALSE, stringsAsFactors = FALSE, ... )
x |
object of class |
row.names |
ignored. |
optional |
ignored. |
stringsAsFactors |
logical: should the character vector be converted to a factor?. |
... |
additional arguments to be passed to or from methods. |
a data frame with four columns: names of n-gram, p-values, occurrences in positive and negative sequences.
Binarizes a matrix.
binarize(x)
binarize(x)
x |
|
a matrix
or simple_triplet_matrix
(depending on the input).
Computes a chosen statistical criterion for each feature versus target vector.
calc_criterion(target, features, criterion_function)
calc_criterion(target, features, criterion_function)
target |
|
features |
|
criterion_function |
a function calculating criterion. For a full list, see
|
The permutation test implemented in biogram
uses several criterions to filter
important features. Each can be used by test_features
by specifying the
criterion
parameter.
a integer
vector of length equal to the number of features
containing computed information gain values.
Both target
and features
must be binary, i.e. contain only 0
and 1 values.
tar <- sample(0L:1, 100, replace = TRUE) feats <- matrix(sample(0L:1, 400, replace = TRUE), ncol = 4) # Information Gain calc_criterion(tar, feats, calc_ig) # hi-squared-based measure calc_criterion(tar, feats, calc_cs) # Kullback-Leibler divergence calc_criterion(tar, feats, calc_kl)
tar <- sample(0L:1, 100, replace = TRUE) feats <- matrix(sample(0L:1, 400, replace = TRUE), ncol = 4) # Information Gain calc_criterion(tar, feats, calc_ig) # hi-squared-based measure calc_criterion(tar, feats, calc_cs) # Kullback-Leibler divergence calc_criterion(tar, feats, calc_kl)
Computes Chi-squared-based measure between features and target vector.
calc_cs(feature, target, len_target, pos_target)
calc_cs(feature, target, len_target, pos_target)
feature |
feature vector. |
target |
target. |
len_target |
length of the target vector. |
pos_target |
number of positive cases in the target vector. |
A numeric
vector of length 1 representing computed Chi-square values.
Both target
and features
must be binary, i.e. contain only 0
and 1 values.
The function was designed to be as fast as possible subroutine of
calc_criterion
and might be cumbersome if directly called by a user.
chisq.test
- Pearson's chi-squared test for count data.
tar <- sample(0L:1, 100, replace = TRUE) feat <- sample(0L:1, 100, replace = TRUE) calc_cs(feat, tar, 100, sum(tar))
tar <- sample(0L:1, 100, replace = TRUE) feat <- sample(0L:1, 100, replace = TRUE) calc_cs(feat, tar, 100, sum(tar))
Computes the encoding distance between two encodings.
calc_ed(a, b, prop = NULL, measure)
calc_ed(a, b, prop = NULL, measure)
a |
encoding (see |
b |
encoding to which |
prop |
|
measure |
See the package vignette for more details. |
an encoding distance.
calc_si
: compute the similarity index of two encodings.
encoding2df
: converts an encoding to a data frame.
validate_encoding
: validate a structure of an encoding.
# calculate encoding distance between two encodings of amino acids aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), `2` = c("k", "h"), `3` = c("d", "e"), `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q")) aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"), `2` = c("k", "h", "d", "e", "i"), `3` = c("f", "r", "w", "y", "s", "t", "c", "n")) calc_ed(aa1, aa2, measure = "pi") # the encoding distance between two identical encodings is 0 calc_ed(aa1, aa1, measure = "pi")
# calculate encoding distance between two encodings of amino acids aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), `2` = c("k", "h"), `3` = c("d", "e"), `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q")) aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"), `2` = c("k", "h", "d", "e", "i"), `3` = c("f", "r", "w", "y", "s", "t", "c", "n")) calc_ed(aa1, aa2, measure = "pi") # the encoding distance between two identical encodings is 0 calc_ed(aa1, aa1, measure = "pi")
Computes information gain of single feature and target vector.
calc_ig(feature, target, len_target, pos_target)
calc_ig(feature, target, len_target, pos_target)
feature |
feature vector. |
target |
target. |
len_target |
length of the target vector. |
pos_target |
number of positive cases in the target vector. |
The information gain term is used here (improperly) as a synonym of mutual information. It is defined as:
In biogram package information gain is computed using following relationship:
A numeric
vector of length 1 representing information gain in nats.
During calculations . For a justification see References.
The function was designed to be afast subroutine of
calc_criterion
and might be cumbersome if directly called by a user.
Cover TM, Thomas JA Elements of Information Theory, 2nd Edition Wiley, 2006.
tar <- sample(0L:1, 100, replace = TRUE) feat <- sample(0L:1, 100, replace = TRUE) calc_ig(feat, tar, 100, sum(tar))
tar <- sample(0L:1, 100, replace = TRUE) feat <- sample(0L:1, 100, replace = TRUE) calc_ig(feat, tar, 100, sum(tar))
Computes Kullback-Leibler divergence between features and target vector.
calc_kl(feature, target, len_target, pos_target)
calc_kl(feature, target, len_target, pos_target)
feature |
feature vector. |
target |
target. |
len_target |
length of the target vector. |
pos_target |
number of positive cases in the target vector. |
A numeric
vector of length 1 representing Kullback-Leibler divergence
value.
Both target
and features
must be binary, i.e. contain only 0
and 1 values.
The function was designed to be as fast as possible subroutine of
calc_criterion
and might be cumbersome if directly called by a user.
Kullback S, Leibler RA On information and sufficiency. Annals of Mathematical Statistics 22 (1):79-86, 1951.
test_features
.
Kullback-Leibler divergence is calculated using KL.plugin
.
tar <- sample(0L:1, 100, replace = TRUE) feat <- sample(0L:1, 100, replace = TRUE) calc_kl(feat, tar, 100, sum(tar))
tar <- sample(0L:1, 100, replace = TRUE) feat <- sample(0L:1, 100, replace = TRUE) calc_kl(feat, tar, 100, sum(tar))
Computes the encoding distance between two encodings.
calc_pi(a, b)
calc_pi(a, b)
a |
encoding (see |
b |
encoding to which |
The encoding distance between a
and b
is defined as the
minimum number of amino acids that have to be moved between subgroups of encoding
to make a
identical to b
(order of subgroups in the encoding and amino
acids in a group is unimportant).
If the parameter prop
is supplied, the encoding distance is normalized by the
factor equal to the sum of distances for each group in a
and the closest group
in b
. The position of a group is defined as the mean value of properties of
amino acids or nucleotides belonging the group.
See the package vignette for more details.
an encoding distance.
calc_si
: compute the similarity index of two encodings.
encoding2df
: converts an encoding to a data frame.
validate_encoding
: validate a structure of an encoding.
# calculate encoding distance between two encodings of amino acids aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), `2` = c("k", "h"), `3` = c("d", "e"), `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q")) aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"), `2` = c("k", "h", "d", "e", "i"), `3` = c("f", "r", "w", "y", "s", "t", "c", "n")) calc_pi(aa1, aa2) # the encoding distance between two identical encodings is 0 calc_pi(aa1, aa1)
# calculate encoding distance between two encodings of amino acids aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), `2` = c("k", "h"), `3` = c("d", "e"), `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q")) aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"), `2` = c("k", "h", "d", "e", "i"), `3` = c("f", "r", "w", "y", "s", "t", "c", "n")) calc_pi(aa1, aa2) # the encoding distance between two identical encodings is 0 calc_pi(aa1, aa1)
Computes similarity index between two encodings.
calc_si(a, b)
calc_si(a, b)
a |
encoding (see |
b |
encoding to which |
Briefly, the similarity index is a fraction of elements that have the same pairing in both encodings. Pairing is a binary variable, that has value 1 if two elements are in the same group and 0 if not. For more details, see references.
the value of similarity index.
Stephenson, J.D., and Freeland, S.J. (2013). Unearthing the Root of Amino Acid Similarity. J Mol Evol 77, 159-169.
calc_ed
: calculate the encoding distance between two encodings.
# example from Stephenson & Freeland, 2013 (Fig. 6) enc1 <- list(`1` = "A", `2` = c("F", "E"), `3` = c("C", "D", "G")) enc2 <- list(`1` = c("A", "G"), `2` = c("C", "D", "E", "F")) enc3 <- list(`1` = c("D", "G"), `2` = c("E", "F"), `3` = c("A", "C")) calc_si(enc1, enc2) calc_si(enc2, enc3) calc_si(enc1, enc3)
# example from Stephenson & Freeland, 2013 (Fig. 6) enc1 <- list(`1` = "A", `2` = c("F", "E"), `3` = c("C", "D", "G")) enc2 <- list(`1` = c("A", "G"), `2` = c("C", "D", "E", "F")) enc3 <- list(`1` = c("D", "G"), `2` = c("E", "F"), `3` = c("A", "C")) calc_si(enc1, enc2) calc_si(enc2, enc3) calc_si(enc1, enc3)
Checks if the criterion is viable or matches it to the list of implemented criterions.
check_criterion(input_criterion, criterion_names = c("ig", "kl", "cs"))
check_criterion(input_criterion, criterion_names = c("ig", "kl", "cs"))
input_criterion |
|
criterion_names |
list of implemented criterions, always in lowercase. |
a list of three:
criterion name,
its function,
nice name for outputs.
Calculate the value of criterion: calc_criterion
.
Clusters sequences hierarchically with regular expressions. At each step we minimize number of degrees of freedom for all regular expressions needed to describe the data
cluster_reg_exp(ngrams)
cluster_reg_exp(ngrams)
ngrams |
list of elements |
Regular expression is a list of the length equal to the length of the input sequences. Each element of the list represents a position in the sequence and contains amino acid, that are likely to occure on this position.
List of four
"regExps"regular expression in best clustering
"seqClustering"clustering of sequences in best clustering
"allRegExps"all regular expressions.
"allIndices"all clusterings
data(human_cleave) #cluster_reg_exp is computationally expensive results <- cluster_reg_exp(human_cleave[1L:10, 1L:4])
data(human_cleave) #cluster_reg_exp is computationally expensive results <- cluster_reg_exp(human_cleave[1L:10, 1L:4])
Code human-friendly representation of n-grams into a biogram format.
code_ngrams(decoded_ngrams)
code_ngrams(decoded_ngrams)
decoded_ngrams |
a |
a character
vector of n-grams.
Inverse function: decode_ngrams
.
code_ngrams(c("11_2", "1__12", "222")) code_ngrams(c("aaa_b", "d__aa", "abd"))
code_ngrams(c("11_2", "1__12", "222")) code_ngrams(c("aaa_b", "d__aa", "abd"))
Builds and selects important n-grams stepwise.
construct_ngrams( target, seq, u, n_max, conf_level = 0.95, gap = TRUE, use_heuristics = TRUE )
construct_ngrams( target, seq, u, n_max, conf_level = 0.95, gap = TRUE, use_heuristics = TRUE )
target |
|
seq |
a vector or matrix describing sequence(s). |
u |
|
n_max |
size of constructed n-grams. |
conf_level |
confidence level. |
gap |
|
use_heuristics |
if |
construct_ngrams
starts by
extracting unigrams from the sequences, pasting them together in all combination and
choosing from them significant features (with p-value below conf_level
). The
chosen n-grams are further extended to the specified by n_max
size by pasting
unigrams at both ends.
The gap
parameter determines if construct_ngrams
performs the
feature selection on exact n-grams (gap
equal to FALSE) or on all features in the
Hamming distance 1 from the n-gram (gap
equal to TRUE).
a vector of n-grams.
Feature filtering method: test_features
.
# to make the example faster, we run construct_ngrams() on the # subset of data deg_seqs <- degenerate(human_cleave[c(1L:100, 801L:900), 1L:9], list(`1` = c(1, 6, 8, 10, 11, 18), `2` = c(2, 13, 14, 16, 17), `3` = c(5, 19, 20), `4` = c(7, 9, 12, 15), '5' = c(3, 4))) bigrams <- construct_ngrams(human_cleave[c(1L:100, 801L:900), "tar"], deg_seqs, 1L:5, 2)
# to make the example faster, we run construct_ngrams() on the # subset of data deg_seqs <- degenerate(human_cleave[c(1L:100, 801L:900), 1L:9], list(`1` = c(1, 6, 8, 10, 11, 18), `2` = c(2, 13, 14, 16, 17), `3` = c(5, 19, 20), `4` = c(7, 9, 12, 15), '5' = c(3, 4))) bigrams <- construct_ngrams(human_cleave[c(1L:100, 801L:900), "tar"], deg_seqs, 1L:5, 2)
A convinient wrapper around count_ngrams
for counting multiple
values of n
and d
.
count_multigrams( ns, ds = rep(0, length(ns)), seq, u, pos = FALSE, scale = FALSE, threshold = 0 )
count_multigrams( ns, ds = rep(0, length(ns)), seq, u, pos = FALSE, scale = FALSE, threshold = 0 )
ns |
|
ds |
|
seq |
a vector or matrix describing sequence(s). |
u |
|
pos |
|
scale |
|
threshold |
|
ns
vector and ds
vector must have equal length. Elements of
ds
vector are used as equivalents of d
parameter for respective values
of ns
. For example, if ns
is c(4, 4, 4)
, the ds
must be a list of
length 3. Each element of the ds
list must have length 3 or 1, as appropriate
for a d
parameter in count_ngrams
function.
An integer
matrix with named columns. The naming conventions are the same
as in count_ngrams
.
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50) count_multigrams(c(3, 1), list(c(1, 0), 0), seqs, 1L:4, pos = TRUE) # if ds parameter is not present, n-grams are calculated for distance 0 count_multigrams(c(3, 1), seq = seqs, u = 1L:4) # calculate three times n-gram with the same length, but different distances between # elements count_multigrams(c(4, 4, 4), list(c(2, 0, 1), c(2, 1, 0), c(0, 1, 2)), seqs, 1L:4, pos = TRUE)
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50) count_multigrams(c(3, 1), list(c(1, 0), 0), seqs, 1L:4, pos = TRUE) # if ds parameter is not present, n-grams are calculated for distance 0 count_multigrams(c(3, 1), seq = seqs, u = 1L:4) # calculate three times n-gram with the same length, but different distances between # elements count_multigrams(c(4, 4, 4), list(c(2, 0, 1), c(2, 1, 0), c(0, 1, 2)), seqs, 1L:4, pos = TRUE)
Counts all n-grams or position-specific n-grams present in the input sequence(s).
count_ngrams(seq, n, u, d = 0, pos = FALSE, scale = FALSE, threshold = 0)
count_ngrams(seq, n, u, d = 0, pos = FALSE, scale = FALSE, threshold = 0)
seq |
a vector or matrix describing sequence(s). |
n |
|
u |
|
d |
|
pos |
|
scale |
|
threshold |
|
A distance
vector should be always n
- 1 in length.
For example when n
= 3, d
= c(1,2) means A_A__A. For n
= 4,
d
= c(2,0,1) means A__AA_A. If vector d
has length 1, it is recycled to
length n
- 1.
n-gram names follow a specific convention and have three parts for position-specific
n-grams and two parts otherwise. The parts are separated by _
. The .
symbol
is used to separate elements within a part. The general naming scheme is
POSITION_NGRAM_DISTANCE
. The optional POSITION
part of the name indicates
the actual position of the n-gram in the sequence(s) and will be present
only if pos
= TRUE
. This part is always a single integer. The NGRAM
part of the name is a sequence of elements in the n-gram. For example, 4.2.2
indicates the n-gram 422 (e.g. TCC). The DISTANCE
part of the name is a vector of
distance(s). For example, 0.0
indicates zero distances (continuous n-grams), while
1.2
represents distances for the n-gram A_A__A.
Examples of n-gram names:
46_4.4.4_0.1 : trigram 44_4 on position 46
12_2.1_2 : bigram 2__1 on position 12
8_1.1.1_0.0 : continuous trigram 111 on position 8
1.1.1_0.0 : continuous trigram 111 without position information
a simple_triplet_matrix
where columns represent
n-grams and rows sequences. See Details
for specifics of the naming convention.
By default, the counted n-gram data is stored in a memory-saving format.
To convert an object to a 'classical' matrix use the as.matrix
function. See examples for further information.
Create vector of possible n-grams: create_ngrams
.
Extract n-grams from sequence(s): seq2ngrams
.
Get indices of n-grams: get_ngrams_ind
.
Count n-grams for multiple values of n: count_multigrams
.
Count only specified n-grams: count_specified
.
# count trigrams without position information for nucleotides count_ngrams(sample(1L:4, 50, replace = TRUE), 3, 1L:4, pos = FALSE) # count position-specific trigrams from multiple nucleotide sequences seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50) ngrams <- count_ngrams(seqs, 3, 1L:4, pos = TRUE) # output results of the n-gram counting to screen as.matrix(ngrams)
# count trigrams without position information for nucleotides count_ngrams(sample(1L:4, 50, replace = TRUE), 3, 1L:4, pos = FALSE) # count position-specific trigrams from multiple nucleotide sequences seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50) ngrams <- count_ngrams(seqs, 3, 1L:4, pos = TRUE) # output results of the n-gram counting to screen as.matrix(ngrams)
Counts specified n-grams in the input sequence(s).
count_specified(seq, ngrams)
count_specified(seq, ngrams)
seq |
vector or matrix describing sequence(s). |
ngrams |
vector of n-grams. |
count_specified
counts only selected n-grams declared by
user in the ngrams
parameter. Declared n-grams must be written using the
biogram
notation.
A simple_triplet_matrix
where columns represent
n-grams and rows sequences.
Count all possible n-grams: count_ngrams
.
seqs <- matrix(c(1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 3, 4, 1, 2, 2, 4), nrow = 2) count_specified(seqs, ngrams = c("1.1.1_0.0", "2.2.2_0.0", "1.1.2_0.0")) seqs <- matrix(sample(1L:5, 200, replace = TRUE), nrow = 20) count_specified(seqs, ngrams = c("2_4.2_0", "2_1.4_0", "3_1.3_0", "2_4.2_1", "2_1.4_1", "3_1.3_1", "2_4.2_2", "2_1.4_2", "3_1.3_2"))
seqs <- matrix(c(1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 3, 4, 1, 2, 2, 4), nrow = 2) count_specified(seqs, ngrams = c("1.1.1_0.0", "2.2.2_0.0", "1.1.2_0.0")) seqs <- matrix(sample(1L:5, 200, replace = TRUE), nrow = 20) count_specified(seqs, ngrams = c("2_4.2_0", "2_1.4_0", "3_1.3_0", "2_4.2_1", "2_1.4_1", "3_1.3_1", "2_4.2_2", "2_1.4_2", "3_1.3_2"))
Computes total number of n-grams that can be extracted from sequences.
count_total(seq, n, d)
count_total(seq, n, d)
seq |
a vector or matrix describing sequence(s). |
n |
|
d |
|
The maximum number of possible n-grams is limited by their length and the distance between elements of the n-gram.
An integer
rperesenting the total number of n-grams.
A format of d
vector is discussed in Details of
count_ngrams
. The maximum
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50) # make several sequences shorter by replacing them partially with NA seqs[8L:11, 46L:50] <- NA seqs[1L, 31L:50] <- NA count_total(seqs, 3, c(1, 0))
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50) # make several sequences shorter by replacing them partially with NA seqs[8L:11, 46L:50] <- NA seqs[1L, 31L:50] <- NA count_total(seqs, 3, c(1, 0))
Reduces an alphabet using physicochemical properties.
create_encoding(prop, len)
create_encoding(prop, len)
prop |
|
len |
length of the resulting encoding. Must be larger than zero and smaller than number of elements in the alphabet. |
The encoding is a list of groups to which elements of an alphabet should be reduced. All elements of the alphabet (all amino acids or all nucleotides) should appear in the encoding.
An encoding.
calc_ed
: calculate the encoding distance between two encodings.
encoding2df
: converts an encoding to a data frame.
validate_encoding
: validate a structure of an encoding.
enc1 = list(`1` = c("a", "t"), `2` = c("g", "c")) encoding2df(enc1)
enc1 = list(`1` = c("a", "t"), `2` = c("g", "c")) encoding2df(enc1)
Creates a matrix of features and target based on the values from contingency matrix.
create_feature_target(n11, n01, n10, n00)
create_feature_target(n11, n01, n10, n00)
n11 |
number of elements for which both target and feature equal 1. |
n01 |
number of elements for which target and feature equal 1,0 respectively. |
n10 |
number of elements for which target and feature equal 0,1 respectively. |
n00 |
number of elements for which both target and feature equal 0. |
a matrix of 2 columns and n11+n10+n01+n00 rows. Columns represent target and feature vectors, respectively.
# equivalent of # target # feature 10 375 # 15 600 target_feature <- create_feature_target(10, 375, 15, 600)
# equivalent of # target # feature 10 375 # 15 600 target_feature <- create_feature_target(10, 375, 15, 600)
Creates the vector of all possible n_grams (for given n
).
create_ngrams(n, u, possible_grams = NULL)
create_ngrams(n, u, possible_grams = NULL)
n |
|
u |
|
possible_grams |
number of possible n-grams. If not |
See Details section of count_ngrams
for more
information about n-grams naming convention. The possible information about distance
must be added by hand (see examples).
a character vector. Elements of n-gram are separated by dot.
Input data must be a matrix or data frame of numeric elements.
# bigrams for standard aminoacids create_ngrams(2, 1L:20) # bigrams for standard aminoacids with positions, 10 amino acid long sequence, so # only 9 bigrams can be located in sequence create_ngrams(2, 1L:20, 9) # bigrams for DNA with positions, 10 nucleotide long sequence, distance 1, so only # 8 bigrams in sequence # paste0 adds information about distance at the end of n-gram paste0(create_ngrams(2, 1L:4, 8), "_0")
# bigrams for standard aminoacids create_ngrams(2, 1L:20) # bigrams for standard aminoacids with positions, 10 amino acid long sequence, so # only 9 bigrams can be located in sequence create_ngrams(2, 1L:20, 9) # bigrams for DNA with positions, 10 nucleotide long sequence, distance 1, so only # 8 bigrams in sequence # paste0 adds information about distance at the end of n-gram paste0(create_ngrams(2, 1L:4, 8), "_0")
A result of distr_crit
function.
An object of class criterion_distribution
is a numeric matrix.
possible values of criterion.
probability density function.
cumulative distribution function.
A matrix with values of the criterion and their probabilities.
'Nice' name of the criterion.
Categorizes results of test_features
function into groups based on their
significance.
## S3 method for class 'feature_test' cut(x, split = "significances", breaks = c(0, 1e-04, 0.01, 0.05, 1), ...)
## S3 method for class 'feature_test' cut(x, split = "significances", breaks = c(0, 1e-04, 0.01, 0.05, 1), ...)
x |
an object of class |
split |
attribute along which output should be categorized. Possible values are
|
breaks |
a vector of significances of frequencies along which n-grams are aggregated.
See description of |
... |
further parameters accepted by the |
the value of function depends on the split
parameter.
The function returns a named list of length equal to the length
of significances
(when split
equals "significances"
) or
frequencies
(when split
equals "positives"
or "negatives"
)
minus one. Each elements of the list contains names of the n-grams belonging to the given
significance or frequency group.
Transforms a vector of n-grams into a human-friendly form.
decode_ngrams(ngrams)
decode_ngrams(ngrams)
ngrams |
a |
a character
vector of length equal to the number of n-grams.
Decoded n-grams lose the position information.
Validate n-gram structure: is_ngram
.
Inverse function: code_ngrams
.
decode_ngrams(c("2_1.1.2_0.1", "3_1.1.2_2.0", "3_2.2.2_0.0"))
decode_ngrams(c("2_1.1.2_0.1", "3_1.1.2_2.0", "3_2.2.2_0.0"))
'Degenerates' amino acid or nucleic sequence by aggregating elements to bigger groups.
degenerate(seq, element_groups)
degenerate(seq, element_groups)
seq |
|
element_groups |
encoding of elements: list of groups to which elements of sequence should be aggregated. Must have unique names. |
A character
vector or matrix (if input is a matrix)
containing aggregated elements.
Characters not present in the element_groups
will be converted to NA with a
warning.
l2n
to easily convert information stored in biological sequences from
letters to numbers.
degenerate_ngrams
to degenerate counts of n-grams instead of
sequences.
calc_ed
to calculate distance between encodings.
sample_seq <- c(1, 3, 1, 3, 4, 4, 3, 1, 2) table(sample_seq) # aggregate sequence to purins and pyrimidines deg_seq <- degenerate(sample_seq, list(w = c(1, 4), s = c(2, 3))) table(deg_seq)
sample_seq <- c(1, 3, 1, 3, 4, 4, 3, 1, 2) table(sample_seq) # aggregate sequence to purins and pyrimidines deg_seq <- degenerate(sample_seq, list(w = c(1, 4), s = c(2, 3))) table(deg_seq)
'Degenerates' n-grams by aggregating amino acid or nucleotide elements into bigger groups.
degenerate_ngrams(x, element_groups, binarize = FALSE)
degenerate_ngrams(x, element_groups, binarize = FALSE)
x |
a |
element_groups |
encoding of elements: list of groups to which elements of n-grams should be aggregated. Must have unique names. |
binarize |
logical indicating if n-grams should be binarized |
Depending on the x{}
a simple_triplet_matrix
or matrix of degenerated n-grams.
Computes criterion distribution under null hypothesis for all contingency tables possible for a feature and a target.
distr_crit(target, feature, criterion = "ig", iter_limit = 200)
distr_crit(target, feature, criterion = "ig", iter_limit = 200)
target |
{0,1}-valued target vector. See Details. |
feature |
{0,1}-valued feature vector. See Details. |
criterion |
criterion used for calculations of distribution.
See |
iter_limit |
limit the number of calculated contingence matrices. If
|
both target
and feature
vectors may contain only 0
and 1.
An object of class criterion_distribution
.
target_feature <- create_feature_target(10, 375, 15, 600) distr_crit(target = target_feature[,1], feature = target_feature[,2])
target_feature <- create_feature_target(10, 375, 15, 600) distr_crit(target = target_feature[,1], feature = target_feature[,2])
Converts an encoding to a data frame.
encoding2df(x, sort = FALSE)
encoding2df(x, sort = FALSE)
x |
encoding. |
sort |
if |
The encoding is a list of groups to which elements of an alphabet should be reduced. All elements of the alphabet (all amino acids or all nucleotides) should appear in the encoding.
data frame with two columns. First column represents an index of a group in the supplied encoding and the second column contains all elements of the encoding.
calc_ed
: calculate the encoding distance between two encodings.
encoding2df
: converts an encoding to a data frame.
validate_encoding
: validate a structure of an encoding.
create_encoding(aaprop[1L:5, ], 5)
create_encoding(aaprop[1L:5, ], 5)
Quickly cross-tabulates two binary vectors.
fast_crosstable(target, len_target, pos_target, feature)
fast_crosstable(target, len_target, pos_target, feature)
target |
target. |
len_target |
length of the target vector. |
pos_target |
number of positive cases in the target vector. |
feature |
feature vector. |
Input looks odd, but the function was build to be fast
subroutine of calc_ig
, which works on
many features but only one target.
a vector of length four:
target +, feature+
target +, feature-
target -, feature+
target -, feature-
Binary vector means a numeric vector with 0 or 1.
tar <- sample(0L:1, 100, replace = TRUE) feat <- sample(0L:1, 100, replace = TRUE) fast_crosstable(tar, length(tar), sum(tar), feat)
tar <- sample(0L:1, 100, replace = TRUE) feat <- sample(0L:1, 100, replace = TRUE) fast_crosstable(tar, length(tar), sum(tar), feat)
A result of test_features
function.
An object of the feature_test
class is a numeric vector of p-values.
Additional attributes characterizes futher the details of test which returned these
p-values.
the criterion used in permutation test.
the name of p-value adjusting method.
the number of permutations. If QuiPT was chosen NA
.
frequency of features splitted in subset based on the value of target.
Methods:
Converts an encoding from the full format to the simple format.
full2simple(x)
full2simple(x)
x |
encoding. |
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), `2` = c("k", "h"), `3` = c("d", "e"), `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q")) full2simple(aa1)
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), `2` = c("k", "h"), `3` = c("d", "e"), `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q")) full2simple(aa1)
Introduces gaps in the n-grams.
gap_ngrams(ngrams)
gap_ngrams(ngrams)
ngrams |
a vector of positioned n-grams (as created by |
A single element of the input n-gram at a time will be replaced
by a gap. For example, introducing gaps in n-gram 2_1.1.2_0.1
will results in three n-grams: 3_1.2_1
(where the 2_1_0
unigram
was replaced by a gap), 2_1.2_2
and 2_1.1_0
.
A character
vector of (n-1)-grams with introduced gaps.
Reverse function: add_1grams
.
gap_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0")) gap_ngrams(c("1.1.2_0.1", "1.1.2_0.0", "2.2.2_0.0"))
gap_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0")) gap_ngrams(c("1.1.2_0.1", "1.1.2_0.0", "2.2.2_0.0"))
Generate a sequences using an alphabet of unigrams and set of rules.
generate_sequence(alphabet, regions)
generate_sequence(alphabet, regions)
alphabet |
the unigram alphabet. Columns are equivalent to unigrams and rows to particular properties. |
regions |
a list of rules describing regions. |
Generate a region using an alphabet of unigrams and considering provided set of rules.
generate_single_region(alphabet, reg_len, prop_ranges, exactness)
generate_single_region(alphabet, reg_len, prop_ranges, exactness)
alphabet |
the unigram alphabet. Columns are equivalent to unigrams and rows to particular properties. |
reg_len |
the number of unigrams inside the region. |
prop_ranges |
required intervals of properties of unigrams in the region. See Details. |
exactness |
a |
props1 <- list(P1 = c(0, 0.5), P2 = c(0.2, 0.4), P3 = c(0.5, 1), P4 = c(0, 0)) props2 <- list(P1 = c(0.5, 1), P2 = c(0.4, 1), P3 = c(0, 0.5), P4 = c(1, 1)) alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE), replicate(12, props2, simplify = FALSE)), unigram_names = letters[1L:20]) rules1 <- list(P1 = c(0.5, 1), P2 = c(0.4, 1), P3 = c(0, 0.5), P4 = c(1, 1)) generate_single_region(alph, 10, rules1, 0.9)
props1 <- list(P1 = c(0, 0.5), P2 = c(0.2, 0.4), P3 = c(0.5, 1), P4 = c(0, 0)) props2 <- list(P1 = c(0.5, 1), P2 = c(0.4, 1), P3 = c(0, 0.5), P4 = c(1, 1)) alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE), replicate(12, props2, simplify = FALSE)), unigram_names = letters[1L:20]) rules1 <- list(P1 = c(0.5, 1), P2 = c(0.4, 1), P3 = c(0, 0.5), P4 = c(1, 1)) generate_single_region(alph, 10, rules1, 0.9)
Assign randomly generated properties to a single unigram.
generate_single_unigram(unigram_ranges)
generate_single_unigram(unigram_ranges)
unigram_ranges |
list of ranges containing respective properties. If named, names are preserved. |
generate_single_unigram
is a helper function for
generate_unigrams
.
generate_single_unigram(list(P1 = c(0, 0.5), P2 = c(0.2, 0.4), P3 = c(0.5, 1), P4 = c(0, 0)))
generate_single_unigram(list(P1 = c(0, 0.5), P2 = c(0.2, 0.4), P3 = c(0.5, 1), P4 = c(0, 0)))
Generates an alphabet of unigrams based on given list of properties.
generate_unigrams(unigram_list, unigram_names = NULL, prop_names = NULL)
generate_unigrams(unigram_list, unigram_names = NULL, prop_names = NULL)
unigram_list |
a list of unigrams' parameters. See Details. |
unigram_names |
names of unigrams. If not |
prop_names |
names of properties. If not |
Unigram parameters are represented as a list of intervals, where each interval corresponds to a different property. The function generate unigrams randomly choosing values of properties from given intervals using uniform distribution. All lists of ranges should have the same length, which equils to describing each unigram using the same properties.
props1 <- list(P1 = c(0, 0.5), P2 = c(0.2, 0.4), P3 = c(0.5, 1), P4 = c(0, 0)) props2 <- list(P1 = c(0.5, 1), P2 = c(0.4, 1), P3 = c(0, 0.5), P4 = c(1, 1)) alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE), replicate(12, props2, simplify = FALSE)), unigram_names = letters[1L:20])
props1 <- list(P1 = c(0, 0.5), P2 = c(0.2, 0.4), P3 = c(0.5, 1), P4 = c(0, 0)) props2 <- list(P1 = c(0.5, 1), P2 = c(0.4, 1), P3 = c(0, 0.5), P4 = c(1, 1)) alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE), replicate(12, props2, simplify = FALSE)), unigram_names = letters[1L:20])
Computes list of n-gram elements positions in sequence.
get_ngrams_ind(len_seq, n, d)
get_ngrams_ind(len_seq, n, d)
len_seq |
|
n |
|
d |
|
A format of d
vector is discussed in Details of
count_ngrams
.
A list with number of elements equal to n
. Every element is a
vector containing locations of given n-gram letter. For example, first element of
list contain indices of first letter of all n-grams. The attribute d
of output contains distances between letter used to compute locations
(see Details).
# positions trigrams in sequence of length 10 get_ngrams_ind(10, 9, 0)
# positions trigrams in sequence of length 10 get_ngrams_ind(10, 9, 0)
A set of 648 cleavage sites and 648 parts of mature proteins shortly after cleavage sites derived from human proteome.
A data frame with 1296 observations on the following 10 variables. Columns from
P1
to P9
describes positions in an extracted peptide. tar
is a target vector. It
has value 1 if a peptide is a cleavage site and 0 if not.
Each peptide in the data set is nine amino acid residues long. In case of cleavage sites, the clevage is located between fifth and sixth peptide. The non-cleavage sites are parts of mature proteins starting five positions after cleavage site.
Amino acid residues were recoded as integers.
data(human_cleave) table(human_cleave[, 1])
data(human_cleave) table(human_cleave[, 1])
Checks if the character string may be used as an n-gram and its notation follows specific
convention of biogram
package.
is_ngram(x)
is_ngram(x)
x |
|
TRUE
if n-gram's notation is correct, FALSE
if not.
print(is_ngram("1_1.1.1_0.0")) print(is_ngram("not_ngram"))
print(is_ngram("1_1.1.1_0.0")) print(is_ngram("not_ngram"))
Converts biological sequence from letter to number notation.
l2n(seq, seq_type)
l2n(seq, seq_type)
seq |
|
seq_type |
the type of sequence. Can be |
a numeric
vector or matrix containing converted elements.
l2n
is a wrapper around degenerate
.
Inverse function: n2l
.
sample_seq <- c("a", "d", "d", "g", "a", "g", "n", "a", "l") l2n(sample_seq, "prot")
sample_seq <- c("a", "d", "d", "g", "a", "g", "n", "a", "l") l2n(sample_seq, "prot")
Computes the length of n-grams.
lengths_ngrams(ngrams)
lengths_ngrams(ngrams)
ngrams |
a |
A numeric
vector of n-gram lengths.
lengths_ngrams(c("2_1.1.2_0.1", "3_1.1.2_2.0", "3_2.2.2_0.0"))
lengths_ngrams(c("2_1.1.2_0.1", "3_1.1.2_2.0", "3_2.2.2_0.0"))
Converts list of sequences to matrix.
list2matrix(seq_list)
list2matrix(seq_list)
seq_list |
list of sequences (e.g. as returned by
the |
A matrix with the number of rows equal to the number of sequences and the number of columns equal to the length of the longest sequence.
Since matrix must have specified number of columns, ends of shorter sequences are completed with NAs.
list2matrix(list(s1 = c("c", "g", "g", "t"), s2 = c("g", "t", "c", "t", "t", "g"), s3 = c("a", "a", "t")))
list2matrix(list(s1 = c("c", "g", "g", "t"), s2 = c("g", "t", "c", "t", "t", "g"), s3 = c("a", "a", "t")))
Converts biological sequence from number to letter notation.
n2l(seq, seq_type)
n2l(seq, seq_type)
seq |
|
seq_type |
the type of sequence. Can be |
a character
vector or matrix containing converted elements.
n2l
is a wrapper around degenerate
.
Inverse function: l2n
.
sample_seq <- c(1, 3, 3, 6, 1, 6, 12, 1, 10) n2l(sample_seq, "prot")
sample_seq <- c(1, 3, 3, 6, 1, 6, 12, 1, 10) n2l(sample_seq, "prot")
Tranforms a vector of n-grams into a data frame.
ngrams2df(ngrams)
ngrams2df(ngrams)
ngrams |
a |
a data.frame
with 2 (in case of n-grams without known position) or
three columns (n-grams with position information).
Decode n-grams: decode_ngrams
.
ngrams2df(c("2_1.1.2_0.0", "3_1.1.2_0.0", "3_2.2.2_0.0", "2_1.1_0"))
ngrams2df(c("2_1.1.2_0.0", "3_1.1.2_0.0", "3_2.2.2_0.0", "2_1.1_0"))
Plots results of distr_crit
function.
## S3 method for class 'criterion_distribution' plot(x, ...)
## S3 method for class 'criterion_distribution' plot(x, ...)
x |
object of class |
... |
further arguments passed to |
nothing.
target_feature <- create_feature_target(10, 375, 15, 600) example_result <- distr_crit(target = target_feature[,1], feature = target_feature[,2]) plot(example_result) # a ggplot2 plot library(ggplot2) ggplot_distr <- function(x) { b <- data.frame(cbind(x=as.numeric(rownames(attr(x, "plot_data"))), attr(x, "plot_data"))) d1 <- cbind(b[,c(1,2)], attr(x, "nice_name")) d2 <- cbind(b[,c(1,3)], "Probability") colnames(d1) <- c("x", "y", "panel") colnames(d2) <- c("x", "y", "panel") d <- rbind(d1, d2) p <- ggplot(data = d, mapping = aes(x = x, y = y)) + facet_grid(panel~., scale="free") + geom_freqpoly(data= d2, aes(color=y), stat = "identity") + scale_fill_brewer(palette = "Set1") + geom_point(data=d1, aes(size=y), stat = "identity") + guides(color = "none") + guides(size = "none") + xlab("Number of cases with feature=1 and target=1") + ylab("") p } ggplot_distr(example_result)
target_feature <- create_feature_target(10, 375, 15, 600) example_result <- distr_crit(target = target_feature[,1], feature = target_feature[,2]) plot(example_result) # a ggplot2 plot library(ggplot2) ggplot_distr <- function(x) { b <- data.frame(cbind(x=as.numeric(rownames(attr(x, "plot_data"))), attr(x, "plot_data"))) d1 <- cbind(b[,c(1,2)], attr(x, "nice_name")) d2 <- cbind(b[,c(1,3)], "Probability") colnames(d1) <- c("x", "y", "panel") colnames(d2) <- c("x", "y", "panel") d <- rbind(d1, d2) p <- ggplot(data = d, mapping = aes(x = x, y = y)) + facet_grid(panel~., scale="free") + geom_freqpoly(data= d2, aes(color=y), stat = "identity") + scale_fill_brewer(palette = "Set1") + geom_point(data=d1, aes(size=y), stat = "identity") + guides(color = "none") + guides(size = "none") + xlab("Number of cases with feature=1 and target=1") + ylab("") p } ggplot_distr(example_result)
Tranforms a vector of positioned n-grams into a list of positions filled with n-grams that start on them.
position_ngrams(ngrams, df = FALSE, unigrams_output = TRUE)
position_ngrams(ngrams, df = FALSE, unigrams_output = TRUE)
ngrams |
a vector of positioned n-grams (as created by |
df |
logical, if |
unigrams_output |
logical, if |
if df
is FALSE
, returns a list of length equal to the number of unique
n-gram starts present in n-grams. Each element of the list contains n-grams that start on
this position. If df
is FALSE
, returns a data frame where first column contains
n-grams and the second column represent their start positions.
Transform n-gram name to human-friendly form: decode_ngrams
.
Validate n-gram structure: is_ngram
.
# position data in the list format position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0")) # position data in the data frame format position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"), df = TRUE)
# position data in the list format position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0")) # position data in the data frame format position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"), df = TRUE)
Prints results of test_features
function.
## S3 method for class 'feature_test' print(x, ...)
## S3 method for class 'feature_test' print(x, ...)
x |
object of class |
... |
further arguments passed to |
nothing.
A lightweight tool to read nucleic or amino-acid sequences from a file in FASTA format.
read_fasta(file)
read_fasta(file)
file |
the name of the file which the data are to be read from. |
a list of sequences.
read.fasta
: heavier function for processing FASTA files.
## Not run: read_fasta("https://www.uniprot.org/uniprot/P28307.fasta") ## End(Not run)
## Not run: read_fasta("https://www.uniprot.org/uniprot/P28307.fasta") ## End(Not run)
Read sequence data saved in text file.
read_txt(connection)
read_txt(connection)
connection |
a |
The input file should contain one or more amino acid sequences separated by empty line(s).
a list of sequences.
sequences <- read_txt(system.file("PlastoGram/sequences.txt", package = "PlastoGram"))
sequences <- read_txt(system.file("PlastoGram/sequences.txt", package = "PlastoGram"))
'Regenerates' amino acid or nucleic sequence written in a simplified alphabet by converting groups to regular expression.
regenerate(x, element_groups)
regenerate(x, element_groups)
x |
|
element_groups |
encoding of elements: list of groups to which elements of sequence should be aggregated. Must have unique names. |
A character
string representing a POSIX regular expression.
Gaps (_
) will be converted to any possible character from the alphabet
(nucleotides or amino acids).
degenerate
to easily convert information stored in biological sequences from
letters to numbers.
calc_ed
to calculate distance between simplified alphabets.
regenerate("ssw", list(w = c(1, 4), s = c(2, 3)))
regenerate("ssw", list(w = c(1, 4), s = c(2, 3)))
List of rules defining the region.
An object of the regional_param
class is a list consisting of all rules
necessary to properly build a region.
the number of unigrams inside the region. Might be 0
required intervals of properties of unigrams in the region
a numeric
value between 0 and 1 defining how stricly
unigrams are kept within prop_ranges
. If 1, only unigrams within
prop_ranges
are inside the region. if 0.9, there is 10
unigrams that are not in the prop_ranges
will be inside the region.
Extracts vector of n-grams present in sequence(s).
seq2ngrams(seq, n, u, d = 0, pos = FALSE)
seq2ngrams(seq, n, u, d = 0, pos = FALSE)
seq |
a vector or matrix describing sequence(s). |
n |
|
u |
|
d |
|
pos |
|
A format of d
vector is discussed in Details of
count_ngrams
.
A character
matrix of n-grams, where every row corresponds to a
different sequence.
# trigrams from multiple sequences seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50) seq2ngrams(seqs, 3, 1L:4)
# trigrams from multiple sequences seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50) seq2ngrams(seqs, 3, 1L:4)
Converts an encoding from the simple format to the full format.
simple2full(x)
simple2full(x)
x |
encoding (see Details). |
The encoding should be named. Each name should correspond to a different amino acid or nucleotide.
aa1 = structure(c("1", "4", "3", "3", "4", "1", "2", "1", "2", "1", "1", "4", "1", "4", "4", "4", "4", "1", "4", "4"), .Names = c("a", "c", "d", "e", "f", "g", "h", "i", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "y")) simple2full(aa1)
aa1 = structure(c("1", "4", "3", "3", "4", "1", "2", "1", "2", "1", "1", "4", "1", "4", "4", "4", "4", "1", "4", "4"), .Names = c("a", "c", "d", "e", "f", "g", "h", "i", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "y")) simple2full(aa1)
Summarizes results of test_features
function.
## S3 method for class 'feature_test' summary(object, conf_level = 0.95, ...)
## S3 method for class 'feature_test' summary(object, conf_level = 0.95, ...)
object |
of class |
conf_level |
confidence level. A feature with p-value equal to or smaller than the confidence is considered significant. |
... |
ignored |
nothing.
Builds a contingency table of the n-gram counts versus their class labels.
table_ngrams(seq, ngrams, target)
table_ngrams(seq, ngrams, target)
seq |
vector or matrix describing sequence(s). |
ngrams |
vector of n-grams. |
target |
|
a data frame with the number of columns equal to the length of the
target
plus 1. The first column contains names of the n-grams. Further
columns represents counts of n-grams for respective value of the
target
.
seqs_pos <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE, prob = c(0.2, 0.4, 0.35, 0.05)), ncol = 5) seqs_neg <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE), ncol = 5) tab <- table_ngrams(seq = rbind(seqs_pos, seqs_neg), ngrams = c("1_c.t_0", "1_g.g_0", "2_t.c_0", "2_g.g_0", "3_c.c_0", "3_g.c_0"), target = c(rep(1, 20), rep(0, 20))) # see the results print(tab) # easily plot the results using ggplot2
seqs_pos <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE, prob = c(0.2, 0.4, 0.35, 0.05)), ncol = 5) seqs_neg <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE), ncol = 5) tab <- table_ngrams(seq = rbind(seqs_pos, seqs_neg), ngrams = c("1_c.t_0", "1_g.g_0", "2_t.c_0", "2_g.g_0", "3_c.c_0", "3_g.c_0"), target = c(rep(1, 20), rep(0, 20))) # see the results print(tab) # easily plot the results using ggplot2
Performs a feature selection on positioned n-gram data using a Fisher's permutation test.
test_features( target, features, criterion = "ig", adjust = "BH", threshold = 1, quick = TRUE, times = 1e+05, occurrences = TRUE )
test_features( target, features, criterion = "ig", adjust = "BH", threshold = 1, quick = TRUE, times = 1e+05, occurrences = TRUE )
target |
|
features |
|
criterion |
criterion used in permutation test. See Details for the list of possible criterions. |
adjust |
name of p-value adjustment method. See |
threshold |
|
quick |
|
times |
number of times procedure should be repeated. Ignored if |
occurrences |
|
Since the procedure involves multiple testing, it is advisable to use one
of the avaible p-value adjustment methods. Such methods can be used directly by
specifying the adjust
parameter.
Available criterions:
Information Gain: calc_ig
.
Kullback-Leibler divergence: calc_kl
.
Chi-squared-based measure: calc_cs
.
an object of class feature_test
.
Both target
and features
must be binary, i.e. contain only 0
and 1 values.
Features occuring too often and too rarely are considered not informative and may be removed using the threshold parameter.
Radivojac P, Obradovic Z, Dunker AK, Vucetic S, Feature selection filters based on the permutation test in Machine Learning: ECML 2004, 15th European Conference on Machine Learning, Springer, 2004.
binarize
- binarizes input data.
calc_criterion
- computes selected criterion.
distr_crit
- distribution of criterion used in QuiPT.
summary.feature_test
- summary of results.
cut.feature_test
- aggregates test results in groups based on feature's
p-value.
# significant feature tar_feat1 <- create_feature_target(10, 390, 0, 600) # significant feature tar_feat2 <- create_feature_target(9, 391, 1, 599) # insignificant feature tar_feat3 <- create_feature_target(198, 202, 300, 300) test_res <- test_features(tar_feat1[, 1], cbind(tar_feat1[, 2], tar_feat2[, 2], tar_feat3[, 2])) summary(test_res) cut(test_res) # real data example # we will analyze only a subsample of a dataset to make analysis quicker ids <- c(1L:100, 701L:800) deg_seqs <- degenerate(human_cleave[ids, 1L:9], list(`a` = c(1, 6, 8, 10, 11, 18), `b` = c(2, 5, 13, 14, 16, 17, 19, 20), `c` = c(3, 4, 7, 9, 12, 15))) # positioned n-grams example bigrams_pos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE) test_features(human_cleave[ids, 10], bigrams_pos) # unpositioned n-grams example, binarization required bigrams_notpos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE) test_features(human_cleave[ids, 10], binarize(bigrams_notpos))
# significant feature tar_feat1 <- create_feature_target(10, 390, 0, 600) # significant feature tar_feat2 <- create_feature_target(9, 391, 1, 599) # insignificant feature tar_feat3 <- create_feature_target(198, 202, 300, 300) test_res <- test_features(tar_feat1[, 1], cbind(tar_feat1[, 2], tar_feat2[, 2], tar_feat3[, 2])) summary(test_res) cut(test_res) # real data example # we will analyze only a subsample of a dataset to make analysis quicker ids <- c(1L:100, 701L:800) deg_seqs <- degenerate(human_cleave[ids, 1L:9], list(`a` = c(1, 6, 8, 10, 11, 18), `b` = c(2, 5, 13, 14, 16, 17, 19, 20), `c` = c(3, 4, 7, 9, 12, 15))) # positioned n-grams example bigrams_pos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE) test_features(human_cleave[ids, 10], bigrams_pos) # unpositioned n-grams example, binarization required bigrams_notpos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE) test_features(human_cleave[ids, 10], binarize(bigrams_notpos))
Checks the structure of an encoding.
validate_encoding(x, u)
validate_encoding(x, u)
x |
encoding. |
u |
|
The encoding is a list of groups to which elements of an alphabet should be reduced. All elements of the alphabet (all amino acids or all nucleotides) should appear in the encoding.
TRUE
if the x
is a correctly reduced u
,
FALSE
in any other cases.
calc_ed
: calculate the encoding distance between two encodings.
encoding2df
: converts an encoding to a data frame.
enc1 = list(`1` = c("a", "t"), `2` = c("g", "c")) # see if enc1 is the correctly reduced nucleotide (DNA) alphabet validate_encoding(enc1, c("a", "c", "g", "t")) # enc1 is not the RNA alphabet, so the results is FALSE validate_encoding(enc1, c("a", "c", "g", "u")) # validate_encoding works also on other notations enc2 = list(a = c(1, 4), b = c(2, 3)) validate_encoding(enc2, 1L:4)
enc1 = list(`1` = c("a", "t"), `2` = c("g", "c")) # see if enc1 is the correctly reduced nucleotide (DNA) alphabet validate_encoding(enc1, c("a", "c", "g", "t")) # enc1 is not the RNA alphabet, so the results is FALSE validate_encoding(enc1, c("a", "c", "g", "u")) # validate_encoding works also on other notations enc2 = list(a = c(1, 4), b = c(2, 3)) validate_encoding(enc2, 1L:4)
Saves a list of encodings (or a single encoding to the file).
write_encoding(x, file = "")
write_encoding(x, file = "")
x |
encoding or list of encodings. |
file |
ither a character string naming a file or a
|
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), `2` = c("k", "h"), `3` = c("d", "e"), `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q")) write_encoding(aa1)
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), `2` = c("k", "h"), `3` = c("d", "e"), `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q")) write_encoding(aa1)
A lightweight tool to read nucleic or amino-acid sequences from a file in FASTA format.
write_fasta(seq, file, nchar = 80)
write_fasta(seq, file, nchar = 80)
seq |
a list of sequences. |
file |
the name of the output file. |
nchar |
the number of characters per line. |
write.fasta
: heavier function for writing FASTA files.