Package 'biogram' reference manual

Title:	N-Gram Analysis of Biological Sequences
Description:	Tools for extraction and analysis of various n-grams (k-mers) derived from biological sequences (proteins or nucleic acids). Contains QuiPT (quick permutation test) for fast feature-filtering of the n-gram data.
Authors:	Michal Burdukiewicz [cre, aut] , Piotr Sobczyk [aut], Chris Lauber [aut], Dominik Rafacz [aut], Katarzyna Sidorczuk [ctb]
Maintainer:	Michal Burdukiewicz <[email protected]>
License:	GPL-3
Version:	1.6.3
Built:	2025-02-26 05:01:44 UTC
Source:	https://github.com/michbur/biogram

biogram - analysis of biological sequences using n-grams

Description

biogram package is a toolbox for the analysis of nucleic acid and protein sequences using n-grams. Possible applications include motif discovery, feature selection, clustering, and classification.

n-grams

n-grams (k-tuples) are sets of n characters derived from the input sequence(s). They may form continuous sub-sequences or be discontinuous. For example, from the sequence of nucleotides AATA one can extract the following continuous 2-grams (bigrams): AA, AT and TA. Moreover, there are two possible bigrams separated by a single space: A_T and A_A, and one bigram separated by two spaces: A__A.

Another important n-gram parameter is its position. Instead of just counting n-grams, one may want to count how many n-grams occur at a given position in multiple (e.g. related) sequences. For example, in the sequences AATA and AACA there is only one bigram at position 1: AA, but there are two bigrams at position two: AT and AC. The following notation is used for position-specific n-grams: 1_AA, 2_AT, 2_AC.

In the biogram package, the count_ngrams function is used for counting and extracting n-grams. Using the d argument the user can specify the distance between elements of the n-grams. The pos argument can be used to enable position specificity.

n-gram data dimensionality

We note that n-grams suffer from the curse of dimensionality. For example, for a peptide of length 6 $20^{n}$ n-grams and $6 \times 20^{n}$ positioned n-grams are possible. Data sets of such an enormous size are hard to manage and analyze in R.

The biogram package deals with both of the abovementioned problems. It uses innate properties of the n-gram data which usually can be represented by sparse matrices. Data storage is done using functionalities from the slam package. To ease the selection of significant features, biogram provides the user with QuiPT, a very fast permutation test for binary data (see test_features).

Another way of reducing dimensionality is the aggregation of sequence residues into more general groups. For example, all positively-charged amino acids may be aggregated into one group. This action can be performed using the degenerate function.

Encoding of amino acids can easu sequence analysis, but multidimensional objects as the aggregations of amino acids are not easily comparable. We introduced the encoding distance, a measure defining the distance between encodings. It can be computed using the calc_ed function.

Author(s)

Michal Burdukiewicz, Piotr Sobczyk, Chris Lauber

Examples

# use data set from package
data(human_cleave)
# first nine columns represent subsequent nine amino acids from cleavage sites
# degenerate the sequence to reduce the dimensionality of the problem
# (use five groups instead of 20 amino acids)
deg_seqs <- degenerate(human_cleave[, 1L:9], 
                      list(`a` = c(1, 6, 8, 10, 11, 18), 
                           `b` = c(2, 13, 14, 16, 17), 
                           `c` = c(5, 19, 20), 
                           `d` = c(7, 9, 12, 15), 
                           'e' = c(3, 4)))
# EXAMPLE 1 - extract significant trigrams
# extract trigrams
trigrams <- count_ngrams(deg_seqs, 3, letters[1L:5], pos = TRUE)
# select features that differ between the two target groups using QuiPT
test1 <- test_features(human_cleave[, "tar"], trigrams)
# see a summary of the results
summary(test1)
# aggregate features in groups based on their p-value
gr <- cut(test1)
# get position map of the most significant n-grams
position_ngrams(gr[[1]])
# transform the most significant n-grams to more readable form
decode_ngrams(gr[[1]])

# EXAMPLE 2 - search for specific n-grams
# the n-grams of the interest are a_a (a-gap-a) and e_e (e-gap-e) on the
# 3rd and 4th position
# firstly code n-grams in biogram notation and add position information
coded <- code_ngrams(c("a_a", "c_c"))
# add position information
coded <- c(paste0("3_", coded), paste0("4_", coded))
# count only the features of the interest
bigrams <- count_specified(deg_seqs, coded)
# test which of the features of the interest is significant
test2 <- test_features(human_cleave[, "tar"], bigrams)
cut(test2)
# use data set from package
data(human_cleave)
# first nine columns represent subsequent nine amino acids from cleavage sites
# degenerate the sequence to reduce the dimensionality of the problem
# (use five groups instead of 20 amino acids)
deg_seqs <- degenerate(human_cleave[, 1L:9], 
                      list(`a` = c(1, 6, 8, 10, 11, 18), 
                           `b` = c(2, 13, 14, 16, 17), 
                           `c` = c(5, 19, 20), 
                           `d` = c(7, 9, 12, 15), 
                           'e' = c(3, 4)))
# EXAMPLE 1 - extract significant trigrams
# extract trigrams
trigrams <- count_ngrams(deg_seqs, 3, letters[1L:5], pos = TRUE)
# select features that differ between the two target groups using QuiPT
test1 <- test_features(human_cleave[, "tar"], trigrams)
# see a summary of the results
summary(test1)
# aggregate features in groups based on their p-value
gr <- cut(test1)
# get position map of the most significant n-grams
position_ngrams(gr[[1]])
# transform the most significant n-grams to more readable form
decode_ngrams(gr[[1]])

# EXAMPLE 2 - search for specific n-grams
# the n-grams of the interest are a_a (a-gap-a) and e_e (e-gap-e) on the
# 3rd and 4th position
# firstly code n-grams in biogram notation and add position information
coded <- code_ngrams(c("a_a", "c_c"))
# add position information
coded <- c(paste0("3_", coded), paste0("4_", coded))
# count only the features of the interest
bigrams <- count_specified(deg_seqs, coded)
# test which of the features of the interest is significant
test2 <- test_features(human_cleave[, "tar"], bigrams)
cut(test2)

Normalized amino acids properties

Description

Normalized (0-1) 554 amino acid properties as retreived from AAIndex database (release 9.1) enriched with contactivity of amino acids.

Format

A data frames with 20 columns and 600 rows.

Details

Following properties are included (AAIndex key: description of the property)

ANDN920101: alpha-CH chemical shifts (Andersen et al., 1992)
ARGP820101: Hydrophobicity index (Argos et al., 1982)
ARGP820102: Signal sequence helical potential (Argos et al., 1982)
ARGP820103: Membrane-buried preference parameters (Argos et al., 1982)
BEGF750101: Conformational parameter of inner helix (Beghin-Dirkx, 1975)
BEGF750102: Conformational parameter of beta-structure (Beghin-Dirkx, 1975)
BEGF750103: Conformational parameter of beta-turn (Beghin-Dirkx, 1975)
BHAR880101: Average flexibility indices (Bhaskaran-Ponnuswamy, 1988)
BIGC670101: Residue volume (Bigelow, 1967)
BIOV880101: Information value for accessibility; average fraction 35% (Biou et al., 1988)
BIOV880102: Information value for accessibility; average fraction 23% (Biou et al., 1988)
BROC820101: Retention coefficient in TFA (Browne et al., 1982)
BROC820102: Retention coefficient in HFBA (Browne et al., 1982)
BULH740101: Transfer free energy to surface (Bull-Breese, 1974)
BULH740102: Apparent partial specific volume (Bull-Breese, 1974)
BUNA790101: alpha-NH chemical shifts (Bundi-Wuthrich, 1979)
BUNA790102: alpha-CH chemical shifts (Bundi-Wuthrich, 1979)
BUNA790103: Spin-spin coupling constants 3JHalpha-NH (Bundi-Wuthrich, 1979)
BURA740101: Normalized frequency of alpha-helix (Burgess et al., 1974)
BURA740102: Normalized frequency of extended structure (Burgess et al., 1974)
CHAM810101: Steric parameter (Charton, 1981)
CHAM820101: Polarizability parameter (Charton-Charton, 1982)
CHAM820102: Free energy of solution in water, kcal/mole (Charton-Charton, 1982)
CHAM830101: The Chou-Fasman parameter of the coil conformation (Charton-Charton, 1983)
CHAM830102: A parameter defined from the residuals obtained from the best correlation of the Chou-Fasman parameter of beta-sheet (Charton-Charton, 1983)
CHAM830103: The number of atoms in the side chain labelled 1+1 (Charton-Charton, 1983)
CHAM830104: The number of atoms in the side chain labelled 2+1 (Charton-Charton, 1983)
CHAM830105: The number of atoms in the side chain labelled 3+1 (Charton-Charton, 1983)
CHAM830106: The number of bonds in the longest chain (Charton-Charton, 1983)
CHAM830107: A parameter of charge transfer capability (Charton-Charton, 1983)
CHAM830108: A parameter of charge transfer donor capability (Charton-Charton, 1983)
CHOC750101: Average volume of buried residue (Chothia, 1975)
CHOC760101: Residue accessible surface area in tripeptide (Chothia, 1976)
CHOC760102: Residue accessible surface area in folded protein (Chothia, 1976)
CHOC760103: Proportion of residues 95% buried (Chothia, 1976)
CHOC760104: Proportion of residues 100% buried (Chothia, 1976)
CHOP780101: Normalized frequency of beta-turn (Chou-Fasman, 1978a)
CHOP780201: Normalized frequency of alpha-helix (Chou-Fasman, 1978b)
CHOP780202: Normalized frequency of beta-sheet (Chou-Fasman, 1978b)
CHOP780203: Normalized frequency of beta-turn (Chou-Fasman, 1978b)
CHOP780204: Normalized frequency of N-terminal helix (Chou-Fasman, 1978b)
CHOP780205: Normalized frequency of C-terminal helix (Chou-Fasman, 1978b)
CHOP780206: Normalized frequency of N-terminal non helical region (Chou-Fasman, 1978b)
CHOP780207: Normalized frequency of C-terminal non helical region (Chou-Fasman, 1978b)
CHOP780208: Normalized frequency of N-terminal beta-sheet (Chou-Fasman, 1978b)
CHOP780209: Normalized frequency of C-terminal beta-sheet (Chou-Fasman, 1978b)
CHOP780210: Normalized frequency of N-terminal non beta region (Chou-Fasman, 1978b)
CHOP780211: Normalized frequency of C-terminal non beta region (Chou-Fasman, 1978b)
CHOP780212: Frequency of the 1st residue in turn (Chou-Fasman, 1978b)
CHOP780213: Frequency of the 2nd residue in turn (Chou-Fasman, 1978b)
CHOP780214: Frequency of the 3rd residue in turn (Chou-Fasman, 1978b)
CHOP780215: Frequency of the 4th residue in turn (Chou-Fasman, 1978b)
CHOP780216: Normalized frequency of the 2nd and 3rd residues in turn (Chou-Fasman, 1978b)
CIDH920101: Normalized hydrophobicity scales for alpha-proteins (Cid et al., 1992)
CIDH920102: Normalized hydrophobicity scales for beta-proteins (Cid et al., 1992)
CIDH920103: Normalized hydrophobicity scales for alpha+beta-proteins (Cid et al., 1992)
CIDH920104: Normalized hydrophobicity scales for alpha/beta-proteins (Cid et al., 1992)
CIDH920105: Normalized average hydrophobicity scales (Cid et al., 1992)
COHE430101: Partial specific volume (Cohn-Edsall, 1943)
CRAJ730101: Normalized frequency of middle helix (Crawford et al., 1973)
CRAJ730102: Normalized frequency of beta-sheet (Crawford et al., 1973)
CRAJ730103: Normalized frequency of turn (Crawford et al., 1973)
DAWD720101: Size (Dawson, 1972)
DAYM780101: Amino acid composition (Dayhoff et al., 1978a)
DAYM780201: Relative mutability (Dayhoff et al., 1978b)
DESM900101: Membrane preference for cytochrome b: MPH89 (Degli Esposti et al., 1990)
DESM900102: Average membrane preference: AMP07 (Degli Esposti et al., 1990)
EISD840101: Consensus normalized hydrophobicity scale (Eisenberg, 1984)
EISD860101: Solvation free energy (Eisenberg-McLachlan, 1986)
EISD860102: Atom-based hydrophobic moment (Eisenberg-McLachlan, 1986)
EISD860103: Direction of hydrophobic moment (Eisenberg-McLachlan, 1986)
FASG760101: Molecular weight (Fasman, 1976)
FASG760102: Melting point (Fasman, 1976)
FASG760103: Optical rotation (Fasman, 1976)
FASG760104: pK-N (Fasman, 1976)
FASG760105: pK-C (Fasman, 1976)
FAUJ830101: Hydrophobic parameter pi (Fauchere-Pliska, 1983)
FAUJ880101: Graph shape index (Fauchere et al., 1988)
FAUJ880102: Smoothed upsilon steric parameter (Fauchere et al., 1988)
FAUJ880103: Normalized van der Waals volume (Fauchere et al., 1988)
FAUJ880104: STERIMOL length of the side chain (Fauchere et al., 1988)
FAUJ880105: STERIMOL minimum width of the side chain (Fauchere et al., 1988)
FAUJ880106: STERIMOL maximum width of the side chain (Fauchere et al., 1988)
FAUJ880107: N.m.r. chemical shift of alpha-carbon (Fauchere et al., 1988)
FAUJ880108: Localized electrical effect (Fauchere et al., 1988)
FAUJ880109: Number of hydrogen bond donors (Fauchere et al., 1988)
FAUJ880110: Number of full nonbonding orbitals (Fauchere et al., 1988)
FAUJ880111: Positive charge (Fauchere et al., 1988)
FAUJ880112: Negative charge (Fauchere et al., 1988)
FAUJ880113: pK-a(RCOOH) (Fauchere et al., 1988)
FINA770101: Helix-coil equilibrium constant (Finkelstein-Ptitsyn, 1977)
FINA910101: Helix initiation parameter at posision i-1 (Finkelstein et al., 1991)
FINA910102: Helix initiation parameter at posision i,i+1,i+2 (Finkelstein et al., 1991)
FINA910103: Helix termination parameter at posision j-2,j-1,j (Finkelstein et al., 1991)
FINA910104: Helix termination parameter at posision j+1 (Finkelstein et al., 1991)
GARJ730101: Partition coefficient (Garel et al., 1973)
GEIM800101: Alpha-helix indices (Geisow-Roberts, 1980)
GEIM800102: Alpha-helix indices for alpha-proteins (Geisow-Roberts, 1980)
GEIM800103: Alpha-helix indices for beta-proteins (Geisow-Roberts, 1980)
GEIM800104: Alpha-helix indices for alpha/beta-proteins (Geisow-Roberts, 1980)
GEIM800105: Beta-strand indices (Geisow-Roberts, 1980)
GEIM800106: Beta-strand indices for beta-proteins (Geisow-Roberts, 1980)
GEIM800107: Beta-strand indices for alpha/beta-proteins (Geisow-Roberts, 1980)
GEIM800108: Aperiodic indices (Geisow-Roberts, 1980)
GEIM800109: Aperiodic indices for alpha-proteins (Geisow-Roberts, 1980)
GEIM800110: Aperiodic indices for beta-proteins (Geisow-Roberts, 1980)
GEIM800111: Aperiodic indices for alpha/beta-proteins (Geisow-Roberts, 1980)
GOLD730101: Hydrophobicity factor (Goldsack-Chalifoux, 1973)
GOLD730102: Residue volume (Goldsack-Chalifoux, 1973)
GRAR740101: Composition (Grantham, 1974)
GRAR740102: Polarity (Grantham, 1974)
GRAR740103: Volume (Grantham, 1974)
GUYH850101: Partition energy (Guy, 1985)
HOPA770101: Hydration number (Hopfinger, 1971), Cited by Charton-Charton (1982)
HOPT810101: Hydrophilicity value (Hopp-Woods, 1981)
HUTJ700101: Heat capacity (Hutchens, 1970)
HUTJ700102: Absolute entropy (Hutchens, 1970)
HUTJ700103: Entropy of formation (Hutchens, 1970)
ISOY800101: Normalized relative frequency of alpha-helix (Isogai et al., 1980)
ISOY800102: Normalized relative frequency of extended structure (Isogai et al., 1980)
ISOY800103: Normalized relative frequency of bend (Isogai et al., 1980)
ISOY800104: Normalized relative frequency of bend R (Isogai et al., 1980)
ISOY800105: Normalized relative frequency of bend S (Isogai et al., 1980)
ISOY800106: Normalized relative frequency of helix end (Isogai et al., 1980)
ISOY800107: Normalized relative frequency of double bend (Isogai et al., 1980)
ISOY800108: Normalized relative frequency of coil (Isogai et al., 1980)
JANJ780101: Average accessible surface area (Janin et al., 1978)
JANJ780102: Percentage of buried residues (Janin et al., 1978)
JANJ780103: Percentage of exposed residues (Janin et al., 1978)
JANJ790101: Ratio of buried and accessible molar fractions (Janin, 1979)
JANJ790102: Transfer free energy (Janin, 1979)
JOND750101: Hydrophobicity (Jones, 1975)
JOND750102: pK (-COOH) (Jones, 1975)
JOND920101: Relative frequency of occurrence (Jones et al., 1992)
JOND920102: Relative mutability (Jones et al., 1992)
JUKT750101: Amino acid distribution (Jukes et al., 1975)
JUNJ780101: Sequence frequency (Jungck, 1978)
KANM800101: Average relative probability of helix (Kanehisa-Tsong, 1980)
KANM800102: Average relative probability of beta-sheet (Kanehisa-Tsong, 1980)
KANM800103: Average relative probability of inner helix (Kanehisa-Tsong, 1980)
KANM800104: Average relative probability of inner beta-sheet (Kanehisa-Tsong, 1980)
KARP850101: Flexibility parameter for no rigid neighbors (Karplus-Schulz, 1985)
KARP850102: Flexibility parameter for one rigid neighbor (Karplus-Schulz, 1985)
KARP850103: Flexibility parameter for two rigid neighbors (Karplus-Schulz, 1985)
KHAG800101: The Kerr-constant increments (Khanarian-Moore, 1980)
KLEP840101: Net charge (Klein et al., 1984)
KRIW710101: Side chain interaction parameter (Krigbaum-Rubin, 1971)
KRIW790101: Side chain interaction parameter (Krigbaum-Komoriya, 1979)
KRIW790102: Fraction of site occupied by water (Krigbaum-Komoriya, 1979)
KRIW790103: Side chain volume (Krigbaum-Komoriya, 1979)
KYTJ820101: Hydropathy index (Kyte-Doolittle, 1982)
LAWE840101: Transfer free energy, CHP/water (Lawson et al., 1984)
LEVM760101: Hydrophobic parameter (Levitt, 1976)
LEVM760102: Distance between C-alpha and centroid of side chain (Levitt, 1976)
LEVM760103: Side chain angle theta(AAR) (Levitt, 1976)
LEVM760104: Side chain torsion angle phi(AAAR) (Levitt, 1976)
LEVM760105: Radius of gyration of side chain (Levitt, 1976)
LEVM760106: van der Waals parameter R0 (Levitt, 1976)
LEVM760107: van der Waals parameter epsilon (Levitt, 1976)
LEVM780101: Normalized frequency of alpha-helix, with weights (Levitt, 1978)
LEVM780102: Normalized frequency of beta-sheet, with weights (Levitt, 1978)
LEVM780103: Normalized frequency of reverse turn, with weights (Levitt, 1978)
LEVM780104: Normalized frequency of alpha-helix, unweighted (Levitt, 1978)
LEVM780105: Normalized frequency of beta-sheet, unweighted (Levitt, 1978)
LEVM780106: Normalized frequency of reverse turn, unweighted (Levitt, 1978)
LEWP710101: Frequency of occurrence in beta-bends (Lewis et al., 1971)
LIFS790101: Conformational preference for all beta-strands (Lifson-Sander, 1979)
LIFS790102: Conformational preference for parallel beta-strands (Lifson-Sander, 1979)
LIFS790103: Conformational preference for antiparallel beta-strands (Lifson-Sander, 1979)
MANP780101: Average surrounding hydrophobicity (Manavalan-Ponnuswamy, 1978)
MAXF760101: Normalized frequency of alpha-helix (Maxfield-Scheraga, 1976)
MAXF760102: Normalized frequency of extended structure (Maxfield-Scheraga, 1976)
MAXF760103: Normalized frequency of zeta R (Maxfield-Scheraga, 1976)
MAXF760104: Normalized frequency of left-handed alpha-helix (Maxfield-Scheraga, 1976)
MAXF760105: Normalized frequency of zeta L (Maxfield-Scheraga, 1976)
MAXF760106: Normalized frequency of alpha region (Maxfield-Scheraga, 1976)
MCMT640101: Refractivity (McMeekin et al., 1964), Cited by Jones (1975)
MEEJ800101: Retention coefficient in HPLC, pH7.4 (Meek, 1980)
MEEJ800102: Retention coefficient in HPLC, pH2.1 (Meek, 1980)
MEEJ810101: Retention coefficient in NaClO4 (Meek-Rossetti, 1981)
MEEJ810102: Retention coefficient in NaH2PO4 (Meek-Rossetti, 1981)
MEIH800101: Average reduced distance for C-alpha (Meirovitch et al., 1980)
MEIH800102: Average reduced distance for side chain (Meirovitch et al., 1980)
MEIH800103: Average side chain orientation angle (Meirovitch et al., 1980)
MIYS850101: Effective partition energy (Miyazawa-Jernigan, 1985)
NAGK730101: Normalized frequency of alpha-helix (Nagano, 1973)
NAGK730102: Normalized frequency of bata-structure (Nagano, 1973)
NAGK730103: Normalized frequency of coil (Nagano, 1973)
NAKH900101: AA composition of total proteins (Nakashima et al., 1990)
NAKH900102: SD of AA composition of total proteins (Nakashima et al., 1990)
NAKH900103: AA composition of mt-proteins (Nakashima et al., 1990)
NAKH900104: Normalized composition of mt-proteins (Nakashima et al., 1990)
NAKH900105: AA composition of mt-proteins from animal (Nakashima et al., 1990)
NAKH900106: Normalized composition from animal (Nakashima et al., 1990)
NAKH900107: AA composition of mt-proteins from fungi and plant (Nakashima et al., 1990)
NAKH900108: Normalized composition from fungi and plant (Nakashima et al., 1990)
NAKH900109: AA composition of membrane proteins (Nakashima et al., 1990)
NAKH900110: Normalized composition of membrane proteins (Nakashima et al., 1990)
NAKH900111: Transmembrane regions of non-mt-proteins (Nakashima et al., 1990)
NAKH900112: Transmembrane regions of mt-proteins (Nakashima et al., 1990)
NAKH900113: Ratio of average and computed composition (Nakashima et al., 1990)
NAKH920101: AA composition of CYT of single-spanning proteins (Nakashima-Nishikawa, 1992)
NAKH920102: AA composition of CYT2 of single-spanning proteins (Nakashima-Nishikawa, 1992)
NAKH920103: AA composition of EXT of single-spanning proteins (Nakashima-Nishikawa, 1992)
NAKH920104: AA composition of EXT2 of single-spanning proteins (Nakashima-Nishikawa, 1992)
NAKH920105: AA composition of MEM of single-spanning proteins (Nakashima-Nishikawa, 1992)
NAKH920106: AA composition of CYT of multi-spanning proteins (Nakashima-Nishikawa, 1992)
NAKH920107: AA composition of EXT of multi-spanning proteins (Nakashima-Nishikawa, 1992)
NAKH920108: AA composition of MEM of multi-spanning proteins (Nakashima-Nishikawa, 1992)
NISK800101: 8 A contact number (Nishikawa-Ooi, 1980)
NISK860101: 14 A contact number (Nishikawa-Ooi, 1986)
NOZY710101: Transfer energy, organic solvent/water (Nozaki-Tanford, 1971)
OOBM770101: Average non-bonded energy per atom (Oobatake-Ooi, 1977)
OOBM770102: Short and medium range non-bonded energy per atom (Oobatake-Ooi, 1977)
OOBM770103: Long range non-bonded energy per atom (Oobatake-Ooi, 1977)
OOBM770104: Average non-bonded energy per residue (Oobatake-Ooi, 1977)
OOBM770105: Short and medium range non-bonded energy per residue (Oobatake-Ooi, 1977)
OOBM850101: Optimized beta-structure-coil equilibrium constant (Oobatake et al., 1985)
OOBM850102: Optimized propensity to form reverse turn (Oobatake et al., 1985)
OOBM850103: Optimized transfer energy parameter (Oobatake et al., 1985)
OOBM850104: Optimized average non-bonded energy per atom (Oobatake et al., 1985)
OOBM850105: Optimized side chain interaction parameter (Oobatake et al., 1985)
PALJ810101: Normalized frequency of alpha-helix from LG (Palau et al., 1981)
PALJ810102: Normalized frequency of alpha-helix from CF (Palau et al., 1981)
PALJ810103: Normalized frequency of beta-sheet from LG (Palau et al., 1981)
PALJ810104: Normalized frequency of beta-sheet from CF (Palau et al., 1981)
PALJ810105: Normalized frequency of turn from LG (Palau et al., 1981)
PALJ810106: Normalized frequency of turn from CF (Palau et al., 1981)
PALJ810107: Normalized frequency of alpha-helix in all-alpha class (Palau et al., 1981)
PALJ810108: Normalized frequency of alpha-helix in alpha+beta class (Palau et al., 1981)
PALJ810109: Normalized frequency of alpha-helix in alpha/beta class (Palau et al., 1981)
PALJ810110: Normalized frequency of beta-sheet in all-beta class (Palau et al., 1981)
PALJ810111: Normalized frequency of beta-sheet in alpha+beta class (Palau et al., 1981)
PALJ810112: Normalized frequency of beta-sheet in alpha/beta class (Palau et al., 1981)
PALJ810113: Normalized frequency of turn in all-alpha class (Palau et al., 1981)
PALJ810114: Normalized frequency of turn in all-beta class (Palau et al., 1981)
PALJ810115: Normalized frequency of turn in alpha+beta class (Palau et al., 1981)
PALJ810116: Normalized frequency of turn in alpha/beta class (Palau et al., 1981)
PARJ860101: HPLC parameter (Parker et al., 1986)
PLIV810101: Partition coefficient (Pliska et al., 1981)
PONP800101: Surrounding hydrophobicity in folded form (Ponnuswamy et al., 1980)
PONP800102: Average gain in surrounding hydrophobicity (Ponnuswamy et al., 1980)
PONP800103: Average gain ratio in surrounding hydrophobicity (Ponnuswamy et al., 1980)
PONP800104: Surrounding hydrophobicity in alpha-helix (Ponnuswamy et al., 1980)
PONP800105: Surrounding hydrophobicity in beta-sheet (Ponnuswamy et al., 1980)
PONP800106: Surrounding hydrophobicity in turn (Ponnuswamy et al., 1980)
PONP800107: Accessibility reduction ratio (Ponnuswamy et al., 1980)
PONP800108: Average number of surrounding residues (Ponnuswamy et al., 1980)
PRAM820101: Intercept in regression analysis (Prabhakaran-Ponnuswamy, 1982)
PRAM820102: Slope in regression analysis x 1.0E1 (Prabhakaran-Ponnuswamy, 1982)
PRAM820103: Correlation coefficient in regression analysis (Prabhakaran-Ponnuswamy, 1982)
PRAM900101: Hydrophobicity (Prabhakaran, 1990)
PRAM900102: Relative frequency in alpha-helix (Prabhakaran, 1990)
PRAM900103: Relative frequency in beta-sheet (Prabhakaran, 1990)
PRAM900104: Relative frequency in reverse-turn (Prabhakaran, 1990)
PTIO830101: Helix-coil equilibrium constant (Ptitsyn-Finkelstein, 1983)
PTIO830102: Beta-coil equilibrium constant (Ptitsyn-Finkelstein, 1983)
QIAN880101: Weights for alpha-helix at the window position of -6 (Qian-Sejnowski, 1988)
QIAN880102: Weights for alpha-helix at the window position of -5 (Qian-Sejnowski, 1988)
QIAN880103: Weights for alpha-helix at the window position of -4 (Qian-Sejnowski, 1988)
QIAN880104: Weights for alpha-helix at the window position of -3 (Qian-Sejnowski, 1988)
QIAN880105: Weights for alpha-helix at the window position of -2 (Qian-Sejnowski, 1988)
QIAN880106: Weights for alpha-helix at the window position of -1 (Qian-Sejnowski, 1988)
QIAN880107: Weights for alpha-helix at the window position of 0 (Qian-Sejnowski, 1988)
QIAN880108: Weights for alpha-helix at the window position of 1 (Qian-Sejnowski, 1988)
QIAN880109: Weights for alpha-helix at the window position of 2 (Qian-Sejnowski, 1988)
QIAN880110: Weights for alpha-helix at the window position of 3 (Qian-Sejnowski, 1988)
QIAN880111: Weights for alpha-helix at the window position of 4 (Qian-Sejnowski, 1988)
QIAN880112: Weights for alpha-helix at the window position of 5 (Qian-Sejnowski, 1988)
QIAN880113: Weights for alpha-helix at the window position of 6 (Qian-Sejnowski, 1988)
QIAN880114: Weights for beta-sheet at the window position of -6 (Qian-Sejnowski, 1988)
QIAN880115: Weights for beta-sheet at the window position of -5 (Qian-Sejnowski, 1988)
QIAN880116: Weights for beta-sheet at the window position of -4 (Qian-Sejnowski, 1988)
QIAN880117: Weights for beta-sheet at the window position of -3 (Qian-Sejnowski, 1988)
QIAN880118: Weights for beta-sheet at the window position of -2 (Qian-Sejnowski, 1988)
QIAN880119: Weights for beta-sheet at the window position of -1 (Qian-Sejnowski, 1988)
QIAN880120: Weights for beta-sheet at the window position of 0 (Qian-Sejnowski, 1988)
QIAN880121: Weights for beta-sheet at the window position of 1 (Qian-Sejnowski, 1988)
QIAN880122: Weights for beta-sheet at the window position of 2 (Qian-Sejnowski, 1988)
QIAN880123: Weights for beta-sheet at the window position of 3 (Qian-Sejnowski, 1988)
QIAN880124: Weights for beta-sheet at the window position of 4 (Qian-Sejnowski, 1988)
QIAN880125: Weights for beta-sheet at the window position of 5 (Qian-Sejnowski, 1988)
QIAN880126: Weights for beta-sheet at the window position of 6 (Qian-Sejnowski, 1988)
QIAN880127: Weights for coil at the window position of -6 (Qian-Sejnowski, 1988)
QIAN880128: Weights for coil at the window position of -5 (Qian-Sejnowski, 1988)
QIAN880129: Weights for coil at the window position of -4 (Qian-Sejnowski, 1988)
QIAN880130: Weights for coil at the window position of -3 (Qian-Sejnowski, 1988)
QIAN880131: Weights for coil at the window position of -2 (Qian-Sejnowski, 1988)
QIAN880132: Weights for coil at the window position of -1 (Qian-Sejnowski, 1988)
QIAN880133: Weights for coil at the window position of 0 (Qian-Sejnowski, 1988)
QIAN880134: Weights for coil at the window position of 1 (Qian-Sejnowski, 1988)
QIAN880135: Weights for coil at the window position of 2 (Qian-Sejnowski, 1988)
QIAN880136: Weights for coil at the window position of 3 (Qian-Sejnowski, 1988)
QIAN880137: Weights for coil at the window position of 4 (Qian-Sejnowski, 1988)
QIAN880138: Weights for coil at the window position of 5 (Qian-Sejnowski, 1988)
QIAN880139: Weights for coil at the window position of 6 (Qian-Sejnowski, 1988)
RACS770101: Average reduced distance for C-alpha (Rackovsky-Scheraga, 1977)
RACS770102: Average reduced distance for side chain (Rackovsky-Scheraga, 1977)
RACS770103: Side chain orientational preference (Rackovsky-Scheraga, 1977)
RACS820101: Average relative fractional occurrence in A0(i) (Rackovsky-Scheraga, 1982)
RACS820102: Average relative fractional occurrence in AR(i) (Rackovsky-Scheraga, 1982)
RACS820103: Average relative fractional occurrence in AL(i) (Rackovsky-Scheraga, 1982)
RACS820104: Average relative fractional occurrence in EL(i) (Rackovsky-Scheraga, 1982)
RACS820105: Average relative fractional occurrence in E0(i) (Rackovsky-Scheraga, 1982)
RACS820106: Average relative fractional occurrence in ER(i) (Rackovsky-Scheraga, 1982)
RACS820107: Average relative fractional occurrence in A0(i-1) (Rackovsky-Scheraga, 1982)
RACS820108: Average relative fractional occurrence in AR(i-1) (Rackovsky-Scheraga, 1982)
RACS820109: Average relative fractional occurrence in AL(i-1) (Rackovsky-Scheraga, 1982)
RACS820110: Average relative fractional occurrence in EL(i-1) (Rackovsky-Scheraga, 1982)
RACS820111: Average relative fractional occurrence in E0(i-1) (Rackovsky-Scheraga, 1982)
RACS820112: Average relative fractional occurrence in ER(i-1) (Rackovsky-Scheraga, 1982)
RACS820113: Value of theta(i) (Rackovsky-Scheraga, 1982)
RACS820114: Value of theta(i-1) (Rackovsky-Scheraga, 1982)
RADA880101: Transfer free energy from chx to wat (Radzicka-Wolfenden, 1988)
RADA880102: Transfer free energy from oct to wat (Radzicka-Wolfenden, 1988)
RADA880103: Transfer free energy from vap to chx (Radzicka-Wolfenden, 1988)
RADA880104: Transfer free energy from chx to oct (Radzicka-Wolfenden, 1988)
RADA880105: Transfer free energy from vap to oct (Radzicka-Wolfenden, 1988)
RADA880106: Accessible surface area (Radzicka-Wolfenden, 1988)
RADA880107: Energy transfer from out to in(95%buried) (Radzicka-Wolfenden, 1988)
RADA880108: Mean polarity (Radzicka-Wolfenden, 1988)
RICJ880101: Relative preference value at N" (Richardson-Richardson, 1988)
RICJ880102: Relative preference value at N' (Richardson-Richardson, 1988)
RICJ880103: Relative preference value at N-cap (Richardson-Richardson, 1988)
RICJ880104: Relative preference value at N1 (Richardson-Richardson, 1988)
RICJ880105: Relative preference value at N2 (Richardson-Richardson, 1988)
RICJ880106: Relative preference value at N3 (Richardson-Richardson, 1988)
RICJ880107: Relative preference value at N4 (Richardson-Richardson, 1988)
RICJ880108: Relative preference value at N5 (Richardson-Richardson, 1988)
RICJ880109: Relative preference value at Mid (Richardson-Richardson, 1988)
RICJ880110: Relative preference value at C5 (Richardson-Richardson, 1988)
RICJ880111: Relative preference value at C4 (Richardson-Richardson, 1988)
RICJ880112: Relative preference value at C3 (Richardson-Richardson, 1988)
RICJ880113: Relative preference value at C2 (Richardson-Richardson, 1988)
RICJ880114: Relative preference value at C1 (Richardson-Richardson, 1988)
RICJ880115: Relative preference value at C-cap (Richardson-Richardson, 1988)
RICJ880116: Relative preference value at C' (Richardson-Richardson, 1988)
RICJ880117: Relative preference value at C" (Richardson-Richardson, 1988)
ROBB760101: Information measure for alpha-helix (Robson-Suzuki, 1976)
ROBB760102: Information measure for N-terminal helix (Robson-Suzuki, 1976)
ROBB760103: Information measure for middle helix (Robson-Suzuki, 1976)
ROBB760104: Information measure for C-terminal helix (Robson-Suzuki, 1976)
ROBB760105: Information measure for extended (Robson-Suzuki, 1976)
ROBB760106: Information measure for pleated-sheet (Robson-Suzuki, 1976)
ROBB760107: Information measure for extended without H-bond (Robson-Suzuki, 1976)
ROBB760108: Information measure for turn (Robson-Suzuki, 1976)
ROBB760109: Information measure for N-terminal turn (Robson-Suzuki, 1976)
ROBB760110: Information measure for middle turn (Robson-Suzuki, 1976)
ROBB760111: Information measure for C-terminal turn (Robson-Suzuki, 1976)
ROBB760112: Information measure for coil (Robson-Suzuki, 1976)
ROBB760113: Information measure for loop (Robson-Suzuki, 1976)
ROBB790101: Hydration free energy (Robson-Osguthorpe, 1979)
ROSG850101: Mean area buried on transfer (Rose et al., 1985)
ROSG850102: Mean fractional area loss (Rose et al., 1985)
ROSM880101: Side chain hydropathy, uncorrected for solvation (Roseman, 1988)
ROSM880102: Side chain hydropathy, corrected for solvation (Roseman, 1988)
ROSM880103: Loss of Side chain hydropathy by helix formation (Roseman, 1988)
SIMZ760101: Transfer free energy (Simon, 1976), Cited by Charton-Charton (1982)
SNEP660101: Principal component I (Sneath, 1966)
SNEP660102: Principal component II (Sneath, 1966)
SNEP660103: Principal component III (Sneath, 1966)
SNEP660104: Principal component IV (Sneath, 1966)
SUEM840101: Zimm-Bragg parameter s at 20 C (Sueki et al., 1984)
SUEM840102: Zimm-Bragg parameter sigma x 1.0E4 (Sueki et al., 1984)
SWER830101: Optimal matching hydrophobicity (Sweet-Eisenberg, 1983)
TANS770101: Normalized frequency of alpha-helix (Tanaka-Scheraga, 1977)
TANS770102: Normalized frequency of isolated helix (Tanaka-Scheraga, 1977)
TANS770103: Normalized frequency of extended structure (Tanaka-Scheraga, 1977)
TANS770104: Normalized frequency of chain reversal R (Tanaka-Scheraga, 1977)
TANS770105: Normalized frequency of chain reversal S (Tanaka-Scheraga, 1977)
TANS770106: Normalized frequency of chain reversal D (Tanaka-Scheraga, 1977)
TANS770107: Normalized frequency of left-handed helix (Tanaka-Scheraga, 1977)
TANS770108: Normalized frequency of zeta R (Tanaka-Scheraga, 1977)
TANS770109: Normalized frequency of coil (Tanaka-Scheraga, 1977)
TANS770110: Normalized frequency of chain reversal (Tanaka-Scheraga, 1977)
VASM830101: Relative population of conformational state A (Vasquez et al., 1983)
VASM830102: Relative population of conformational state C (Vasquez et al., 1983)
VASM830103: Relative population of conformational state E (Vasquez et al., 1983)
VELV850101: Electron-ion interaction potential (Veljkovic et al., 1985)
VENT840101: Bitterness (Venanzi, 1984)
VHEG790101: Transfer free energy to lipophilic phase (von Heijne-Blomberg, 1979)
WARP780101: Average interactions per side chain atom (Warme-Morgan, 1978)
WEBA780101: RF value in high salt chromatography (Weber-Lacey, 1978)
WERD780101: Propensity to be buried inside (Wertz-Scheraga, 1978)
WERD780102: Free energy change of epsilon(i) to epsilon(ex) (Wertz-Scheraga, 1978)
WERD780103: Free energy change of alpha(Ri) to alpha(Rh) (Wertz-Scheraga, 1978)
WERD780104: Free energy change of epsilon(i) to alpha(Rh) (Wertz-Scheraga, 1978)
WOEC730101: Polar requirement (Woese, 1973)
WOLR810101: Hydration potential (Wolfenden et al., 1981)
WOLS870101: Principal property value z1 (Wold et al., 1987)
WOLS870102: Principal property value z2 (Wold et al., 1987)
WOLS870103: Principal property value z3 (Wold et al., 1987)
YUTK870101: Unfolding Gibbs energy in water, pH7.0 (Yutani et al., 1987)
YUTK870102: Unfolding Gibbs energy in water, pH9.0 (Yutani et al., 1987)
YUTK870103: Activation Gibbs energy of unfolding, pH7.0 (Yutani et al., 1987)
YUTK870104: Activation Gibbs energy of unfolding, pH9.0 (Yutani et al., 1987)
ZASB820101: Dependence of partition coefficient on ionic strength (Zaslavsky et al., 1982)
ZIMJ680101: Hydrophobicity (Zimmerman et al., 1968)
ZIMJ680102: Bulkiness (Zimmerman et al., 1968)
ZIMJ680103: Polarity (Zimmerman et al., 1968)
ZIMJ680104: Isoelectric point (Zimmerman et al., 1968)
ZIMJ680105: RF rank (Zimmerman et al., 1968)
AURR980101: Normalized positional residue frequency at helix termini N4'(Aurora-Rose, 1998)
AURR980102: Normalized positional residue frequency at helix termini N"' (Aurora-Rose, 1998)
AURR980103: Normalized positional residue frequency at helix termini N" (Aurora-Rose, 1998)
AURR980104: Normalized positional residue frequency at helix termini N'(Aurora-Rose, 1998)
AURR980105: Normalized positional residue frequency at helix termini Nc (Aurora-Rose, 1998)
AURR980106: Normalized positional residue frequency at helix termini N1 (Aurora-Rose, 1998)
AURR980107: Normalized positional residue frequency at helix termini N2 (Aurora-Rose, 1998)
AURR980108: Normalized positional residue frequency at helix termini N3 (Aurora-Rose, 1998)
AURR980109: Normalized positional residue frequency at helix termini N4 (Aurora-Rose, 1998)
AURR980110: Normalized positional residue frequency at helix termini N5 (Aurora-Rose, 1998)
AURR980111: Normalized positional residue frequency at helix termini C5 (Aurora-Rose, 1998)
AURR980112: Normalized positional residue frequency at helix termini C4 (Aurora-Rose, 1998)
AURR980113: Normalized positional residue frequency at helix termini C3 (Aurora-Rose, 1998)
AURR980114: Normalized positional residue frequency at helix termini C2 (Aurora-Rose, 1998)
AURR980115: Normalized positional residue frequency at helix termini C1 (Aurora-Rose, 1998)
AURR980116: Normalized positional residue frequency at helix termini Cc (Aurora-Rose, 1998)
AURR980117: Normalized positional residue frequency at helix termini C' (Aurora-Rose, 1998)
AURR980118: Normalized positional residue frequency at helix termini C" (Aurora-Rose, 1998)
AURR980119: Normalized positional residue frequency at helix termini C"' (Aurora-Rose, 1998)
AURR980120: Normalized positional residue frequency at helix termini C4' (Aurora-Rose, 1998)
ONEK900101: Delta G values for the peptides extrapolated to 0 M urea (O'Neil-DeGrado, 1990)
ONEK900102: Helix formation parameters (delta delta G) (O'Neil-DeGrado, 1990)
VINM940101: Normalized flexibility parameters (B-values), average (Vihinen et al., 1994)
VINM940102: Normalized flexibility parameters (B-values) for each residue surrounded by none rigid neighbours (Vihinen et al., 1994)
VINM940103: Normalized flexibility parameters (B-values) for each residue surrounded by one rigid neighbours (Vihinen et al., 1994)
VINM940104: Normalized flexibility parameters (B-values) for each residue surrounded by two rigid neighbours (Vihinen et al., 1994)
MUNV940101: Free energy in alpha-helical conformation (Munoz-Serrano, 1994)
MUNV940102: Free energy in alpha-helical region (Munoz-Serrano, 1994)
MUNV940103: Free energy in beta-strand conformation (Munoz-Serrano, 1994)
MUNV940104: Free energy in beta-strand region (Munoz-Serrano, 1994)
MUNV940105: Free energy in beta-strand region (Munoz-Serrano, 1994)
WIMW960101: Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water (Wimley-White, 1996)
KIMC930101: Thermodynamic beta sheet propensity (Kim-Berg, 1993)
MONM990101: Turn propensity scale for transmembrane helices (Monne et al., 1999)
BLAM930101: Alpha helix propensity of position 44 in T4 lysozyme (Blaber et al., 1993)
PARS000101: p-Values of mesophilic proteins based on the distributions of B values (Parthasarathy-Murthy, 2000)
PARS000102: p-Values of thermophilic proteins based on the distributions of B values (Parthasarathy-Murthy, 2000)
KUMS000101: Distribution of amino acid residues in the 18 non-redundant families of thermophilic proteins (Kumar et al., 2000)
KUMS000102: Distribution of amino acid residues in the 18 non-redundant families of mesophilic proteins (Kumar et al., 2000)
KUMS000103: Distribution of amino acid residues in the alpha-helices in thermophilic proteins (Kumar et al., 2000)
KUMS000104: Distribution of amino acid residues in the alpha-helices in mesophilic proteins (Kumar et al., 2000)
TAKK010101: Side-chain contribution to protein stability (kJ/mol) (Takano-Yutani, 2001)
FODM020101: Propensity of amino acids within pi-helices (Fodje-Al-Karadaghi, 2002)
NADH010101: Hydropathy scale based on self-information values in the two-state model (5% accessibility) (Naderi-Manesh et al., 2001)
NADH010102: Hydropathy scale based on self-information values in the two-state model (9% accessibility) (Naderi-Manesh et al., 2001)
NADH010103: Hydropathy scale based on self-information values in the two-state model (16% accessibility) (Naderi-Manesh et al., 2001)
NADH010104: Hydropathy scale based on self-information values in the two-state model (20% accessibility) (Naderi-Manesh et al., 2001)
NADH010105: Hydropathy scale based on self-information values in the two-state model (25% accessibility) (Naderi-Manesh et al., 2001)
NADH010106: Hydropathy scale based on self-information values in the two-state model (36% accessibility) (Naderi-Manesh et al., 2001)
NADH010107: Hydropathy scale based on self-information values in the two-state model (50% accessibility) (Naderi-Manesh et al., 2001)
MONM990201: Averaged turn propensities in a transmembrane helix (Monne et al., 1999)
KOEP990101: Alpha-helix propensity derived from designed sequences (Koehl-Levitt, 1999)
KOEP990102: Beta-sheet propensity derived from designed sequences (Koehl-Levitt, 1999)
CEDJ970101: Composition of amino acids in extracellular proteins (percent) (Cedano et al., 1997)
CEDJ970102: Composition of amino acids in anchored proteins (percent) (Cedano et al., 1997)
CEDJ970103: Composition of amino acids in membrane proteins (percent) (Cedano et al., 1997)
CEDJ970104: Composition of amino acids in intracellular proteins (percent) (Cedano et al., 1997)
CEDJ970105: Composition of amino acids in nuclear proteins (percent) (Cedano et al., 1997)
FUKS010101: Surface composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)
FUKS010102: Surface composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
FUKS010103: Surface composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
FUKS010104: Surface composition of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)
FUKS010105: Interior composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)
FUKS010106: Interior composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
FUKS010107: Interior composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
FUKS010108: Interior composition of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)
FUKS010109: Entire chain composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)
FUKS010110: Entire chain composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
FUKS010111: Entire chain composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)
FUKS010112: Entire chain compositino of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)
AVBF000101: Screening coefficients gamma, local (Avbelj, 2000)
AVBF000102: Screening coefficients gamma, non-local (Avbelj, 2000)
AVBF000103: Slopes tripeptide, FDPB VFF neutral (Avbelj, 2000)
AVBF000104: Slopes tripeptides, LD VFF neutral (Avbelj, 2000)
AVBF000105: Slopes tripeptide, FDPB VFF noside (Avbelj, 2000)
AVBF000106: Slopes tripeptide FDPB VFF all (Avbelj, 2000)
AVBF000107: Slopes tripeptide FDPB PARSE neutral (Avbelj, 2000)
AVBF000108: Slopes dekapeptide, FDPB VFF neutral (Avbelj, 2000)
AVBF000109: Slopes proteins, FDPB VFF neutral (Avbelj, 2000)
YANJ020101: Side-chain conformation by gaussian evolutionary method (Yang et al., 2002)
MITS020101: Amphiphilicity index (Mitaku et al., 2002)
TSAJ990101: Volumes including the crystallographic waters using the ProtOr (Tsai et al., 1999)
TSAJ990102: Volumes not including the crystallographic waters using the ProtOr (Tsai et al., 1999)
COSI940101: Electron-ion interaction potential values (Cosic, 1994)
PONP930101: Hydrophobicity scales (Ponnuswamy, 1993)
WILM950101: Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/MeCN/H2O (Wilce et al. 1995)
WILM950102: Hydrophobicity coefficient in RP-HPLC, C8 with 0.1%TFA/MeCN/H2O (Wilce et al. 1995)
WILM950103: Hydrophobicity coefficient in RP-HPLC, C4 with 0.1%TFA/MeCN/H2O (Wilce et al. 1995)
WILM950104: Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/2-PrOH/MeCN/H2O (Wilce et al. 1995)
KUHL950101: Hydrophilicity scale (Kuhn et al., 1995)
GUOD860101: Retention coefficient at pH 2 (Guo et al., 1986)
JURD980101: Modified Kyte-Doolittle hydrophobicity scale (Juretic et al., 1998)
BASU050101: Interactivity scale obtained from the contact matrix (Bastolla et al., 2005)
BASU050102: Interactivity scale obtained by maximizing the mean of correlation coefficient over single-domain globular proteins (Bastolla et al., 2005)
BASU050103: Interactivity scale obtained by maximizing the mean of correlation coefficient over pairs of sequences sharing the TIM barrel fold (Bastolla et al., 2005)
SUYM030101: Linker propensity index (Suyama-Ohara, 2003)
PUNT030101: Knowledge-based membrane-propensity scale from 1D_Helix in MPtopo databases (Punta-Maritan, 2003)
PUNT030102: Knowledge-based membrane-propensity scale from 3D_Helix in MPtopo databases (Punta-Maritan, 2003)
GEOR030101: Linker propensity from all dataset (George-Heringa, 2003)
GEOR030102: Linker propensity from 1-linker dataset (George-Heringa, 2003)
GEOR030103: Linker propensity from 2-linker dataset (George-Heringa, 2003)
GEOR030104: Linker propensity from 3-linker dataset (George-Heringa, 2003)
GEOR030105: Linker propensity from small dataset (linker length is less than six residues) (George-Heringa, 2003)
GEOR030106: Linker propensity from medium dataset (linker length is between six and 14 residues) (George-Heringa, 2003)
GEOR030107: Linker propensity from long dataset (linker length is greater than 14 residues) (George-Heringa, 2003)
GEOR030108: Linker propensity from helical (annotated by DSSP) dataset (George-Heringa, 2003)
GEOR030109: Linker propensity from non-helical (annotated by DSSP) dataset (George-Heringa, 2003)
ZHOH040101: The stability scale from the knowledge-based atom-atom potential (Zhou-Zhou, 2004)
ZHOH040102: The relative stability scale extracted from mutation experiments (Zhou-Zhou, 2004)
ZHOH040103: Buriability (Zhou-Zhou, 2004)
BAEK050101: Linker index (Bae et al., 2005)
HARY940101: Mean volumes of residues buried in protein interiors (Harpaz et al., 1994)
PONJ960101: Average volumes of residues (Pontius et al., 1996)
DIGM050101: Hydrostatic pressure asymmetry index, PAI (Di Giulio, 2005)
WOLR790101: Hydrophobicity index (Wolfenden et al., 1979)
OLSK800101: Average internal preferences (Olsen, 1980)
KIDA850101: Hydrophobicity-related index (Kidera et al., 1985)
GUYH850102: Apparent partition energies calculated from Wertz-Scheraga index (Guy, 1985)
GUYH850103: Apparent partition energies calculated from Robson-Osguthorpe index (Guy, 1985)
GUYH850104: Apparent partition energies calculated from Janin index (Guy, 1985)
GUYH850105: Apparent partition energies calculated from Chothia index (Guy, 1985)
ROSM880104: Hydropathies of amino acid side chains, neutral form (Roseman, 1988)
ROSM880105: Hydropathies of amino acid side chains, pi-values in pH 7.0 (Roseman, 1988)
JACR890101: Weights from the IFH scale (Jacobs-White, 1989)
COWR900101: Hydrophobicity index, 3.0 pH (Cowan-Whittaker, 1990)
BLAS910101: Scaled side chain hydrophobicity values (Black-Mould, 1991)
CASG920101: Hydrophobicity scale from native protein structures (Casari-Sippl, 1992)
CORJ870101: NNEIG index (Cornette et al., 1987)
CORJ870102: SWEIG index (Cornette et al., 1987)
CORJ870103: PRIFT index (Cornette et al., 1987)
CORJ870104: PRILS index (Cornette et al., 1987)
CORJ870105: ALTFT index (Cornette et al., 1987)
CORJ870106: ALTLS index (Cornette et al., 1987)
CORJ870107: TOTFT index (Cornette et al., 1987)
CORJ870108: TOTLS index (Cornette et al., 1987)
MIYS990101: Relative partition energies derived by the Bethe approximation (Miyazawa-Jernigan, 1999)
MIYS990102: Optimized relative partition energies - method A (Miyazawa-Jernigan, 1999)
MIYS990103: Optimized relative partition energies - method B (Miyazawa-Jernigan, 1999)
MIYS990104: Optimized relative partition energies - method C (Miyazawa-Jernigan, 1999)
MIYS990105: Optimized relative partition energies - method D (Miyazawa-Jernigan, 1999)
ENGD860101: Hydrophobicity index (Engelman et al., 1986)
FASG890101: Hydrophobicity index (Fasman, 1989)
K6.5: Values of Wc in proteins from class Beta, cutoff 6 A, separation 5 (Wozniak, 2014)
K8.5: Values of Wc in proteins from class Beta, cutoff 8 A, separation 5 (Wozniak, 2014)
K12.5: Values of Wc in proteins from class Beta, cutoff 12 A, separation 5 (Wozniak, 2014)
K6.15: Values of Wc in proteins from class Beta, cutoff 6 A, separation 15 (Wozniak, 2014)
K8.15: Values of Wc in proteins from class Beta, cutoff 8 A, separation 15 (Wozniak, 2014)
K12.15: Values of Wc in proteins from class Beta, cutoff 12 A, separation 15 (Wozniak, 2014)

Source

AAIndex database.

References

Kawashima, S. and Kanehisa, M. (2000) AAindex: amino acid index database. Nucleic Acids Res., 28:374.

Wozniak, P. and Kotulska M. (2014) Characteristics of protein residue-residue contacts and their application in contact prediction. 20(11):2497

Examples

data(aaprop)

data(aaprop)

Add 1-grams

Description

Builds (n+1)-grams from n-grams.

Usage

add_1grams(ngram, u, seq_length)
add_1grams(ngram, u, seq_length)

Arguments

`ngram`	a single n-gram.
`u`	`integer`, `numeric` or `character` vector of all possible unigrams.
`seq_length`	length of an origin sequence.

Details

n-grams are built by pasting every possible unigram in the every possible free position. The total length of n-gram (n plus total distance between elements of the n-gram) is limited by the length of an origin sequence, because the n-gram cannot be longer than an origin sequence.

Value

vector of n-grams (where n is equal to the n of the input plus one).

Examples

add_1grams("1_2.3.4_3.0", 1L:4, 8)

add_1grams("a.a_1", c("a", "b", "c"), 4)
add_1grams("1_2.3.4_3.0", 1L:4, 8)

add_1grams("a.a_1", c("a", "b", "c"), 4)

Coerce feature_test object to a data frame

Description

Coerce results of test_features function to a data.frame.

Usage

## S3 method for class 'feature_test'
as.data.frame(
  x,
  row.names = NULL,
  optional = FALSE,
  stringsAsFactors = FALSE,
  ...
)
## S3 method for class 'feature_test'
as.data.frame(
  x,
  row.names = NULL,
  optional = FALSE,
  stringsAsFactors = FALSE,
  ...
)

Arguments

`x`	object of class `feature_test`.
`row.names`	ignored.
`optional`	ignored.
`stringsAsFactors`	logical: should the character vector be converted to a factor?.
`...`	additional arguments to be passed to or from methods.

Value

a data frame with four columns: names of n-gram, p-values, occurrences in positive and negative sequences.

Binarize

Description

Binarizes a matrix.

Usage

binarize(x)
binarize(x)

Arguments

`x`	`matrix` or `simple_triplet_matrix`.

Value

a matrix or simple_triplet_matrix (depending on the input).

Calculate value of criterion

Description

Computes a chosen statistical criterion for each feature versus target vector.

Usage

calc_criterion(target, features, criterion_function)
calc_criterion(target, features, criterion_function)

Arguments

`target`	`integer` vector with target information (e.g. class labels).
`features`	`integer` matrix of features with number of rows equal to the length of the target vector.
`criterion_function`	a function calculating criterion. For a full list, see `test_features`.

Details

The permutation test implemented in biogram uses several criterions to filter important features. Each can be used by test_features by specifying the criterion parameter.

Value

a integer vector of length equal to the number of features containing computed information gain values.

Note

Both target and features must be binary, i.e. contain only 0 and 1 values.

Examples

tar <- sample(0L:1, 100, replace = TRUE)
feats <- matrix(sample(0L:1, 400, replace = TRUE), ncol = 4)

# Information Gain
calc_criterion(tar, feats, calc_ig)

# hi-squared-based measure
calc_criterion(tar, feats, calc_cs)

# Kullback-Leibler divergence
calc_criterion(tar, feats, calc_kl)
tar <- sample(0L:1, 100, replace = TRUE)
feats <- matrix(sample(0L:1, 400, replace = TRUE), ncol = 4)

# Information Gain
calc_criterion(tar, feats, calc_ig)

# hi-squared-based measure
calc_criterion(tar, feats, calc_cs)

# Kullback-Leibler divergence
calc_criterion(tar, feats, calc_kl)

Calculate Chi-squared-based measure

Description

Computes Chi-squared-based measure between features and target vector.

Usage

calc_cs(feature, target, len_target, pos_target)
calc_cs(feature, target, len_target, pos_target)

Arguments

`feature`	feature vector.
`target`	target.
`len_target`	length of the target vector.
`pos_target`	number of positive cases in the target vector.

Value

A numeric vector of length 1 representing computed Chi-square values.

Note

Both target and features must be binary, i.e. contain only 0 and 1 values.

The function was designed to be as fast as possible subroutine of calc_criterion and might be cumbersome if directly called by a user.

Examples

tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
calc_cs(feat, tar, 100, sum(tar))
tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
calc_cs(feat, tar, 100, sum(tar))

Calculate encoding distance

Description

Computes the encoding distance between two encodings.

Usage

calc_ed(a, b, prop = NULL, measure)
calc_ed(a, b, prop = NULL, measure)

Arguments

`a`	encoding (see `validate_encoding` for more information about the required structure of encoding).
`b`	encoding to which `a` should be compared. Must have equal number of groups or less than `a`. Both `a` and b must have the the same number of elements.
`prop`	`matrix` of physicochemical properties to normalize the encoding distance. Each column should represent properties of the single amino acid/nucleotide. If `NULL`, encoding distance is not normalized.
`measure`	`character` vector of length one specifying the measure. Currently avaible measures are `"pi"` (partition index) and `"si"` (similarity index). If the parameter `prop` is supplied, the encoding distance is normalized by the factor equal to the sum of distances for each group in `a` and the closest group in `b`. The position of a group is defined as the mean value of properties of amino acids or nucleotides belonging the group. See the package vignette for more details.

Value

an encoding distance.

Examples

# calculate encoding distance between two encodings of amino acids
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), 
           `2` = c("k", "h"), 
           `3` = c("d", "e"), 
           `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))

aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"), 
           `2` = c("k", "h", "d", "e", "i"), 
           `3` = c("f", "r", "w", "y", "s", "t", "c", "n"))
calc_ed(aa1, aa2, measure = "pi") 
    
# the encoding distance between two identical encodings is 0
calc_ed(aa1, aa1, measure = "pi") 
# calculate encoding distance between two encodings of amino acids
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), 
           `2` = c("k", "h"), 
           `3` = c("d", "e"), 
           `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))

aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"), 
           `2` = c("k", "h", "d", "e", "i"), 
           `3` = c("f", "r", "w", "y", "s", "t", "c", "n"))
calc_ed(aa1, aa2, measure = "pi") 
    
# the encoding distance between two identical encodings is 0
calc_ed(aa1, aa1, measure = "pi")

Calculate IG for single feature

Description

Computes information gain of single feature and target vector.

Usage

calc_ig(feature, target, len_target, pos_target)
calc_ig(feature, target, len_target, pos_target)

Arguments

`feature`	feature vector.
`target`	target.
`len_target`	length of the target vector.
`pos_target`	number of positive cases in the target vector.

Details

The information gain term is used here (improperly) as a synonym of mutual information. It is defined as:

$IG(X; Y) = \sum_{y \in Y} \sum_{x \in X} p(x, y) \log \left(\frac{p(x, y)}{p(x) p(y)} \right)$

In biogram package information gain is computed using following relationship: $IG = E(S) - E(S|F)$

Value

A numeric vector of length 1 representing information gain in nats.

Note

During calculations $0 \log 0 = 0$ . For a justification see References.

The function was designed to be afast subroutine of calc_criterion and might be cumbersome if directly called by a user.

References

Cover TM, Thomas JA Elements of Information Theory, 2nd Edition Wiley, 2006.

Examples

tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
calc_ig(feat, tar, 100, sum(tar))
tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
calc_ig(feat, tar, 100, sum(tar))

Calculate KL divergence of features

Description

Computes Kullback-Leibler divergence between features and target vector.

Usage

calc_kl(feature, target, len_target, pos_target)
calc_kl(feature, target, len_target, pos_target)

Arguments

`feature`	feature vector.
`target`	target.
`len_target`	length of the target vector.
`pos_target`	number of positive cases in the target vector.

Value

A numeric vector of length 1 representing Kullback-Leibler divergence value.

Note

Both target and features must be binary, i.e. contain only 0 and 1 values.

The function was designed to be as fast as possible subroutine of calc_criterion and might be cumbersome if directly called by a user.

References

Kullback S, Leibler RA On information and sufficiency. Annals of Mathematical Statistics 22 (1):79-86, 1951.

Examples

tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
calc_kl(feat, tar, 100, sum(tar))
tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
calc_kl(feat, tar, 100, sum(tar))

Calculate partition index

Description

Computes the encoding distance between two encodings.

Usage

calc_pi(a, b)
calc_pi(a, b)

Arguments

`a`	encoding (see `validate_encoding` for more information about the required structure of encoding).
`b`	encoding to which `a` should be compared. Must have equal number of groups or less than `a`. Both `a` and b must have the the same number of elements.

Details

The encoding distance between a and b is defined as the minimum number of amino acids that have to be moved between subgroups of encoding to make a identical to b (order of subgroups in the encoding and amino acids in a group is unimportant).

If the parameter prop is supplied, the encoding distance is normalized by the factor equal to the sum of distances for each group in a and the closest group in b. The position of a group is defined as the mean value of properties of amino acids or nucleotides belonging the group.

See the package vignette for more details.

Value

an encoding distance.

Examples

# calculate encoding distance between two encodings of amino acids
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), 
           `2` = c("k", "h"), 
           `3` = c("d", "e"), 
           `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))

aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"), 
           `2` = c("k", "h", "d", "e", "i"), 
           `3` = c("f", "r", "w", "y", "s", "t", "c", "n"))
calc_pi(aa1, aa2) 
    
# the encoding distance between two identical encodings is 0
calc_pi(aa1, aa1) 
 
# calculate encoding distance between two encodings of amino acids
aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), 
           `2` = c("k", "h"), 
           `3` = c("d", "e"), 
           `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))

aa2 = list(`1` = c("g", "a", "p", "v", "m", "l", "q"), 
           `2` = c("k", "h", "d", "e", "i"), 
           `3` = c("f", "r", "w", "y", "s", "t", "c", "n"))
calc_pi(aa1, aa2) 
    
# the encoding distance between two identical encodings is 0
calc_pi(aa1, aa1)

Compute similarity index

Description

Computes similarity index between two encodings.

Usage

calc_si(a, b)
calc_si(a, b)

Arguments

`a`	encoding (see `validate_encoding` for more information about the required structure of encoding).
`b`	encoding to which `a` should be compared. Must have equal number of groups or less than `a`. Both `a` and b must have the the same number of elements.

Details

Briefly, the similarity index is a fraction of elements that have the same pairing in both encodings. Pairing is a binary variable, that has value 1 if two elements are in the same group and 0 if not. For more details, see references.

Value

the value of similarity index.

References

Stephenson, J.D., and Freeland, S.J. (2013). Unearthing the Root of Amino Acid Similarity. J Mol Evol 77, 159-169.

Examples

# example from Stephenson & Freeland, 2013 (Fig. 6)
enc1 <- list(`1` = "A",
             `2` = c("F", "E"),
             `3` = c("C", "D", "G"))

enc2 <- list(`1` = c("A", "G"),
             `2` = c("C", "D", "E", "F"))

enc3 <- list(`1` = c("D", "G"),
             `2` = c("E", "F"),
             `3` = c("A", "C"))
             
calc_si(enc1, enc2)
calc_si(enc2, enc3)
calc_si(enc1, enc3)
# example from Stephenson & Freeland, 2013 (Fig. 6)
enc1 <- list(`1` = "A",
             `2` = c("F", "E"),
             `3` = c("C", "D", "G"))

enc2 <- list(`1` = c("A", "G"),
             `2` = c("C", "D", "E", "F"))

enc3 <- list(`1` = c("D", "G"),
             `2` = c("E", "F"),
             `3` = c("A", "C"))
             
calc_si(enc1, enc2)
calc_si(enc2, enc3)
calc_si(enc1, enc3)

Check chosen criterion

Description

Checks if the criterion is viable or matches it to the list of implemented criterions.

Usage

check_criterion(input_criterion, criterion_names = c("ig", "kl", "cs"))
check_criterion(input_criterion, criterion_names = c("ig", "kl", "cs"))

Arguments

`input_criterion`	`character` string, criterion from input.
`criterion_names`	list of implemented criterions, always in lowercase.

Value

a list of three:

criterion name,
its function,
nice name for outputs.

Clustering of sequences based on regular expression

Description

Clusters sequences hierarchically with regular expressions. At each step we minimize number of degrees of freedom for all regular expressions needed to describe the data

Usage

cluster_reg_exp(ngrams)
cluster_reg_exp(ngrams)

Arguments

ngrams

list of elements

Details

Regular expression is a list of the length equal to the length of the input sequences. Each element of the list represents a position in the sequence and contains amino acid, that are likely to occure on this position.

Value

List of four

"regExps"regular expression in best clustering
"seqClustering"clustering of sequences in best clustering
"allRegExps"all regular expressions.
"allIndices"all clusterings

Examples

data(human_cleave)
#cluster_reg_exp is computationally expensive

results <- cluster_reg_exp(human_cleave[1L:10, 1L:4])

data(human_cleave)
#cluster_reg_exp is computationally expensive

results <- cluster_reg_exp(human_cleave[1L:10, 1L:4])

Code n-grams

Description

Code human-friendly representation of n-grams into a biogram format.

Usage

code_ngrams(decoded_ngrams)
code_ngrams(decoded_ngrams)

Arguments

decoded_ngrams

a character vector of decoded n-grams.

Value

a character vector of n-grams.

Examples

code_ngrams(c("11_2", "1__12", "222"))
code_ngrams(c("aaa_b", "d__aa", "abd"))
code_ngrams(c("11_2", "1__12", "222"))
code_ngrams(c("aaa_b", "d__aa", "abd"))

Construct and filter n-grams

Description

Builds and selects important n-grams stepwise.

Usage

construct_ngrams(
  target,
  seq,
  u,
  n_max,
  conf_level = 0.95,
  gap = TRUE,
  use_heuristics = TRUE
)
construct_ngrams(
  target,
  seq,
  u,
  n_max,
  conf_level = 0.95,
  gap = TRUE,
  use_heuristics = TRUE
)

Arguments

`target`	`integer` vector with target information (e.g. class labels).
`seq`	a vector or matrix describing sequence(s).
`u`	`integer`, `numeric` or `character` vector of all possible unigrams.
`n_max`	size of constructed n-grams.
`conf_level`	confidence level.
`gap`	`logical`, if `TRUE` gaps are used. See Details.
`use_heuristics`	if `FALSE` then all n-grams are tested. This may slow down computations significantly

Details

construct_ngrams starts by extracting unigrams from the sequences, pasting them together in all combination and choosing from them significant features (with p-value below conf_level). The chosen n-grams are further extended to the specified by n_max size by pasting unigrams at both ends.

The gap parameter determines if construct_ngrams performs the feature selection on exact n-grams (gap equal to FALSE) or on all features in the Hamming distance 1 from the n-gram (gap equal to TRUE).

Value

a vector of n-grams.

Examples

# to make the example faster, we run construct_ngrams() on the 
# subset of data
deg_seqs <- degenerate(human_cleave[c(1L:100, 801L:900), 1L:9],
list(`1` = c(1, 6, 8, 10, 11, 18),
     `2` = c(2, 13, 14, 16, 17),
     `3` = c(5, 19, 20),
     `4` = c(7, 9, 12, 15),
     '5' = c(3, 4)))
bigrams <- construct_ngrams(human_cleave[c(1L:100, 801L:900), "tar"], deg_seqs, 1L:5, 2)
# to make the example faster, we run construct_ngrams() on the 
# subset of data
deg_seqs <- degenerate(human_cleave[c(1L:100, 801L:900), 1L:9],
list(`1` = c(1, 6, 8, 10, 11, 18),
     `2` = c(2, 13, 14, 16, 17),
     `3` = c(5, 19, 20),
     `4` = c(7, 9, 12, 15),
     '5' = c(3, 4)))
bigrams <- construct_ngrams(human_cleave[c(1L:100, 801L:900), "tar"], deg_seqs, 1L:5, 2)

Detect and count multiple n-grams in sequences

Description

A convinient wrapper around count_ngrams for counting multiple values of n and d.

Usage

count_multigrams(
  ns,
  ds = rep(0, length(ns)),
  seq,
  u,
  pos = FALSE,
  scale = FALSE,
  threshold = 0
)
count_multigrams(
  ns,
  ds = rep(0, length(ns)),
  seq,
  u,
  pos = FALSE,
  scale = FALSE,
  threshold = 0
)

Arguments

`ns`	`numeric` vector of n-grams' sizes. See Details.
`ds`	`list` of distances between elements of n-grams. Each element of the list is a vector used as distance for the respective n-gram size given by the `ns` parameter.
`seq`	a vector or matrix describing sequence(s).
`u`	`integer`, `numeric` or `character` vector of all possible unigrams.
`pos`	`logical`, if `TRUE` position-specific n_grams are counted.
`scale`	`logical`, if `TRUE` output data is normalized. May be applied only to the counts of n-grams without position information. See `Details`.
`threshold`	`integer`, if not equal to 0, data is binarized into two groups (larger or equal to threshold vs. smaller than threshold).

Details

ns vector and ds vector must have equal length. Elements of ds vector are used as equivalents of d parameter for respective values of ns. For example, if ns is c(4, 4, 4), the ds must be a list of length 3. Each element of the ds list must have length 3 or 1, as appropriate for a d parameter in count_ngrams function.

Value

An integer matrix with named columns. The naming conventions are the same as in count_ngrams.

Examples

seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
count_multigrams(c(3, 1), list(c(1, 0), 0), seqs, 1L:4, pos = TRUE)
# if ds parameter is not present, n-grams are calculated for distance 0
count_multigrams(c(3, 1), seq = seqs, u = 1L:4)

# calculate three times n-gram with the same length, but different distances between
# elements
count_multigrams(c(4, 4, 4), list(c(2, 0, 1), c(2, 1, 0), c(0, 1, 2)), 
                 seqs, 1L:4, pos = TRUE)
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
count_multigrams(c(3, 1), list(c(1, 0), 0), seqs, 1L:4, pos = TRUE)
# if ds parameter is not present, n-grams are calculated for distance 0
count_multigrams(c(3, 1), seq = seqs, u = 1L:4)

# calculate three times n-gram with the same length, but different distances between
# elements
count_multigrams(c(4, 4, 4), list(c(2, 0, 1), c(2, 1, 0), c(0, 1, 2)), 
                 seqs, 1L:4, pos = TRUE)

Count n-grams in sequences

Description

Counts all n-grams or position-specific n-grams present in the input sequence(s).

Usage

count_ngrams(seq, n, u, d = 0, pos = FALSE, scale = FALSE, threshold = 0)
count_ngrams(seq, n, u, d = 0, pos = FALSE, scale = FALSE, threshold = 0)

Arguments

`seq`	a vector or matrix describing sequence(s).
`n`	`integer` size of n-gram.
`u`	`integer`, `numeric` or `character` vector of all possible unigrams.
`d`	`integer` vector of distances between elements of n-gram (0 means consecutive elements). See Details.
`pos`	`logical`, if `TRUE` position-specific n_grams are counted.
`scale`	`logical`, if `TRUE` output data is normalized. May be applied only to the counts of n-grams without position information. See `Details`.
`threshold`	`integer`, if not equal to 0, data is binarized into two groups (larger or equal to threshold vs. smaller than threshold).

Details

A distance vector should be always n - 1 in length. For example when n = 3, d = c(1,2) means A_A__A. For n = 4, d = c(2,0,1) means A__AA_A. If vector d has length 1, it is recycled to length n - 1.

n-gram names follow a specific convention and have three parts for position-specific n-grams and two parts otherwise. The parts are separated by _. The . symbol is used to separate elements within a part. The general naming scheme is POSITION_NGRAM_DISTANCE. The optional POSITION part of the name indicates the actual position of the n-gram in the sequence(s) and will be present only if pos = TRUE. This part is always a single integer. The NGRAM part of the name is a sequence of elements in the n-gram. For example, 4.2.2 indicates the n-gram 422 (e.g. TCC). The DISTANCE part of the name is a vector of distance(s). For example, 0.0 indicates zero distances (continuous n-grams), while 1.2 represents distances for the n-gram A_A__A.

Examples of n-gram names:

46_4.4.4_0.1 : trigram 44_4 on position 46
12_2.1_2 : bigram 2__1 on position 12
8_1.1.1_0.0 : continuous trigram 111 on position 8
1.1.1_0.0 : continuous trigram 111 without position information

Value

a simple_triplet_matrix where columns represent n-grams and rows sequences. See Details for specifics of the naming convention.

Note

By default, the counted n-gram data is stored in a memory-saving format. To convert an object to a 'classical' matrix use the as.matrix function. See examples for further information.

Examples

# count trigrams without position information for nucleotides
count_ngrams(sample(1L:4, 50, replace = TRUE), 3, 1L:4, pos = FALSE)
# count position-specific trigrams from multiple nucleotide sequences
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
ngrams <- count_ngrams(seqs, 3, 1L:4, pos = TRUE)
# output results of the n-gram counting to screen
as.matrix(ngrams)
# count trigrams without position information for nucleotides
count_ngrams(sample(1L:4, 50, replace = TRUE), 3, 1L:4, pos = FALSE)
# count position-specific trigrams from multiple nucleotide sequences
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
ngrams <- count_ngrams(seqs, 3, 1L:4, pos = TRUE)
# output results of the n-gram counting to screen
as.matrix(ngrams)

Count specified n-grams

Description

Counts specified n-grams in the input sequence(s).

Usage

count_specified(seq, ngrams)
count_specified(seq, ngrams)

Arguments

`seq`	vector or matrix describing sequence(s).
`ngrams`	vector of n-grams.

Details

count_specified counts only selected n-grams declared by user in the ngrams parameter. Declared n-grams must be written using the biogram notation.

Value

A simple_triplet_matrix where columns represent n-grams and rows sequences.

Examples

seqs <- matrix(c(1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 3, 4, 1, 2, 2, 4), nrow = 2)
count_specified(seqs, ngrams = c("1.1.1_0.0", "2.2.2_0.0", "1.1.2_0.0"))

seqs <- matrix(sample(1L:5, 200, replace = TRUE), nrow = 20)
count_specified(seqs, ngrams = c("2_4.2_0", "2_1.4_0", "3_1.3_0",
                                 "2_4.2_1", "2_1.4_1", "3_1.3_1",
                                 "2_4.2_2", "2_1.4_2", "3_1.3_2"))
seqs <- matrix(c(1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 3, 4, 1, 2, 2, 4), nrow = 2)
count_specified(seqs, ngrams = c("1.1.1_0.0", "2.2.2_0.0", "1.1.2_0.0"))

seqs <- matrix(sample(1L:5, 200, replace = TRUE), nrow = 20)
count_specified(seqs, ngrams = c("2_4.2_0", "2_1.4_0", "3_1.3_0",
                                 "2_4.2_1", "2_1.4_1", "3_1.3_1",
                                 "2_4.2_2", "2_1.4_2", "3_1.3_2"))

Count total number of n-grams

Description

Computes total number of n-grams that can be extracted from sequences.

Usage

count_total(seq, n, d)
count_total(seq, n, d)

Arguments

`seq`	a vector or matrix describing sequence(s).
`n`	`integer` size of n-gram.
`d`	`integer` vector of distances between elements of n-gram (0 means consecutive elements). See Details.

Details

The maximum number of possible n-grams is limited by their length and the distance between elements of the n-gram.

Value

An integer rperesenting the total number of n-grams.

Note

A format of d vector is discussed in Details of count_ngrams. The maximum

Examples

seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
# make several sequences shorter by replacing them partially with NA
seqs[8L:11, 46L:50] <- NA
seqs[1L, 31L:50] <- NA
count_total(seqs, 3, c(1, 0))
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
# make several sequences shorter by replacing them partially with NA
seqs[8L:11, 46L:50] <- NA
seqs[1L, 31L:50] <- NA
count_total(seqs, 3, c(1, 0))

Create encoding

Description

Reduces an alphabet using physicochemical properties.

Usage

create_encoding(prop, len)
create_encoding(prop, len)

Arguments

`prop`	`matrix` of properties with number of column equal to the length of the alphabet. Column must be named after elements of the alphabet. Each row represents a different physicochemical property.
`len`	length of the resulting encoding. Must be larger than zero and smaller than number of elements in the alphabet.

Details

The encoding is a list of groups to which elements of an alphabet should be reduced. All elements of the alphabet (all amino acids or all nucleotides) should appear in the encoding.

Value

An encoding.

Examples

enc1 = list(`1` = c("a", "t"), 
            `2` = c("g", "c"))
encoding2df(enc1)
enc1 = list(`1` = c("a", "t"), 
            `2` = c("g", "c"))
encoding2df(enc1)

Create feature according to given contingency matrix

Description

Creates a matrix of features and target based on the values from contingency matrix.

Usage

create_feature_target(n11, n01, n10, n00)
create_feature_target(n11, n01, n10, n00)

Arguments

`n11`	number of elements for which both target and feature equal 1.
`n01`	number of elements for which target and feature equal 1,0 respectively.
`n10`	number of elements for which target and feature equal 0,1 respectively.
`n00`	number of elements for which both target and feature equal 0.

Value

a matrix of 2 columns and n11+n10+n01+n00 rows. Columns represent target and feature vectors, respectively.

Examples

# equivalent of 
#         target
# feature 10 375
#        15 600
target_feature <- create_feature_target(10, 375, 15, 600)
# equivalent of 
#         target
# feature 10 375
#        15 600
target_feature <- create_feature_target(10, 375, 15, 600)

Get all possible n-Grams

Description

Creates the vector of all possible n_grams (for given n).

Usage

create_ngrams(n, u, possible_grams = NULL)
create_ngrams(n, u, possible_grams = NULL)

Arguments

`n`	`integer` size of n-gram.
`u`	`integer`, `numeric` or `character` vector of all possible unigrams.
`possible_grams`	number of possible n-grams. If not `NULL` n-grams do not contain information about position

Details

See Details section of count_ngrams for more information about n-grams naming convention. The possible information about distance must be added by hand (see examples).

Value

a character vector. Elements of n-gram are separated by dot.

Note

Input data must be a matrix or data frame of numeric elements.

Examples

# bigrams for standard aminoacids
create_ngrams(2, 1L:20)
# bigrams for standard aminoacids with positions, 10 amino acid long sequence, so 
# only 9 bigrams can be located in sequence
create_ngrams(2, 1L:20, 9)
# bigrams for DNA with positions, 10 nucleotide long sequence, distance 1, so only 
# 8 bigrams in sequence
# paste0 adds information about distance at the end of n-gram
paste0(create_ngrams(2, 1L:4, 8), "_0")
# bigrams for standard aminoacids
create_ngrams(2, 1L:20)
# bigrams for standard aminoacids with positions, 10 amino acid long sequence, so 
# only 9 bigrams can be located in sequence
create_ngrams(2, 1L:20, 9)
# bigrams for DNA with positions, 10 nucleotide long sequence, distance 1, so only 
# 8 bigrams in sequence
# paste0 adds information about distance at the end of n-gram
paste0(create_ngrams(2, 1L:4, 8), "_0")

criterion_distribution class

Description

A result of distr_crit function.

Details

An object of class criterion_distribution is a numeric matrix.

Data

1st column:: possible values of criterion.
2nd column:: probability density function.
3rd column:: cumulative distribution function.

Attributes

plot_data: A matrix with values of the criterion and their probabilities.
nice_name: 'Nice' name of the criterion.

Categorize tested features

Description

Categorizes results of test_features function into groups based on their significance.

Usage

## S3 method for class 'feature_test'
cut(x, split = "significances", breaks = c(0, 1e-04, 0.01, 0.05, 1), ...)
## S3 method for class 'feature_test'
cut(x, split = "significances", breaks = c(0, 1e-04, 0.01, 0.05, 1), ...)

Arguments

`x`	an object of class `feature_test`.
`split`	attribute along which output should be categorized. Possible values are `"significances"`, `"positives"` and `"negatives"`. See `Value`.
`breaks`	a vector of significances of frequencies along which n-grams are aggregated. See description of `cut` function and `Details`.
`...`	further parameters accepted by the `cut` function.

Value

the value of function depends on the split parameter. The function returns a named list of length equal to the length of significances (when split equals "significances") or frequencies (when split equals "positives" or "negatives") minus one. Each elements of the list contains names of the n-grams belonging to the given significance or frequency group.

Decode n-grams

Description

Transforms a vector of n-grams into a human-friendly form.

Usage

decode_ngrams(ngrams)
decode_ngrams(ngrams)

Arguments

ngrams

a character vector of n-grams.

Value

a character vector of length equal to the number of n-grams.

Note

Decoded n-grams lose the position information.

Examples

decode_ngrams(c("2_1.1.2_0.1", "3_1.1.2_2.0", "3_2.2.2_0.0"))
decode_ngrams(c("2_1.1.2_0.1", "3_1.1.2_2.0", "3_2.2.2_0.0"))

Degenerate protein sequence

Description

'Degenerates' amino acid or nucleic sequence by aggregating elements to bigger groups.

Usage

degenerate(seq, element_groups)
degenerate(seq, element_groups)

Arguments

`seq`	`character` vector or matrix representing single sequence.
`element_groups`	encoding of elements: list of groups to which elements of sequence should be aggregated. Must have unique names.

Value

A character vector or matrix (if input is a matrix) containing aggregated elements.

Note

Characters not present in the element_groups will be converted to NA with a warning.

Examples

sample_seq <- c(1, 3, 1, 3, 4, 4, 3, 1, 2)
table(sample_seq)

# aggregate sequence to purins and pyrimidines
deg_seq <- degenerate(sample_seq, list(w = c(1, 4), s = c(2, 3)))
table(deg_seq)
sample_seq <- c(1, 3, 1, 3, 4, 4, 3, 1, 2)
table(sample_seq)

# aggregate sequence to purins and pyrimidines
deg_seq <- degenerate(sample_seq, list(w = c(1, 4), s = c(2, 3)))
table(deg_seq)

Degenerate n-grams

Description

'Degenerates' n-grams by aggregating amino acid or nucleotide elements into bigger groups.

Usage

degenerate_ngrams(x, element_groups, binarize = FALSE)
degenerate_ngrams(x, element_groups, binarize = FALSE)

Arguments

`x`	a `simple_triplet_matrix` or matrix of n-grams.
`element_groups`	encoding of elements: list of groups to which elements of n-grams should be aggregated. Must have unique names.
`binarize`	logical indicating if n-grams should be binarized

Value

Depending on the x{} a simple_triplet_matrix or matrix of degenerated n-grams.

Compute criterion distribution

Description

Computes criterion distribution under null hypothesis for all contingency tables possible for a feature and a target.

Usage

distr_crit(target, feature, criterion = "ig", iter_limit = 200)
distr_crit(target, feature, criterion = "ig", iter_limit = 200)

Arguments

`target`	{0,1}-valued target vector. See Details.
`feature`	{0,1}-valued feature vector. See Details.
`criterion`	criterion used for calculations of distribution. See `calc_criterion` for the list of avaible criteria.
`iter_limit`	limit the number of calculated contingence matrices. If `NULL`, computes all possible contingence matrices.

Details

both target and feature vectors may contain only 0 and 1.

Value

An object of class criterion_distribution.

Examples

target_feature <- create_feature_target(10, 375, 15, 600) 
distr_crit(target = target_feature[,1], feature = target_feature[,2])
target_feature <- create_feature_target(10, 375, 15, 600) 
distr_crit(target = target_feature[,1], feature = target_feature[,2])

Convert encoding to data frame

Description

Converts an encoding to a data frame.

Usage

encoding2df(x, sort = FALSE)
encoding2df(x, sort = FALSE)

Arguments

`x`	encoding.
`sort`	if `TRUE` rows are sorted according to elements.

Details

The encoding is a list of groups to which elements of an alphabet should be reduced. All elements of the alphabet (all amino acids or all nucleotides) should appear in the encoding.

Value

data frame with two columns. First column represents an index of a group in the supplied encoding and the second column contains all elements of the encoding.

Examples

create_encoding(aaprop[1L:5, ], 5)
create_encoding(aaprop[1L:5, ], 5)

2d cross-tabulation

Description

Quickly cross-tabulates two binary vectors.

Usage

fast_crosstable(target, len_target, pos_target, feature)
fast_crosstable(target, len_target, pos_target, feature)

Arguments

`target`	target.
`len_target`	length of the target vector.
`pos_target`	number of positive cases in the target vector.
`feature`	feature vector.

Details

Input looks odd, but the function was build to be fast subroutine of calc_ig, which works on many features but only one target.

Value

a vector of length four:

target +, feature+
target +, feature-
target -, feature+
target -, feature-

Note

Binary vector means a numeric vector with 0 or 1.

Examples

tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
fast_crosstable(tar, length(tar), sum(tar),  feat)
tar <- sample(0L:1, 100, replace = TRUE)
feat <- sample(0L:1, 100, replace = TRUE)
fast_crosstable(tar, length(tar), sum(tar),  feat)

feature_test class

Description

A result of test_features function.

Details

An object of the feature_test class is a numeric vector of p-values. Additional attributes characterizes futher the details of test which returned these p-values.

Attributes

criterion: the criterion used in permutation test.
adjust: the name of p-value adjusting method.
times: the number of permutations. If QuiPT was chosen NA.
occ: frequency of features splitted in subset based on the value of target.

Convert encoding from full to simple format

Description

Converts an encoding from the full format to the simple format.

Usage

full2simple(x)
full2simple(x)

Arguments

x

encoding.

Examples

aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), 
           `2` = c("k", "h"), 
           `3` = c("d", "e"), 
           `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))
full2simple(aa1)

aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), 
           `2` = c("k", "h"), 
           `3` = c("d", "e"), 
           `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))
full2simple(aa1)

Gap n-grams

Description

Introduces gaps in the n-grams.

Usage

gap_ngrams(ngrams)
gap_ngrams(ngrams)

Arguments

ngrams

a vector of positioned n-grams (as created by count_ngrams).

Details

A single element of the input n-gram at a time will be replaced by a gap. For example, introducing gaps in n-gram 2_1.1.2_0.1 will results in three n-grams: 3_1.2_1 (where the 2_1_0 unigram was replaced by a gap), 2_1.2_2 and 2_1.1_0.

Value

A character vector of (n-1)-grams with introduced gaps.

Examples

gap_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"))
gap_ngrams(c("1.1.2_0.1", "1.1.2_0.0", "2.2.2_0.0"))
gap_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"))
gap_ngrams(c("1.1.2_0.1", "1.1.2_0.0", "2.2.2_0.0"))

Generate sequence

Description

Generate a sequences using an alphabet of unigrams and set of rules.

Usage

generate_sequence(alphabet, regions)
generate_sequence(alphabet, regions)

Arguments

`alphabet`	the unigram alphabet. Columns are equivalent to unigrams and rows to particular properties.
`regions`	a list of rules describing regions.

Generate single region

Description

Generate a region using an alphabet of unigrams and considering provided set of rules.

Usage

generate_single_region(alphabet, reg_len, prop_ranges, exactness)
generate_single_region(alphabet, reg_len, prop_ranges, exactness)

Arguments

`alphabet`	the unigram alphabet. Columns are equivalent to unigrams and rows to particular properties.
`reg_len`	the number of unigrams inside the region.
`prop_ranges`	required intervals of properties of unigrams in the region. See Details.
`exactness`	a `numeric` value between 0 and 1 defining how stricly unigrams are kept within `prop_ranges`. If 1, only unigrams within `prop_ranges` are inside the region. if 0.9, there is 10 unigrams that are not in the `prop_ranges` will be inside the region.

Examples

props1 <- list(P1 = c(0, 0.5), 
               P2 = c(0.2, 0.4),
               P3 = c(0.5, 1),
               P4 = c(0, 0))

props2 <- list(P1 = c(0.5, 1), 
               P2 = c(0.4, 1),
               P3 = c(0, 0.5),
               P4 = c(1, 1))


alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE),
                            replicate(12, props2, simplify = FALSE)),
                          unigram_names = letters[1L:20])

rules1 <- list(P1 = c(0.5, 1), 
               P2 = c(0.4, 1),
               P3 = c(0, 0.5),
               P4 = c(1, 1))

generate_single_region(alph, 10, rules1, 0.9)
props1 <- list(P1 = c(0, 0.5), 
               P2 = c(0.2, 0.4),
               P3 = c(0.5, 1),
               P4 = c(0, 0))

props2 <- list(P1 = c(0.5, 1), 
               P2 = c(0.4, 1),
               P3 = c(0, 0.5),
               P4 = c(1, 1))


alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE),
                            replicate(12, props2, simplify = FALSE)),
                          unigram_names = letters[1L:20])

rules1 <- list(P1 = c(0.5, 1), 
               P2 = c(0.4, 1),
               P3 = c(0, 0.5),
               P4 = c(1, 1))

generate_single_region(alph, 10, rules1, 0.9)

Generate single unigram

Description

Assign randomly generated properties to a single unigram.

Usage

generate_single_unigram(unigram_ranges)
generate_single_unigram(unigram_ranges)

Arguments

unigram_ranges

list of ranges containing respective properties. If named, names are preserved.

Examples

generate_single_unigram(list(P1 = c(0, 0.5), 
                             P2 = c(0.2, 0.4),
                             P3 = c(0.5, 1),
                             P4 = c(0, 0)))

generate_single_unigram(list(P1 = c(0, 0.5), 
                             P2 = c(0.2, 0.4),
                             P3 = c(0.5, 1),
                             P4 = c(0, 0)))

Generate unigrams

Description

Generates an alphabet of unigrams based on given list of properties.

Usage

generate_unigrams(unigram_list, unigram_names = NULL, prop_names = NULL)
generate_unigrams(unigram_list, unigram_names = NULL, prop_names = NULL)

Arguments

`unigram_list`	a list of unigrams' parameters. See Details.
`unigram_names`	names of unigrams. If not `NULL`, will overwrite any existing unigram names.
`prop_names`	names of properties. If not `NULL`, will overwrite any existing names.

Details

Unigram parameters are represented as a list of intervals, where each interval corresponds to a different property. The function generate unigrams randomly choosing values of properties from given intervals using uniform distribution. All lists of ranges should have the same length, which equils to describing each unigram using the same properties.

Examples

props1 <- list(P1 = c(0, 0.5), 
               P2 = c(0.2, 0.4),
               P3 = c(0.5, 1),
               P4 = c(0, 0))

props2 <- list(P1 = c(0.5, 1), 
               P2 = c(0.4, 1),
               P3 = c(0, 0.5),
               P4 = c(1, 1))


alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE),
                          replicate(12, props2, simplify = FALSE)),
                          unigram_names = letters[1L:20])
                          
props1 <- list(P1 = c(0, 0.5), 
               P2 = c(0.2, 0.4),
               P3 = c(0.5, 1),
               P4 = c(0, 0))

props2 <- list(P1 = c(0.5, 1), 
               P2 = c(0.4, 1),
               P3 = c(0, 0.5),
               P4 = c(1, 1))


alph <- generate_unigrams(c(replicate(8, props1, simplify = FALSE),
                          replicate(12, props2, simplify = FALSE)),
                          unigram_names = letters[1L:20])

Get indices of n-grams

Description

Computes list of n-gram elements positions in sequence.

Usage

get_ngrams_ind(len_seq, n, d)
get_ngrams_ind(len_seq, n, d)

Arguments

`len_seq`	`integer` value describing sequence's length.
`n`	`integer` size of n-gram.
`d`	`integer` vector of distances between elements of n-gram (0 means consecutive elements). See Details.

Details

A format of d vector is discussed in Details of count_ngrams.

Value

A list with number of elements equal to n. Every element is a vector containing locations of given n-gram letter. For example, first element of list contain indices of first letter of all n-grams. The attribute d of output contains distances between letter used to compute locations (see Details).

Examples

# positions trigrams in sequence of length 10
get_ngrams_ind(10, 9, 0)
# positions trigrams in sequence of length 10
get_ngrams_ind(10, 9, 0)

Human signal peptides cleavage sites

Description

A set of 648 cleavage sites and 648 parts of mature proteins shortly after cleavage sites derived from human proteome.

Format

A data frame with 1296 observations on the following 10 variables. Columns from P1 to P9 describes positions in an extracted peptide. tar is a target vector. It has value 1 if a peptide is a cleavage site and 0 if not.

Details

Each peptide in the data set is nine amino acid residues long. In case of cleavage sites, the clevage is located between fifth and sixth peptide. The non-cleavage sites are parts of mature proteins starting five positions after cleavage site.

Note

Amino acid residues were recoded as integers.

Source

UniProt

Examples


data(human_cleave)
table(human_cleave[, 1])

data(human_cleave)
table(human_cleave[, 1])

Validate n-gram

Description

Checks if the character string may be used as an n-gram and its notation follows specific convention of biogram package.

Usage

is_ngram(x)
is_ngram(x)

Arguments

`x`	`character` string representing single n-gram.

Value

TRUE if n-gram's notation is correct, FALSE if not.

Examples

print(is_ngram("1_1.1.1_0.0"))
print(is_ngram("not_ngram"))
print(is_ngram("1_1.1.1_0.0"))
print(is_ngram("not_ngram"))

Convert letters to numbers

Description

Converts biological sequence from letter to number notation.

Usage

l2n(seq, seq_type)
l2n(seq, seq_type)

Arguments

`seq`	`character` vector or matrix representing single sequence.
`seq_type`	the type of sequence. Can be `rna`, `dna` or `prot`.

Value

a numeric vector or matrix containing converted elements.

Examples

sample_seq <- c("a", "d", "d", "g", "a", "g", "n", "a", "l")
l2n(sample_seq, "prot")
sample_seq <- c("a", "d", "d", "g", "a", "g", "n", "a", "l")
l2n(sample_seq, "prot")

Get lengths of the n-grams

Description

Computes the length of n-grams.

Usage

lengths_ngrams(ngrams)
lengths_ngrams(ngrams)

Arguments

ngrams

a character vector of n-grams.

Value

A numeric vector of n-gram lengths.

Examples

lengths_ngrams(c("2_1.1.2_0.1", "3_1.1.2_2.0", "3_2.2.2_0.0"))
lengths_ngrams(c("2_1.1.2_0.1", "3_1.1.2_2.0", "3_2.2.2_0.0"))

Convert list of sequences to matrix

Description

Converts list of sequences to matrix.

Usage

list2matrix(seq_list)
list2matrix(seq_list)

Arguments

seq_list

list of sequences (e.g. as returned by the read.fasta function).

Value

A matrix with the number of rows equal to the number of sequences and the number of columns equal to the length of the longest sequence.

Note

Since matrix must have specified number of columns, ends of shorter sequences are completed with NAs.

Examples

list2matrix(list(s1 = c("c", "g", "g", "t"),
                 s2 = c("g", "t", "c", "t", "t", "g"),
                 s3 = c("a", "a", "t")))
list2matrix(list(s1 = c("c", "g", "g", "t"),
                 s2 = c("g", "t", "c", "t", "t", "g"),
                 s3 = c("a", "a", "t")))

Convert numbers to letters

Description

Converts biological sequence from number to letter notation.

Usage

n2l(seq, seq_type)
n2l(seq, seq_type)

Arguments

`seq`	`integer` vector or matrix representing single sequence.
`seq_type`	the type of sequence. Can be `rna`, `dna` or `prot`.

Value

a character vector or matrix containing converted elements.

Examples

sample_seq <- c(1, 3, 3, 6, 1, 6, 12, 1, 10)
n2l(sample_seq, "prot")
sample_seq <- c(1, 3, 3, 6, 1, 6, 12, 1, 10)
n2l(sample_seq, "prot")

n-grams to data frame

Description

Tranforms a vector of n-grams into a data frame.

Usage

ngrams2df(ngrams)
ngrams2df(ngrams)

Arguments

ngrams

a character vector of n-grams.

Value

a data.frame with 2 (in case of n-grams without known position) or three columns (n-grams with position information).

Examples

ngrams2df(c("2_1.1.2_0.0", "3_1.1.2_0.0", "3_2.2.2_0.0", "2_1.1_0"))
ngrams2df(c("2_1.1.2_0.0", "3_1.1.2_0.0", "3_2.2.2_0.0", "2_1.1_0"))

Plot criterion distribution

Description

Plots results of distr_crit function.

Usage

## S3 method for class 'criterion_distribution'
plot(x, ...)
## S3 method for class 'criterion_distribution'
plot(x, ...)

Arguments

`x`	object of class `criterion_distribution`.
`...`	further arguments passed to `plot`.

Value

nothing.

Examples

target_feature <- create_feature_target(10, 375, 15, 600) 
example_result <- distr_crit(target = target_feature[,1], 
                             feature = target_feature[,2])
plot(example_result)

# a ggplot2 plot
library(ggplot2)
ggplot_distr <- function(x) {
b <- data.frame(cbind(x=as.numeric(rownames(attr(x, "plot_data"))), 
                      attr(x, "plot_data")))
d1 <- cbind(b[,c(1,2)], attr(x, "nice_name"))
d2 <- cbind(b[,c(1,3)], "Probability")
colnames(d1) <- c("x", "y", "panel")
colnames(d2) <- c("x", "y", "panel")
d <- rbind(d1, d2)
p <- ggplot(data = d, mapping = aes(x = x, y = y)) + 
  facet_grid(panel~., scale="free") + 
  geom_freqpoly(data= d2, aes(color=y), stat = "identity") + 
  scale_fill_brewer(palette = "Set1") + 
  geom_point(data=d1, aes(size=y), stat = "identity") + 
  guides(color = "none") + 
  guides(size = "none") + 
  xlab("Number of cases with feature=1 and target=1") + ylab("")
p
}
ggplot_distr(example_result)

target_feature <- create_feature_target(10, 375, 15, 600) 
example_result <- distr_crit(target = target_feature[,1], 
                             feature = target_feature[,2])
plot(example_result)

# a ggplot2 plot
library(ggplot2)
ggplot_distr <- function(x) {
b <- data.frame(cbind(x=as.numeric(rownames(attr(x, "plot_data"))), 
                      attr(x, "plot_data")))
d1 <- cbind(b[,c(1,2)], attr(x, "nice_name"))
d2 <- cbind(b[,c(1,3)], "Probability")
colnames(d1) <- c("x", "y", "panel")
colnames(d2) <- c("x", "y", "panel")
d <- rbind(d1, d2)
p <- ggplot(data = d, mapping = aes(x = x, y = y)) + 
  facet_grid(panel~., scale="free") + 
  geom_freqpoly(data= d2, aes(color=y), stat = "identity") + 
  scale_fill_brewer(palette = "Set1") + 
  geom_point(data=d1, aes(size=y), stat = "identity") + 
  guides(color = "none") + 
  guides(size = "none") + 
  xlab("Number of cases with feature=1 and target=1") + ylab("")
p
}
ggplot_distr(example_result)

Position n-grams

Description

Tranforms a vector of positioned n-grams into a list of positions filled with n-grams that start on them.

Usage

position_ngrams(ngrams, df = FALSE, unigrams_output = TRUE)
position_ngrams(ngrams, df = FALSE, unigrams_output = TRUE)

Arguments

`ngrams`	a vector of positioned n-grams (as created by `count_ngrams`).
`df`	logical, if `TRUE` returns a data frame, if `FALSE` returns a list.
`unigrams_output`	logical, if `TRUE` extracts unigrams from the data and returns information about their position.

Value

if df is FALSE, returns a list of length equal to the number of unique n-gram starts present in n-grams. Each element of the list contains n-grams that start on this position. If df is FALSE, returns a data frame where first column contains n-grams and the second column represent their start positions.

Examples

# position data in the list format
position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"))
# position data in the data frame format
position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"), df = TRUE)
# position data in the list format
position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"))
# position data in the data frame format
position_ngrams(c("2_1.1.2_0.1", "3_1.1.2_0.0", "3_2.2.2_0.0"), df = TRUE)

Print tested features

Description

Prints results of test_features function.

Usage

## S3 method for class 'feature_test'
print(x, ...)
## S3 method for class 'feature_test'
print(x, ...)

Arguments

`x`	object of class `feature_test`.
`...`	further arguments passed to `print.default`.

Value

nothing.

Read FASTA files

Description

A lightweight tool to read nucleic or amino-acid sequences from a file in FASTA format.

Usage

read_fasta(file)
read_fasta(file)

Arguments

file

the name of the file which the data are to be read from.

Value

a list of sequences.

Examples

## Not run: 
  read_fasta("https://www.uniprot.org/uniprot/P28307.fasta")
  
## End(Not run)
## Not run: 
  read_fasta("https://www.uniprot.org/uniprot/P28307.fasta")
  
## End(Not run)

Read sequences from .txt file

Description

Read sequence data saved in text file.

Usage

read_txt(connection)
read_txt(connection)

Arguments

connection

a connection to the text (.txt) file.

Details

The input file should contain one or more amino acid sequences separated by empty line(s).

Value

a list of sequences.

Examples

sequences <- read_txt(system.file("PlastoGram/sequences.txt", package = "PlastoGram"))
sequences <- read_txt(system.file("PlastoGram/sequences.txt", package = "PlastoGram"))

Regenerate n-grams

Description

'Regenerates' amino acid or nucleic sequence written in a simplified alphabet by converting groups to regular expression.

Usage

regenerate(x, element_groups)
regenerate(x, element_groups)

Arguments

`x`	`character` string representing single n-gram.
`element_groups`	encoding of elements: list of groups to which elements of sequence should be aggregated. Must have unique names.

Value

A character string representing a POSIX regular expression.

Note

Gaps (_) will be converted to any possible character from the alphabet (nucleotides or amino acids).

Examples

regenerate("ssw", list(w = c(1, 4), s = c(2, 3)))
regenerate("ssw", list(w = c(1, 4), s = c(2, 3)))

regional_param class

Description

List of rules defining the region.

Details

An object of the regional_param class is a list consisting of all rules necessary to properly build a region.

Attributes

reg_len: the number of unigrams inside the region. Might be 0
prop_ranges: required intervals of properties of unigrams in the region
exactness: a numeric value between 0 and 1 defining how stricly unigrams are kept within prop_ranges. If 1, only unigrams within prop_ranges are inside the region. if 0.9, there is 10 unigrams that are not in the prop_ranges will be inside the region.

Extract n-grams from sequence

Description

Extracts vector of n-grams present in sequence(s).

Usage

seq2ngrams(seq, n, u, d = 0, pos = FALSE)
seq2ngrams(seq, n, u, d = 0, pos = FALSE)

Arguments

`seq`	a vector or matrix describing sequence(s).
`n`	`integer` size of n-gram.
`u`	`integer`, `numeric` or `character` vector of all possible unigrams.
`d`	`integer` vector of distances between elements of n-gram (0 means consecutive elements). See Details.
`pos`	`logical`, if `TRUE` position-specific n_grams are counted.

Details

A format of d vector is discussed in Details of count_ngrams.

Value

A character matrix of n-grams, where every row corresponds to a different sequence.

Examples

# trigrams from multiple sequences
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
seq2ngrams(seqs, 3, 1L:4)
# trigrams from multiple sequences
seqs <- matrix(sample(1L:4, 600, replace = TRUE), ncol = 50)
seq2ngrams(seqs, 3, 1L:4)

Convert encoding from simple to full format

Description

Converts an encoding from the simple format to the full format.

Usage

simple2full(x)
simple2full(x)

Arguments

`x`	encoding (see Details).

Details

The encoding should be named. Each name should correspond to a different amino acid or nucleotide.

Examples

aa1 = structure(c("1", "4", "3", "3", "4", "1", "2", "1", "2", "1", 
                  "1", "4", "1", "4", "4", "4", "4", "1", "4", "4"), 
                .Names = c("a", "c", "d", "e", "f", "g", "h", "i", 
                           "k", "l", "m", "n", "p", "q", 
                           "r", "s", "t", "v", "w", "y"))
simple2full(aa1)

aa1 = structure(c("1", "4", "3", "3", "4", "1", "2", "1", "2", "1", 
                  "1", "4", "1", "4", "4", "4", "4", "1", "4", "4"), 
                .Names = c("a", "c", "d", "e", "f", "g", "h", "i", 
                           "k", "l", "m", "n", "p", "q", 
                           "r", "s", "t", "v", "w", "y"))
simple2full(aa1)

Summarize tested features

Description

Summarizes results of test_features function.

Usage

## S3 method for class 'feature_test'
summary(object, conf_level = 0.95, ...)
## S3 method for class 'feature_test'
summary(object, conf_level = 0.95, ...)

Arguments

`object`	of class `feature_test`.
`conf_level`	confidence level. A feature with p-value equal to or smaller than the confidence is considered significant.
`...`	ignored

Value

nothing.

Tabulate n-grams

Description

Builds a contingency table of the n-gram counts versus their class labels.

Usage

table_ngrams(seq, ngrams, target)
table_ngrams(seq, ngrams, target)

Arguments

`seq`	vector or matrix describing sequence(s).
`ngrams`	vector of n-grams.
`target`	`integer` vector with target information (e.g. class labels). Must have at least two values.

Value

a data frame with the number of columns equal to the length of the target plus 1. The first column contains names of the n-grams. Further columns represents counts of n-grams for respective value of the target.

Examples

seqs_pos <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE, 
            prob = c(0.2, 0.4, 0.35, 0.05)), ncol = 5)
seqs_neg <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE), 
            ncol = 5)
tab <- table_ngrams(seq = rbind(seqs_pos, seqs_neg), 
                    ngrams = c("1_c.t_0", "1_g.g_0", "2_t.c_0", "2_g.g_0", "3_c.c_0", "3_g.c_0"), 
                    target = c(rep(1, 20), rep(0, 20)))
# see the results
print(tab)
# easily plot the results using ggplot2
seqs_pos <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE, 
            prob = c(0.2, 0.4, 0.35, 0.05)), ncol = 5)
seqs_neg <- matrix(sample(c("a", "c", "g", "t"), 100, replace = TRUE), 
            ncol = 5)
tab <- table_ngrams(seq = rbind(seqs_pos, seqs_neg), 
                    ngrams = c("1_c.t_0", "1_g.g_0", "2_t.c_0", "2_g.g_0", "3_c.c_0", "3_g.c_0"), 
                    target = c(rep(1, 20), rep(0, 20)))
# see the results
print(tab)
# easily plot the results using ggplot2

Permutation test for feature selection

Description

Performs a feature selection on positioned n-gram data using a Fisher's permutation test.

Usage

test_features(
  target,
  features,
  criterion = "ig",
  adjust = "BH",
  threshold = 1,
  quick = TRUE,
  times = 1e+05,
  occurrences = TRUE
)
test_features(
  target,
  features,
  criterion = "ig",
  adjust = "BH",
  threshold = 1,
  quick = TRUE,
  times = 1e+05,
  occurrences = TRUE
)

Arguments

`target`	`integer` vector with target information (e.g. class labels).
`features`	`integer` matrix of features with number of rows equal to the length of the target vector.
`criterion`	criterion used in permutation test. See Details for the list of possible criterions.
`adjust`	name of p-value adjustment method. See `p.adjust` for the list of possible values. If `NULL`, p-values are not adjusted.
`threshold`	`integer`. Features that occur less than `threshold` and more often than `nrow(features)-threshold` are discarded from the permutation test.
`quick`	`logical`, if `TRUE` Quick Permutation Test (QuiPT) is used. If `FALSE`, normal permutation test is performed.
`times`	number of times procedure should be repeated. Ignored if `quick` is `TRUE`.
`occurrences`	`logical`, if `TRUE` occurrences of n-grams are computed.

Details

Since the procedure involves multiple testing, it is advisable to use one of the avaible p-value adjustment methods. Such methods can be used directly by specifying the adjust parameter.

Available criterions:

ig: Information Gain: calc_ig.
kl: Kullback-Leibler divergence: calc_kl.
cs: Chi-squared-based measure: calc_cs.

Value

an object of class feature_test.

Note

Both target and features must be binary, i.e. contain only 0 and 1 values.

Features occuring too often and too rarely are considered not informative and may be removed using the threshold parameter.

References

Radivojac P, Obradovic Z, Dunker AK, Vucetic S, Feature selection filters based on the permutation test in Machine Learning: ECML 2004, 15th European Conference on Machine Learning, Springer, 2004.

Examples

# significant feature
tar_feat1 <- create_feature_target(10, 390, 0, 600) 
# significant feature
tar_feat2 <- create_feature_target(9, 391, 1, 599)
# insignificant feature
tar_feat3 <- create_feature_target(198, 202, 300, 300)
test_res <- test_features(tar_feat1[, 1], cbind(tar_feat1[, 2], tar_feat2[, 2], 
                          tar_feat3[, 2]))
summary(test_res)
cut(test_res)

# real data example
# we will analyze only a subsample of a dataset to make analysis quicker
ids <- c(1L:100, 701L:800)
deg_seqs <- degenerate(human_cleave[ids, 1L:9], 
                       list(`a` = c(1, 6, 8, 10, 11, 18), 
                            `b` = c(2, 5, 13, 14, 16, 17, 19, 20), 
                            `c` = c(3, 4, 7, 9, 12, 15)))

# positioned n-grams example
bigrams_pos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE)
test_features(human_cleave[ids, 10], bigrams_pos)

# unpositioned n-grams example, binarization required
bigrams_notpos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE)
test_features(human_cleave[ids, 10], binarize(bigrams_notpos))
# significant feature
tar_feat1 <- create_feature_target(10, 390, 0, 600) 
# significant feature
tar_feat2 <- create_feature_target(9, 391, 1, 599)
# insignificant feature
tar_feat3 <- create_feature_target(198, 202, 300, 300)
test_res <- test_features(tar_feat1[, 1], cbind(tar_feat1[, 2], tar_feat2[, 2], 
                          tar_feat3[, 2]))
summary(test_res)
cut(test_res)

# real data example
# we will analyze only a subsample of a dataset to make analysis quicker
ids <- c(1L:100, 701L:800)
deg_seqs <- degenerate(human_cleave[ids, 1L:9], 
                       list(`a` = c(1, 6, 8, 10, 11, 18), 
                            `b` = c(2, 5, 13, 14, 16, 17, 19, 20), 
                            `c` = c(3, 4, 7, 9, 12, 15)))

# positioned n-grams example
bigrams_pos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE)
test_features(human_cleave[ids, 10], bigrams_pos)

# unpositioned n-grams example, binarization required
bigrams_notpos <- count_ngrams(deg_seqs, 2, letters[1L:3], pos = TRUE)
test_features(human_cleave[ids, 10], binarize(bigrams_notpos))

Validate encoding

Description

Checks the structure of an encoding.

Usage

validate_encoding(x, u)
validate_encoding(x, u)

Arguments

`x`	encoding.
`u`	`integer`, `numeric` or `character` vector of all elements belonging to the encoding. See Details.

Details

The encoding is a list of groups to which elements of an alphabet should be reduced. All elements of the alphabet (all amino acids or all nucleotides) should appear in the encoding.

Value

TRUE if the x is a correctly reduced u, FALSE in any other cases.

Examples

enc1 = list(`1` = c("a", "t"), 
            `2` = c("g", "c"))
# see if enc1 is the correctly reduced nucleotide (DNA) alphabet
validate_encoding(enc1, c("a", "c", "g", "t"))

# enc1 is not the RNA alphabet, so the results is FALSE
validate_encoding(enc1, c("a", "c", "g", "u"))

# validate_encoding works also on other notations
enc2 = list(a = c(1, 4),
            b = c(2, 3))
validate_encoding(enc2, 1L:4)
enc1 = list(`1` = c("a", "t"), 
            `2` = c("g", "c"))
# see if enc1 is the correctly reduced nucleotide (DNA) alphabet
validate_encoding(enc1, c("a", "c", "g", "t"))

# enc1 is not the RNA alphabet, so the results is FALSE
validate_encoding(enc1, c("a", "c", "g", "u"))

# validate_encoding works also on other notations
enc2 = list(a = c(1, 4),
            b = c(2, 3))
validate_encoding(enc2, 1L:4)

Write encodings to a file

Description

Saves a list of encodings (or a single encoding to the file).

Usage

write_encoding(x, file = "")
write_encoding(x, file = "")

Arguments

`x`	encoding or list of encodings.
`file`	ither a character string naming a file or a `connection` open for writing. "" indicates output to the console.

Examples

aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), 
           `2` = c("k", "h"), 
           `3` = c("d", "e"), 
           `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))
write_encoding(aa1)

aa1 = list(`1` = c("g", "a", "p", "v", "m", "l", "i"), 
           `2` = c("k", "h"), 
           `3` = c("d", "e"), 
           `4` = c("f", "r", "w", "y", "s", "t", "c", "n", "q"))
write_encoding(aa1)

Write FASTA files

Description

A lightweight tool to read nucleic or amino-acid sequences from a file in FASTA format.

Usage

write_fasta(seq, file, nchar = 80)
write_fasta(seq, file, nchar = 80)

Arguments

`seq`	a list of sequences.
`file`	the name of the output file.
`nchar`	the number of characters per line.

Package 'biogram'

Help Index

biogram - analysis of biological sequences using n-grams

Description

n-grams

n-gram data dimensionality

Author(s)

See Also

Examples

Normalized amino acids properties

Description

Format

Details

Source

References

Examples

Add 1-grams

Description

Usage

Arguments

Details

Value

See Also

Examples

Coerce feature_test object to a data frame

Description

Usage

Arguments

Value

Binarize

Description

Usage

Arguments

Value

Calculate value of criterion

Description

Usage

Arguments

Details

Value

Note

See Also

Examples

Calculate Chi-squared-based measure

Description

Usage

Arguments

Value

Note

See Also

Examples

Calculate encoding distance

Description

Usage

Arguments

Value

See Also

Examples

Calculate IG for single feature

Description

Usage

Arguments

Details

Value

Note

References

Examples

Calculate KL divergence of features

Description

Usage

Arguments

Value

Note

References

See Also

Examples

Calculate partition index

Description

Usage

Arguments