Title: | Predict Presence of Signal Peptides |
---|---|
Description: | Predicts the presence of signal peptides in eukaryotic protein using hidden semi-Markov models. The implemented algorithm can be accessed from both the command line and GUI. |
Authors: | Michal Burdukiewicz [cre, aut] , Piotr Sobczyk [aut], Jaroslaw Chilimoniuk [ctb] |
Maintainer: | Michal Burdukiewicz <[email protected]> |
License: | GPL-3 |
Version: | 1.5 |
Built: | 2024-11-03 03:19:53 UTC |
Source: | https://github.com/michbur/signalhsmm |
Amino acids are grouped together in larger sets based on their physicochemical properties important in the recognition of signal peptide.
aaaggregation
aaaggregation
a list of length four. Each element contains a character
vector
of amino acid names (one-letter abbreviations).
Changes parameters for Hidden Semi-Markov Model to add k-mer
add_k_mer_state(kMer, pipar, tpmpar, od, params, pState, nState, pTrans, d)
add_k_mer_state(kMer, pipar, tpmpar, od, params, pState, nState, pTrans, d)
kMer |
|
pipar |
Probabilities of initial state in Markov Model. |
tpmpar |
Matrix with transition probabilities between states. |
od |
Matrix of response probabilities. Eg. od[1,2] is a probability of signal 2 in state 1. |
params |
Matrix of probability distribution for duration. Eg. params[10,2] is probability of duration of time 10 in state 2. |
pState |
number denoting hidden state right before k-mer. |
nState |
number denoting hidden state right after k-mer. |
pTrans |
Probability of change from pState to k-mer hidden state. |
d |
Duration of the state. |
A list of length four:
pipar a vector of new probabilities of initial state in Markov Model,
tpmpar a matrix with new transition probabilities between states,
od matrix of new response probabilities,
params matrix of new probability distributions for duration.
Currently add only k-mers without distance.
Lists eukaryotic proteins added to UniProt database release 2015_06 between 1.01.2010 and 1.06.2015 (140 proteins with signal peptide and 280 randomly sampled proteins without signal peptide).
benchmark_dat
benchmark_dat
a list of SeqFastaAA
objects.
Slot sig
contains the range of signal peptide (if any).
summary(benchmark_dat)
summary(benchmark_dat)
Viterbi algorithm for Hidden Markov Model with duration
duration_viterbi(aa_sample, pipar, tpmpar, od, params)
duration_viterbi(aa_sample, pipar, tpmpar, od, params)
aa_sample |
|
pipar |
probabilities of initial state in Markov Model. |
tpmpar |
matrix of transition probabilities between states. |
od |
matrix of response probabilities. Eg. od[1,2] is a probability of signal 2 in state 1. |
params |
matrix of probability distribution for duration. Eg. params[10,2] is probability of duration of time 10 in state 2. |
A list of length four:
path a vector of most probable path
viterbi values of probability in all intermediate points,
psi matrix that gives for every signal and state the previous state in viterbi path,
duration matrix that gives for every signal and state gives the duration in that state on viterbi path.
All computations are on logarithms of probabilities.
Finds borders between distinct regions constituting signal peptides using a heuristic algorithm.
find_nhc(protein, signal = NULL)
find_nhc(protein, signal = NULL)
protein |
a vector of amino acids or object of class
|
signal |
range of signal peptide. If |
a vector of length 4 containing positions of:
start of n-region,
start of h-region,
start of c-region,
cleavage site.
Henrik Nielsen, Anders Krogh (1998). Prediction of signal peptides and signal anchors by a hidden Markov model. Proc. Sixth Int. Conf. on Intelligent Systems for Molecular Biology.
A graphical user interface for predicting presence of signal peptides.
gui_signalHsmm()
gui_signalHsmm()
null.
Any ad-blocking software may be cause of malfunctions.
A single prediction of signalHsmm
.
A stochastic model of signal peptide produced by signalHsmm
.
Always a named list of five elements
sp_probability
is a probability of signal peptide presence.
sp_start
is a start of potential signal peptide (naively 1 aminoacid).
sp_end
is a position of last amino acid of signal peptide.
struc
is numeric vector representing predicted structure of input
protein.
prot
is character vector containing input sequence of amino acids.
str_approx
has value bigger than 0 if the predicted signal peptide
structure was approximated (usually in case of sequences that have no signal peptides).
Always a named list of five elements
aa_group
encoding of amino acids. See aaaggregation
for an example.
pipar
probabilities of initial state in Markov Model.
tpmpar
matrix of transition probabilities between states.
od
matrix of response probabilities. Eg. od[1,2] is a probability
of signal 2 in state 1.
overall_probs_log
probabilities of amino acids in mature protein.
params
matrix of probability distribution for duration. Eg. params[10,2]
is probability of duration of time 10 in state 2.
summary.hsmm_pred
plot.hsmm_pred
train_hsmm
predict.sighsmm_model
A list of prediction(s) generated by run_signalHsmm
function.
A named list. Each element belongs to the hsmm_pred
class.
summary.hsmm_pred_list
, pred2df
Checks if an object is a protein (contains letters from one-letter amino acid code).
is_protein(object)
is_protein(object)
object |
|
TRUE
or FALSE
.
Plots objects of class hsmm_pred
.
## S3 method for class 'hsmm_pred' plot(x, add_legend = TRUE, only_sure = TRUE, ...)
## S3 method for class 'hsmm_pred' plot(x, add_legend = TRUE, only_sure = TRUE, ...)
x |
object of class |
add_legend |
|
only_sure |
|
... |
ignored. |
Nothing.
Converts objects of class hsmm_pred_list
to data frame.
pred2df(object)
pred2df(object)
object |
of class |
Data frame which columns contain respectively the probability of signal peptide presence as well as the start and the end of predicted signal peptide.
Predicts the presence of signal peptides using signalHsmm models.
## S3 method for class 'sighsmm_model' predict(object, newdata, ...)
## S3 method for class 'sighsmm_model' predict(object, newdata, ...)
object |
|
newdata |
unknown sequence of class |
... |
further arguments passed to or from other methods. |
#remember to remove it ## Not run: pos_train_ultrahard <- read_uniprot("pos_ultrahard_data.txt", euk = TRUE) model1 <- train_hsmm(pos_train_ultrahard, aa_group = aaaggregation) predict(model1, benchmark_dat[1L:5]) ## End(Not run)
#remember to remove it ## Not run: pos_train_ultrahard <- read_uniprot("pos_ultrahard_data.txt", euk = TRUE) model1 <- train_hsmm(pos_train_ultrahard, aa_group = aaaggregation) predict(model1, benchmark_dat[1L:5]) ## End(Not run)
Read sequence data saved in text file.
read_txt(connection)
read_txt(connection)
connection |
a |
The input file should contain one or more amino acid sequences separated by empty line(s).
a list of sequences. Each element has class SeqFastaAA
. If
connection contains no characters, function prompts warning and returns NULL
.
Read data saved in UniProt original flat text format.
read_uniprot(connection, ft_names, kwds = NULL)
read_uniprot(connection, ft_names, kwds = NULL)
connection |
a |
ft_names |
a character vector of UuniProt features to be extracted, for example
|
kwds |
a |
a list of sequences. Each element has a class SeqFastaAA
.
Attributes OS
and OC
represents respectively OS and OC fields in the
protein description. A value of each feature is preserved as an attribute named
after the feature.
Using the hidden semi-Markov model predict presence of signal peptide in eukaryotic proteins.
run_signalHsmm(test_data)
run_signalHsmm(test_data)
test_data |
single protein sequence ( |
Function signalHsmm
returns respectively probability of presence of
signal peptide, start of signal peptide and the probable cleavage site localization.
If input consists of more than one sequence, result is a data.frame where each column
contains above values for different proteins.
An object of class hsmm_pred_list
.
Currently start of signal peptide is naively set as 1 amino acid. The prediction of a cleavage site is still an experimental feature, use on your own risk.
#run signalHsmm on one sequence x1 <- run_signalHsmm(benchmark_dat[[1]]) #run signalHsmm on one sequence, but input is a character vector x2 <- run_signalHsmm(c("M", "A", "G", "K", "E", "V", "I", "F", "I", "M", "A", "L", "F", "I", "A", "V", "E", "S", "S", "P", "I", "F", "S", "F", "D", "D", "L", "V", "C", "P", "S", "V", "T", "S", "L", "R", "V", "N", "V", "E", "K", "N", "E", "C", "S", "T", "K", "K", "D", "C", "G", "R", "N", "L", "C", "C", "E", "N", "Q", "N", "K", "I", "N", "V", "C", "V", "G", "G", "I", "M", "P", "L", "P", "K", "P", "N", "L", "D", "V", "N", "N", "I", "G", "G", "A", "V", "S", "E", "S", "V", "K", "Q", "K", "R", "E", "T", "A", "E", "S", "L")) #run signalHsmm on list of sequences x3 <- run_signalHsmm(benchmark_dat[1:3]) #see summary of results summary(x3) #print results as data frame pred2df(x3) #summary one result summary(x3[[1]]) plot(x3[[1]])
#run signalHsmm on one sequence x1 <- run_signalHsmm(benchmark_dat[[1]]) #run signalHsmm on one sequence, but input is a character vector x2 <- run_signalHsmm(c("M", "A", "G", "K", "E", "V", "I", "F", "I", "M", "A", "L", "F", "I", "A", "V", "E", "S", "S", "P", "I", "F", "S", "F", "D", "D", "L", "V", "C", "P", "S", "V", "T", "S", "L", "R", "V", "N", "V", "E", "K", "N", "E", "C", "S", "T", "K", "K", "D", "C", "G", "R", "N", "L", "C", "C", "E", "N", "Q", "N", "K", "I", "N", "V", "C", "V", "G", "G", "I", "M", "P", "L", "P", "K", "P", "N", "L", "D", "V", "N", "N", "I", "G", "G", "A", "V", "S", "E", "S", "V", "K", "Q", "K", "R", "E", "T", "A", "E", "S", "L")) #run signalHsmm on list of sequences x3 <- run_signalHsmm(benchmark_dat[1:3]) #see summary of results summary(x3) #print results as data frame pred2df(x3) #summary one result summary(x3[[1]]) plot(x3[[1]])
Using hidden semi-Markov models as a probabilistic framework, signalHsmm is new, highly accurate signal peptide predictor for eukaryotic proteins.
Secretory signal peptides are short (20-30 residues) N-terminal amino acid sequences tagging among others tag among others hormons, immune system proteins, structural proteins, and metabolic enzymes. They direct a protein to the endomembrane system and next to the extracellular localization. All signal peptides possess three distinct domains with variable length and characteristic amino acid composition. Despite their variability, signal peptides are universal enough to direct properly proteins in different secretory systems. For example, artifically introduced bacterial signal peptides can guide proteins in mammals and plants.
The development of signalHsmm was funded by National Science Center (2015/17/N/NZ2/01845).
few_predictions <- run_signalHsmm(benchmark_dat[1:3]) #see all predictions pred2df(few_predictions) #summary one prediction summary(few_predictions[[1]]) #plot one prediction plot(few_predictions[[1]]) #have fun with GUI ## Not run: gui_signalHsmm() ## End(Not run)
few_predictions <- run_signalHsmm(benchmark_dat[1:3]) #see all predictions pred2df(few_predictions) #summary one prediction summary(few_predictions[[1]]) #plot one prediction plot(few_predictions[[1]]) #have fun with GUI ## Not run: gui_signalHsmm() ## End(Not run)
Summarizes objects of class hsmm_pred
.
## S3 method for class 'hsmm_pred' summary(object, only_sure = TRUE, double_linebreak = FALSE, ...)
## S3 method for class 'hsmm_pred' summary(object, only_sure = TRUE, double_linebreak = FALSE, ...)
object |
of class |
only_sure |
|
double_linebreak |
|
... |
ignored |
Nothing.
Summarizes objects of class hsmm_pred_list
.
## S3 method for class 'hsmm_pred_list' summary(object, ...)
## S3 method for class 'hsmm_pred_list' summary(object, ...)
object |
of class |
... |
ignored |
nothing.
Train sighsmm_model object
train_hsmm(train_data, aa_group, max_length = 32, region_fun = find_nhc)
train_hsmm(train_data, aa_group, max_length = 32, region_fun = find_nhc)
train_data |
training data. |
aa_group |
method of aggregating amino acids. |
max_length |
maximum length of signal peptide. |
region_fun |
function defining borders of regions (see |
object of class sighsmm_model
.