PyRBP.Features¶
- PyRBP.Features.generateBPFeatures(sequences, pseudoKNC=False, ktuple=3, zigzag_coding=False, guanine_cytosine_Quantity=False, nucleotide_tilt=False, percentage_of_bases=False, PGKM=False, gapValue=1, kValue=2, mValue=2, DPCP=False)¶
This function is used to generate various types of features based on the sequences (physicochemical features, sequence properties, base composition, etc.).
- Parameters:
- sequences:list or array, necessary parameters
List or array of sequences used to generate features.
- pseudoKNC:bool, default=False
Whether to use pseudoKNC algorithm to generate features, if
True, the value of the parameter ktuple will be used for subsequent feature generation.
- ktuple:int, default=3
Used to determine the number of bases for each tuple in pseudoKNC in the values of
[3, 4, 5].
- zigzag_coding:bool, default=False
Whether to use zigzag_coding algorithm to generate features.
- guanine_cytosine_Quantity:bool, default=False
Whether to use guanine_cytosine_Quantity algorithm to generate features.
- nucleotide_tilt:bool, default=False
Whether to use nucleotide_tilt algorithm to generate features.
- percentage_of_bases:bool, default=False
Whether to use percentage_of_bases algorithm to generate features.
- PGKM:bool, default=False
Whether to use PGKM algorithm to generate features. if
True, the value of the parameter gapValue, kValue and mValue will be used for subsequent feature generation.
- gapValue:int, default=1
Used to determine the number of gaps between each two tuples. Its value is limited to
[1, 2, 3, 4, 5].
- kValue:int, default=2
Used to determine the number of bases for the first tuple in PGKM, the value is limited to
[1, 2].
- mValue:int, default=2
Used to determine the number of bases for the second tuple in PGKM, the value is limited to
[1, 2].
- DPCP:bool, default=False
Whether to use DPCP algorithm to generate features.
- Attributes:
- features:list of feature according to each sequence
Used to store the final generated features for each sequence. When the function returns, it has been converted to an array.
- feature:list of feature values for one sequence
Used Used to store the various features of the sequence being processed.
Note
The values of gapValue, kValue and mValue greatly affect the number of feature dimensions generated by the PGKM algorithm, the larger the values of these three parameters, the longer the running time of PGKM. Please set the parameter values according to your needs.
- PyRBP.Features.generateDynamicLMFeatures(sequences, kmer=3, model='')¶
This function is used to generate the dynamic semantic information matrix of the sequence, in which we provide the fine-tuned
BERT models (RBPBERT)for the RBP classification problem. The models can be downloaded from figshare link. When extraction is complete, the dimension of the feature matrix obtained is (number of sequences,number of tokens per sequence,768), where768indicates 768 hidden units of 12 attention heads in the last transformer layer.- Parameters:
- sequences:list or array, necessary parameters
List or array of sequences used to generate features.
- kmer:int, default=3
kmer specifies the window size to be used when tokenizing sequences. There are four window sizes to choose from
[3, 4, 5, 6].
- model:str, default=''
The path where the downloaded
RBPBERTmodel is stored, it should be noted that when passing parameters, only the absolute path to the folder where the model is located should be passed in, as in the example:/home/wangyansong/PyRBP/src/dynamicRNALM/circleRNA/pytorch_model_3mer
- PyRBP.Features.generateStaticLMFeatures(sequences, kmer=3, model='')¶
There are four static semantic models available in the PyRBP:
fasttext,GloVe,word2vecanddoc2vec. These models can be downloaded from figshare link.- Parameters:
- sequences:list or array, necessary parameters
List or array of sequences used to generate features.
- kmer:int, default=3
kmer specifies the window size to be used when tokenizing sequences. There are four window sizes to choose from
[3, 4, 5, 6].
- model:str, default=''
The path where the downloaded
static semantic modelis stored, it should be noted that when passing parameters, you need to be careful that the model path passed in is consistent withkmer.
- Attributes:
- LM_type:str, ['word2vec', 'fasttext', 'doc2vec', 'GloVe']
Separated from the incoming model file name, used to distinguish different models when extracting embedding.
Note
Since both of the above modules need to separate some required information from the model names to perform matching checks, please do not make any changes to the downloaded model file names when using dynamic or static language models for semantic feature extraction, otherwise it may cause the module not to work properly.
- PyRBP.Features.generateStructureFeatures(dataset_path='', script_path='', basic_path='', W=101, L=70, u=1, dataset_name='')¶
This function calls
RNAplfoldto calculate locally stable secondary structure - pair probabilities [RNAplfold]. The respective scripts ofRNAplfoldcan be downloaded at figshare link.- Parameters:
- dataset_path:str, default=''
Path to the fasta file.
- script_path:str, default=''
The path where the
RNAplfoldscripts are located.
- basic_path:str, default=''
The path where the generated secondary structure profile files are stored, under which four folders E, H, I and M will be created, as well as the final structure information file combined_profile.txt
- W:int, default=101
Average the pair probabilities over windows of given size.
- L:int, default=70
Set the maximum allowed separation of a base pair to span. By setting the maximum base pair span no pairs (i,j) with j−i > span will be allowed. Defaults to winsize if parameter is omitted.
- u:int, default=1
Compute the mean probability that regions of length 1 to a given length are unpaired
- dateset_name, default=''
To facilitate the storage of structural information for multiple datasets, dataset_name is used to mark different datasets, which is combined with
basic_pathto form the storage path.
- Attributes:
- path:str
Used to store the final secondary structure information file.
- E_path:str
Used to store the E_RNAplfold secondary structure information file.
- M_path:str
Used to store the M_RNAplfold secondary structure information file.
- I_path:str
Used to store the I_RNAplfold secondary structure information file.
- H_path:str
Used to store the H_RNAplfold secondary structure information file.
- cmd:str
Used to store the command to summary secondary structure information file.
Note
Note that you need to give the RNAplfold scripts executable permissions using the following command:
chmod 764 path_to_the_scripts
[RNAplfold]Lorenz, S.H. Bernhart, C. Hoener zu Siederdissen, H. Tafer, C. Flamm, P.F. Stadler and I.L. Hofacker (2011), “ViennaRNA Package 2.0”, Algorithms for Molecular Biology: 6:26