PyRBP.Features¶

PyRBP.Features.generateBPFeatures(sequences, pseudoKNC=False, ktuple=3, zigzag_coding=False, guanine_cytosine_Quantity=False, nucleotide_tilt=False, percentage_of_bases=False, PGKM=False, gapValue=1, kValue=2, mValue=2, DPCP=False)¶

This function is used to generate various types of features based on the sequences (physicochemical features, sequence properties, base composition, etc.).

Parameters:

sequences:list or array, necessary parameters: List or array of sequences used to generate features.

pseudoKNC:bool, default=False: Whether to use pseudoKNC algorithm to generate features, if True, the value of the parameter ktuple will be used for subsequent feature generation.

ktuple:int, default=3: Used to determine the number of bases for each tuple in pseudoKNC in the values of [3, 4, 5].

zigzag_coding:bool, default=False: Whether to use zigzag_coding algorithm to generate features.

guanine_cytosine_Quantity:bool, default=False: Whether to use guanine_cytosine_Quantity algorithm to generate features.

nucleotide_tilt:bool, default=False: Whether to use nucleotide_tilt algorithm to generate features.

percentage_of_bases:bool, default=False: Whether to use percentage_of_bases algorithm to generate features.

PGKM:bool, default=False: Whether to use PGKM algorithm to generate features. if True, the value of the parameter gapValue, kValue and mValue will be used for subsequent feature generation.

gapValue:int, default=1: Used to determine the number of gaps between each two tuples. Its value is limited to [1, 2, 3, 4, 5].

kValue:int, default=2: Used to determine the number of bases for the first tuple in PGKM, the value is limited to [1, 2].

mValue:int, default=2: Used to determine the number of bases for the second tuple in PGKM, the value is limited to [1, 2].

DPCP:bool, default=False: Whether to use DPCP algorithm to generate features.

Attributes:

features:list of feature according to each sequence: Used to store the final generated features for each sequence. When the function returns, it has been converted to an array.

feature:list of feature values for one sequence: Used Used to store the various features of the sequence being processed.

Note

The values of gapValue, kValue and mValue greatly affect the number of feature dimensions generated by the PGKM algorithm, the larger the values of these three parameters, the longer the running time of PGKM. Please set the parameter values according to your needs.

PyRBP.Features.generateDynamicLMFeatures(sequences, kmer=3, model='')¶

This function is used to generate the dynamic semantic information matrix of the sequence, in which we provide the fine-tuned BERT models (RBPBERT) for the RBP classification problem. The models can be downloaded from figshare link. When extraction is complete, the dimension of the feature matrix obtained is (number of sequences, number of tokens per sequence, 768), where 768 indicates 768 hidden units of 12 attention heads in the last transformer layer.

Parameters:

sequences:list or array, necessary parameters: List or array of sequences used to generate features.

kmer:int, default=3: kmer specifies the window size to be used when tokenizing sequences. There are four window sizes to choose from [3, 4, 5, 6].

model:str, default=''

The path where the downloaded RBPBERT model is stored, it should be noted that when passing parameters, only the absolute path to the folder where the model is located should be passed in, as in the example:

/home/wangyansong/PyRBP/src/dynamicRNALM/circleRNA/pytorch_model_3mer

PyRBP.Features.generateStaticLMFeatures(sequences, kmer=3, model='')¶

There are four static semantic models available in the PyRBP: fasttext, GloVe, word2vec and doc2vec. These models can be downloaded from figshare link.

Parameters:

sequences:list or array, necessary parameters: List or array of sequences used to generate features.

kmer:int, default=3: kmer specifies the window size to be used when tokenizing sequences. There are four window sizes to choose from [3, 4, 5, 6].

model:str, default='': The path where the downloaded static semantic model is stored, it should be noted that when passing parameters, you need to be careful that the model path passed in is consistent with kmer.

Attributes:

LM_type:str, ['word2vec', 'fasttext', 'doc2vec', 'GloVe']: Separated from the incoming model file name, used to distinguish different models when extracting embedding.

Note

Since both of the above modules need to separate some required information from the model names to perform matching checks, please do not make any changes to the downloaded model file names when using dynamic or static language models for semantic feature extraction, otherwise it may cause the module not to work properly.

PyRBP.Features.generateStructureFeatures(dataset_path='', script_path='', basic_path='', W=101, L=70, u=1, dataset_name='')¶

This function calls RNAplfold to calculate locally stable secondary structure - pair probabilities [RNAplfold]. The respective scripts of RNAplfold can be downloaded at figshare link.

Parameters:

dataset_path:str, default='': Path to the fasta file.

script_path:str, default='': The path where the RNAplfold scripts are located.

basic_path:str, default='': The path where the generated secondary structure profile files are stored, under which four folders E, H, I and M will be created, as well as the final structure information file combined_profile.txt

W:int, default=101: Average the pair probabilities over windows of given size.

L:int, default=70: Set the maximum allowed separation of a base pair to span. By setting the maximum base pair span no pairs (i,j) with j−i > span will be allowed. Defaults to winsize if parameter is omitted.

u:int, default=1: Compute the mean probability that regions of length 1 to a given length are unpaired

dateset_name, default='': To facilitate the storage of structural information for multiple datasets, dataset_name is used to mark different datasets, which is combined with basic_path to form the storage path.

Attributes:

path:str: Used to store the final secondary structure information file.

E_path:str: Used to store the E_RNAplfold secondary structure information file.

M_path:str: Used to store the M_RNAplfold secondary structure information file.

I_path:str: Used to store the I_RNAplfold secondary structure information file.

H_path:str: Used to store the H_RNAplfold secondary structure information file.

cmd:str: Used to store the command to summary secondary structure information file.

Note

Note that you need to give the RNAplfold scripts executable permissions using the following command:

chmod 764 path_to_the_scripts

[RNAplfold]

Lorenz, S.H. Bernhart, C. Hoener zu Siederdissen, H. Tafer, C. Flamm, P.F. Stadler and I.L. Hofacker (2011), “ViennaRNA Package 2.0”, Algorithms for Molecular Biology: 6:26