Feature generation examples

This page shows how to generate three types of features using the PyRBP.Features module.

Generating biological features for AGO1 dataset

The dimensionality of the matrix generated by these features is independent of the length of the sequence, and when the seven features are generated simultaneously and the maximum value of each parameter is chosen, a 3472-dimensional feature vector can be generated for a sequence.

# read AGO1 dataset as example
fasta_path = '/home/wangyansong/PyRBP/src/RNA_datasets/circRNAdataset/AGO1/seq' # Replace the path to load your own sequences of dataset

sequences = read_fasta_file(fasta_path)
biological_features = generateBPFeatures(sequences, pseudoKNC=True, ktuple=5, zigzag_coding=True, guanine_cytosine_Quantity=True, nucleotide_tilt=True, percentage_of_bases=True, PGKM=True, gapValue=5, kValue=2, mValue=2, DPCP=True)

print(type(biological_features))
print(biological_features.shape)
output:
<class 'numpy.ndarray'>
(34646, 3472)

Generating semantic information for AGO1 dataset

When generating sequence semantic information, we need to use various language models trained for RBP sequences, which can be downloaded from figshare.

Dynamic semantic information generation

Generating dynamic semantic information requires the use of the RBPBERT (available for download here) model, and the value of the parameter model needs to be passed to the k-mer RBPBERT storage path (k=3, 4, 5 or 6) when running the generation function.

# Generate dynamic semantic information, the value of the parameter kmer should be the same as in the model path.
bert_features = generateDynamicLMFeatures(sequences, kmer=4, model='/home/wangyansong/PyRBP/src/dynamicRNALM/circleRNA/pytorch_model_4mer')
print(bert_features.shape)
print(type(bert_features))

output:

Some weights of the model checkpoint at /home/wangyansong/PyRBP/src/dynamicRNALM/circleRNA/pytorch_model_4mer were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
<class 'numpy.ndarray'>
(34636, 98, 768)

Static semantic information generation

The generation of static semantic features requires the use of static language models trained on RBP sequences, including fasttext, word2vec, doc2vec and GloVe (available for download here), and the value of the parameter model needs to be passed to the k-mer model storage path (k=3, 4, 5 or 6) when running the generation function.

# Generate static semantic information, the value of the parameter kmer should be the same as in the model path.
fasttext_features = generateStaticLMFeatures(sequences, kmer=3, model='/home/wangyansong/PyRBP/src/staticRNALM/circleRNA/circRNA_3mer_fasttext')
GloVe_features = generateStaticLMFeatures(sequences, kmer=3, model='/home/wangyansong/PyRBP/src/staticRNALM/circleRNA/circRNA_3mer_GloVe')
word2vec_features = generateStaticLMFeatures(sequences, kmer=3, model='/home/wangyansong/PyRBP/src/staticRNALM/circleRNA/circRNA_3mer_word2vec')
doc2vec_features = generateStaticLMFeatures(sequences, kmer=3, model='/home/wangyansong/PyRBP/src/staticRNALM/circleRNA/circRNA_4mer_doc2vec')

# The shape of features generated from four types of models are same.
print(fasttext_features.shape)
print(GloVe_features.shape)
print(word2vec_features.shape)
print(doc2vec_features.shape)

output:

(34636, 99, 100)
(34636, 99, 100)
(34636, 99, 100)
(34636, 99, 100)

Secondary structure information generation

In the process of generating the secondary structure information, we need to use the RNAplfold scripts, which are located in the folder with the same name in the code package.

# Here we only use the positive samples in AGO1 dataset as an example.
fasta_path = '/home/wangyansong/PyRBP/src/RNA_datasets/circRNAdataset/AGO1/positive'
script_path = '/home/wangyansong/PyRBP/src/PyRBP/RNAplfold' # where RNAplfold locates
# Four subfolders E, H, I and M will be created under the path where basic_path is located, as well as the final combined_profile.txt.
basic_path = '/home/wangyansong/PyRBP/src/circRNAdatasetAGO1'
structure_features = generateStructureFeatures(fasta_path, script_path=script_path, basic_path=basic_path, W=101, L=70, u=1)

If the basic_path you specified already exists, then you will first get the following output. This does not affect the subsequent generation of structural features.

Can not make directory: /home/wangyansong/PyRBP/src/circRNAdatasetAGO1/E/
Can not make directory: /home/wangyansong/PyRBP/src/circRNAdatasetAGO1/H/
Can not make directory: /home/wangyansong/PyRBP/src/circRNAdatasetAGO1/I/
Can not make directory: /home/wangyansong/PyRBP/src/circRNAdatasetAGO1/M/

output:

(17318, 101, 5)

Note

It takes a long time to extract the secondary structure information, please be patient.