Feature selection examples

The PyRBP integrates several feature selection methods and provides a simple interface, which requires only the features to be selected, the dataset label, and the number of features you want to selected.

Prepare three types of features for feature selection

The AGO1 dataset is used to generate biological features, semantic information and secondary structure information respectively.

fasta_path = '/home/wangyansong/PyRBP/src/RNA_datasets/circRNAdataset/AGO1/seq'
label_path = '/home/wangyansong/PyRBP/src/RNA_datasets/circRNAdataset/AGO1/label'

sequences = read_fasta_file(fasta_path)  # read sequences and labels from given path
label = read_label(label_path)

# generate biological features
biological_features = generateBPFeatures(sequences, PGKM=True)
print(biological_features.shape)

# generate static semantic information
static_semantic_information = generateStaticLMFeatures(sequences, kmer=3, model='/home/wangyansong/PyRBP/src/staticRNALM/circleRNA/circRNA_3mer_fasttext')
print(static_semantic_information.shape)

# generate secondary structure information
structure_features = generateStructureFeatures(fasta_path, script_path='/home/wangyansong/PyRBP/RNAplfold', basic_path='/home/wangyansong/PyRBP/src/circRNAdatasetAGO1', W=101, L=70, u=1)
print(structure_features.shape)

output:

(34636, 400)
(34636, 99, 100)
(34636, 101, 5)

Feature selection procedure

It should be noted that the feature dimension to be passed into the feature selection method needs to be two-dimensional, so features with semantic information and secondary structure information need to be downscaled for feature selection (the same applies to machine learning classifiers).

# The first method of dimensionality reduction: multiplying the last two dimensions.
print('-------------------first method------------------------')
static_semantic_information_1 = np.reshape(static_semantic_information, (static_semantic_information.shape[0], static_semantic_information.shape[1] * static_semantic_information.shape[2]))
print(static_semantic_information_1.shape)

structure_features_1 = np.reshape(structure_features, (structure_features.shape[0], structure_features.shape[1] * structure_features.shape[2]))
print(structure_features_1.shape)


# The second method of dimensionality reduction: sum operation according to one of the last two dimensions.
print('------------------second method-----------------------')
print('----------compress the third dimension----------------')
static_semantic_information_2 = np.sum(static_semantic_information, axis=1)
print(static_semantic_information_2.shape)
static_semantic_information_3 = np.sum(static_semantic_information, axis=2)
print(static_semantic_information_3.shape)

print('---------compress the second dimension----------------')
structure_features_2 = np.sum(structure_features, axis=1)
print(structure_features_2.shape)
structure_features_3 = np.sum(structure_features, axis=2)
print(structure_features_3.shape)

output:

-------------------first method------------------------
(34636, 9900)
(34636, 505)
------------------second method-----------------------
----------compress the third dimension----------------
(34636, 100)
(34636, 99)
---------compress the second dimension----------------
(34636, 5)
(34636, 101)

Input the three processed features into the feature selection method for refinement (here we use the CIFE method as an example).

refined_biological_features = cife(biological_features, label, num_features=10)
print(refined_biological_features.shape)

refined_static_semantic_information = cife(static_semantic_information_1, label, num_features=10)
print(refined_static_semantic_information.shape)

refined_structure_features = cife(structure_features_1, label, num_features=10)
print(refined_structure_features.shape)

output:

(34636, 10)
(34636, 10)
(34636, 10)

Note

The calculation process of some feature selection methods is more complicated, so the running time is longer, please be patient.