rag_tutorial_clustering | Schlick Group at NYU

Clustering RNA Motifs

We have implemented a clustering method that distinguishes between RNA-like and non-RNA-like motifs. In order to create a cluster, we use the topological data of RNA secondary structures and training data from currently existing topologies. Every graph can be represented by a Laplacian matrix, and hence an eigenvalue spectrum, however, spectra of different vertex graphs cannot be compared. As a result, we created new variables that will help us analyze and compare different graphs.

By using a least squares method plot, we find that our slope (called beta) measures the average spacing between positive eigenvalues, and that the intercept (called alpha) represents the second largest eigenvalue. Since the eigenvalue spectra of a graph is dependent on the number of vertices, we normalize beta by multiplying it by the number of vertices. In conclusion, we represent each graph by (V*Beta₁, alpha₁, V*Beta₂, alpha₂), where the subscript 1 represents the Laplacian, and the subscript 2 represents the square of the Laplacian. These values are used to cluster RNA-like and non-RNA-like motifs.

The clustering method that we now use as of 2010 is called k-Nearest Neighbor (k-NN), which is an instance-based supervised classification algorithm for classifying objects based on k closest training data in description space. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. Typically, the parameter k is a small number from 1 to 3. We use k=3 because we found it to yield the best accuracy. The result is two main clusters consisting of RNA motifs, one which we call "RNA-like" and the other we call "non-RNA-like". We distinguish between these two groups because more motifs that have been found in nature fall in the RNA-like group than in the non-RNA-like group. Those graphs that fall in the RNA-like group that have not yet been found in nature are considered "candidates", in that they have a higher chance of being found than those motifs in the non-RNA-like group. We represent these candidates as blue graphs in our database.

Previously, we clusted by using Partitioning Around Medoids (PAM), which computes two medoids. PAM places each object of our data set next to the nearest medoid and determines a distance matrix. We further reduced the clustering to two dimensions by using multi-dimensional scaling (MDS). The distance matrix was used to plot points in two dimensions. Our decision to use k-NN over PAM was motivated by the massive increase in the availability of RNA secondary structures in the recent years since 2004. By supplementing the clustering of the topological descriptors (described above) with information from topologies found in nature, we expect the accuracy of our predictions to have improved.