Clustering RNA Motifs

We have developed a clustering method that distinguishes between RNA-like and non-RNA-like motifs. In order to create a cluster, we use the topological data of RNA secondary structures. Every graph can be represented by a Laplacian matrix, and hence an eigenvalue spectrum, however, spectra of different vertex graphs cannot be compared. As a result, we created new variables that will help us analyze and compare different graphs.

By using a least squares method plot, we find that our slope (called beta) measures the average spacing between positive eigenvalues, and that the intercept (called alpha) represents the second largest eigenvalue. Since the eigenvalue spectra of a graph is dependent on the number of vertices, we normalize beta by multiplying it by the number of vertices. In conclusion, we represent each graph by (V*Beta₁, alpha₁, V*Beta₂, alpha₂), where the subscript 1 represents the Laplacian, and the subscript 2 represents the square of the Laplacian. These values are used to cluster RNA-like and non-RNA-like motifs.

The clustering method that we used is called Partitioning Around Medoids (PAM), which computes two medoids. Each object of our data set is put next to the nearest medoid and PAM determines a distance matrix. We further reduce the clustering to two dimensions by using multi-dimensional scaling (MDS). The distance matrix can be used to plot points in two dimensions. The result is two main clusters consisting of RNA motifs, one which we call "RNA-like" and the other we call "non-RNA-like". We distinguish between these two groups because more motifs that have been found in Nature fall in the RNA-like group than in the non-RNA-like group. Those graphs that fall in the RNA-like group that have not yet been found in Nature are considered "candidates", in that they have a higher chance of being found than those motifs in the non-RNA-like group. We represent these candidates as blue graphs in our database.