On Pure Ruby Sparse Matrices
October 18, 2011 § Leave a comment
A while ago I was working on a web app built on Rails that needed some very basic tag analysis. Having once upon a time obtained a MS in Computational Mathematics I will always view tags as a big matrix. A big, sparse matrix with tag names along one dimension and the objects being tagged along the other. When you think about tags like this all kinds of avenues of analysis open up.
You can think of objects as tag vectors (or tags as object vectors) with non-zero entries representing the tags on an object. By multiplying two tag vectors together you get the dot product (or cosine similarity) of the two vectors. This a nice rough estimate of how similar two objects are with respect to their given tags. Multiplying a tag vector (a.k.a. the query vector) with whole matrix will find you the most similar objects (or tags). Math is fun!
To this end I started poking around trying to find a sparse matrix library written for ruby to make my life easier. Alas, no glory. Now I know that Ruby doesn’t have the same math support that, say python has, but I would have thought someone would have thrown something usable together.
Side note: NumPy and SciPy are awesome.
I ended up hacking together a very basic SparseVector class to take care of my immediate needs, but felt that there was a chance to create something new for the community. I looked at things like LAPACK ruby binding and NArray, but these all dealt with dense matrices. My data was incredibly sparse, so they didn’t really cut it. With that said, I have been working on a sparse matrix gem. It’s a pure ruby implementation, so it should be nice and portable and good enough for small prototypes and datasets. I wanted it to be in Ruby so that I could subclass it off of the standard library Matrix class and have it run anywhere Ruby runs. Not having any bindings to compile is a plus for me, since it means I don’t have to worry about the different environments it might be running in. If you are working with really large datasets then this probably isn’t for you right now.
Check out my progress over on GitHub.