On My Projects Becoming Obsolete

December 27, 2011 § 1 Comment

Rails 3.2 RC1 was released last week and it announces some pretty cool new things. One of those things is something that I have played with building out myself (tagged logging) and another is something I actually did (ActiveRecord Store). When I wrote my data-attributes gem, I was inspired to do so because I had a users table that had an ever-growing number of email permission checks. I never queried against these, and they made doing a SELCET * FROM users; really painful to try and read. I decided to just throw them all in a serialized column and be done with it.

When I first read the release notes for 3.2 I had two thoughts go through my head.

1) Well, crap. data-attributes is now dead and useless. (yes, I use <code> tags in my head)

2) Awesome, the solution I came up with for a problem I was having is the same one that the Rails team came up with. Maybe I might know a thing or two about what I am doing.

But as I started to compare the new ActiveRecord::Store with data-attributes, I began to realize that my little project isn’t dead quite yet. The accessors generated by the store call with AR::S aren’t full attribute accessors. By this I mean they don’t go through read_attribute and write_attribute before committing the new data to the serialized hash. This prevents you from intercepting the accessor call and doing some pre/post processing. You also don’t get default values as with other attribute accessors. Definitely not a huge deal in the least, but something to be aware of. AR::S does mark things as dirty though. data-attributes does not do this as of yet.

As for the future for of data-attributes, I think that it is actually dead in the long run. If I were to work on this problem more in the future, I’d probably do so by adding the AR::S instead of continuing working on data-attributes.

On Why id is Faster

November 14, 2011 § 1 Comment

Database performance is always something that has fascinated me. Not on a admin level, but from the point of view of a developer. Even with a stock database setup there are things that a developer can do to help optimize access. If you do any middle or backend web development you will be familiar with running queries against a database. You will also be familiar with how adding indexes on the commonly queried columns of your tables can increase performance. But it turns out, that not all indexes are created the same.

I have been working with the InnoDB storage engine of MySQL in a Ruby on Rails web development for a while now. Over the years I began to noticed something but was never able to figure out what was causing it . When ever I would queries tables using the id column, The queries would return much faster than when I queried against some other indexed column, even if that column was an indexed integer. For those not familiar with Rails conventions, every table is created with an integer column, id, that has the table’s primary index placed on it. It is auto-incrementing so every row has a unique value, a requirement of it being a primary key. Now the InnoDB constructs the primary key in a special way, using a clustered index. What this means is that the primary key index points directly to the page the row data resides on either on disk or in memory. Non-clustered, or secondary indexes, point to the primary key instead of the row data. This little bit of indirection cause the secondary indexes to be slower than a look up on the primary key.

It’s always satisfying figuring out the reasons behind behavior you do not initially understand.

For some more information you can check out the MySQL Reference Manual. This investigation was prompted by an answer to a question I posted on Stackoverflow about partial indexes and the group by operation.

On Not Rolling Your Own Verification or Encryption

November 10, 2011 § Leave a comment

When developing code, there are somethings that you should leave to the experts. Encryption is one of them. When I wrote my encrypted cookies and encrypted cookie sessions gems, one of the the things I didn’t want to do was write any sort of encryption routines. Luckily ActiveSupport has some help in the way of ActiveSupport::MessageEncryptor. It’s used much like ActiveSupport::MessageVerifier which is used for signed cookies in Rails. People much smarter than me have put these pieces together, so it just makes sense to use them. Almost nothing good can come from trying to do this stuff yourself. Here are some examples of a verifier and encryptor being used.

ActiveSupport::MessageVerifier

> secret = ActiveSupport::SecureRandom.hex(10)
 => "379af645b8dcce20b607"

> verifier = ActiveSupport::MessageVerifier.new(secret)
 => #<ActiveSupport::MessageVerifier:0x00000103d7ce50 @secret="379af645b8dcce20b607", @digest="SHA1">

> signed_message = verifier.generate("sign this!")
 => "BAhJIg9zaWduIHRoaXMhBjoGRVQ=--af1e810b074b1abd6d9dcd775f71b1fafa53c218"
# this is "<base 64 encoded and serialized string>--<digest of string>"

> verifier.verify(signed_message)
 => "sign this!"

> verifier.verify(signed_message + "alittleextraontheend")
ActiveSupport::MessageVerifier::InvalidSignature

> verifier.verify("alittleextraatthebeginning" + signed_message)
ActiveSupport::MessageVerifier::InvalidSignature

ActiveSupport::MessgeEncryptor

> secret = ActiveSupport::SecureRandom.hex(20)
 => "c1578de6ec2e1789940729dc9d97b335fc7df588"

> encryptor = ActiveSupport::MessageEncryptor.new(secret)
 => #<ActiveSupport::MessageEncryptor:0x000001295337c8 @secret="c1578de6ec2e1789940729dc9d97b335fc7df588", @cipher="aes-256-cbc">

> encrypted_message = encryptor.encrypt_and_sign("Nothing to see here...")
 => "BAhJIl9YbDZkK0czS3o0ZkI0Yml6K05uYzgzM05meDJjWWU4QWh0YzdFeFFrbC85b3BocHFORWtRWXdDVWIxaW45TEQ5LS1yVkxGTURJYzFWb2pva0UrVkkwTkFnPT0GOgZFRg==--e61e02a818960d66c7865f5624fad63b1564283f"

> encryptor.decrypt_and_verify(encrypted_message)
 => "Nothing to see here..."

If your secret is too short you’ll get a OpenSSL::Cipher::CipherError. Make sure your secret is at least 32 bytes. Using ActiveSupport::SecureRandom.hex(16) should satisfy this requirement, but obviously longer is better. You can also pass in a :digest => <digest> option as a second argument to both initializers to specify a different algorithm to use.

One of my thoughts is to submit these two gems into Rails so that people don’t make the mistake of trying to roll their own encryption systems for cookies. We’ll see how it goes.

On New Milestones

November 7, 2011 § Leave a comment

Earlier this year I got the itch to start doing some open source stuff. I hadn’t done anything of the sort before, so I didn’t really have any idea where to begin.  There have definitely been some bumps along the way, but I think that I have made some progress.  Looking back on my progress I identified a few milestones I hit along the way.  They weren’t obvious at the time, and I certainly can’t say that I meant to achieve each one, but I can clearly see them now.  Each one helped me a little more on my way to being an open source contributor.  Each one also elicited different thoughts and feelings.  Now in no particular order and mostly for me to remember what I was thinking…

First Repository Uploaded to GitHub

I signed up for GitHub back at the end of 2008.  I forked a few repos here and there just because there was a button to do it.  I didn’t really know anything about git and didn’t do anything with repos I had forked.  I didn’t actually upload a repo of my own until the end of March 2011.  I had been working with subdomains in Rails 3 and experimenting with the new cookie jar chaining.  I found a use case for adding another cookie jar that assigned the domain of the cookie.  I abstracted it out, gave a poor project name and put it up on GitHub. [tld-cookies]

It was a bit nerve-racking putting something you have made out there for whole world to see.  I am sure artists feel the same way when they show their paintings or writing to the general public.  I had no expectations that people would look at it or care about it, but still it was out there for people to scoff at if they saw fit to.  Nervous as I was, I felt something that I wasn’t expecting… liberation.  Even if no one ever used my code, I was now a contributing member of the open source community.  It felt great.

Two more repos went up almost immediately.

First Gem Published

When I pushed my first repos, the code was more or less in it’s finalized state, so the gem(s) followed quickly behind.  I remember having my RubyGems.org dashboard up all day watching the those first downloads trickle in.  I quickly realized that the first dozen or so were mirrors downloading all the new and updated gems.  Not going to lie, was a little sad when I realized that.  That first gem still sits at 50 downloads, probably because of it’s bad name, but some of the other ones have a few hundred downloads.

It’s awesome.  Some of my peers, not my coworkers, are using my code.  Building things that people use is awesome, whether it’s a website or a library.  As a software developer, I get a thrill when websites I work on get traction and usage.  Building successful businesses is my end goal, but with that being said, I get a very different and more personal thrill when I see some of my code being used by other developers.  It is a type of validation of your technical skills and it feels great.

First Gist Uploaded

We are a rails shop at work.  One of the things that I was getting annoyed with when dealing with debug statements was where to print these statements out to.  Do I print them to the log file or to standard out?  The log file is the obvious choice, but if you are working in the console, then using puts so you don’t have to switch terminal windows might be preferable.  I spent some time and came up with a useful little function.  I realized that this might be useful to others so I put it up on GitHub as a gist, and called it a day.  Looking back, this was really the first time I just casually threw something out there that I thought might be interesting, but wasn’t a fully functioning library.

First Issue Filed Against Me

Then came the bugs… I had my first GitHub issue opened against me about two months after I published my first gems.  Now here is the funny thing, I got excited.  Like watching the downloads trickle in on RubyGems.org, someone opening a bug against you means they care enough about what you are doing to want/need it to work right.  Obviously I want it to work as well, but I know that I am bound to let bugs slip through.  When someone took the time to report a bug and identify the potential source of the problem, I knew that I was doing something right.  Well, something was wrong but the situation was right… right? [bug]

First Issue Filed Against Me Fixed

Fixing the issue and pushing out a new version was satisfying.  I had the satisfaction of knowing that I cared enough about a little side project and the people using it to fix the issues and rerelease the gem.

Dealing With Future Compatibility

When I first pushed out my encrypted-cookie-store gem it was Rails 3.0 only. I figured I’d update it when 3.1 was released because I wasn’t expecting people to really care that much about it. Turns out I was mistaken. Not long after it was pushed out I got a request to add 3.1 compatibility. Well, one of my customers was asking for it and I was going to have to do it sooner or later, so I fired up rvm and got to work on figuring out what changed between 3.0 and 3.1. It wasn’t that big of a deal, but having different methods defined based on the which gems are installed seems like a less than ideal way to go about things. Thus concluded my first attempt at programing against an unreleased version of a project.

First Pull Request for One of My Projects

Now having someone open an issue against you and pointing you in the direction of the problem is nice and all, but having someone open a pull request against your project because they liked/needed it enough to fix the issue themselves, that is an awesome feeling. My little sparse matrix library was getting some love from across the Atlantic. Couldn’t have been happier.

First Issue Opened Against Another Project

Almost as scary as putting your own code out there is opening an issue against a well established project like Rails. You keep asking yourself, “Am I doing something wrong?” or “Do I just not understand what is going on?” I mean, what if your issue is just a result of you being stupid and not knowing what you are doing? These guys are busy and don’t really need to deal with bugs that probably aren’t actually bugs. So you run every test case you can think of, and then some that have nothing to do with your issue, you know, just in case. Then you hope that everyone is nice to the newbie. PS – They were.

First Pull Request to Another Project

It made me even more nervous to offer up a fix. Opening that Pull Request was nerve racking. I was patching one of the most used methods in ActiveRecord, so I obviously didn’t want to mess up. My pull request went through a couple of iterations, mostly on my test cases, but was finally accepted and merged into master 11 days after I opened the initial issue. Once again everyone was great and very helpful.

Started Blog

And then the blog started. I won’t lie to you, the last bit of motivation I needed to finally get this blog up and running was bit of selfishness. I realized that I needed some way to let people know about the stuff I had put out there that might be useful to them. This obvious thought happened when I came across a blog of someone working on a similar problem with encrypted cookies saying that they were thinking about packing it up and submitting it to rails. Since I liked the way I was doing it better, I figured I needed something other than the GitHub page as a way to promote it. Since then I have started writing on more topics than just the gems I have written. A blog, when properly utilized, is much more than just an advertising platform for your own code.

Being Asked to Contribute to a Project

The most recent development in my open source career is being asked to contribute to SciRuby. They liked the work I was doing on a sparse matrix library I was working on, and asked if they could include it in their offerings. Obviously I was ecstatic. Then I started thinking about the state of the code and how much work was left to do on it. Then I started worrying. It is very much still in an alpha/incomplete state. But now that I know that I have some people that are interested in it, I should be able to get working on it some more. Get that math side of my education some more use.

Conclusions

I have come a long way with my contributions to the open source community. From using to creating to contributing, each of these milestones has been a new challenge and a new experience. This isn’t a how to get a start in OSS, but it a list of things that you can look at for next steps when you are stuck and want to get more involved. It’s easy and most people will be excited that you want to help or are offering something up they can use. Even if you don’t want to write something, use OSS. That is your true first step.

On New Gems

November 1, 2011 § Leave a comment

One doesn’t go through the process of starting a new ruby gem all that often. Even if you have published a few, it is easy to forget how get one started. So here, for my benefit as much as anyone else’s, is a quick easy way to get a gem started and hosted on GitHub.

On the command line navigate to the directory you want to use to work on your different gems.

> cd ~/Projects/gems

I like using bundler, so this is what I do:

> bundle gem <gem name>
    create <gem name>/Gemfile
    create <gem name>/Rakefile
    create <gem name>/.gitignore
    create <gem name>/.gemspec
    create <gem name>/lib/<gem name>.rb
    create <gem name>/lib/<gem name>/version.rb

It will create a directory for you with the gem name and initialize a git repo for you with everything already staged to be committed. Now you need to create a new repo on GitHub. It’s pretty straight forward, so I’ll let GitHub’s on screen instructions take it from there. My only suggestion would be to make the project name the same as your gem name to avoid confusion.

Note: If you use dashes “-” instead of underscores “_” in your gem name, bundler will treat them as module hierarchies and build folders to accomodate.

Now cd into your new gem’s directory.

> cd <gem name>

Hook the git repo up with GitHub. The link should be shown on the empty repo’s main page. Just copy that.

> git remote add origin git@github.com:<GitHub username>/<GitHub Project Name>.git

Now commit your skeleton. Everything is already staged for you.

> git commit -m "Initial commit"

Now push it out to GitHub

> git push origin master

And just like that you are ready to start working on your new ruby gem. You’ll also want to remember the following commands:

> rake build   # build <gem name>-<version> into pkg directory
> rake install # build and install <gem name>-<version> on local machine
> rake release # tags the repo with the version number and pushes new version out to RubyGems.org

Google will be able to tell you more about filling out the .gemspec file than I can. This is just to get you up and running.

On Waiting a Lifetime

October 31, 2011 § Leave a comment

The Texas Rangers. I love these guys. The 2010 and 2011 seasons have been ones that I’ll remember for the rest of my life. Now, it’s always awesome when your team wins and goes deep into the playoffs, but that isn’t the main reason I’ll be remembering these years.

About six years ago my dad and I started a yearly pilgrimage to Surprise, AZ to take in some Spring Training. Around that same time I also started receiving the the Newberg Report. And as soon as it was released, I started using the MLB At Bat app for my iPhone. I have always been a Rangers fan, comes from growing up in the Dallas area, but these three things got me back into it in a big way. Actually seeing the new players in March and watching them progress and come up through the system over the years, reading about what Jamey Newberg and Scott Lucas had to say about them and the team, and being able to listen to all the games (Eric Nadel is the best sorts announcer as far as I can tell) makes it so much easier to get emotionally invested.

But these still aren’t why I’ll remember these runs. My dad has been a Rangers fan since before they were the Rangers. The Washington Senators were his team and when they moved to Texas, his loyalty followed. My mom has great stories of his devotion to the Rangers from back when they were dating and attending the University of Southern Mississippi in Hattiesburg. There’d be evening when they be hanging out together, but when the game came on, he’d go sit out in the car to listen to the game.  The radio in my mom’s apartment couldn’t pick up the station the games were being broadcast on, but my dad’s car radio had no such issues. He’d sit there for two or three hours and come back in when the game was over.

As fellow Rangers’ fans, we experienced the ups and downs of the 90’s teams that managed to make it to the playoffs and only win a single game between the three trips. I hated the Yankees before that, but having them knock us out all three times, made me hate them even more. Then Josh Daniels changed everything though. He put together The Plan. A five year push to meant to turn the Rangers into a serious contender in the 2011. He very methodically pulled in key players at the right times and developed the Rangers farm system into the envy of the league. Did I mention that he wasn’t even 30 when he stated this? Every year we were able to see more fruits of his labor, and the Rangers that weren’t suppose to come about until 2011 showed up a year early in 2010 to make an awesome run.

Oh and how I enjoyed these games. I am not going to lie. Over the last two years I have been very near useless at work during the postseason. I watched most of the games at my parents’ house, even if they were afternoon games. My dad and I would sit in the two chairs in front of the tv and yell and scream our heads off. Sharing the the fist bumps of victory and agony of defeats. First up in the 2010 playoffs was Tampa Bay and winning the Division Series against them was probably the memory I’ll remember most. When Upton popped up to Andrus to end the game, my dad turned to me and said, “Les, I literally have been waiting my whole life for this.” There may have been some wet eyes… It was an awesome thing to get to experience with my dad. The following roller coaster ride of beating the Yankees with Feliz striking out A-Rod to send us to the World Series for the first time in franchise history with all the symbolism and irony it entailed and then losing to the Giants in the Fall Classic just added to the memories. This year has been a great encore to that 2010 season. Back in the World Series again, but coming coming up short once again.

I was pretty depressed for about 2 hours after that final game, but then I shook it off. One of the things the Rangers did better than anyone else this year was to shake off a bad game and show up ready to place. Josh Hamilton said after game 6 that one’s ability to put yesterday’s game behind them is the difference between being a professional and a fan. I can’t say that I can completely let go, but I’ll do my best. I’ll be a professional fan.

Go Rangers!

On the Brilliance and Stupidity of Splitting

October 24, 2011 § Leave a comment

I’m a little late to the game with this post, but I had an interesting conversation with my buddy David Gleich, so I wanted to put this up.  When I first heard about Netflix splitting its DVD and Instant Watch services into two different business entities, I thought it was a brilliant  idea. It looked to me as another example of Netflix taking the lead and innovating in the media distributing business. Their two distribution channels are very, very different even though they aim to deliver similar content. So why keep them tied to each other when the only thing that they really have in common is whether or not the user has already seen the content? Everywhere you look there were good reasons to go through with the split, and in all honesty I can think of only a few reasons not to proceed, and only one I actually cared about.

By separating the two services, each new business unit would have been able to more aggressively pursue new opportunities and experiments in both content acquisition and innovative business models at the cost of a little bit of extra work for customers using both services. Even after the price hike (which was totally reasonable and people who jumped ship and rose up in protest can’t wrap their heads around the fact they were drastically underpaying for the service), I loved Netflix and was getting excited about all the possibilities for them.

A House of Cards is Stronger than You Think

Doubling down on a Netflix production studio or even partner with an existing one is an amazing opportunity. They are working on House of Cards, but I’d love to see Netflix do some more content. Being a big fan of Dr. WhoTorchwood, and a number of other British tv shows, I have come to appreciate the short series. Torchwood season three was only five episodes, but was masterfully done. Jekyll was another miniseries that was only six episodes, but managed to be quite entertaining. If Netflix were to put out two or three mini series a year as exclusive content only available on Netflix, that could be a huge draw. Not a huge investment in any single story line, with ability to bring them back the following season if they are successful. I’d love it if Netflix turned into a variation of HBO, Showtime or Bravo that did short form TV. Imagine if the only place you could watch the new episode of True Blood was on Netflix every week… This path is my personal favorite.

The Hulu+ Experience

While it is very nice to not have to watch commercials, I could see a subsidized plan that factors in limited commercial breaks being a possible success.  I mean come on, Hulu makes us pay for Hulu+ and still runs ads during the normal commercial breaks.

Slice and Dice and Serve it Up

There are an untold number of ways Netflix could slice up their content and charge for it. Charge extra for HD.  Have a plane that only has access to movies. Another one that has access to TV only. A third that is movie and TV access. Charge a premium for early access to new releases. Netflix could even spilt up access to the content based on who is licensing it to them.  What about and HBO, Showtime and Bravo only TV package? Or a NBC and Fox one? I know I’d love to have access to a greater variety of SyFy’s programming. Now I’m not saying that I would enjoy it if they started doing some of these, I’m just throwing them out there as possibilities. And as unlikely as they maybe, these pricing strategies aren’t even on the table as possibilities while the DVD’s are a part of the package.

The DVD rental business would have had the same opportunities. Charging more for BluRays doesn’t have anything to do with my streaming usage. Throwing video game rentals in there as it has been said they are planning, doesn’t really jive with Instant Watch.

When it comes down to it, the only thing that these two business really need to share is a queue and a ratings system, but Netflix had said they were going to be splitting the two systems up and not share any data between them. This was really my only complaint. I have spent a fair bit of time rating over a thousand items on Netflix. I would be less than happy if I had to start maintaining multiple ratings databases.

And that was a nice lead in for the reason I actually first thought write this post.

Monetize the Platforms

David mentioned the advantage that Netflix over its competitors early on was its ability to turn around DVDs. Get them back from a customer and get them back out to the next one waiting. With the streaming and DVD sides of the business separated, the DVD side could put some effort and monetizing one of its greatest assets, it distribution and logistics know how. Much like Amazon has Fulfillment By Amazon, Netflix could easily do somethign like Rentals By Netflix. Other companies could out source the logistical part of turning around rented items. They could start with the form factor they know best, DVDs and CDs and expand from there. Amazon has pioneered the way for them and has proven you can be successful being this type of platform.

Interesting side thought, what is the breakdown of DVDs on the shelf, being processed, in transit to the customer and in the customers’ hands? Ideally you’d want to optimize your processes to drive the percentage of stock on the shelves down to near zero by having enough demand to send out a DVD once it has been returned. Turn around time is probably bottlenecked by pick and delivery times of the USPS, so optimizing that process increases throughput but only decreases turn around time to a point. Transit time is more or less fixed around two days, so all that really varies is how long a person holds on to the DVD. Netflix would obviously like to have the user hold on to the DVD as long as possible to drive down postage costs. Anyways, just a side thought, nothing really going on.

The commoditization of the logistics of DVD rentals and how much I would have disliked having my ratings separated between streaming and DVDs got me thinking about another opportunity Netflix misses out on. Turning their recommendation engine into a service and selling access to their curated datasets. Their recommendations are the best on the web as far as I am concerned. I trust when Netflix says that it thinks I would give something 4.1 stars, that I will enjoy it. I do not have nearly the same confidence in a book or movie when Amazon suggests in my personal recommendations. By spinning this out into a separate service, both my Instant Watch and DVD queues could have benefited from it. They could even do deals with other video services to provide recommendations for them. Would I like Hulu better if my Netflix recommendations were integrated into it guiding me though its offerings? Absolutely. And it doesn’t have to stop with movies and TV shows. I would love it if Netflix could use its recommendation engine to suggest books to me on Amazon. They have been adding and refining it over the years and have committed a lot of money to it through things like the Netflix Prize. Put together an API for categorizing objects and adding user ratings, and you’ll have startup beating down your door to use algorithms on their datasets.

Naming Conventions

Even though I was getting really excited about all that could have happened, Netflix demonstrated some good old fashion stupidity with the way they went about it. The only thing I’ll say about the name is that it was obvious that they didn’t focus group test it. It would never have passed the mid 20’s demographic.

Split Then Raise

I also wonder if it might have been a better plan to split the two services first and then do the price hikes. The way they did it, the customers got mad and then forcing them to sign up for a new site would have been a recipe for an even bigger churn than they saw. If they had gotten the switch done first and then after the users were settled, hiked the prices, people who probably have been less inclined to leave.

One is Better than Two

I wouldn’t mind two different services, but the I would, as mentioned above, mind having to duplicate my efforts in rating movies and shows and maintaining a queue. Netflix should have foreseen this problem. Nothing about this part of the plan should have been able to get through early exploratory conversations. There had to have been ways around it that would have been feasible.

Anyways, this has been a fun little thought exercise, but it has gone on quite long enough. In conclusion, I think that Netflix has missed out on a very big opportunity here.  They flashed a glimpse of some true brilliance only to quickly hide it under a blanket of stupidity.

I truly believe this is one instance where the total is going to be far less than the sum of the parts.

Update 10/24

Looks like Netflix blew their earnings and took a 25% nose dive in after hours trading. Suck.

On Pure Ruby Sparse Matrices

October 18, 2011 § Leave a comment

A while ago I was working on a web app built on Rails that needed some very basic tag analysis. Having once upon a time obtained a MS in Computational Mathematics I will always view tags as a big matrix. A big, sparse matrix with tag names along one dimension and the objects being tagged along the other. When you think about tags like this all kinds of avenues of analysis open up.

You can think of objects as tag vectors (or tags as object vectors) with non-zero entries representing the tags on an object. By multiplying two tag vectors together you get the dot product (or cosine similarity) of the two vectors. This a nice rough estimate of how similar two objects are with respect to their given tags.  Multiplying a tag vector (a.k.a. the query vector) with whole matrix will find you the most similar objects (or tags). Math is fun!

To this end I started poking around trying to find a sparse matrix library written for ruby to make my life easier. Alas, no glory. Now I know that Ruby doesn’t have the same math support that, say python has, but I would have thought someone would have thrown something usable together.

Side note: NumPy and SciPy are awesome.

I ended up hacking together a very basic SparseVector class to take care of my immediate needs, but felt that there was a chance to create something new for the community. I looked at things like LAPACK ruby binding and NArray, but these all dealt with dense matrices. My data was incredibly sparse, so they didn’t really cut it. With that said, I have been working on a sparse matrix gem. It’s a pure ruby implementation, so it should be nice and portable and good enough for small prototypes and datasets. I wanted it to be in Ruby so that I could subclass it off of the standard library Matrix class and have it run anywhere Ruby runs. Not having any bindings to compile is a plus for me, since it means I don’t have to worry about the different environments it might be running in. If you are working with really large datasets then this probably isn’t for you right now.

Check out my progress over on GitHub.

On the Differences of Months

July 28, 2011 § Leave a comment

Dealing with dates is hard.  Dealing date calculations is harder.  Luckily, in the Rails world we have ActiveSupport to help us with a lot of this.  It actually does so much, that I usually forget how much of a pain dealing with dates is suppose to be.  However, there are times now and then that I am reminded.  Little edge cases that haven’t yet been built into ActiveSupport.  One such edge case is determining the number of months between two dates. Why do months have to have different numbers of days?

This particular use case can be very useful when dealing with recurring payments.  Calculating the number of payment cycles a subscriber has gone through can tell you how much revenue they have generated.  I have put together two class methods for the Time object that will calculate just this.  One is a simple loop that takes time proportional to the time between the start and end times, and the other a more efficient direct calculation of the same number.

The simple loop looks like:

def months_between2(start_date, end_date)
  return -months_between2(end_date, start_date) if end_date < start_date

  count = 1

  while true
    return count - 1 if (start_date + count.months) > end_date
    count += 1
  end
end

It just starts at the start_date and keeps adding months until it passes the end_date. Not particularly difficult, but should get the job done for most cases.  All the complexity of the more efficient version comes from checks dealing with the various cases arising from different months having a different number of days. Anyways, the source code for that one looks much nicer over on GitHub.

On the Size of a String

July 21, 2011 § Leave a comment

In computer programs, number constants can be interesting and bewildering things.  Trying to figure out why one was chosen over another can be really confusing.  For a good while I was confused as to why ActiveRecord would set a string attribute to be a VARCHAR(255) in the database.  It limits the size of the string attributes to 255 characters long.  256 is a bit more natural when choosing constants in computer science.  255 is commonly used to denoted the last index in a 0-based array of 256 elements.  So why 255?  The short answer is “because of InnoDB and UTF-8 character sets.”

InnoDB has a limitation on the size of a key for a single column index of 767 bytes.  When the table is encoded with a UTF-8 character set, each characters has the possibility of using 3-bytes to represent its intended character.  That means in order to be able to fully index a UTF-8 encoded varchar column, the string must be able to be represented in 767 bytes.  767 / 3 = 255 2/3.  This means that the largest UTF-8 encoded varchar column can be 255 characters long, hence the ActiveRecord default string attribute size.

Problems on the Way

As bigger and bigger pushes are made for complete internationalization, we’ll see more things encoded with UTF-16 and UTF-32.  Characters using these encodings might require up to 4 bytes to represent their value.  When this happens, ActiveRecord will need to reduce the size of indexable string attributes to 191 characters.

For Fun

Here is a truly awesome magic number that seems to come out of no where, 0x5f3759d5.