On the Size of a String
July 21, 2011 § Leave a comment
In computer programs, number constants can be interesting and bewildering things. Trying to figure out why one was chosen over another can be really confusing. For a good while I was confused as to why ActiveRecord would set a string attribute to be a VARCHAR(255) in the database. It limits the size of the string attributes to 255 characters long. 256 is a bit more natural when choosing constants in computer science. 255 is commonly used to denoted the last index in a 0-based array of 256 elements. So why 255? The short answer is “because of InnoDB and UTF-8 character sets.”
InnoDB has a limitation on the size of a key for a single column index of 767 bytes. When the table is encoded with a UTF-8 character set, each characters has the possibility of using 3-bytes to represent its intended character. That means in order to be able to fully index a UTF-8 encoded varchar column, the string must be able to be represented in 767 bytes. 767 / 3 = 255 2/3. This means that the largest UTF-8 encoded varchar column can be 255 characters long, hence the ActiveRecord default string attribute size.
Problems on the Way
As bigger and bigger pushes are made for complete internationalization, we’ll see more things encoded with UTF-16 and UTF-32. Characters using these encodings might require up to 4 bytes to represent their value. When this happens, ActiveRecord will need to reduce the size of indexable string attributes to 191 characters.
Here is a truly awesome magic number that seems to come out of no where, 0x5f3759d5.