November 14, 2011 § 1 Comment
Database performance is always something that has fascinated me. Not on a admin level, but from the point of view of a developer. Even with a stock database setup there are things that a developer can do to help optimize access. If you do any middle or backend web development you will be familiar with running queries against a database. You will also be familiar with how adding indexes on the commonly queried columns of your tables can increase performance. But it turns out, that not all indexes are created the same.
I have been working with the InnoDB storage engine of MySQL in a Ruby on Rails web development for a while now. Over the years I began to noticed something but was never able to figure out what was causing it . When ever I would queries tables using the id column, The queries would return much faster than when I queried against some other indexed column, even if that column was an indexed integer. For those not familiar with Rails conventions, every table is created with an integer column, id, that has the table’s primary index placed on it. It is auto-incrementing so every row has a unique value, a requirement of it being a primary key. Now the InnoDB constructs the primary key in a special way, using a clustered index. What this means is that the primary key index points directly to the page the row data resides on either on disk or in memory. Non-clustered, or secondary indexes, point to the primary key instead of the row data. This little bit of indirection cause the secondary indexes to be slower than a look up on the primary key.
It’s always satisfying figuring out the reasons behind behavior you do not initially understand.
For some more information you can check out the MySQL Reference Manual. This investigation was prompted by an answer to a question I posted on Stackoverflow about partial indexes and the group by operation.
July 21, 2011 § Leave a Comment
In computer programs, number constants can be interesting and bewildering things. Trying to figure out why one was chosen over another can be really confusing. For a good while I was confused as to why ActiveRecord would set a string attribute to be a VARCHAR(255) in the database. It limits the size of the string attributes to 255 characters long. 256 is a bit more natural when choosing constants in computer science. 255 is commonly used to denoted the last index in a 0-based array of 256 elements. So why 255? The short answer is “because of InnoDB and UTF-8 character sets.”
InnoDB has a limitation on the size of a key for a single column index of 767 bytes. When the table is encoded with a UTF-8 character set, each characters has the possibility of using 3-bytes to represent its intended character. That means in order to be able to fully index a UTF-8 encoded varchar column, the string must be able to be represented in 767 bytes. 767 / 3 = 255 2/3. This means that the largest UTF-8 encoded varchar column can be 255 characters long, hence the ActiveRecord default string attribute size.
Problems on the Way
As bigger and bigger pushes are made for complete internationalization, we’ll see more things encoded with UTF-16 and UTF-32. Characters using these encodings might require up to 4 bytes to represent their value. When this happens, ActiveRecord will need to reduce the size of indexable string attributes to 191 characters.
Here is a truly awesome magic number that seems to come out of no where, 0x5f3759d5.
July 9, 2011 § 1 Comment
Many times when we are selecting a rows out of the database we just want a single column and have no need for the entire object. There are a number of ways to accomplish this with ActiveRecord. One can get all the records from the database and then collect the attribute needed:
Posts.where(:status => 'published').collect(&:id) => [ 1, 5, 8, 10 ]
This has the benefit of being able to us any overwritten accessors, but has a lot of overhead associated with generating the objects. Another way to do it is to go directly to the database:
ActiveRecord::Base.connection.select_values("SELECT id FROM posts WHERE status = 'published'") => [ 1, 5, 8, 10 ]
This is much faster, but requires one to use the direct connection to the database and have the SQL literal prepared. Not particularly user friendly even if you can get the SQL literal using the to_sql:
ActiveRecord::Base.connection.select_values(Posts.where(:status => 'published').select(:id).to_sql) => [ 1, 5, 8, 10 ]
Wouldn’t it be nicer if you could just do the following:
Post.where(:status => 'published').select_column(:id) => [ 1, 5, 8, 10 ]
The select-column gem provides the above functionality above. You can you it in your Rails 3 app or checkout the source code over on github.
select_column accepts a single optional argument. This is the column that you want to have returned in an array. The returned column can also be specified using the select query method.
If neither a select nor an argument is given,
:id is assumed to be the column to be returned. If multiple select query methods are present, the first one defined will be the column returned.
# selects an array of ids Post.select_column # selects an array of titles Post.select_column(:title) # selects an array of ids Post.where(:status => 'published').select_column # selects an array of titles Post.where(:status => 'published').select_column(:title) # selects an array of titles Post.select(:title).where(:status => 'published').select_column
Update (Jan 21, 2012): It’s like they keep looking at my gems and integrating them into Rails. As of Rails 3.2 this gem’s functionality has been replicated by
ActiveRecord::Relation#pluck. Check it out in the release notes.
June 28, 2011 § 1 Comment
MySQL partial indexes are a great way to reduce the size of your indexes. In Rails apps, the default string column is a
VARCHAR(255) and adding an index to it can create large indexes. Since very few of the columns you use will ever actually be 255 characters in length, and many everyday attributes and columns have high entropy in some prefix substring, partial indexes make for great compromises.
Another quick thing to note is that if you are using the InnoDB storage you can’t use full indexes on
VARCHAR(255) columns in compound indexes because of the 767 byte limit on the index key size.
When working with partial indexes it can be helpful to know exactly how much of the column is covered uniquely by an index of a given size. Fernando Ipar has a pretty nifty little SQL query that will give you a rudimentary peek into how well a partial index will perform. The query will tell you what percentage of rows are uniquely identified by the index. You can check out his blog post about it over here. Here is the general form of the query:
-- SELECT COUNT(DISTNICT(SUBSTR(<column>,1,<partial index length>))) / COUNT(DISTINCT(<column>)) * 100 FROM <table>; SELECT COUNT(DISTNICT(SUBSTR(name,1,10))) / COUNT(DISTINCT(name)) * 100 FROM customers;
A Little Problem
With all the goodness that partial indexes offer, I have found at least one draw back. It seems that partial indexes cannot be used with aggregation functions like
GROUP BY. Even if the partial index does not uniquely identify each row in the table, one would think that MySQL would be able to use the partial index to at least help the
Update (11/8/2011): Someone posted an interesting answer to my question about this problem on stackoverflow. They made the point that using an index for a hint can’t really buy you anything when doing grouping operations. If the index doesn’t cover the entire string then the partial index might be able to tell if they are different, but it can’t tell for sure if they are the same. It’ll have to go to the table itself for confirmation, and if it is having to go to the table a bunch for confirmation then it might as well just to a table scan. The table scan will be more likely to have the nicer properties of a sequential read while using a partial index for hints and then going to the table for confirmation could create a bunch of random reads. There is probably some tipping point here that would make using the partial index’s hints favorable, but one would probably be better served shrinking the size of the column and indexing the full thing if you want to use the index with grouping operations.
Update (10/30/2011): Turns out this post shows up when some searches for mysql partial index in Google. Figured I might want to make it a little more helpful for those who end up here.
-- The most basic way to create a new partial index on a column -- CREATE INDEX <index name> ON <table name> (<column name>(<number of characters to index>)); CREATE INDEX part_of_name ON customers (name(10));
# To create a partial index with a Rails migration # add_index(<table name>, <column name>, :name => <index name>, :length => <partial length>) add_index(:customers, :name, :name => 'part_of_name', :length => 10)