Another ObjectId Trick

A while back I posted A Few ObjectId Tricks. While not a new trick, this one is relatively new to me and quite useful.

As you know, the key _id is automatically indexed for each document in a collection. What you might not know, is that you can use objects for _id.

It seems obvious now, but I had not thought of it. A few days back, I re-read foursquare’s article on MongoDB Strategies for the Disk-Averse. They make heavy use of objects if _id in their analytics.

I work on Gauges, a web analytics tool, and I use several techniques to more efficiently store data. What I have been doing up until now is string _id’s with multiple values smushed in.

{"_id":  "<oid>:<hash>"}

oid is typically a bson object id and hash is some kind of hash on whatever I’m doing a write for. I do writes against _id and the hash determines uniqueness without storing the full url of the page we are tracking.

The crappy part is strings can’t be serialized as efficiently as an object id or an integer. The foursquare article pointed out that I could just store my _id’s like this:

{"_id":  {"i": <oid>, "h": <hash>}}

So what is the benefit of objects as _id’s? I see at least four.

1. Fewer indexes

If you were not using objects or concatenated values for _id, those values would need to be with keys inside the document. If you wanted to write against those keys, you would need to index them. This means you would have an _id that is indexed and also a secondary compound index (ie: [[i, 1], [h, 1]]).

More indexes means more writes and more RAM. Using an object for _id or a concatenated value saves this extra index and thus saves you efficiency and RAM on the server side.

2. Document Size

Object ids and integers can be serialized in bson more efficiently than strings. Switching from a string that is a mashup of object id and hash to an object can cut several bytes from each document.

require 'rubygems'
require 'bson'

oid = BSON::ObjectId.new

puts BSON.serialize(:_id => "#{oid}:123456789").size # 49
puts BSON.serialize(:_id => {:i => oid, :h => 123456789}).size #37

In the example above, the size difference was 37 verse 49 bytes. That saves almost a fourth of the document size, simply by using an object instead of a string mashup. Your mileage may vary, but applied across the millions of documents we track every month, this is a non-trivial amount of savings.

3. More Query-able (than concatenated values)

The example below shows using a range query using objects as _id, with greater than or equal to and less than or equal to. This would be painful with concatenated values in some scenarios and impossible in others.

require 'pp'
require 'rubygems'
require 'mongo'

conn = Mongo::Connection.new
db = conn.db('test')
col = db['test']
col.remove

oid = BSON::ObjectId.new

(1..3).each do |day|
  col.save(:_id => {:i => oid, :d => day})
end

# Get all documents
pp col.find.to_a

# [{"_id"=>{"i"=>BSON::ObjectId('4fc62a0f4c114f273c000001'), "d"=>1}},
#  {"_id"=>{"i"=>BSON::ObjectId('4fc62a0f4c114f273c000001'), "d"=>2}},
#  {"_id"=>{"i"=>BSON::ObjectId('4fc62a0f4c114f273c000001'), "d"=>3}}]

# Only get those matching a given day for a given oid
pp col.find({
  :_id => {
    '$gte' => {:i => oid, :d => 1},
    '$lte' => {:i => oid, :d => 1},
  },
}).to_a

# [{"_id"=>{"i"=>BSON::ObjectId('4fc62a0f4c114f273c000001'), "d"=>1}}]

We could change the $lte :d value to 2 or to 3 and those documents would then be included as well. Sure, you can query relatively the same thing using strings, but you would need to generate all the _id’s and use a $in query to pull them all out. An added bonus is that you get the documents back sorted ascending, which is what I have wanted when using objects as _id’s.

Note: if you try to execute the code above on Ruby 1.8, it will not work. Ruby 1.8 hashes are not ordered and ordering is important for _id objects. Use BSON::OrderedHash instead of a plain hash if on 1.8. The same applies to whatever language you use if you aren’t using Ruby and said language does not default to ordered hashes or dictionaries.

4. Objects are easier to grok

Values that are concatenated are not very intent revealing. With objects, however, the keys reveal what the values are.

Granted if you shorten the keys to save space, they might not reveal as much, but typically in this scenario, you create a mapping of short to long keys, which could be used to quickly deduce the key names and thus what the values are.

This point might not be a major one, but objects definitely feel cleaner and more obvious to me than concatenated values.

Labels: Features