Appendix B. Design patterns – MongoDB in Action

Appendix B. Design patterns

B.1. Patterns

The early chapters of this book implicitly advocate a certain set of design patterns. Here I’ll summarize those patterns and augment them with a few patterns that fall outside the flow of the text.

B.1.1. Embed versus reference

Suppose you’re building a simple application in MongoDB that stores blog posts and comments. How do you represent this data? Do you embed the comments in their respective blog post documents? Or is it better to create two collections, one for posts and the other for comments, and then relate the comments to the posts with an object id reference?

This is the problem of embedding versus referencing, and it’s a common source of confusion for new users of MongoDB. Fortunately, there’s a simple rule of thumb that works for most schema design scenarios: Embed when the child objects always appear in the context of their parent. Otherwise, store the child objects in a separate collection.

What does this mean for blog posts and comments? It depends on the application. If the comments always appear within a blog post, and if they don’t need to be ordered in arbitrary ways (by post date, comment rank, and so on), then embedding is fine. But if, say, you want to be able to display the most recent comments, regardless of which post they appear on, then you’ll want to reference. Embedding may provide a slight performance advantage, but referencing is far more flexible.

B.1.2. One-to-many

As stated in the previous section, you can represent a one-to-many relationship by either embedding or referencing. You should embed when the many object intrinsically belongs with its parent and rarely changes. The schema for a how-to application illustrates this well. The steps in each guide can be represented as an array of sub-documents because these steps are an intrinsic part of the guide, and rarely change:

{ title: "How to soft-boil an egg",
          steps: [
          { desc: "Bring a pot of water to boil.",
            materials: ["water", "eggs"] },
          { desc: "Gently add the eggs a cook for four minutes.",
            materials: ["egg timer"]},
          { desc: "Cool the eggs under running water." },
         ]
}

When the two related entities will appear independently in the application, you’ll want to relate. Many articles on MongoDB suggest that embedding comments in blog posts is a good idea. But relating is far more flexible. For one thing, you can easily show users a list of all their comments. You can also show all recent comments across all posts. These features, considered de rigueur for most sites, aren’t possible with embedded documents at this time.[1] You typically relate documents using an object ID. Here’s a sample post:

1 There’s a popular feature request for virtual collections, which could provide the best of both worlds. See http://jira.mongodb.org/browse/SERVER-142 to track this issue.

{ _id: ObjectId("4d650d4cf32639266022018d"),
  title: "Cultivating herbs",
  text: "Herbs require occasional watering..."
}

And here’s a comment, related by the post_id field:

{ _id: ObjectId("4d650d4cf32639266022ac01"),
  post_id: ObjectId("4d650d4cf32639266022018d"),
  username: "zjones",
  text: "Indeed, basil is a hearty herb!"
}

The post and the comment live in their own collections, and it takes two queries to display a post with its comments. Because you’ll be querying comments on their post_id field, you’ll want an index there:

db.comments.ensureIndex({post_id: 1})

We used this one-to-many pattern extensively in chapters 4, 5, and 6; look there for more examples.

B.1.3. Many-to-many

In RDBMSs, you use a join table to represent many-to-many relationships; in MongoDB, you use array keys. You can see a clear example of this technique earlier in the book where we relate products and categories. Each product contains an array of category IDs, and both products and categories get their own collections. If you have two simple category documents

{ _id: ObjectId("4d6574baa6b804ea563c132a"),
  title: "Epiphytes"
}
{ _id: ObjectId("4d6574baa6b804ea563c459d"),
  title: "Greenhouse flowers"
}

then a product belonging to both categories will look like this:

{ _id: ObjectId("4d6574baa6b804ea563ca982"),
  name: "Dragon Orchid",
  category_ids: [ ObjectId("4d6574baa6b804ea563c132a"),
                 ObjectId("4d6574baa6b804ea563c459d") ]
}

For efficient queries, you should index the array of category IDs:

db.products.ensureIndex({category_ids: 1})

Then, to find all products in the Epiphytes category, simply match against the category_id field:

db.products.find({category_id: ObjectId("4d6574baa6b804ea563c132a")})

And to return all category documents related to the Dragon Orchid product, first get the list of that product’s category IDs:

product = db.products.findOne({_id: ObjectId("4d6574baa6b804ea563c132a")})

And then query the categories collection using the $in operator:

db.categories.find({_id: {$in: product['category_ids']}})

You’ll notice that finding the categories requires two queries, whereas the product search takes just one. This optimizes for the common case, as you’re more likely to search for products in a category than the other way around.

B.1.4. Trees

Like most RDBMSs, MongoDB has no built-in facility for tree representation and traversal. Thus if you need tree-like behavior, then you’ve got to roll your own solution. I presented a solution to the category hierarchy problem in chapters 5 and 6. The strategy there was to store a snapshot of the category’s ancestors within each category document. This denormalization makes updates more complicated but greatly simplifies reads.

Alas, the denormalized ancestor approach isn’t great for all problems. Another common tree scenario is the online forum, where hundreds of messages are frequently nested many levels deep. There’s too much nesting, and too much data, for the ancestor approach to work well here. A good alternative is the materialized path.

Following the materialized path pattern, each node in the tree contains a path field. This field stores the concatenation of each of the node’s ancestor’s IDs, and root-level nodes have a null path because they have no ancestors. Let’s flesh out an example to see how this works. First, look at the comment thread in figure B.1. This represents a few questions and answers in thread about Greek history.

Figure B.1. Threaded comments in a forum

Let’s see how these comments look as documents organized with a materialized path. The first is a root-level comment, so the path is null:

{ _id: ObjectId("4d692b5d59e212384d95001"),
  depth: 0,
  path: null,
  created: ISODate("2011-02-26T17:18:01.251Z"),
  username: "plotinus",
  body: "Who was Alexander the Great's teacher?",
  thread_id: ObjectId("4d692b5d59e212384d95223a")
}

The other root-level question, the one by user seuclid, will have the same structure. More illustrative are the follow-up comments to the question about Alexander the Great’s teacher. Examine the first of these, and note that path contains the _id of the immediate parent:

{ _id: ObjectId("4d692b5d59e212384d951002"),
  depth: 1,
  path: "4d692b5d59e212384d95001",
  created: ISODate("2011-02-26T17:21:01.251Z"),
  username: "asophist",
  body: "It was definitely Socrates.",
  thread_id: ObjectId("4d692b5d59e212384d95223a")
}

The next deeper comment’s path contains both the IDs of the original and immediate parents, in that order and separated by a colon:

{ _id: ObjectId("4d692b5d59e212384d95003"),
  depth: 2,
  path: "4d692b5d59e212384d95001:4d692b5d59e212384d951002",
  created: ISODate("2011-02-26T17:21:01.251Z"),
  username: "daletheia",
  body: "Oh you sophist...It was actually Aristotle!",
  thread_id: ObjectId("4d692b5d59e212384d95223a")
}

At a minimum, you’ll want indexes on the thread_id and path fields, as you’ll always be querying on exactly one of these fields:

db.comments.ensureIndex({thread_id: 1})
db.comments.ensureIndex({path: 1})

Now the question is how you go about querying and displaying the tree. One of the advantages of the materialized path pattern is that you query the database only once, whether you’re displaying the entire comment thread or just a sub-tree within the thread. The query for the first of these is straightforward:

db.comments.find({thread_id: ObjectId("4d692b5d59e212384d95223a")})

The query for a particular sub-tree is more subtle because it uses a prefix query:

db.comments.find({path: /^4d692b5d59e212384d95001/})

This returns all comments with a path beginning with the specified string. This string represents the _id of the comment with the username plotinus, and if you examine the path field on each child comment, it’s easy to see that they’ll all satisfy the query. And they’ll do so quickly because these prefix queries can use the index on path.

Getting the list of comments is easy, since it requires just one database query. Displaying them is trickier because you need a list that preserves thread order. This requires a bit of client-side processing, which you can achieve with the following Ruby methods.[2] The first method, threaded_list, builds a list of all root-level comments and a map that keys parent IDs to lists of child nodes:

2 This book’s source code includes a complete example of threaded comments with materialized paths using the display methods presented here.

def threaded_list(cursor, opts={})
  list      = []
  child_map = {}
  start_depth = opts[:start_depth] || 0

  cursor.each do |comment|
    if comment['depth'] == start_depth
      list.push(comment)
    else
      matches = comment['path'].match(/([d|w]+)$/)
      immediate_parent_id = matches[1]
      if immediate_parent_id
        child_map[immediate_parent_id] ||= []
        child_map[immediate_parent_id] << comment
      end
    end
  end

  assemble(list, child_map)
end

The assemble method takes the list of root nodes and the child map and then builds a new list in display order:

def assemble(comments, map)
  list = []
  comments.each do |comment|
    list.push(comment)
    child_comments = map[comment['_id'].to_s]
    if child_comments
      list.concat(assemble(child_comments, map))
    end
  end

  list
end

To print the comments, you merely iterate over the list, indenting appropriately for each comment’s depth:

def print_threaded_list(cursor, opts={})
  threaded_list(cursor, opts).each do |item|
    indent = " " * item['depth']
    puts indent + item['body'] + " #{item['path']}"
  end
end

Querying for the comments and printing them is then straightforward:

cursor = @comments.find.sort("created")
print_threaded_list(cursor)

B.1.5. Worker queues

You can implement worker queues in MongoDB using either standard or capped collections. In both cases, the findAndModify command will permit you to process queue entries atomically.

A queue entry requires a state and a timestamp plus any remaining fields to contain the payload. The state can be encoded as a string, but an integer is more space-efficient. We’ll use 0 and 1 to indicate processed and unprocessed, respectively. The timestamp is the standard BSON date. And the payload here is a simple plaintext message, but could be anything in principle:

{ state: 0,
  created: ISODate("2011-02-24T16:29:36.697Z"),
  message: "hello world" }

You’ll need to declare an index that allows you to efficiently fetch the oldest unprocessed entry (FIFO). A compound index on state and created fits the bill:

db.queue.ensureIndex({state: 1, created: 1})

You then use findAndModify to return the next entry and mark it as processed:

q = {state: 0}
s = {created: 1}
u = {$set: {state: 1}}
db.queue.findAndModify({query: q, sort: s, update: u})

If you’re using a standard collection, then you’ll need to be sure to remove old queue entries. It’s possible to remove them at processing time using findAndModify’s {remove: true} option. But some applications may want to postpone removal for a later time, once the processing is complete.

Capped collections may also serve as the basis for a worker queue. Without the default index on _id, a capped collection has potentially faster insert speed, but the difference will be negligible for most applications. The other potential advantage is automatic deletion. But this feature is a double-edged sword: you’ll have to make sure that the collection is large enough to prevent unprocessed entries from aging out. Thus if you do use a capped collection, make it extra large. The ideal collection size will depend on your queue write throughput and the average payload size.

Once you’ve decided on the size of capped collection to use, the schema, index, and findAndModify will be identical to those of the standard collection just described.

B.1.6. Dynamic attributes

MongoDB’s document data model is useful for representing entities whose attributes vary. Products are the canonical example of this, and you saw some ways of modeling these attributes earlier in the book. One viable way to model these attributes is to scope them to a sub-document. In a single products collection, you can then store disparate product types. You might store a set of headphones

{ _id: ObjectId("4d669c225d3a52568ce07646")
  sku: "ebd-123"
  name: "Hi-Fi Earbuds",
  type: "Headphone",
  attrs: { color: "silver",
           freq_low: 20,
           freq_hi: 22000,
           weight: 0.5
         }
}

and an SSD drive:

{ _id: ObjectId("4d669c225d3a52568ce07646")
  sku: "ssd-456"
  name: "Mini SSD Drive",
  type: "Hard Drive",
  attrs: { interface: "SATA",
           capacity: 1.2 * 1024 * 1024 * 1024,
           rotation: 7200,
           form_factor: 2.5
         }
}

If you need to frequently query on these attributes, you can create sparse indexes for them. For example, you can optimize for range queries in headphone frequency response:

db.products.ensureIndex({"attrs.freq_low": 1, "attrs.freq_hi": 1},
  {sparse: true})

You can also efficiently search hard disks by rotation speed with the following index:

db.products.ensureIndex({"attrs.rotation": 1}, {sparse: true})

The overall strategy here is to scope your attributes for readability and app discoverability and to use sparse indexes to keep null values out of the indexes.

If your attributes are completely unpredictable, then you can’t build a separate index for each one. You have to use a different strategy in this case as illustrated by the following sample document:

{ _id: ObjectId("4d669c225d3a52568ce07646")
  sku: "ebd-123"
  name: "Hi-Fi Earbuds",
  type: "Headphone",
  attrs: [ {n: "color", v: "silver"},
           {n: "freq_low", v: 20},
           {n: "freq_hi", v: 22000},
           {n: "weight", v: 0.5}
         ]
}

Here attrs points to an array of sub-documents. Each of these documents has two values, n and v, corresponding to each dynamic attribute’s name and value. This normalized representation allows you to index these attributes using a single compound index:

db.products.ensureIndex({"attrs.n": 1, "attrs.v": 1})

You can then query using these attributes, but to do that, you must use the $elemMatch query operator:

db.products.find({attrs: {$elemMatch: {n: "color", v: "silver"}}})

Note that this strategy incurs a lot of overhead since it requires storing the key names in the index. It would be important to test this for performance on a representative data set before going into production.

B.1.7. Transactions

MongoDB doesn’t provide ACID guarantees over a series of operations, and no equivalent of RDBMSs’ BEGIN, COMMIT, and ROLLBACK semantics exists. When you need these features, use a different database (either for the data that needs proper transactions or for the application as a whole). Still MongoDB supports atomic, durable updates on individual documents and consistent reads, and these features, though primitive, can be used to implement transaction-like behavior in an application.

You saw an extended example of this in chapter 6’s treatments of order authorization and inventory management. And the worker queue implementation earlier in this appendix could easily be modified to support rollback. In both cases, the foundation for transaction-like behavior is the ever-versatile findAndModify command, which is used to atomically manipulate a state field on one or more documents.

The transactional strategy used in all these cases can be described as compensation-driven.[3] The compensation process in abstract works like this:

3 Two pieces of literature covering compensation-driven transactions are worth studying. The original is Garcia-Molina and Salem’s “Sagas” paper (http://mng.bz/73is). The less formal but no less interesting “Your Coffee Shop Doesn’t Use Two-Phase Commit” by Gregor Hohpe (http://mng.bz/kpAq) is also a great read.

  1. Atomically modify a document’s state.
  2. Perform some unit of work, which may include atomically modifying other documents.
  3. Ensure that the system as a whole (all documents involved) is in a valid state. If so, mark the transaction complete; otherwise, revert each document to its pre-transaction state.

It’s worth noting that the compensation-driven strategy is all but necessary for long-running and multistep transactions. The whole process of authorizing, shipping, and canceling an order is just one example. For these cases, even an RDBMS with full transactional semantics must implement a similar strategy.

There may be no getting around certain applications’ requirements for multi-object ACID transactions. But with the right patterns, MongoDB can pull some transactional weight and might support the transactional semantic your application needs.

B.1.8. Locality and precomputation

MongoDB is frequently billed as an analytics database, and plenty of users store analytics data in MongoDB. A combination of atomic increments and rich documents seems to work best. For example, here’s a document representing total page views for each day of the month along with the total for the month as a whole. For brevity, the following document contains totals only for the first five days of the month:

{ base: "org.mongodb", path: "/",
  total: 99234,
  days: {
    "1": 4500,
    "2": 4324,
    "3": 2700,
    "4": 2300,
    "5": 0
  }
}

You can update the totals for the day and month with a simple targeted update using the $inc operator:

use stats-2011
db.sites-nov.update({ base: "org.mongodb", path: "/" },
  $inc: {total: 1, "days.5": 1 });

Take a moment to notice the collection and database names. The collection, sites-nov, is scoped to a given month, and the database, stats-2011, to a particular year.

This gives the application good locality. When you query for recent visits, you’re always querying a single collection that’s relatively small compared with the overall analytics history. If you need to delete data, you can drop a time-scoped collection rather than removing some subset of documents from a larger collection. That latter operation may result in on-disk fragmentation.

The other principle at work here is precomputation. Sometime near the beginning of the month, you insert a template document with zeroed values for each day of the month. As a result, the document will never change size as you increment the counters therein because you’ll never actually be adding fields; you’ll only be changing their values in-place. This is important because it keeps the document from being relocated on disk as you write to it. Relocation is slow and often results in fragmentation.

B.2. Anti-patterns

MongoDB lacks constraints, which can lead to poorly organized data. Here are a few issues commonly found in problematic production deployments.

B.2.1. Careless indexing

When users experience performance problems, it’s not unusual to discover a whole slew of unused or inefficient indexes. The most efficient set of indexes for an application will always be based on an analysis of the queries being run. Be disciplined about the optimization methods presented in chapter 7.

B.2.2. Motley types

Ensure that keys of the same name within a collection all share the same type. If you store a phone number, for instance, then store it consistently, either as a string or an integer (but not as both). The mixing of types in a single key’s value makes the application logic complex, and makes BSON documents difficult to parse in certain strongly typed languages.

B.2.3. Bucket collections

Collections should be used for one type of entity only; don’t put products and users in the same collection. Because collections are cheap, each type within your application should get its own collection.

B.2.4. Large, deeply nested documents

There are two misunderstandings about MongoDB’s document data model. One is that you should never build relationships between collections, but rather represent all relationships in the same document. This frequently degenerates into a mess, but users nevertheless sometimes try it. The second misunderstanding stems from an overly literal interpretation of the word document. A document, these users reason, is a single entity just like a real-life document. This leads to large documents that are difficult to query and update, let alone comprehend.

The bottom line here is that you should keep documents small (well under 100 KB per document unless you’re storing raw binary data) and that you shouldn’t nest more than a few levels deep. A smaller size makes document updates cheaper because, in the case where a document needs to be rewritten on disk completely, there’s less to rewrite. The other advantage is that the documents remain comprehensible, which makes life easier for developers needing to understand the data model.

B.2.5. One collection per user

It’s rarely a good idea to build out one collection per user. One problem with this is that the namespaces (indexes plus collections) max out at 24,000 by default. Once you grow beyond that, you have to allocate a new database. In addition, each collection and its indexes introduce extra overhead, making this strategy a waste of space.

B.2.6. Unshardable collections

If you expect a collection to grow large enough to merit sharding, be sure that you can eventually shard it. A collection is shardable if you can define an efficient shard key for that collection. Review the tips in chapter 9 on choosing a shard key.